Kyoushi Testbed Environment: A Model-driven Simulation Framework to Generate Open Log Data Sets for Security Evaluations

by Max Landauer, Florian Skopik, Markus Wurzenberger and Wolfgang Hotwagner (AIT)

Cyber security leverages intrusion detection systems that analyse log data and network traffic to disclose suspicious activities and protect networks against cyberattacks. Verifying the functionality and measuring the effectiveness of these detection systems is not trivial, since it usually is not desirable to launch actual attacks in an organisation’s productive infrastructure. Therefore, such evaluations are often carried out in isolated testbeds, i.e., simulated networks comprising components and applications that are representative of their real-world counterparts in terms of configuration, scale, and utilisation. However, setting up and maintaining such testbeds is complex and labour-intensive, particularly when experiments are required to be reproducible and adaptable. To alleviate these issues, we developed the Kyoushi Testbed Environment, an open-source simulation framework that enables automatic and parallel testbed instantiation through model-driven design, simulation of normal user activities to generate a baseline workload, injection of attack scenarios with variations, and labelling of collected log data.

Despite a great need, there are hardly any publicly available log data sets that are suitable for security exercises, such as evaluations of attack detection and classification solutions. The main problems with existing data sets are: outdated or oversimplified use-cases, processed or anonymised logs, incomplete documentations, and missing reproducibility. Understandably, organisations are reluctant to make log data collected at their premises publicly available, as they likely contain traces of sensitive information, such as usernames, network configurations, setup structures, asset information, or software versions [1]. In addition, adversaries could possibly gain insights on deployed solutions and configurations from log data and target them in attacks.

Testbeds do not suffer from such issues, as they are isolated from the production environment and therefore allow the launching of attacks against services without the fear of any adverse consequences. Another advantage is the fact that simulated normal activities are clearly discernible from attack manifestations as all activities that are expected to occur are known beforehand. This facilitates generation of a ground truth table that specifies attack times and malicious events, and is essential for computing detection accuracies in evaluations. On top of that, analysts have full control over all settings of the simulation running on a testbed, meaning that they can arbitrarily adjust simulation parameters such as the network size or average utilisation of services [2].

To further ease and automatise the process of adapting the simulations, the Kyoushi Testbed Environment incorporates concepts from model-driven engineering that select relevant testbed parameters from predefined dictionaries or distributions, for example, usernames are randomly chosen from thesauri and executed activities are randomly selected based on probability distributions. Designing testbeds from such abstract models introduces variations in the resulting log traces, which is advantageous for several reasons: (i) it facilitates log collection of an arbitrary number of testbeds that represent different technical environments, (ii) it increases robustness of results when evaluating intrusion detection systems, and (iii) it uses repeated executions of similar attacks in evaluations of alert aggregation approaches [3].

Technical Overview of the Kyoushi Testbed Environment
The Kyoushi Testbed Environment is a modular framework for testbed generation, behaviour simulation, and data handling. Figure 1 shows a conceptual overview of all involved components. The left side depicts the kyoushi-environment, which is the main component that defines the scope of the testbed and provides all models required to set up the technical infrastructure of the simulated network. Models are templated scripts that do not specify testbed parameters, such as usernames and IP addresses, and are therefore referred to as testbed-independent models (TIM). The kyoushi-generator then ingests these models and fills out all missing information. The resulting scripts are stored in the local environment and allow deployment of a specific testbed instance, and are accordingly referred to as testbed-specific models (TSM). Provisioning tools, such as Terraform, are then capable of creating the network and machines on virtualisation platforms, such as OpenStack, in a fully automatic process.

In addition to hardware provisioning scripts, the configurations of the user simulation and testbed are generated as part of the transformation of TIMs to TSMs. The testbed configuration comprises setup and initialisation scripts for services and applications, for example, databases and content management systems, that are suitable for automatic deployment with software provisioning tools, such as Ansible.

Figure 1: Overview of the Kyoushi Testbed Environment components and log data generation process.

To generate a baseline of normal workload on the network, the kyoushi-statemachines module provides TIMs for all possible activities carried out by users and attackers, such as writing emails, browsing the Internet, executing commands, etc. In the kyoushi-simulation component, these state machines are combined with configurations that specify parameters, such as state transition probabilities, to yield simulation TSMs. These TSMs are executed by the simulation runner, e.g., a web automation framework such as Selenium.
The simulation may be stopped at any desired time, typically after several hours or days. A script is used to gather various log files from all hosts in the testbed, including authentication logs, access logs, error logs, application logs, syslog, network traffic, etc. The kyoushi-dataset module then stores the logs in a database and uses labelling rules generated alongside the TSMs to identify log events related to attacks.

The Kyoushi Testbed environment was used to instantiate eight enterprise IT networks comprising web servers, cloud shares, groupware, etc., that vary in terms of network size, configuration, and utilisation. As part of the simulation, several attacks, such as security scans, data exfiltration, exploits, and password cracking, were launched against the servers. The data collected from these testbeds were labelled as suitable for forensic security evaluations, and made available open source [L1, L2].

Links:
[L1] https://zenodo.org/record/5789064
[L2] https://github.com/ait-aecid/kyoushi-environment

References:
[1] R. Uetz, et al.: “Reproducible and Adaptable Log Data Generation for Sound Cybersecurity Experiments”, in Proc. of the Annual Computer Security Applications Conference, pp. 690-705. ACM, 2021.
[2] F. Skopik, et al.: “Semi-synthetic Data Set Generation for Security Software Evaluation”, in Proc. of the Annual International Conference on Privacy, Security and Trust, pp. 156-163. IEEE, 2014.
[3] M. Landauer, et al.: “Have It Your Way: Generating Customized Log Data Sets with a Model-driven Simulation Testbed”, IEEE Transactions on Reliability, Vol.70, Issue 1, pp. 402-415. IEEE, 2021.

Please contact:
Max Landauer
AIT Austrian Institute of Technology, Austria
This email address is being protected from spambots. You need JavaScript enabled to view it.
+43 664 88256012