by Patrick Kochberger, Sebastian Schrittwieser (University of Vienna) and Edgar R. Weippl (SBA Research)
In cybercrime, malware plays a weighty role and malware authors heavily rely on different code obfuscation techniques such as packing, virtualisation, or control flow transformations, and other anti-analysis methods to hide malicious functionality in binary code. With thousands of new malware samples emerging every day, efficient analysis is crucial for fighting malware-based cybercrime. We present a novel meta-framework for malware analysis that helps find the optimal analysis strategy for a malware sample. The research for the work was conducted in a joint project together with the University of Gent in Belgium [L1].
Code obfuscation  is widely used for protecting benign software, but also for hiding malicious functionality in malware. The basic idea of obfuscation is to intentionally modify code in such a way that its (malicious) functionality is more difficult to detect, and analysis becomes more time-consuming. In malware identification, potentially malicious code samples are often analysed in dynamic malware sandboxes, which observe their functionality and interaction with the operating system at runtime. Malware, however, often can find out that it is running in a sandboxed environment and then stops its malicious activities to avoid detection.
The second important malware analysis methodology is automated static code analysis to get a quick insight into the functionality that is contained in a binary. The state-of-the-art in static code analysis has made great strides in speed and coverage in recent years . Generally speaking, static code analysis aims at approximating a program's behaviour without actually executing it. A static analysis returns only incomplete information from which assumptions have to be made. Thus, only incomplete approximations can be made from analysing a program statically and the results of different analysis tools will differ. Especially for obfuscated binaries, this means that multiple static analysis tools can result in very different approximations, and while some tools might deliver usable results, others might fail completely. In addition, the static code analysis landscape is highly diverse with methodologies ranging from formal model checking over data flow analysis to machine-learning-based approaches. The results from all these different methodologies will highly depend on the analysed binaries, their structure and applied code obfuscations.
To be able to efficiently find the optimal static code analysis methodology for a given malware binary, we developed a fully automatic, novel meta-framework that runs multiple analyses in parallel and compares its results.
The framework consists of so-called modules and actions (see Figure 1). Actions represent reverse engineering or analysis tasks (e.g., listing functions, disassembling bytes, reconstructing a control flow graph, etc.). The tool-specific modules translate the actions selected in the configuration of an analysis run into the specific parameters required for the various binary analysis frameworks. In our implementation of the framework, a module consists of a simple shell script for setup and a Python class for the actual translation tasks, which derives from a generic analysis base class. For each framework, the class connects to its API, calls the individual tools and collects the output for a certain task. The output is then cleaned, normalised, and can be used for further analysis tasks or comparison with the results from other tools.
Figure 1: A fully automatic, novel meta-framework.
The binary analysis frameworks that we integrated in the meta-framework are freely available, not cloud-based, and provide an API or scriptable interface of some kind. Currently, the framework includes the static binary analysis tools radare2, rizin, angr, AMOCO, BARF, Capstone, Distorm3, Ghidra, objdump and Jakstab. Besides basic information on a sample (MIME-type, extension, architecture, etc.), the framework allows extracting the control flow graph, functions, sections, and the disassembly of the code.
The introduced meta-framework for static binary analysis is a first step into making large-scale malware analysis more efficient, as its results indicate the most promising methodology for further analysis tasks to a human analyst. In the future, we aim to use the framework for collecting large-scale datasets of analysis runs from different types of binaries (e.g., built using different compilers and obfuscations). With the help of machine-learning, we then want to identify which types of binaries are best analysed with which static code analysis methodology – even before running one of the analysis tools.
The project EMRESS [L2] is funded by the Austrian Science Fund (FWF) under grant I 3646-N31. The financial support by the Austrian Federal Ministry for Digital and Economic Affairs and the National Foundation for Research, Technology and Development and the Christian Doppler Research Association is gratefully acknowledged.
 S. Schrittwieser et al.: “Protecting Software through Obfuscation: Can It Keep Pace with Progress in Code Analysis?”, ACM Computing Surveys 2017. https://doi.org/10.1145/2886012
 C. Pang et al.: “SoK: All You Ever Wanted to Know About x86/x64 Binary Disassembly But Were Afraid to Ask”, IEEE SP 2021. https://doi.org/10.1109/SP40001.2021.00012
University of Vienna, Austria
University of Vienna, Austria
Edgar R. Weippl
SBA Research, Austria