Gorille: Efficient and Relevant Software Comparisons

by Philippe Antoine, Guillaume Bonfante and Jean-Yves Marion (Loria)

Binary code analysis is a complex process that can only be performed by skilled cybersecurity experts whose workload just keeps increasing. Gorille greatly speeds up their daily routines, while providing them with more in-depth knowledge.

During our work on the complexity of algorithms and on computational models, we developed an interest in malware and viruses. Malicious software – which have been increasingly discussed in the mainstream media lately with the rise of advanced cyber attacks - represent a practical case in which hackers push cybersecurity tools to their limits by placing them in the worst-case scenario. Currently, there simply aren’t enough skilled engineers available to cope with the ever increasing amount of data required for cyber defence. Since the threats are still relatively new and still evolving, professionals in this area still lack some automated tools to allow them to engage in a process of continuous improvement. This is true for cybersecurity generally, but particularly for the specific branch of binary code analysis known as reverse engineering.

Binary code analysis has become a major topic of research in the last decade. Use cases include vulnerability detection, testing, clustering and classification and malware analysis. Perhaps one of the best known tools so far is bindiff [1], a comparison tool for binary files designed to quickly find differences in similar software (before and after a patch for instance) using their disassembled code. Our system Gorille takes a different approach, looking for similarities instead of differences. This design is more efficient when the compared binary files differ more than they share code. This philosophy permits us to build a scalable solution, capable of running comparisons against a database with millions of samples in seconds on a regular laptop.

Figure 1: 3D representation of a control flow graph.

Furthermore, Gorille's refine output, connecting almost-identical pieces of code, can be used in several ways for retro-engineering as shown in [2]: function identification or malware classification into families for example. Thanks to the automation of the process, the addition of new samples to a knowledge database is simple, making it painless to share information before the next version of the malware pops up.

In order to achieve meaningful results, Gorille and other solutions strive towards a high level of semantics for the binary code. Control flow graphs provide a fair level of abstraction to deal with the binary codes they represent. This structure is currently being thoroughly used in state-of-the-art papers [3] and is the basic input for Gorille. After applying some graph rewriting rules to normalise these graphs, our software tackles the subgraph search problem in a way which is both efficient and convenient for that kind of graph. This technique is described as morphological analysis since it recognizes the whole shape of the malware.

That being said, some pitfalls still need to be considered. First, the output will only be as good as the input data. And it is known that static disassembly cannot produce the perfect control flow graph since software on a Turing machine are able to modify themselves, thus making this problem an undecidable one. As a matter of fact, malware heavily use obfuscation techniques such as opaque predicates to hide their payloads and confuse analyses. Dynamic analysis should then be used along with static disassembly to combine their strengths.

Another dangerous pitfall feared by every expert is the ‘false positives rate’: false alarms that result in precious time being wasted assessing the reality of the threat. Shared binary code is not always relevant as software often has embedded static standard libraries. Gorille’s solution to this issue lies in graph rewriting. By rewriting classic subgraphs into configuration-based special nodes, we obtain an even higher abstraction of the control flow graph.

There is enormous potential to get more out of Gorille: for instance, by building more knowledge databases to recognize packers; by making it compatible with different kinds of binary code such as java or ARM; or by following process executions to check they do not get out of their usual control flow graph (meaning a bug is exploited). The next breakthrough we are working on related to the data flows which will bring additional useful information to control flows, thanks to brand new results in abstract interpretation or data tainting.

The research institute LORIA in Nancy, France, aims to do more than science with Gorille. As part of its involvement in the local economy, a spinoff called Simorfo will be created in 2016 to put the morphological analysis technology to test on the market. This will be one more achievement after other collaborations in cybersecurity such as joint work with TRACIP, a local cyber-forensics company, or lybero.net, the latest spinoff from the cryptography team created in March. Simorfo aims to become a European leader in cybersecurity, providing innovative solutions and measurable returns on investment to its clients through the various skills of its team.

Link:
http://www.lhs.loria.fr/wp/?page_id=96

References:
[1] H. Flake: “Structural Comparison of Executable Objects”, DIMVA, 2004, http://www.zynamics.com/downloads/dimva_paper2.pdf.
[2] G. Bonfante, J.-Y. Marion, F. Sabatier: “Gorille sniffs code similarities, the case study of Qwerty versus Regin”, 10th Int. Conference on Malicious and Unwanted Software, Oct 2015, https://hal.inria.fr/hal-01263123/
[3] A. Abraham et al.: “GroddDroid: a Gorilla for Triggering Malicious Behaviors”, 10th Int. Conference on Malicious and Unwanted Software, Oct 2015, https://hal.inria.fr/hal-01201743

Please contact:
Philippe Antoine
LORIA, France
This email address is being protected from spambots. You need JavaScript enabled to view it.