by Jeroen van den Bos and Tijs van der Storm
Recovering evidence of criminal activities from digital devices is often difficult, time-consuming and prone to errors. The Software Analysis and Transformation group at CWI designed Derric, a domain-specific language (DSL) that enables the efficient construction of scalable digital forensics tools.
An important part of digital forensics is recovering evidence from digital devices. This typically includes recovery of images, text documents and email messages relevant to a forensic investigation. Currently, such investigations crucially depend on custom-made software, which often has to be modified on a case-by-case basis. Additionally, it needs to scale to deal with datasets in the terabyte range. CWI applies state-of-the-art language engineering tools and techniques to make the construction and maintenance of such software less error-prone and time consuming.
An application area for digital forensics software is “file carving”, the process of recovering files from a digital device without the use of file system metadata. File carving is used, for instance, to recover child pornography images, even though the suspect may have tried to delete them. Moreover, because of fragmentation, a file may be distributed over a device in multiple fragments. File carvers then match sequences of bytes to be of a certain file format and attempt to reconstruct the original file.
File formats, such as JPEG (images), ZIP (archives) and DOC (documents), play a crucial role in file carving. They define the structure necessary to determine if a raw file fragment might be part of a complete file of a certain type. File formats exist in many versions and vendor-specific variants. In the current state of practice, file format knowledge is often intertwined with complex, highly optimized, file carving algorithms for reassembling parts of fragmented files. This lack of “separation of concerns” makes changing forensic software error-prone and time consuming.
Derric: a DSL for file formats
Derric,is a domain-specific language (DSL) designed by CWI that can be used to describe file formats. Such descriptions are then input to a code generator that generates high-performance file carvers. This way knowledge about file formats is isolated from the algorithmic file carving code. Forensic investigators can focus on maintaining and evolving file format descriptions, whereas software engineers can focus on optimization of the runtime system.
A Derric description consists of three parts: a configuration header, the sequence section, and a list of structure definitions. The configuration header declares file type metadata, such as endianness, signedness and string encodings. The sequence section then describes the high-level structure of a file using a regular expression. Finally, the tokens used in the regular expression are defined in the structure section. Each structure is identified by a name and contains one or more fields. The contents and length of a field may be arbitrarily constrained in order to guide the matching process. In our experience, Derric is expressive enough to describe a wide range of file formats.
Figure 1: An example of the Derric description of JPEG.
We have evaluated Derric by comparing generated file carvers to existing file carvers that are used in forensic practice [1]. Our results show that the Derric-based file carvers perform as well as the best file carvers out there, and sometimes even better. Derric is implemented in Rascal, a metaprogramming language and its implementation is very small: around 2000 lines of Rascal, and a runtime library of 4200 lines of Java Code. As a result, the overhead of maintaining the DSL implementation is acceptable.
An additional advantage of declaratively describing file formats using Derric is that the descriptions can be transformed before passing them to the code generator. Source-to-source transformation can be applied to configure the trade-off between runtime performance and accuracy. We have implemented three such transformations for successively obtaining carvers that are more efficient. In certain forensics cases, it may be more effective to compromise on accuracy in order to obtain results more quickly. Since the transformations are fully automated this trade-off can be made without having to change any code. We have evaluated the effect of the transformations on a 1TB test image. Our results show that performance gains up to a factor of three can be achieved, at the expense of up to 8% in precision and 5% in recall.
Conclusion
Digital forensics, now more than ever, is crucially dependent on software. DSLs can help untangle the concerns that are at play in the domain of forensics. Derric is an important step in this direction: by separating file format descriptions from how they are used in implementation, forensic tools become easier to modify. Moreover, model transformation provides opportunities for configuring trade-offs that would otherwise be cast in stone.
Links:
Derric: http://www.derric-lang.org/
Rascal: http://www.rascal-mpl.org
References:
[1] J. van den Bos, T. van der Storm, “Bringing Domain-Specific Languages to Digital Forensics”, in: Proc. of the 33rd International Conference on Software Engineering (ICSE'11), Software Engineering in Practice, ACM, 2011
[2] J. van den Bos and T. van der Storm, “Domain-Specific Optimization in Digital Forensics”, in: Proc. of the 5th International Conference on Model Transformation (ICMT'12), 2012
Please contact:
Jeroen van den Bos, Tijs van der Storm
CWI, The Netherlands
E-mail: