Open and Shared Infrastructure for Software Research

by Jurgen Vinju (CWI and TU Eindhoven)

Research software engineers, scientific programmers, PhD students and postdocs alike spend their time and energy on engineering the software for their research infrastructure. Sometimes this can be reused in follow-up research projects, but most often valuable software output falls like trees in the forest, without an audience, and without users. Surprisingly, even in the academic field of software engineering, research software infrastructure is not sustained, leaving a wide gap of opportunity for more reuse and more impact.

Software has penetrated every aspect of society to the point where it has become critical to its day-to-day functioning. We must learn to better understand software: how to construct it, maintain it, check it and control it. Based on an increased understanding we can learn to better control the risk factors of software (from financial risk to personal safety risk) and we can learn to innovate with higher quality and more agility.

Understanding the complexity of software requires excellent observational instruments that enable top-quality empirical research methods. Since 2009 the Rascal metaprogramming language [L1] has become an infrastructure for empirical research in software engineering. By providing both language-agnostic and language-specific intermediate data representations of source code and data around software processes (versions, issues, discussion), new research was enabled that covers programming languages such as Lua, PHP, Java, C and C++ in a familiar and consistent environment. The development of Rascal and its core supporting language front-ends was done in the context of both national European collaborations such as FP7 OSSMETER and H2020 CROSSMINER and NWO MERITS.

However, there is ample room for growth and improvement. The field is still hampered by a lack of up-to-date, easy-to-use, and easy-to-combine (integrated) instruments for collecting data about software and the software development process. The existing instruments that do exist are scattered, isolated and incompatible. Therefore, we are extending the Rascal platform (and community) for software analysis and transformation with many new necessary high-quality data sources and accurate instruments for software data acquisition and analysis. By design, these instruments will be easy to integrate in order to link data about software in new and unforeseen ways. For example, this year the Ada-air front-end was added to enable the analysis of high-tech components programmed in the Ada language.

On the one hand, all these instruments are comparable to radio telescopes: they acquire the raw data which are essential for knowledge discovery. On the other hand, dissecting software is more like microbiology: every little piece has a different shape and size and requires specialised instruments. Even though these science analogies fail to explain the whole story, the need to learn from accurate observations of software is no less urgent. Software has grown exponentially in the last decades, making manual analysis impossible.

The automated extraction of data from source code and information about the software development process such as version control systems, discussion forums and design diagrams is burdened by countless heterogeneous details. Furthermore, such data acquisition instruments must avoid error and bias, and the information from different data sources has also to be linked. Relative to the downstream research, validated and well-integrated instruments are very expensive to obtain. Moreover, new programming languages, frameworks, and collaboration tools emerge at an alarming rate. Researchers can easily spend more than 50% of their time building and connecting their research instruments. Making high-quality software data acquisition and linking instruments reusable for a wide audience of researchers presents an opportunity to increase the impact of software research.

Figure 1: The Rascal Lab.

The “Rascal Lab”, as described above and depicted in Figure 1, is a vision of a more complete and sustainable laboratory. It is based on the existing Rascal metaprogramming community [1] and supported by a broad consortium of researchers from all Dutch universities. The lab’s vision consists of the construction of many more reusable and accurate data acquisition instruments, the acquisition of several (curated) corpora of software data using those instruments, and the instruments to link, integrate, analyse and report on the extracted data. This is an ongoing and growing effort, and we are welcoming new participants to the community regularly.

If you are interested in participating, please contact the author or connect on GitHub [L2].

Links:
[L1] http://www.rascal-mpl.org
[L2] https://github.com/usethesource/rascal

References:
[1] P. Klint, T. van der Storm and J. J. Vinju, “Rascal, 10 years later,” in IEEE International Working Conference on Source Code Analysis and Manipulation, 2019.

Please contact:
Jurgen Vinju, CWI, The Netherlands
This email address is being protected from spambots. You need JavaScript enabled to view it.

Sidebar

Contents

Open and Shared Infrastructure for Software Research