Mining Open Software Repositories

by Jesús Alonso Abad, Carlos López Nozal and Jesús M. Maudes Raedo

With the boom in data mining which has occurred in recent years and higher processing powers, software repository mining now represents a promising tool for developing better software. Open software repositories, with their availability and wide spectrum of data attributes are an exciting testing ground for software repository mining and quality assessment research. In this project, the aim was to achieve improvements in software development processes in relation to change control, release planning, test recording, code review and project planning processes.

In recent years, scientists and engineers have started turning their heads towards the field of software repository mining. The ability to not only examine static snapshots of software but also the way they have evolved over time is opening up new and exciting lines of research towards the goal of enhancing the quality assessment process. Descriptive statistics (e.g., mean, median, mode, quartiles of the data-set, variance and standard deviation) are not enough to generalize specific behaviours such as how prone a file is to change [1]. Data mining analysis (e.g., clustering, regression, etc.) which are based on the newly accessible information from software repositories (e.g., contributors, commits, code frequency, active issues and active pull requests) must be developed with the aim of proactively improving software quality, not only reactively responding to issues.

Open source software repositories like Sourceforge and GitHub provide a rich and varied source of data to mine. Their open nature welcomes contributors with very different skill sets and experience levels and the absence or low levels of standardized workflow enforcement make them reflect ‘close-to-extreme’ cases (as opposed to the more structured workflow patterns experienced when using, for instance, a branch-per-task branching policy). In addition, they provide easily accessible data sources for scientists to experiment with. The collection of these massive amounts of data have been supported by Qualitas Corpus [2] and GHTorrent [3] who have both made multiple efforts to gather and offer datasets to the scientific community.

The project workflow, undertaken by our research team at the University of Burgos, Spain, included the following steps (Figure 1):

Obtain data collected by GHTorrent from the GitHub repository and put it into MongoDB databases.
Filter the data according to needs and expand the data where possible (e.g., downloading source code files or calculating measurements such as the number of commits, number of issues opened, etc.). Some pre-processing of the data using JavaScript was completed during the database querying step and a number of Node.js scripts were used for several operations afterwards (e.g., file downloading or calculating static code metrics such as the number of lines of code, McCabe’s complexity, etc.)
Define an experiment with the aim of improving the software development process and pack the expanded data into a data table that will be supplied to a data mining tool to be used for a range of different techniques including regression or clustering.
Evaluate the data mining results and prepare experiments to validate new hypotheses based on those results.

Figure 1: The process of mining data from an open software repository." title="Figure 1: The process of mining data from an open software repository.

Despite the benefits of using such repositories, it is important to remember that, sometimes, a lack of standarization in the integration process can create unformatted or missing commit messages or frequent unstable commits. This, and other constraints (not discussed here) can make data mining these repositories more difficult and/or lead to sub-optimal results.

Until now, software quality assessment has focused on single snapshots taken throughout the life of the software. Thus, the assessments have not been able to take the time variable into account. The use of software repositories allows researchers to address this shortcoming. Consequently, future software repository mining will play a key role in enhancing the software development process, allowing developers to detect weak points, predict future issues and provide optimized processes and development cycles. Open software repositories offer a number of future research opportunities.

Links:
http://sourceforge.net/
https://github.com/
http://qualitascorpus.com/
http://ghtorrent.org/

References:
[1] I. S. Wiese et al.: “Comparing communication and development networks for predicting file change proneness: An exploratory study considering process and social metrics,” Electron. Commun. EASST - proc. of SQM 2014, vol. 65, 2014.
[2] E. Tempero et al.: “The Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies,” 2010 Asia Pacific Softw. Eng. Conf., 2010.
[3] G. Gousios, D. Spinellis: “GHTorrent: Github’s data from a firehose,” 9th IEEE MSR, 2012.

Please contact:
Jesús Alonso Abad
University of Burgos, Spain
Tel: +34 600813116
E-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

Carlos López Nozal
University of Burgos, Spain
Tel: +34 947258989
E-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

Jesús M Maudes Raedo.
University of Burgos, Spain
Tel: +34 947259358
E-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

Sidebar

Contents

Mining Open Software Repositories