A System for the Discovery and Selection of FLOSS Projects

by Francesca Arcelli Fontana, Riccardo Roveda and Marco Zanoni

Developing software systems by reusing components is a common practice. FLOSS (Free, Libre and Open Source Software) components represent a significant part of the reusable components available. The selection of suitable FLOSS components raises important issues both for software companies and research institutions. RepoFinder supports a keyword-based discovery process for FLOSS components, and applies well-known software metrics and analyses to compare the structural aspects of different components.

FLOSS (Free, Libre and Open Source Software) components are widely used in industry. They are managed on dedicated web platforms, called Code Forges, designed to collect, promote, and categorize FLOSS projects, and to support development teams. Some examples are Launchpad, Google Code, Sourceforge, Github, Bitbucket and Gnu Savannah. Code Forges provide tools and services for distributed development teams, eg, version control systems, bug tracking systems, web hosting space. Currently, Github hosts more than ten million software components, and is probably the largest. Sourceforge is one of the first Code Forges, created in 1999, with more than 425,000 components, all of them well categorized, while Google Code hosts about 250,000 components.

To find a component satisfying specific requirements, we can start from a simple query on a generic search engine or on each Code Forge. The risk is that we receive many useless answers, because FLOSS project names are often ambiguous and difficult to recover. Even retrieving the source code of a particular project can be difficult, because it may be managed by different portals, with different versioning technologies (e.g., Git, SVN, Mercurial (HG), CVS, GNU Bazaar).

RepoFinder is implemented in a web application and has been designed to retrieve and analyze the source code from open source software repositories. It performs these tasks through the following steps:
Crawling of widely known Code Forges (Github, Sourceforge, Launchpad and Google Code) through their APIs or by parsing of HTML pages; during this step, projects are tagged with labels extracted from the web pages crawled and the other available metadata (e.g., tags, forks, stars).
Analysis of the components, by integrating external software for the computation of metrics and the detection of code smells in Java code. We integrated the following tools: Checkstyle, JDepend, NICAD, PMD, Understand and another tool for metrics computation, called DFMC4J, developed at our Lab.
Integration of data gathered by the code analyzers in a dataset, to be used for statistical analyses.
Browsing of the data through a search engine based on simple google-like queries.

We have indexed more than 90,000 projects using RepoFinder, and found more than 2,000 labels. Analyses can only be applied to Java systems with a working Maven configuration. Maven ensures that all relevant sources are included in the analyses. It also retrieves all dependencies needed to compile the systems and to gather the required metrics correctly. In fact, Maven automatically downloads library dependencies, and depending on the configuration, can execute code generation tools (e.g., ANTLR, JOOQ), only selecting source code without test cases and, if required, can compile the system. RepoFinder can also visualize the results gathered for one system and compare the results of different systems.

The overall architecture of RepoFinder is represented in Figure 1.

Figure 1: RepoFinder Architecture

We aim to apply and extend RepoFinder in the following directions:

Analysis of software history. Software repositories contain important information about the development team and system maturity. Many analyses can be performed in this context; for example, to understand whether the quality of a software has been improved over time, and if new releases tend to be corrected quickly, evidencing a low maturity. This kind of data can be evaluated for an individual project or compared for different projects.
Language support. Currently, RepoFinder analyzes Java systems configured to be built by Maven. Many other languages and build systems are available and widely exploited in open source development. Examples of other build systems for Java are Eclipse, Ant, Gradle, Ivy, Grape and Buildr. Analyzing additional languages could require the integration of further analysis tools. Other tools also exist that could retrieve useful information, like Cobertura (for systems with large test suites), or FindBugs and other code smell detection tools.

Other applications exist with similar functionalities, such as SonarQube and Ohloh. However, RepoFinder differs in that it combines discovery functionalities, more related to Ohloh, with analysis functionalities, which is the domain of SonarQube. There are a number of datasets that gather data coming from different Code Forges, which could be integrated in the crawling and analysis processes of RepoFinder. However, their availability is variable and, unfortunately, some of these data-collection projects are closed.

Please contact:
Francesca Arcelli Fontana and Marco Zanoni
University of Milan Bicocca, Italy
E-mail: {arcelli, zanoni}disco.unimib.it