Carrot2: Making Sense of the Haystack

Looking for a piece of information in large collections of documents (such as the Internet) is very much like looking for a needle in a haystack. One can limit the number of matching documents returned from a search engine with the right choice of keywords, but this usually requires some initial knowledge of the context in which the information in question may appear. When this context is unclear, or if this very context is the resource sought, information retrieval becomes a great challenge.

Carrot² is a collection of algorithms and tools designed to help humans explore the thematic context of documents retrieved from a text collection. A set of documents (for example a list of results retrieved from a search engine) is analysed and dynamically linked together into groups related to a common topic (see Figure 1). Typical examples used to demonstrate this technique show the context of broad and ambiguous queries like 'apache' (helicopter, indian tribe or software organization) or 'salsa' (dance or food). However, the gains from explicit visualization of context in information retrieval go far beyond simple queries. Companies such as Google, Amazon and Vivísimo already employ techniques for context exploration and visualization to improve their search products. Carrot² offers best-of-breed algorithms and contains demonstration applications for clustering data from multiple sources, including search engines like Google, Yahoo! and MSN, and data repositories like Wikipedia or PubMed.

Figure 1: Information flow inside Carrot2 a set of search results is clustered into topic groups and then shown back to the user in a variety of ways (hierarchy of topics, graph of relationships, etc).

Carrot² was established in 2001 by Dawid Weiss and Stanisław Osiński, who at the time were students at Poznań University of Technology in Poznań, Poland. From its inception the project was meant to be open and provide value to both the research and commercial communities (BSD licence). The source code was published at SourceForge and further development took place in the public domain.

From a research point of view, the task of text clustering presents a great challenge, especially in a multilingual context. While a number of document-clustering techniques exist, they all lack the fundamental ability to provide sensible descriptions (labels) of the output document groups. This has been the primary focus of the Carrot² project to extract sensible groups of documents on related topics, but most of all to provide a short, comprehensive description of these clusters.

At the time of writing, the project includes a number of original text-clustering algorithms and auxiliary components for text processing. In 2004, Carrot² was awarded a special prize for research tools in the finals of the European Academic Software Award competition. The rough-set-based clustering algorithm included in the project received best paper award at the 2005 IEEE Web Intelligence conference. We are also proud to have a number of deployments worldwide, many references in research literature and a few sibling open-source projects using Carrot² components. Constantly growing commercial interest in text-clustering services and algorithms resulted in the establishment in 2005 of a spin-off company, Carrot Search. The company took over the maintenance and further development of the project.

Links:
http://www.carrot2.org
http://www.carrot-search.com

Please contact:
Dawid Weiss
Institute of Computer Science, Poznan University of Technology, Poland
E-mail: dawid.weisscs.put.poznan.pl