by Lars Bröcker
The semantic Web offers exciting opportunities for scientific communities: knowledge bases with underlying ontologies promote the process of collaborative knowledge creation and facilitate research based on the published body of work. However, since most scientific communities do not possess the knowledge required to build an ontology, these opportunities tend not to be taken up. The WIKINGER project aims to provide tools that largely automate the creation of an ontology from a domain-specific document collection.
There is a multitude of interesting content on the Web that would greatly benefit from inclusion into the semantic Web, eg cultural heritage collections, educational resources and scientific digital libraries. These collections offer a plethora of possibilities for knowledge applications if only they could be made accessible through ontologies. However, as long as domain experts require the assistance of ontology engineers to even begin the process of building a shallow ontology for their own domains, this is unlikely to happen.
Although there are tools available that guide the creation of ontologies, they are pitched to the level of an experienced ontology engineer. Domain experts wanting to create an ontology must therefore become ontology-creation experts before they can make use of these tools. A more sensible situation would be to provide tools to guide the ontological layperson through the process of creating an ontology for their domain, by automating the creation as much as possible.
WIKINGER is a joint project involving the department for Computer Linguistics at the University of Duisburg-Essen, the Fraunhofer-Institut für Intelligente Analyse und Informationssystem (IAIS), and the Commission for Contemporary History (KfZG). The goal is to create ontologies from collections of scientific documents; these are accessed using a Wiki containing the entities and concepts relevant to the domain, as well as qualified associations connecting them.
Named entity recognition (NER) is employed to automatically extract from the text collection entities of different concept classes. Domain experts provide examples for these concept classes using WALU, the WIKINGER Annotation and Learning Environment. Figure 1 shows a typical screenshot of the tool in action. Entities are marked up in different colours in the text, and are thus easily discerned at a glance. They are then assigned to a concept class by pressing the corresponding button in the lower part of the editor. WALU contains helper functions that mark up entities automatically, eg date/time information, previous annotations or entities provided in list form. The results of this step are sent back to the WIKINGER server, which then trains the extraction algorithms to extract entities on the larger scale of the document collection.
Whereas this step provides the nodes of the semantic network, the following step is concerned with the provision of the edges connecting these nodes. These edges represent the relations between the different concept classes. They are determined in a relation discovery module that first discerns all statistically relevant relations between classes in a training part of the corpus. The association rule mining and clustering algorithms are applied to this task. The association rules highlight interesting concept combinations in the text collection. The relations present in the occurrences of different rules are separated using clustering algorithms.
The relation discovery process can be governed by domain experts using a Java application. It generates the association rules, performs the clustering and allows the selection and labelling of relations that are to be included in the ontology. The different clusters can be reviewed on screen, meaning they can be merged with other similar clusters, discarded in the event of mistakes, and finally labelled if they fit into the ontology that is to be created. Results are sent back to the server that applies them on the automatically extracted entities in order to populate the ontology. The server then provides access to the ontology for Web applications or via a SPARQL Protocol and RDF Query Language.
The pilot application for our system focuses on the domain of contemporary history, in particular the social and political history of German Catholicism. Historians from the KfZG are already using the tools described above for the creation of annotation examples and the subsequent selection of appropriate relations for the ontology describing their domain. They are reporting no difficulties in adapting to these new tools. The metaphors used in WALU imitate known behaviours used in marking up text passages, and thus integrate easily into a familiar pattern of workflow. Slightly more effort is required to become acquainted with the relation discovery application, since the actions performed here are new to the experts.
The prototype of the system is in its final development stages, and its release to the scientific community is planned for early 2008. Work conducted on different data sets shows the transferability of the approach: WALU for example achieved second place in the EVALITA shared task on NER in Italian newspapers in 2007.
The project is funded by the German Federal Ministry of Research and Education in the programme 'e-Science und Wissensvernetzung' (e-Science and Knowledge Networking). Work on the project began in October 2005, and will be completed in September 2008.
Lars Broecker, Fraunhofer-Institut Intelligente Analyse- und Informationssysteme, Germany
Tel: +49 2241 14 1993