by Pierre-Yves Vandenbussche and Bernard Vatant
The “Web of Data” has recently undergone rapid growth with the publication of large datasets – often as Linked Data - by public institutions around the world. One of the major barriers to the deployment of Linked Data is the difficulty data publishers have in determining which vocabularies to use to describe the semantics of data. The Linked Open Vocabularies (LOV) initiative stands as an innovative observatory for the re-usable linked vocabularies ecosystem. The initiative goes beyond collecting and highlighting vocabulary metadata. It now plays a major social role in promoting good practice and improving overall ecosystem quality.
The last few years have seen the emergence of a “Web of Data”. Open government transparency initiatives, such as data.gov (US) and data.gov.uk (UK), have played a key role in its emergence, together with a diverse range of players including: crowd sourcing projects (e.g. DBpedia), heritage organizations (e.g. Europeana, Library of Congress) and Web companies (e.g. schema.org). This development has been facilitated by Semantic Web technologies and standards for exposing, sharing and connecting data. In particular, the adoption of Linked Data best practices has bridged the gap between separately maintained data silos describing people, places, music, movies, books, companies, etc. Publishing data on the Web as Linked Data makes it easy for other organizations and data providers to create detailed links to your data (and vice-versa) and to make your data interoperable in other contexts, resulting in your data being more visible and reusable.
Initiated in March 2011, within the framework of the DataLift research project [1] hosted by the Open Knowledge Foundation, the Linked Open Vocabularies (LOV) initiative is now standing as an innovative observatory of the vocabulary ecosystem. It gathers and makes visible indicators that have not previously been harvested, such as interconnection between vocabularies, versioning history and maintenance policy, and, where relevant, past and current referents (individual or organization).
LOV’s features include:
- Documentation: the best way to publish information about a vocabulary is to formally declare the metadata in the vocabulary itself [2]. The documentation assists any user in the task of understanding the semantics of each vocabulary term and therefore of the data using it. For instance, information about the creator and publisher is a key indication for a vocabulary user in case help or clarification is required from the author, or to assess the stability of that artefact. About 55% of vocabularies specify at least one creator, contributor or editor. We augmented this information using not formally defined and manually gathered information, leading to inclusion of data about the creator in over 85% of vocabularies in LOV.
- Versions: the LOV database stores every different version of a vocabulary over time since its first issue. For each version, a user can access the file (even though the original online file is no longer available) and a log of modifications since the previous version.
- Dependencies: the very nature of the Web is distributed and uncontrolled. To embrace the complexity of the vocabulary ecosystem and assess the impact of a modification, one needs to know in which vocabularies and datasets a particular vocabulary term is referenced. For the first time LOV provides such a vision.
- Search: the LOV search feature queries a repository which contains the entire vocabulary ecosystem along with LOV metadata and metrics of vocabulary terms used in the Linked Open Data cloud. To help users in the selection of a vocabulary term, the results are ordered by a ranking algorithm based on the term popularity in the LOD datasets and in the LOV ecosystem.
All data produced within the LOV initiative are published and openly available for the community. LOV has opened new paths in using vocabularies for Linked Open Data representation by offering new search features based on rich metadata, social support and by fostering the “long tail” of vocabularies that have thus far remained unknown despite their high quality and potential usefulness. Our results reveal the high diversity of practices, both technical and social, taking place in the life cycle of vocabularies [3]. They highlight both the healthy interconnectivity of organic growth, and a certain number of pitfalls and potential points of failure in the ecosystem. Furthermore, results show encouraging signs of a growing awareness in the community of the importance of keeping the whole ecosystem alive and sustainable, through ways of governance which, for the most part, are yet to be invented
Links:
http://lov.okfn.org/dataset/lov/
http://datalift.org/en/
http://okfn.org/
References:
[1] F. Scharffe et al.: “Enabling linked-data publication with the datalift platform” in proc. AAAI workshop on semantic cities, 2012, http://www.aaai.org/ocs/index.php/WS/AAAIW12/paper/view/5349/5678
[2] P.-Y. Vandenbussche, B. Vatant: “Metadata recommendations for linked open data vocabularies”, white paper v1.1., 2012, http://lov.okfn.org/dataset/lov/Recommendations_Vocabulary_Design.pdf
[3] M. C. Suárez-Figueroa, A. Gómez-Pérez: “NeOn methodology for building ontology networks: a scenario-based methodology”, in proc. of the International Conference on Software, Services & Semantic Technologies, 2009, http://oa.upm.es/5475/1/INVE_MEM_2009_64399.pdf
Please contact:
Pierre-Yves Vandenbussche
Fujitsu Limited, Ireland
E-mail: