by Pierre Maillot, Catherine Faron, Fabien Gandon and Franck Michel (Inria)
Where can we find the data we need for a given task? IndeGx aims to create an index of knowledge graphs published on the web, in the form of linked open datasets, for humans and machines alike. This framework provides descriptions of the knowledge graphs to draw a picture of their content, quality and compliance with the standards. A companion webpage is also provided to support data providers in the description of their datasets in compliance with the latest best practices.
In recent years, a large number of knowledge graphs have been built and published on the web in the form of RDF datasets, in fields as diverse as linguistics or life sciences, along with other general datasets such as DBpedia or Wikidata. The reliable exploitation of these datasets requires specific knowledge about their content, access points and commonalities. Yet, it is usually difficult to have such a clear and up-to-date description as most datasets have neither a machine-readable nor a human-readable description, and not all access points can handle the complex queries required to automatically generate such descriptions. The data providers are commonly regarded as responsible for describing the datasets they publish. However, this requires specific efforts, costs, and skills that some providers, who are not necessarily experts in semantic web technologies, do not have or cannot afford. In particular, these descriptions rely on a deep understanding and joint use of specialised vocabularies and there is no standard model or tool for generating and updating these descriptions.
IndeGx is a transparent, declarative, collaborative and extensible framework designed to generate the description of a knowledge graph solely based on information that can be extracted from a SPARQL endpoint that serves that knowledge graph. It is part of the ongoing effort to help humans and machines in the use of knowledge graphs on the web. It creates an open repository of descriptions to guide agents in selecting knowledge graphs, and supports a variety of use cases. For instance, they could be used to fuel the faceted search of a dataset catalogue meant for human agents, or a query federation engine could leverage the statistics on the usage of the classes and properties in a dataset to efficiently rewrite a query over multiple graphs.
IndeGx relies on generation rules expressed in standard RDF vocabularies and SPARQL to yield common metadata features such as provenance information and lists of classes and properties, as well as new kinds of metadata features not covered by previous approaches, such as quality indicators, lists of vocabularies and new statistics. Furthermore, the processing of a dataset at different points in time makes it possible to track its evolution. The descriptions generated by IndeGx are represented in RDF and published as a regular, public knowledge graph.
The rules that generate the descriptions are written with the same vocabulary as the one used by the W3C to describe test suites in RDF. When joined with the execution traces that are also kept and represented in RDF, the process of generating a dataset description is fully transparent and traceable. Moreover, the rules and the IndeGx application are available under open licences [L1], and anyone can replicate them, extend them or create their own set of rules, as long as they use the same vocabulary.
The KartoGraphI website [L2] relies on IndeGx to draw a picture of the open sources of the semantic web. At the time of writing, it provides different visualisations computed to monitor, over an 8-month period, 339 datasets with SPARQL endpoints retrieved from well-established dataset catalogues such as the Linked Data Cloud website and Wikidata. The details of the results of that experiment can be found in [1]. Despite the multiple technical constraints, we obtained detailed statistics on 54% of the datasets. Such statistics serve as the base of exploration, selection and federation methods. For instance, on average, the dataset’s endpoints support 85% of the SPARQL 1.1 features, and knowing this allows for advanced usage of these datasets.
Figure 1: Geolocalisation of the endpoints described in IndeGx as shown on the KartoGraphI website.
Figure 1 shows the geolocalisation of the endpoints of the datasets based on their URL. Figure 2 shows the graph of endpoints and datasets. The three green nodes in the top-left part of the graph, connected to a group of vocabularies around them, are the three mirror sites of Linked Open Vocabularies. The vocabularies that surround them are listed in the Linked Open Vocabulary dataset but not used in any other. The blue nodes grouped at the centre of the graph are the most-used vocabularies; they are surrounded by the majority of the endpoints. Vocabularies used in a single dataset, or datasets using a single vocabulary, radiate from this middle group. Likewise, Figure 3 shows which endpoint uses which (meta)vocabulary (e.g. SKOS, SPIN, SHACL) thus providing an idea of the type of content that can be found in the corresponding dataset.
Figure 2: Graph of the endpoints (in green) and vocabularies (in blue) of the 339 datasets monitored.
Figure 3. Graph of the endpoints connected to the different (meta-)vocabularies.
Only 10% of the analysed datasets contain provenance information. To encourage data providers to write provenance metadata in RDF and more generally rich descriptive metadata, we created the Metadatamatic webpage [L3]. It allows users to generate an RDF description by filling out an interactive form. The description is guaranteed to follow the latest best practices for dataset description, and part of the features can be automatically extracted from the dataset if it is served by a running endpoint.
Together, IndeGx, KartoGraphI, and Metadatamatic provide a set of tools to encourage, support and monitor the description of linked open datasets.
Links:
[L1] https://github.com/Wimmics/dekalog
[L2] http://prod-dekalog.inria.fr/
[L3] https://wimmics.github.io/voidmatic/
References:
[1] P. Maillot, O. Corby, C. Faron, F. Gandon and F. Michel, “IndeGx: A model and a framework for indexing RDF knowledge graphs with SPARQL-based test suits,” Journal of Web Semantics, 2023, https://doi.org/10.1016/j.websem.2023.100775, https://hal.inria.fr/hal-03946680
[2] P. Maillot, O. Corby, C. Faron, F. Gandon and F. Michel, “KartoGraphI: Drawing a map of linked data,” in ESWC 2022 - 19th European Semantic Web Conferences, May 2022, Hersonissos, Greece.
https://hal.inria.fr/hal-03652865.
Please contact:
Pierre Maillot, University Côte d’Azur, Inria, CNRS, I3S, France