SPARQL: A Gateway to Open Data on the Web?

by Pierre-Yves Vandenbussche, Aidan Hogan, Jürgen Umbrich and Carlos Buil Aranda

Hundreds of datasets on the Web can now be queried through public, freely-available SPARQL services. These datasets contain billions of facts spanning a plethora of diverse topics hosted by a variety of publishers, including some household names, such as the UK and US governments, the BBC and the Nobel Prize Committee. A Web client using SPARQL could, for example, query about the winners of Nobel Prizes from Iceland, or about national electric power consumption per capita in Taiwan, or about homologs found in eukaryotic genomes, or about Pokémon that are particularly susceptible to water attacks. But are these novel SPARQL services ready for use in mainstream Web applications? We investigate further.

With a wealth of Linked Data now available on the Web, and more on the way, the robustness of SPARQL technology (a W3C Standard [1]) to access and query these data is of key importance. While SPARQL has undeniable advantages when it comes to query expressivity and interoperability, publishing data using this technology comes at a price. SPARQL services are usually offered free-of-charge to arbitrary clients over the Web, and quality-of-service often suffers. Endpoints may go offline or only return partial results or take longer to return answers than a user is willing to wait. As a result, these endpoints may not be usable for mainstream applications.

The SPARQLES (SPARQL Endpoint Status) project aims to clarify the current state of public SPARQL endpoints deployed on the Web by monitoring their health and performance. The project is an ongoing collaboration between Fujitsu Labs; DCC, Universidad de Chile; PUC, Chile (Grant NC120004); and INSIGHT@NUI Galway. Hosting of the project is provided by the not-for-profit Open Knowledge Foundation (OKFN).

The SPARQLES project currently monitors 442 public SPARQL endpoints registered in Datahub (a community-based data catalogue). The results from the monitoring system are continuously updated on a public website (see links) that provides information along the following dimensions:

Discoverability – how can an agent discover a SPARQL endpoint and what data/metadata is stored? Among the two methods available to describe an endpoint, SPARQL 1.1 Service Descriptions are used in only 10% of the endpoints and VoID descriptions are used in 30% of the endpoints [2]. SPARQLES indicates if these descriptions are provided for monitored endpoints, offering direct access where available.
Interoperability - which SPARQL functionalities are supported? As for any database, the implementation of SPARQL standards (versions 1.0 and 1.1) can vary from one vendor/tool to another. The SPARQLES system assesses the compliance of each endpoint with respect to the SPARQL standard and presents any exceptions that occur to the user. We generally find good compliance with the SPARQL 1.0 standard but support for recently-standardised SPARQL 1.1 features is still sparse [2].
Performance – what general query performance can be expected? Is the endpoint’s performance good enough for a particular application? SPARQLES runs timed experiments over the Web against each endpoint, testing the speed of various operations, such as simple lookups, streaming results and performing joins. A detailed breakdown of the performance of the endpoint is then published on the SPARQLES website. Across all endpoints, the median time for answering a simple lookup is 0.25 seconds, for streaming 100,000 results is 72 seconds, and for running a join with 1,000 intermediate results is 1 second [2]. However, the performance of individual endpoints can vary by up to three orders of magnitude for comparable tasks.
Availability – what is the average uptime based on hourly pings? Which SPARQL endpoints can we trust to be online when we need to query them? For the past three years, SPARQLES has been issuing hourly queries to each public SPARQL endpoint to test if they are online. From these data, the system computes the availability of an endpoint for a given period as the ratio of the total requests that succeed vs. the total number of requests made. Looking at monthly availability, we found that 14.4% of endpoints are available 95% to 99% of the time, 32.2% of endpoints are available 99% to 100% of the time, while the remainder are available less than 95% of the time [2]. The SPARQLES website shows each endpoint’s availability during the last 24 hours and during the last seven days, so application developers can make a more informed decision about whether or not they can rely on an endpoint.

As a whole, SPARQLES contributes to the adoption of SPARQL technology by being seminal in providing the community a complete view on the health of available endpoints [3]. Furthermore, for the first time, this project provides a tool to monitor the service provided by data publishers, creating an incentive for publishers to maintain a high quality service. Future work will include the packaging of the tool (already openly available in github) in a standalone version, which will make it easy for anyone to monitor their endpoint locally. This next step will include an alerts feature in case errors occur.

Links:
http://sparqles.okfn.org/
https://github.com/pyvandenbussche/sparqles
http://datahub.io/
http://okfn.org/

References:
[1] E. Prud’hommeaux, A. Seaborne: “SPARQL query language for RDF”, W3C Recommendation, 2008, http://www.w3.org/TR/rdf-sparql-query
[2] C. Buil-Aranda et al.: “SPARQL Web-Querying Infrastructure: Ready for Action?”, in The Semantic Web–ISWC, 2013, http://vmwebsrv01.deri.ie/sites/default/files/publications/paperiswc.pdf

Please contact:
Pierre-Yves Vandenbussche,
Fujitsu (Ireland) Limited
This email address is being protected from spambots. You need JavaScript enabled to view it.

{jcomments on}

Sidebar

Contents

SPARQL: A Gateway to Open Data on the Web?