by Phil Archer (W3C)
Sharing data between researchers, whether openly or not, requires effort, particularly concerning its metadata. What is the minimum metadata needed to aid discovery? Once data has been discovered, what metadata is needed in order to be able to evaluate its usefulness? And, since it’s not realistic to expect everyone to use the same metadata standard to describe data, how can different systems interoperate with the metadata that is commonly provided? These topics and more were discussed in Amsterdam in late 2016.
The Smart Descriptions & Smarter Vocabularies Workshop (SDSVoc) was organized by ERCIM/W3C under the EU-funded VRE4EIC project and hosted by CWI in Amsterdam. Of 106 registrations, it is estimated that 85-90 people attended. The event comprised a series of sessions in which thematically related presentations were followed by Q&A with the audience, which notably included representatives from both the scientific research and government open data communities.
Participants of the Smart Descriptions & Smarter Vocabularies Workshop.
The workshop began with a series of presentations of different approaches to dataset description, including the CERIF [L1] standard used by VRE4EIC, followed by a closely related set of experiences of using the W3C’s Data Catalog Vocabulary, DCAT [L2]. It was very clear from these talks that DCAT needs to be extended to cover gaps that practitioners have found different ways to fill. High on the list is versioning and the relationships between datasets, but other factors such as descriptions of APIs and a link to a representative sample of the data are also missing.
High level descriptions of any dataset are likely to be very similar (title, licence, creator etc.) but to be useful to another user, the metadata will need to include domain-specific information. A high profile case is data related to specific locations in time and space, and one can expect this to be part of the general descriptive regime. On the other hand, highly specialized data, such as details of experiments conducted at CERN, will always need esoteric descriptions.
An important distinction needs to be made between data discovery – for which the very general approach taken by schema.org is appropriate – and dataset evaluation. In the latter case, an individual needs to know details of things like provenance and structure before they can evaluate its suitability for a given task.
A descriptive vocabulary is only a beginning. In order to achieve real world interoperability, the way a vocabulary is used must also be specified. Application Profiles define cardinality constraints, enumerated lists of allowed values of given properties and so on, and it is the use of these that allows data to be validated and shared with confidence. Several speakers talked about their validation tools and it’s clear that a variety of techniques are used. Validation techniques as such were out of scope for the workshop, although there were many references to the emerging SHACL [L3] standard. Definitely in scope however was how clients and servers might exchange data according to a declared profile. That is, apply the concept of content negotiation not just to content type (CSV, JSON, RDF or whatever) but also the profile used. The demand for this kind of functionality has been made for many years and proposals were made to meet that demand in future standardization work.
The workshop concluded that a new W3C Working Group should be formed to:
- revise and extend DCAT;
- provide guidance and exemplars of its use;
- standardize, or support the standardization elsewhere, of content negotiation by profile.
A full report on the event is published by W3C [L4] along with the agenda that links to all papers and slides, and a complete list of attendees. The W3C membership is being consulted on the formation of the new working group, expected to begin its work in May 2017.
Phil Archer, W3C