Understanding Metadata to Exploit Life Sciences Open Data Datasets

by Paulo Carvalho, Patrik Hitzelberger and Gilles Venturini

Open data (OD) contributes to the spread and publication of life sciences data on the Web. Searching and filtering OD datasets, however, can be challenging since the metadata that accompany the datasets are often incomplete or even non-existent. Even when metadata are present and complete, interpretation can be complicated owing to the quantity, variety and languages used. We present a visual solution to help users understand existing metadata in order to exploit and reuse OD datasets – in particular, OD life sciences datasets.

The Environmental Research and Innovation Department (ERIN) of the Luxembourg Institute of Science and Technology (LIST) conducts research and activities in environmental science (including biology, ecology and other areas of study) using advanced tools for big data analytics and visualization. ERIN’s e-science unit is currently investigating how open data (OD) datasets, containing data pertaining to environmental science and related areas, may be reused. The quality of metadata that accompanies such datasets is often poor. Despite metadata being essential for OD reuse [1], it is often non-existent or of low quality. Sometimes metadata is defined, but not always. When it is defined, it may be incomplete. Furthermore, there is no single common standard used to specify metadata: each data provider can choose how metadata will be represented, and may or may not follow a specific metadata standard, for example: Dublin Core Metadata Initiative (DCMI), Data Catalogue Vocabulary (DCAT), Metadata Objects Description Schema (MODS). This lack of consistency increases the difficulty of exploiting metadata, especially when several data sources – potentially from different countries with metadata represented in different languages – are implicated.

We propose a visual solution to avoid such problems and to help users understand metadata and to search for and find specific datasets. Why a visual solution? Data visualization can rapidly assimilate and recognize large amounts of information [2]. The datasets and metadata information are stored in a database previously filled by an OD dataset downloader script. We assume that this task has been successfully done: it is not the focus of our work. The metadata-mapper offers an overview of every piece of metadata obtained from the datasets. The way the information is displayed is organised in three levels:

The first level organises the metadata according to whether or not the metadata are assigned to a given research criteria. Research criteria are necessary to search datasets, e.g., to enable the user to search a dataset by theme, by date of publication, by language, etc.
The second level refers to the type of metadata value. Every metadata is formed by key-value pairs. Different types of metadata can exist. The value might be a number, a name, a date, an email-address, etc.
Finally, to improve visualization and understanding of all metadata keys, the metadata are organised by name. Several groups exist: they contain the metadata whose key starts with a given letter belonging to the group.

In addition to providing a better understanding of existing metadata, our metadata-mapper solution permits the user to establish links between metadata and search criteria. This means that a specific metadata is used to apply a search based on a precise criterion, for instance: imagine we have a search criteria “Language”. The user can create a link between this criteria and metadata “Language” and “Country”. The solution will convert this relation into a query similar to: “SELECT * FROM Datasets WHERE Language = ? UNION SELECT * FROM Datasets WHERE Country = ?”.

Figure 1: Initial state of the metadata-mapper.

Figure 2: The metadata of 543 datasets obtained from http://data.gov.au (search term: "biology").

Figure 2 demonstrates the result of the chart showing all the metadata related to the datasets found and downloaded from the http://data.gov.au OD portal after searching the term “biology”. Five hundred and forty three different datasets were found and downloaded. 17,435 metadata sets were obtained along with the datasets. These 17,435 metadata sets contain 34 different keys.

If it is impossible to understand and use the metadata associated with individual datasets, the datasets cannot be used, and are effectively useless. It is vital that we are able to unlock the potential and value of OD datasets. The metadata-mapper solution delivers a support tool in order to understand metadata delivered with datasets but also to link metadata with research criteria. This is only a first step towards making metadata more usable and to permit the reuse of OD datasets. This solution will give the user a global picture of all existing metadata, their meaning, and how they can be used to find datasets. The visual approach enables large amounts of metadata to be shown at one time. However, the solution has to be tested with large numbers of datasets in order to test whether the methodology can deal with massive amounts of metadata or whether further modifications need to be made.

References:
[1] N. Houssos, B. Jörg, B. Matthews: “A multi-level metadata approach for a Public Sector Information data infrastructure”, In Proc. of the 11th International Conference on Current Research Information Systems, pp. 19-31, 2012.
[2] S. Card, J. D. Mackinlay, B. Shneiderman: “Information visualization. Human-computer interaction: design issues, solutions, and applications, 181, 2009.

Please contact:
Paulo Carvalho
Luxembourg Institute of Science and Technology
E-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

Sidebar

Contents

Understanding Metadata to Exploit Life Sciences Open Data Datasets