Special Theme

Image ERCIM News 100 cover page

ERCIM News 100
January 2015
Special theme: Scientific Data Sharing and Re-use

Guest editors:
- Costantino Thanos, ISTI-CNR, Italy
- Andreas Rauber, TU Vienna, Austria

This issue in pdf(56 pages)

Back Issues Online

Back Issues Online

Contents

Scientific Data Sharing and Re-use - Introduction to the Special Theme

by Costantino Thanos and Andreas Rauber

Research data are essential to all scientific endeavours. Openness in the sharing of research results is one of the norms of modern science. The assumption behind this openness is that scientific progress requires results to be shared within the scientific community as early as possible in the discovery process.

Creating the Culture and Technology for a Global Data Infrastructure

by Mark A. Parsons

The Research Data Alliance implements data sharing infrastructure by building social and technical bridges across cultures, scales and technologies.

If Data Sharing is the Answer, What is the Question?

by Christine L. Borgman

Data sharing has become policy enforced by governments, funding agencies, journals, and other stakeholders. Arguments in favor include leveraging investments in research, reducing the need to collect new data, addressing new research questions by reusing or combining extant data, and reproducing research, which would lead to greater accountability, transparency, and less fraud. Arguments against data sharing rarely are expressed in public fora, so popular is the idea. Much of the scholarship on data practices attempts to understand the socio-technical barriers to sharing, with goals to design infrastructures, policies, and cultural interventions that will overcome these barriers.

Enhancing the Value of Research Data in Australia

by Andrew Treloar, Ross Wilkinson, and the ANDS team

Over the last seven years, Australia has had a strong investment in research infrastructure, and data infrastructure is a core part of that investment.

Beyond Data: Process Sharing and Reuse

by Tomasz Miksa and Andreas Rauber

Sharing and reuse of data is just an intermediate step on the way to reproducible computational science. The next step, sharing and reuse of processes that transform data, is enabled by process management plans, which benefit multiple stakeholders at all stages of research.

Open Data – What do Research Communities Really Think about it?

by Marie Sandberg, Rob Baxter, Damien Lecarpentier and Paweł Kamocki

Facilitating open access to research data is a principle endorsed by an increasing number of countries and international organizations, and one of the priorities flagged in the European Commission’s Horizon 2020 funding framework [1][2]. But what do researchers themselves think about it? How do they perceive the increasing demand for open access and what are they doing about it? What problems do they face, and what sort of help are they looking for?

Providing Research Infrastructures with Data Publishing

by Massimiliano Assante, Leonardo Candela, Paolo Manghi, Pasquale Pagano, and Donatella Castelli

The purpose of data publishing is to release research data for others to use. However, its implementation remains an open issue. ‘Science 2.0 Repositories’ (SciRepos) address the publishing requirements arising in Science 2.0 by blurring the distinction between research life-cycle and research publishing. SciRepos interface with the ICT services of research infrastructures to intercept and publish research products while providing researchers with social networking tools for discovery, notification, sharing, discussion, and assessment of research products.

Sailing Towards Open Marine Data: the RITMARE Data Policy

by Anna Basoni, Stefano Menegon and Alessandro Sarretta

A thorough understanding of marine and ocean phenomena calls for synergic multidisciplinary data provision. Unfortunately, much scientific data is still kept in drawers, and in many cases scientists and stakeholders are unaware of its existence. At the same time, researchers lament the time consuming nature of data collection and delivery. To overcome barriers to data access, the RITMARE project issued a data policy document, an agreement among participants on how to share the data and products either generated by the project activities or derived from previous activities, with the aim of recognizing the effort involved.

RDA: The Importance of Metadata

by Keith G. Jeffery and Rebecca Koskela

RDA is all about facilitating researchers to use data (including scholarly publications and grey literature used as data). This encompasses data collection, data validation, data management (including preservation/curation), data analysis, data simulation/modelling, data mining, data visualisation and interoperation of data. Metadata are the key to all of these activities because they present to persons, organisations, computer systems and research equipment a representation of the dataset so that the dataset can be acted upon.

RDA: Brokering with Metadata

by Stefano Nativi, Keith G. Jeffery and Rebecca Koskela

RDA is about interoperation for dataset re-use. Datasets exist over many nodes. Those described by metadata can be discovered; those cited by publications or datasets have navigational information. Consequentially two major forms of access requests exist: (1) download of complete datasets based on citation or (query over) metadata and (2) relevant parts of datasets instances from query across datasets.

Asking the Right Questions - Query-Based Data Citation to Precisely Identify Subsets of Data

by Stefan Pröll and Andreas Rauber

Data underpins most scientific endeavours. However, the question of how to enable scalable and precise citation of arbitrary subsets of static and specifically dynamic data still constitutes a non-trivial challenge.

Capturing the Experimental Context via Research Objects

by Catherine Jones, Brian Matthews and Antony Wilson

Data publication and sharing are becoming accepted parts of the data ecosystem to support research, and this is becoming recognised in the field of ‘facilities science’. We define facilities science as that undertaken at large-scale scientific facilities, in particular neutron and synchrotron x-ray sources, although similar characteristics can also apply to large telescopes, particle physics institutes, environmental monitoring centres and satellite observation platforms. In facilities science, a centrally managed set of specialized and high value scientific instruments is made accessible to users to run experiments which require the particular characteristics of those instruments

Engineering the Lifecycle of Data Sharing Agreements

Mirko Manea and Marinella Petrocchi

Sharing data among groups of organizations and/or individuals is essential in a modern web-based society, being at the very core of scientific and business transactions. Data sharing, however, poses several problems including trust, privacy, data misuse and/or abuse, and uncontrolled propagation of data. We describe an approach to preserve privacy whilst data sharing based on scientific Data Sharing Agreements (DSA).

Cross-disciplinary Data Sharing and Reuse via gCube

by Leonardo Candela and Pasquale Pagano

Data sharing has been an emerging topic since the 1980’s. Science evolution – e.g. data-intensive, open science, science 2.0 – is revamping this discussion and calling for data infrastructures capable of properly managing data sharing and promoting extensive reuse. ‘gCube’, a software system that promotes the development of data infrastructures, boasts the distinguishing feature of providing its users with Virtual Research Environments where data sharing and reuse actually happens.

Toward Automatic Data Curation for Open Data

by Thilo Stadelmann, Mark Cieliebak and Kurt Stockinger

In recent years large amounts of data have been made publicly available: literally thousands of open data sources exist, with genome data, temperature measurements, stock market prices, population and income statistics etc. However, accessing and combining data from different data sources is both non-trivial and very time consuming. These tasks typically take up to 80% of the time of data scientists. Automatic integration and curation of open data can facilitate this process.

An Interactive Tool for Transparent Data Preprocessing

by Olivier Parisot and Thomas Tamisier

We propose a visual tool to assist data scientists in data preprocessing. The tool interactively shows the transformation impacts and information loss, while keeping track of the applied preprocessing tasks.

e-Infrastructure across Photon and Neutron Sources

by Juan Bicarregui and Brian Matthews

Today’s scientific research is conducted not just by single experiments but rather by sequences of related experiments or projects linked by a common theme that lead to a greater understanding of the structure, properties and behaviour of the physical world. This is particularly true of research carried out on large-scale facilities such as neutron and photon sources where there is a growing need for a comprehensive data infrastructure across these facilities to enhance the productivity of their science.

Understanding Open Data CSV File Structures for Reuse

by Paulo Carvalho, Patrik Hitzelberger and Gilles Venturini

Open Data (OD) is one of the most active movements contributing to the spread of information over the web. However, there is no common standard to publish datasets. Data is made available by different kind of entities (private and public), in various formats and according to different cost models. Even if the information is accessible, it does not mean it can be reused. Before being able to use it, an aspiring user must have knowledge about its structure, location of meaningful fields and other variables. Information visualization can help the user to understand the structure of OD datasets.

How Openness can Change Scientific Practice

by Robert Viseur and Nicolas Devos

The term ‘open data’ refers to “information that has been made technically and legally available for reuse”. Open data is currently a hot topic for a number of reasons, namely: the scientific community is moving towards reproducible research and sharing of experimental data; the enthusiasm, especially within the scientific community, for the semantic web and linked data; the publication of datasets in the public sector (e.g., geographical information); and the emergence of online communities (e.g., OpenStreetMap). The open data movement engages the public sector, as well as business and academia. The motivation for opening data, however, varies among interest groups.