by Stefan Pröll and Andreas Rauber
Data underpins most scientific endeavours. However, the question of how to enable scalable and precise citation of arbitrary subsets of static and specifically dynamic data still constitutes a non-trivial challenge.
Although data has been the source of new knowledge for centuries, it has never received the same attention as the publications about the derived discoveries. Only recently has it been recognized as a first-class citizen in science, earning equal merit (see Link JDDCP below). Beyond mere referencing for the purpose of acknowledging the creators, it is the underlying basis and evidence for many scientific discoveries. With the increasing focus on repeatability and verifyability in the experimental sciences, providing access to the underlying data is becoming essential.
Data used to be condensed into human readable form, by aggregating source data into tables and graphs. Alternatively, the specific subset and version of data used in a study was deposited in a repository for later re-use. With the arrival of data driven science [1], the increasing amount of data processed and the increasing dynamics of data, these conventional approaches are no longer scalable.
Research datasets can be huge in terms of contained records. Scientists are often interested in a particular view of their data, using subsets tailored to a specific research question. An experiment can only be reproduced and verified if the same subset can be retrieved later. Depositing the specific subset in a repository does not scale to big data settings. Also providing the metadata helping users to find, interpet and access specific datasets again can be a challenge. Textual descriptions of subsets are hardly precise enough, require human intervention and interpretation of ambiguous descriptions in re-creating the dataset, limiting reproducibility of experiments and re-use in meta-studies.
Furthermore, many research datasets are highly dynamic: new data is added continuously via data stream, rendering conventional versioning approaches useless unless one wants to revert to old-fashioned time-delayed batch releases of annual/quarterly versions of the data. Additional dynamics arise from the need to correct errors in the data, removing erroneous data values, or re-calibrating and thus re-computing values at later points in time. Thus, researchers require a mechanism to retrieve a specific state of the data again, in order to compare the results of previous iterations of an experiment. Obviously, storing each individual modification as a new version in a data repository does not scale for large data sources with high change frequencies.
Thus we need to devise new ways of precisely identifying specific subsets of data in potentially highly dynamic settings that do not require human intervention to interpret and assist with identifying the data as used in a specific study. The method needs to be scalable to high-volume data settings, be machine-actionable and resilient to technological changes in order to support long-term access to data. Figure 1 shows the requirements for data citation. This will support reproduction and verification of scientific experiments as well as re-use in follow-up or meta-studies, while providing citability for giving credit to the creators.
Figure 1: Requirements for data citation.
To meet these requirements we introduced a query centric view on data citation based on the tools already used by scientists during the subset creation process[2]. Our approach is based on versioned and timestamped data, where all inserts, updates and deletes of records within a dataset are recorded. Thus any state of the data source can be retrieved later. This time-stamping and versioning is often already in place in many data centres dealing with dynamic data.
To support subset identification we record the query which is used to create the subset by selecting only specific records from a data source. We trace all selection, filtering and sorting operations in an automated fashion and store these parameters together with the execution time of the query. By re-executing the time-stamped query on the time stamped and versioned data, the exact same subset can be retrieved at a later point in time. Assigning persistent identifiers to the query, each dataset can be uniquely referenced.
This approach not only allows retrieval of the original dataset, but also identifies which steps have been performed in order to create a dataset in the first place. It does implicitly provide valuable provenance information on the data set. It furthermore allows re-execution of the query against the current timestamp, re-creating an earlier dataset while incorporating all corrections made or new data added since the original study, whilst still satisfying the original selection criteria. This also supports analysis of any differences in information available between the original study and a later re-evaluation - features that cannot be easily achieved with conventional deposit-based approaches.
We first analyzed the requirements for dynamic data citation capable research data stores in the context of the APARSEN project. We implemented a reference framework in the TIMBUS Project [3], where our approach was used for describing datasets used in reports for critical infrastructure. In the SCAPE Project, we adapted an existing query interface for storing the user input and provide citation texts which could be directly used in publications. These concepts are further elaborated in the Data Citation Working Group of the Research Data Alliance (RDA). In focused workshops and pilots the approach is now being validated in diverse settings for different types of data ranging from structured SQL via XML, linked data to comma-separated value files.
Links:
JDDCP - Joint Declaration of Data Citation Principles: https://www.force11.org/datacitation
APARSEN Project: http://www.alliancepermanentaccess.org/index.php/aparsen/
TIMBUS Project http://timbusproject.net/
SCAPE Project http://www.scape-project.eu/
References:
[1] T. Hey, S. Tansley, K. Tolle, eds.: “The Fourth Paradigm: Data-intensive Scientific Discovery”, Microsoft Research 2009.
[2] S. Proell, A. Rauber: “Citable by Design - A Model for Making Data in Dynamic Environments Citable”, in DATA 2013, 2013.
[3] S. Proell, A. Rauber: “Scalable Data Citation in Dynamic Large Databases: Model and Reference Implementation”, in IEEE BigData 2013, 2013.
Please contact:
Stefan Proell, SBA Research, Austria
Tel: +43150536881803
E-mail: