by George Papastefanatos and Yannis Stavrakas
The recent development of Linked Open Data technologies has enabled large scale exploitation of previously isolated, public, scientific or enterprise data silos. Given its wide availability and value, a fundamental issue arises regarding the long-term accessibility of these knowledge bases; how do we record their evolution and how do we preserve them for future use? Until now, traditional preservation techniques keep information in fixed data sets, “pickled” and “locked away” for future use. Given the complexity, the interlinking and the dynamic nature of current data, especially Linked Open Data, radically new methods are needed.
In this respect, several challenges arise when preserving Linked Open Data:
- How can we monitor changes in third-party LOD datasets released in the past (the evolution tracking problem), and how can ongoing data analysis processes consider newly released versions (the change synchronization problem)?
- How can we understand the evolution of LOD datasets with respect to the real world entities they describe (the provenance problem), and how can we repair various data imperfections, e.g., granularity inconsistencies (the curation problem)?
- How can we assess the quality of harvested LOD datasets in order to decide which and how many versions of them deserve to be further preserved (the appraisal problem)?
- How can we cite a particular revision of a LOD dataset (the citation problem), and how will we be able to retrieve them when looking up a reference in the form in which we saw it – not the most recently available version (the archiving problem)?
- How can we distribute preservation costs to ensure long-term access even when the initial motivation for publishing has changed (the sustainability problem)?
The DIACHRON project aims at tackling these problems with an innovative approach that tries to integrate the preservation processes in the traditional lifecycle of production-processing-consumption of LOD data. LOD should be preserved by keeping them constantly accessible and integrated into a larger framework of open evolving data on the Web. This approach calls for effective and efficient techniques to manage the full lifecycle of LOD. It requires enriching LOD with temporal and provenance annotations, which are produced while tracking LOD re-use in complex value making chains. According to this vision both the data and metadata become diachronic, and the need for third-party preservation (e.g., by memory institutions) is greatly reduced. We expect that this paradigm will contribute towards a self-preserving Data Web or Data Intranets.
To this end, DIACHRON’s main artefact will be a platform for diachronic linked data. The platform is not intended to replace existing standards and tools, but rather to complement, integrate, and co-exist with them, as shown in Figure 1. Notably, we foresee four groups of services for long-term LOD accessibility and usability: acquisition, annotation, evolution, and archiving services.
The acquisition module is responsible for harvesting LOD datasets published on the Data Web and assessing their quality with regard to critical dimensions such as accuracy, completeness, temporal consistency or coverage. It includes services for:
- Ranking LOD datasets according to various quality dimensions.
- Crawling datasets on the Web or Intranets based on their quality criteria.
The annotation module is responsible for enriching LOD with superimposed information regarding temporal validity and provenance of the acquired datasets. It consists of services for:
- Diachronic citations based on persistent URIs of LOD datasets, ie, references to data and their metadata that do not “break” in case those data are modified or removed over time.
- Temporal and provenance annotations. Given that LOD datasets change without any notification while they get freely replicated on the Data Web, understanding where a piece of data (or metadata) came from and why and how it has obtained its current form is also crucial for appraisal.
The evolution module is responsible for detecting, managing and propagating changes in LOD datasets monitored on the Data Web. It provides services for:
- Cleaning and repairing LOD datasets. DIACHRON intends to deal with LOD inconsistencies arising due to evolving information (e.g., changes in scientific knowledge), revisions to their intended usage or simply errors posed by data replication between repositories.
- Change recognition and propagation by monitoring and comparing snapshots of LOD datasets. DIACHRON will pay particular attention to the LOD change language used to produce deltas that can be interpreted both by humans and machines.
The archiving module is responsible for storing and accessing multiple versions of annotated LOD datasets as presented in the previous modules and services. It comprises services for:
- Multi-version Archiving of LOD datasets that is amenable to compression of inherently redundant information, as well as to querying of the evolution history of LOD. The archived data will be replicated in several nodes in order to enable community-based preservation of LODs.
- Longitudinal querying featuring complex conditions on the recorded provenance and change information of archived LOD datasets.
To this end, the results of DIACHRON will be evaluated in three large-scale use cases focusing on open governmental data, enterprise data and scientific data ecosystems. The DIACHRON Project is an FP7 – IP project that started in April 2013 and will run for 36 months. The consortium comprises academic institutions (Institute for the Management of Information Systems/”ATHENA” Research Center, Greece; FORTH, Greece; University of Bonn, Germany, and University of Edinburgh, UK), companies (INTRASOFT, Belgium; DATA PUBLICA, France; DATA MARKET, Iceland; HANZO ARCHIVES, UK; BROX IT SOLUTIONS, Germany) as well as user communities (European Bioinformatics Institute, EMBL ,UK).
 S. Auer et al.: "Diachronic linked data: towards long-term preservation of structured interrelated information", in proc. of WOD '12, Nantes, France, 2012, dx.doi.org/10.1145/2422604.2422610
Athena Research Centre, Greece
Athena Research Centre, Greece