by Laurent Romary
Owing to the growing interest in digital methods within the humanities, an understanding of the tenets of digitally based scholarship and the nature of specific data management issues in the humanities is required. To this end the ESFRI roadmap on European infrastructures has been seminal in identifying the need for a coordinating e-infrastructure in the humanities - DARIAH - whose data policy is outlined in this paper.
Scholarly data in the humanities is a heterogeneous notion. Data creation, ie the transcription of a primary document, the annotation of existing sources or the compilation of observations across collections of objects, is inherent to scholarly activity and thus makes it strongly dependant upon the actual hypotheses or theoretical backgrounds of the researcher. There is little notion of data centre in the humanities since data production and enrichment are anchored on the individuals performing research.
DARIAH’s goal is to create a sound and solid infrastructure to ensure the long-term stability of digital assets, as well as the development of a wide range of thus-far unanticipated services to carry out research on these assets. This comprises both technical aspects (identification, preservation), editorial (curation, standards) and sociological (openness, scholarly recognition).
This vision is underpinned by the notion of digital surrogates, information structures intended to identify, document or represent a primary source used in a scholarly work. Surrogates can be metadata records, a scanned image of a document, digital photographs, transcription of a textual source, or any kind of extract or transformation (eg the spectral analysis of a recorded speech signal) of existing data. Surrogates act as a stable reference for further scholarly work in replacement – or in complement – to the original physical source it represents or describes. Moreover, a surrogate can act as a primary source for the creation of further surrogates, thus forming a network that reflects the various steps of the scholarly workflow where sources are combined and enriched before being further disseminated to a wider community.
Such a unified data landscape for humanities research necessitates a clear policy on standards and good practices. Scholars should both benefit from strong initiatives such as the Text Encoding Initiative (TEI) and stabilize their experience by participating in the development of standards, in collaboration with other stakeholders (publishers, cultural heritage institutions, libraries).
The vision also impacts on the technical priorities for DARIAH, namely:
- deploying a repository infrastructure where researchers can transparently and trustfully deposit their productions, comprising permanent identification and access, targeted dissemination (private, restricted and public) and rights management, possibly in a semi-centralized way allowing efficiency, reliability and evolution (cf. http://hal.archives-ouvertes.fr/hal-00399881);
- defining standardized interfaces for accessing data through such repositories, but also through third-party data sources, with facilities such as threading, searching, selecting, visualizing and importing data;
- experimenting with the agile development of virtual research spaces based on such services, integrating community based research workflows (see http://hal.inria.fr/inria-00593677).
Beyond the technical aspects, an adequate licensing policy must be defined to assert the legal conditions under which data assets can be disseminated. This should be a compromise between making all publicly financed scholarly productions available in open access and preventing the adoption of heterogeneous reuse constraints and/or licensing models. We contemplate encouraging the early dissemination of digital assets in the scholarly process and recommend, when applicable, the use of a Creative Commons CC-BY license, that supports systematic attribution (and thus citation) of the source.
From a political point of view, we need to discuss with potential data providers (cultural heritage entities, libraries or even private sector stakeholders such as Google) methods of creating a seamless data landscape where the following issues should be jointly tackled:
- general reuse agreements for scholars, comprising usage in publications, presentation on web sites, integration (or referencing) in digital editions, etc.;
- definition of standardized formats and APIs that could make access to one or the other data provider more transparent;
- identification of scenarios by covering the archival version of records as well as scholarly created enrichments. For example, TEI transcriptions made by scholars could be archived in the library where the primary source is situated.
As a whole, DARIAH should contribute to excellence in research by being seminal in the establishment of a large coverage, coherent and accessible data space for the humanities. Whether acting at the level of standards, education or core IT services, we should keep this vision in mind when setting priorities in areas that will impact the sustainability of the future digital ecology of scholars.
European report on scientific data: cordis.europa.eu/fp7/ict/e-infrastructure/docs/hlg-sdi-report.pdf
Text Encoding Initiative: http://www.tei-c.org