by George Bruseker, Martin Doerr and Chryssoula Bekiari (ICS-FORTH)
In the era of big data, digital humanities faces the ongoing challenge of formulating a long-term and complete strategy for creating and managing interoperable and accessible datasets to support its research aims. Semantics and formal ontologies, properly understood and implemented, provide a powerful potential solution to this problem. A long-term research programme is contributing to a complete methodology and toolset for managing the semantic data lifecycle.
The field of semantics and the use of formal ontologies for representing research data are key areas of research in the digital humanities, aiming to support the sustainable development of interoperable and accessible datasets for use by research communities in the era of big data. Coined by Berners-Lee, semantic data refers to data which is machine processable and human readable. Formal ontologies provide explicit and disciplined means of producing such data, to ensure its wide compatibility and clear interpretability.
At The Centre for Cultural Informatics (CCI) of ICS-FORTH, based in Heraklion, Greece, we have been researching a comprehensive solution to the complete lifecycle of semantic data use supported by formal ontologies. Our research is nearly at a point where it can be applied within digital humanities by bringing together the basic methods and tools for the distinct steps of this cycle: data modelling, mapping and transformation, querying and management. In particular, research at the CCI focuses on development to fill gaps or improve methodologies in these key steps.
Before semantic data can be integrated, an adequate model for the domain must be elaborated. Semantic data modelling typically follows one of two basic strategies: the elaboration of complex, all inclusive models or restriction to modelling of a highly focussed domain. Both produce highly useful models, that nevertheless display certain limitations for broad interoperability. The limitations of these strategies tend to lie, on the one hand, in a powerful, general integration with less detailed integration at the leaf level in complex models (e.g.: INSPIRE), and, on the other hand, extremely tight integration of data with lack of relation to the general in more compact models (e.g.: FOAF). As a coordinating member of the international collaborative CIDOC CRM Special Interest Group (SIG), [L1] working under the aegis of the International Council of Museums (ICOM), CCI follows a different approach.
Adopting a bottom up development process that works from actual data structures, the CRM SIG has produced an ontology for cultural heritage and e-sciences which provides the general integrative functions of a base ontology. The base model of CIDOC CRM is currently in the sixth revision of its community version with the fifth revision standing as the base of the last ISO release in 2014 . The long term success in uptake of this model has laid the foundation for community collaboration with experts from various disciplines to create harmonised extensions including: FRBRoo and PRESSoo for library data; CRMdig, CRMinf, and CRMsci which collectively provide provenance data in the respective areas of digitisation, argumentation and observation sciences; and CRMarchaeo and CRMba which support reasoning over archaeological practice. The innovation of this extension development process is the collaborative work with specialist communities to elaborate harmonised extensions to the base model which enable the representation of special domains of research while maintaining compatibility with the top level model.
Having a general ontological framework available to express their data, researchers require a means to translate existing data into the common expression. Development work at CCI has created the X3ML Suite which provides an innovative language, database and data mapping tool for generating completely declarative mappings from any XML data source to any RDFS encoded ontology. This suite of functionalities allows domain specialists to carry out and track mapping processes to the CRM or other suitable ontologies on their own without having to rely on the mediation of computer science specialists.  Together with a tool for easily viewing/reviewing RDF data (see article by Minadakis et. al. in the section “Research and Innovation” of this issue), this suite provides a platform for managing large scale semantic mapping processes without restrictions to a specific schema.
Once expressed in a common, but still complex, semantic format, there is still the challenge of how to provide researchers a tool to query this semantic network without necessarily having to learn complex query languages such as SPARQL or the nuances of use of a large ontology. Methodological work at the CCI has produced a theory of Fundamental Categories and Relations which describes how to specify generalist queries over a complex model that will bring back relevant results to researchers by providing an intuitive and semantically consistent abstraction over the complexities of the ontology . This methodology has been taken up and developed as a key tool by the Researchspace project in its development of an open source platform for semantic data in the cultural heritage and digital humanities domains [L2].
Figure 1: User interface of X3ML declarative mapping tool.
Figure 2: UI of ResearchSpace fundamental categories and relations query tool, ©ResearchSpace.
Finally, to ensure the sustainable development and use of semantically encoded resources, a complete strategy to the semantic data life cycle must be elaborated. CCI presently participates in the Parthenos project [L3] developing a conceptual data model and architecture for long term semantic data integration and curation, that aims to model the intergration process itself and thereby support long term, on-demand integration tasks and the monitoring thereof.
In order to meet the challenges and take advantage of the benefits of semantic data in Digital Humanities in the era of big data, a complete strategy and set of tools to cover the basic elements of the semantic data lifecycle is essential. With the maturation of a base model, creation of a declarative mapping tool and language, a generalising query function and a model and method for managing integration processes, we believe that the key elements for meeting these challenges now lie in place.
 ISO: ISO 21127: 2014, Information and documentation – a reference ontology for the interchange of cultural heritage information, 2nd edn., 2014.
 N. Minadakis, Y. et al.: “X3ML Framework: An effective suite for supporting data mappings, Extending. Proceedings of the Workshop on Extending, Mapping and Focusing the CRM co-located with 19th International Conference on Theory and Practice of Digital Libraries (2015), Poznań, Poland, September 17, 2015”, CEUR Workshop Proceedings 1656, 1-12, 2016.
 K. Tzompanaki & M. Doerr: Fundamental Categories and Relationships for Intuitive querying CIDOC-CRM based repositories, Technical Report, 2012.
Centre for Cultural Informatics