by Manolis Terrovitis (ATHENA RC) and Dominik Ślęzak (University of Warsaw)
Data infrastructures are critical for efficient handling of large-scale datasets in any domain, and they are required for industrial, commercial, scientific and policy-making purposes. Robust data infrastructures ensure data security and integrity, which is essential for compliance and maintaining the trust of users and stakeholders. They allow for scalability and flexibility, which is needed to handle the constant influx of new data and changing the data processing pipelines. A well-designed data infrastructure ensures that data is properly integrated, standardised, and secure, which is crucial for accurate and reliable analysis.
The last decade has witnessed an explosion in data infrastructures that are available from private parties. We can also see bold initiatives by the European Commission to offer data infrastructures for scientific purposes (European Open Science Cloud, OpenAIRE, European Data Infrastructure etc.) and regulatory approaches to create common data spaces in the EU (e.g., Common European Health Data Space). The development of data infrastructures poses multiple scientific, governance and social challenges that have attracted great interest from the scientific community but also from regulators. This special issue features contributions addressing the aforementioned challenges that can be grouped in the areas as follows.
Experiences from Existing Data Infrastructures
Candela et al. present virtual research environments (VREs) of the D4Science infrastructure. The VREs are web-based, collaborative, and open-science-enablers, allowing seamless access to datasets and services. Bardi et al. present the ARIADNE infrastructure, which provides archaeological researchers with a suite of tools and services to support different phases of the research life cycle. These include a knowledge base of 3.5 million archaeological resources, virtual research environments for data analysis, data sharing and collaboration, as well as tools for data visualisation. Bardi and Benassi describe how the IPERION HS (heritage science) research infrastructure makes its outputs open and accessible to the community. Its impact is monitored with the help of services of the OpenAIRE infrastructure, such as Zenodo, Connect and Monitor. They let IPERION HS store research resources in a FAIR (findable, accessible, interoperable, reusable) way, make research outputs discoverable, and track the success of its Open Science strategy. Avramo et al. present the data portal of the European Plate Observing System (EPOS), which integrates different data, metadata, software and services into one platform for solid-earth sciences in Europe. The data portal is built around the FAIR principles and is designed to ensure sustainable and universal use and reuse of multidisciplinary solid-earth science data.
Privacy and Security for Research Infrastructures
Salant describes the development of a privacy-aware framework for fine-grained data access that enables secure, policy-driven data exchange with sophisticated access control. Krenn at al. discuss the benefits of joint computations on data from various sources. The authors developed verifiably privacy-preserving protocols based on multi-party computation, which allows parties to jointly evaluate an arbitrary function without revealing anything about the input data. Spanakis et al. describe the secure and trustworthy platform for cross-discipline federation of data using self-sovereign technologies and homomorphic encryption. The article presents the platform’s architecture and highlights its key layers. Albanese et al. present the E-CORRIDOR framework for multi-modal transportation systems. It provides secure services to passengers and transport operators by implementing collaborative privacy-aware edge-enabled information sharing, analysis and protection as a service. The framework is based on the concept of data sharing agreement, which is a digital contract that defines a set of data sharing constraints. Anciaux and Bouganim propose a three-layer logical architecture for extensive and secure Personal Data Management Systems. It allows individuals to manage and control their personal data while allowing third-party applications to access it. The paper also discusses challenges related to handling large volumes of personal data, protecting data of a community of users and retrieving third-party data. Carreras et al. present the components of a trusted ecosystem for sharing medical data in a secure and privacy-preserving manner. In particular, the data anonymisation and functional encryption modules are discussed.
Artificial Intelligence and Data Infrastructures
Hemker et al. propose a modularised approach for reducing complexity in the AI life cycle and describing its data-related components. The platform follows the principles of the Unix philosophy, solving problems with small and effective tools while retaining full control over all parts of the process, treating every data as files, and using Data Version Control to handle their execution in the correct order and to avoid redundant calculations on the unchanged stages. Hoseini and Quix describe SEDAR, a semantic data lake for the integration of heterogeneous datasets and machine learning (ML). SEDAR’s key element is semantic metadata management. The system’s generic ingestion interface can deal with any external data source and incorporates data capture with data versioning and automatic metadata extraction, while ML-related artefacts are embedded into the lake to allow for the coherent development of data preparation and ML pipelines. Ballhausen et al. present a new approach to managing high-quality digital cultural content throughout its life cycle, called curatorial companionship. The approach combines domain knowledge from diverse fields to refine and select digital cultural artefacts, and adopts a structured, iterative process for the creation of art by generative AI. Hummel et al. present a research agenda on network intelligence to support new digitised applications while also being sustainable. The paper discusses the novel networked applications, network intelligence methods, and major challenges in the field. Trasarti et al. present a research infrastructure that provides data and facilities to researchers, and services to firms and public administrations to develop tools based on ethics and fairness principles. It encourages interdisciplinary studies and promotes European principles in social analysis, offering an innovative and free platform that combines AI and social issues to perform large-scale social mining experiments within a legal and ethical framework of responsible data science. Renault and Hitzelberger describe the establishment of a high-performance data analytics and AI testing facility in Luxembourg with the aim of supporting digitisation and Industry 4.0 projects. The facility, which is based on research and technology infrastructure, provides a test-before-invest approach that allows companies to determine the worthiness of the technology for their specific business purposes, tailor their AI and data analytics projects, and make informed investment decisions.
Data Management and Governance
Massa et al. propose a data-aware and declarative solution for determining service-based application placements over the Cloud-IoT continuum while meeting functional and non-functional application requirements. The solution considers the characteristics of the data processed by the application, such as security needs, volume, velocity, transmission rate, and sources and targets, and uses a continuous reasoning approach to reduce the size of the placement problem instances at runtime. Vinju introduces the Rascal metaprogramming language as a solution to the need for up-to-date, easy-to-use and easy-to-combine instruments for collecting data about software and the software development processes. Maillot et al. present the IndeGx framework that is designed to create an index of knowledge graphs in the form of linked open datasets and provide descriptions of them for humans and machines to understand their content, quality and compliance with standards. Those descriptions are generated by extraction from a SPARQL endpoint and represented in RDF, providing a transparent, declarative, collaborative and extensible framework to be used in various use cases. Stefanidis et al. present the federated data sharing/trading and monetisation platform for secure, trusted and controlled exchange and usage of proprietary data assets and data-driven intelligence. The platform employs federated data discovery, distributed ledger technologies, data non-fungible tokens and AI-driven data quality assessment to build trust among data providers, data owners and data consumers. Marazakis and Louloudakis highlight the contribution of the RISER project, which aims to develop the first all-European RISC-V cloud server infrastructure, enhancing Europe’s strategic autonomy in open-source technologies. RISER will leverage and validate open hardware high-speed interfaces and a fully featured operating system environment, enabling the integration of low-power components, including RISC-V processor chips, in an energy-efficient cloud architecture
In conclusion, this special theme highlights the advancements made in the field of data management through exploration of novel research and technologies. The projects showcased in this issue contribute to the development of effective data management environments and techniques that are necessary for addressing the increasing complexity of data generated in today’s world. The papers provide valuable insights and promote further research on data management and infrastructure carried out in Europe.
ATHENA RC, Greece
University of Warsaw, Poland