by Zhiming Zhao, Paul Martin (University of Amsterdam) and Keith G. Jeffery (ERCIM)
Environmental Research Infrastructures (RIs) face common challenges regarding data management and how best to support the activities of scientists at all stages of the research data and experimentation lifecycle. The ENVRIPLUS ‘Data for Science’ theme aims to design and develop a suite of standard solutions to those common problems based on the reference model of research infrastructures (ENVRI-RM) and the e-VRE architecture proposed by VRE4EIC.
Environmental science research is increasingly dependent on the collection and analysis of large volumes of data gathered via wide-scale deployments of sensors and other observation sources. To study the development of earthquakes or volcanoes for example, one needs continuous observation of the surrounding geographic regions and their underlying strata in order to obtain the data necessary to model various seismological processes and their interactions. Depending on the problem scale and geographical focus, these observations can only be provided by sources distributed across different countries, institutions and data centres. Moreover, such research activities also often require advanced computing and storage infrastructure in order to analyse, process and model the data, and to perform simulations. Advanced research support environments (i.e., specialised infrastructure to support research) are clearly needed to better enable researchers to access data, software tools and services from different sources, and to integrate them into cohesive experimental investigations with well-defined, replicable workflows for processing data and recording the provenance of results for peer review.
A recent publication  identified several kinds of support environment that must be made to work together to support data-centric research:
- computing, storage and network infrastructures, e.g., provided via EGI [L1], EUDAT [L2] and GEANT [L3], also called e-Infrastructures (e-Is);
- services for accessing, searching and processing research data within different scientific domains, called Research Infrastructures (RIs), e.g., ICOS [L4], EPOS [L5] and EURO-ARGO [L6] for the atmospheric, earth and marine sciences respectively; and
- environments for providing user-centred support for discovering and selecting data and software services from different sources, and composing and executing application workflows based on them, called Virtual Research Environments (VREs) [L7] or Science Gateways (SGs) . These different types of supporting environments often overlap with each other, as shown in Figure 1.
Figure 1: Different research support environments and the role they play in ICT activities initiated by user communities.
Figure 2: Metadata superset recommendation in ENVRIPLUS to enable future interface to overarching enhanced virtual research environments (e-VRE).
Within the EU Horizon 2020 project ENVRIPLUS [L8], the ‘Data for Science’ theme investigates and develops interoperable solutions to common problems that environmental RIs face for managing data and supporting the activities of scientists throughout the research data and experimentation lifecycle, and encompasses work on the development of common data services within a common semantic framework. Problems being addressed include:
- how to identify and cite data from different sites or infrastructures;
- how to control the quality of nearly real-time data from sensors and annotate them;
- how to catalogue the data and to allow users to search and access data from different sites or infrastructures;
- how to support scientists to perform experiments using data, software tools and resources from different remote infrastructures;
- how to effectively manage the infrastructure resources in the scientific experiments and allow scientists to achieve their goals more quickly; and
- how to effectively record the events and results generated during experiments so that scientists can reproduce them independently. Sharing solutions to those common problems will not only reduce development costs but also promote interoperability between different infrastructures.
Besides being important pillars for user communities in their respective domains, environmental RIs are also intended to support interdisciplinary research as well as contribute directly to cross-domain initiatives such as Copernicus [L9] (contributing to GEOSS [L10]). This requires standard policies, models and e-infrastructure to improve technology reuse and ensure coordination, harmonisation, integration and interoperability of data, applications and other services.
The Data for Science theme follows a ‘Reference Model guided’ approach. It builds upon abstract concepts derived from the analysis of common operations of RIs and subsequently defines an architectural reference model for environmental RIs in general. Early results from specific RIs in construction have been carefully reviewed in order to identify good technology candidates for the realisation of the various common services needed, and a number of interactions have been carried out at various levels with computational e-infrastructures (such as EGI), data infrastructures (such as EUDAT), and other initiatives (such as VRE4EIC [L11]) that work on related issues. The e-VRE reference architecture in the VRE4EIC project, for example, is being used to guide the development of interfaces to access data and software resources from ENVRIPLUS RIs. Figure 2 shows the basic idea.
ENVRIPLUS, a four year project, is approaching the end of its second year. Version 2 of ENVRI RM is available; it has been used to guide the design of new identification/citation, processing, optimisation, curation and cataloguing services. CERIF and CKAN have been recommended for prototyping a cross-RI catalogue service. The Open Information Linking for Environmental RIs (OIL-E) framework developed in ENVRIPLUS has also been aligned with CERIF in collaboration with the metadata team in VRE4EIC. Furthermore, a recommendation for how to use metadata catalogues as a basis for constructing federated services at VRE-level for interacting with individual RIs and underlying e-infrastructure has been produced and introduced to the ENVRIPLUS RI community (see Figure 2).
 Zhiming Zhao, et al.: “Time critical requirements and technical considerations for advanced support environments for data-intensive research”, Proc. IT4RIS workshop Porto 29 November-2 December 2016.
 Mark A. Miller, Wayne Pfeiffer, Terri Schwartz: “The CIPRES science gateway: enabling high-impact science for phylogenetics researchers with limited resources”, in Proc. of XSEDE '12, ACM, New York, NY, USA, 2012.
University of Amsterdam, The Netherlands