E-Infrastructures for Big Data: Opportunities and Challenges

by Kostas Glinos, European Commission, DG Information Society and Media, Head of GEANT and e- Infrastructure Unit

The management of extremely large and growing volumes of data has since many years been a challenge for the large scientific facilities located in Europe such as CERN or ESA, without clear long term solutions. The problem will become even more acute as new ESFRI facilities come on-stream in the near future. The advent of “big data science”, however, is not limited to large facilities or to some fields of science. Big data science emerges as a new paradigm for scientific discovery that reflects the increasing value of observational, experimental and computer-generated data in virtually all domains, from physics to the humanities and social sciences.

The volume of information produced by the “data factories” is a problem for sustainable access and preservation, but it is not the only problem. Diversity of data, formats, metadata, semantics, access rights and associated computing and software tools for simulation and visualization add to the complexity and scale of the challenge.

Big Data and e-Science: challenges and opportunities
ICT empowers science by making possible massive interdisciplinary collaboration between people and computers, on a global scale. The capacity and know-how to compute and simulate, to extract meaning out of vast data quantities and to access scientific resources are central in this new way of co-creating knowledge. Making efficient use of scientific data is a critical issue in this new paradigm and has to be tackled in different dimensions: creation of data, access and preservation for re-use, interoperability to allow cross-disciplinary exploration and efficient computation, intellectual property, etc.

ICT infrastructures for scientific data are increasingly being developed world-wide. However, many barriers still exist across countries and disciplines making interoperability and sustainability difficult to achieve. To cope with the extremely large or complex datasets generated and used in research, it is essential to take a global approach to interoperability and discoverability of scientific information resources. International cooperation to achieve joint governance, compatible legal frameworks and coordinated funding is also necessary.

Data-intensive science needs be reproducible and therefore requires that all research inputs and outcomes are made available to researchers. Open access to scholarly papers, trusted and secure access to data resources and associated software codes, and interlinking of resources with publications, they all support reproducible and verifiable e-science. In some areas the storage and processing of large datasets may have implications to data protection, which need to be investigated together with access to data by the public.

In all fields of science we can encounter similar technical problems when using extremely large and heterogeneous datasets. Data may have different structures or may not be well structured at all. Analytical tools to extract meaningful information from the huge amounts of data being produced are lagging. Technical problems are often more complex in interdisciplinary research which is the research paying the highest rewards. When the amounts of data to be processed are large they cannot easily move around the network. Novel solutions are therefore needed; and in some cases, storage and data analysis resources might need to move to where data is produced.

A significant part of the global effort should focus on increasing trust (eg through international certification) and enhancing interoperability so that data can be more readily shared across borders and disciplines. Second, we need to develop new tools that can create meaningful, high quality analytical results from large distributed data sets. These tools and techniques are also needed to select the data that is most valuable for future analysis and storage. This is a third focus of effort: financial and environmental systainability. The rate of global data production per year has already exceeded the rate of increase in global data storage capacity; this gap is widening all the time, making it increasingly more important to understand what data has an intrinsic value that should not be lost and what data is “transient” and we could eventually throw away [Richard Baraniuk, More is Less: Signal Processing and the Data Deluge Science 2011 (331): at p. 717].

European Commission activities in scientific data
Through the 7th Framework Programme for research, the Commission, in coordination with Member States, promotes and funds ICT infrastructures for research (e-infrastructures) enabling the transition to e-science. The Commission has invested more than 100 M€ in the scientific data infrastructure over the last few years, covering domains ranging from geospatial information and seismology to genomics, biodiversity and linguistics . The development of e-Infrastructures is part of the Digital Agenda flagship initiative, envisioned as means to connect researchers, instruments, data and computation resources throughout Europe. Furthermore, the 2009 Communication of the Commission on ICT infrastructures for e-science highlighted the strategic role of IT in the scientific discovery process and sought to increase adoption of ICT in all phases of this process. The Communication expressed the urgency to develop a coherent strategy to overcome the fragmentation in infrastructures and to enable research communities to better manage, use, share and preserve data. In its conclusions of December 2009, the Competitiveness Council of the European Union invited Member States and the Commission to broaden access to scientific data and open repositories and ensure coherent approaches to data access and curation.

More recently, in October 2010, the High Level Expert Group on Scientific Data submitted its final report to the Commission. The main conclusion of the report is that there is a need for a “collaborative data infrastructure” for science in Europe and globally. The vision this infrastructure would enable is described in the following terms:

“Our vision is a scientific e-infrastructure that supports seamless access, use, re-use, and trust of data. In a sense, the physical and technical infrastructure becomes invisible and the data themselves become the infrastructure a valuable asset, on which science, technology, the economy and society can advance.”

A complementary vision was developed by the Commission co-funded project GRDI2020. It envisions a Research Data Infrastructure that enables integration between data management systems, digital data libraries, research libraries, data collections, data tools and communities of research.

These efforts are expected to create a seamless knowledge territory or “online European Research Area” where knowledge and technology move freely thanks to digital means. Furthermore, it is essential to take a global approach to promote interoperability, discoverability and mutual access of scientific information resources.

Financial support for this policy is expected to come from the next framework programme for research and innovation. The Commission has included data e-infrastructure as a priority in its proposals for the so-called Horizon 2020 programme, covering the period from 2014 to 2020. Coordination with funding sources and policy initiatives in Member States of the EU is also necessary as much of the e-infrastructure in Europe obtains financing and responds to needs at national level.

In summary, data should become an invisible and trusted e-infrastructure that enables the progress of science and technology. Beyond technical hurdles, this requires a European (and global) research communication system that enables and encourages a culture of sharing and open science, ensures long-term preservation of scientific information, and that is financially and environmentally sustainable.

The views expressed are those of the author and do not necessarily represent the official view of the European Commission on the subject.

{jcomments on}