Data Stewardship in the Age of Big Data

by Daniel E. Atkins

As evidenced by a large and growing number of reports from research communities, research funding agencies, and academia, there is growing acceptance of the assertion that science is becoming more and more data-centric.

Data is pushed to the center by the scale and diversity of data from computational models, observatories, sensor networks, and the trails of social engagement in our current age of Internet-based connection. It is pulled to the center by technology and methods now called “the big-data movement” or by some a “fourth paradigm for discovery” that enables extracting knowledge from these data and then acting upon it. Vivid examples of data analytics and its potential to convert data to knowledge and then to action in many fields are found at http://www.cra.org/ccc/dan.php. Note that I am using the phrase “big data” to include both the stewardship of the data and the system of facilities and methods (including massive data centers) to extract knowledge from data.

The focus of these comments is the fact that our current infrastructure - technologies, organizations, and sustainability strategies - for the stewardship of digital data and codes is far from adequate to support the vision of transformative data-intensive discovery. I include both data and programming codes because to the extent that both are critical to research, both need to be curated and preserved to sustain the fundamental tradition of the reproducibility of science. See for example an excellent case study about reproducible research in the digital age at http://stanford.edu/~vcs/papers/RRCiSE-STODDEN2009.pdf.

The power of data mining and analytics increases the opportunity costs for not preserving data for reuse, particularly for inherently longitudinal research such as global climate change. Multi-scale and multi-disciplinary research often requires complex data federation and in some fields careful vetting and credentialing of data is critical. Appraisal and curation is, at present at least, expensive and labor intensive. Government research-funding agencies are declaring data from research to be a public good and requiring that it be made openly available to the public. But where will these data be stored and stewarded?

On the campuses of research universities there is widespread and growing demand by researchers to create university-level, shared, professionally managed data-storage services and associated services for data management. This is being driven by:

the general increase in the scale of data and new methods for extracting information and knowledge from data, including that produced by other people
policies by research funders requiring that data should be an open resource that is available at no or low cost to others over long periods of time
privacy and export regulations on data that are beyond the capability of the researcher to be in compliance
the growing need to situate data and computational resources together to make it easier for researchers to develop scientific applications, increasingly as web service, on top of the rich data store. They could potentially use their data, as well as shared data and tools from others to accelerate discovery, democratize resources, and yield more bang for the buck from research funding.

Although there are numerous successful repository services for the scholarly literature, most do not accommodate research data and codes. Furthermore, as noted by leaders of an emerging Digital Preservation Network (DPN) being incubated by Internet 2, a US-based research and education network consortium, even the scholarship that is being produced today is at serious risk of being lost forever to future generations. There are many digital collections with a smattering of aggregation but all are susceptible to multiple single points of failure. DPN aspires to do something about this risk, including archival services for research data.

No research funding agency, at least in the US, has provided or is likely to provide the enormous funding on a sustained basis required to create and maintain an adequate cyberinfrastructure for big data. We must approach it as a shared enterprise involving academia, government, for-profit and non-profit organizations, with multiple institutions playing complementary roles within an incentive system that is sustainable both financially and technically.

At the federal government level, major research funding agencies including the National Science Foundation, the National Institutes for Health, and the Department of Energy, together with several mission-based agencies are developing, with encouragement from the White House, a coordinated inter-agency response to “big data.” Although details will not be available for several months, the goals will be strategic and will include four linked components: foundational research, infrastructure, transformative applications to research, and related training and education.

Commercial partners could play multiple roles in the big-data movement, especially by providing cloud-based platforms for storing and processing big data. The commercial sector has provided and will likely continue to provide resources at a scale well beyond what can be provided by an individual or even a university consortium. Major cloud service providers in the US have strategies to build a line of business to provide the cloud as a platform for big data, and there is growing interest within both universities and the federal government in exploring sustainable public-private partnerships.

Initial efforts at collaboration between academia, government, and industry are encouraging but great challenges remain to nurture the infrastructure necessary to achieve the promise of the big-data movement.

Please contact:
Daniel E. Atkins
University of Michigan, USA
E-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

{jcomments on}

Sidebar

Contents

Data Stewardship in the Age of Big Data