If Data Sharing is the Answer, What is the Question?

by Christine L. Borgman

Data sharing has become policy enforced by governments, funding agencies, journals, and other stakeholders. Arguments in favor include leveraging investments in research, reducing the need to collect new data, addressing new research questions by reusing or combining extant data, and reproducing research, which would lead to greater accountability, transparency, and less fraud. Arguments against data sharing rarely are expressed in public fora, so popular is the idea. Much of the scholarship on data practices attempts to understand the socio-technical barriers to sharing, with goals to design infrastructures, policies, and cultural interventions that will overcome these barriers.

However, data sharing and reuse are common practice in only a few fields. Astronomy and genomics in the sciences, survey research in the social sciences, and archaeology in the humanities are the typical exemplars, which remain the exceptions rather than the rule. The lack of success of data sharing policies, despite accelerating enforcement over the last decade, indicates the need not just for a much deeper understanding of the roles of data in contemporary science but also for developing new models of scientific practice. Science progressed for centuries without data sharing policies. Why is data sharing deemed so important to scientific progress now? How might scientific practice be different if these policies were in place several generations ago?

Enthusiasm for “big data” and for data sharing are obscuring the complexity of data in scholarship and the challenges for stewardship [1]. Data practices are local, varying from field to field, individual to individual, and country to country. Studying data is a means to observe how rapidly the landscape of scholarly work in the sciences, social sciences, and the humanities is changing. Inside the black box of data is a plethora of research, technology, and policy issues. Data are best understood as representations of observations, objects, or other entities used as evidence of phenomena for the purposes of research or scholarship. Rarely do they stand alone, separable from software, protocols, lab and field conditions, and other context. The lack of agreement on what constitutes data underlies the difficulties in sharing, releasing, or reusing research data.

Concerns for data sharing and open access raise broader questions about what data to keep, what to share, when, how, and with whom. Open data is sometimes viewed simply as releasing data without payment of fees. In research contexts, open data may pose complex issues of licensing, ownership, responsibility, standards, interoperability, and legal harmonization. To scholars, data can be assets, liabilities, or both. Data have utilitarian value as evidence, but they also serve social and symbolic purposes for control, barter, credit, and prestige. Incentives for scientific advancement often run counter to those for sharing data.

To librarians and archivists, data are scholarly products to curate for future users. However, data are more difficult to manage than publications and most other kinds of evidence. Rarely are data self-describing, and rarely can they be interpreted outside their original context without extensive documentation. Interpreting scientific data often requires access to papers, protocols, analytical tools, instruments, software, workflows, and other components of research practice – and access to the people with whom those data originated. Sharing data may have little practical benefit if the associated hardware, software, protocols, and other technologies are proprietary, unavailable, or obsolete and if the people associated with the origins of the data cannot be consulted [2, 3].

Claims that data and publications deserve equal status in scholarly communication for the purposes of citation raise a host of theoretical, methodological, and practical problems for bibliometrics. For example, what unit should be cited, how, when, and why? As argued in depth elsewhere, data are not publications [1]. The “data publication” metaphor, commonly used in promoting open access to data and encouraging data citation, similarly muddies the waters. Transferring bibliographic citation principles to data must be done carefully and selectively, lest the problems associated with citation practice be exacerbated and new ones introduced. Determining how to cite data is a non-trivial matter.

Rather than assume that data sharing is almost always a “good thing” and that doing so will promote the progress of science, more critical questions should be asked: What are the data? What is the utility of sharing or releasing data, and to whom? Who invests the resources in releasing those data and in making them useful to others? When, how, why, and how often are those data reused? Who benefits from what kinds of data transfer, when, and how? What resources must potential re-users invest in discovering, interpreting, processing, and analyzing data to make them reusable? Which data are most important to release, when, by what criteria, to whom, and why? What investments must be made in knowledge infrastructures, including people, institutions, technologies, and repositories, to sustain access to data that are released? Who will make those investments, and for whose benefit?

Only when these questions are addressed by scientists, scholars, data professionals, librarians, archivists, funding agencies, repositories, publishers, policy makers, and other stakeholders in research will satisfactory answers arise to the problems of data sharing [1].

References:
[1] C.L. Borgman: "Big Data, Little Data, No Data: Scholarship in the Networked World". MIT Press, 2015.
[2] C.L. Borgman et al.: "The Ups and Downs of Knowledge Infrastructures in Science: Implications for Data Management", ACM/IEEE Joint Conference on Digital Libraries (JCDL 2014) and International Conference on Theory and Practice in Digital Libraries (TPDL 2014) (London, 2014), 2014.
[3] J.C. Wallis et al.: "If we share data, will anyone use them? Data sharing and reuse in the long tail of science and technology", PLoS ONE. 8, 7 (Jul. 2013), e67332.

Please contact:
Christine L. Borgman
University of California, Los Angeles, USA
E-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

Sidebar

Contents

If Data Sharing is the Answer, What is the Question?