Love Your Data—And Let Others Love It, Too • Into Oblivion

[Lire en français]

The Projects initiative is a Digital Science endeavour. Projects is a desktop app that allows you to comprehensively organise and manage data you produce as research projects progress. The rationale behind Projects is that scientific data needs to be properly managed and preserved if we want it to be perennial. There’s indeed a worrisome trend showcasing that every year, the amount of research data being generated increases by 30%, and yet a massive 80% of scientific data is lost within two decades.

Projects and open science data-sharing platform figshare published an impressive and pretty telling infographic on science data preservation and chronic mismanagement [scroll down to see it]. What struck me looking at these numbers is neither the high-throughput data production nor the overall funds it requires – 1,5 trillion USD spent on R&D! – but the little to no information on public policies aimed at solving the problem.

Why open data?

It’d be a mistake to consider that access to the research paper is enough. A publication is a summary, a scholarly advertisement of sorts. The publication is able in no way, alone, to substitute to raw data, protocol and experiment details, and – when applicable – software source code used to run the analysis. And we see an ever-increasing number of journals open up scientific publications. Still, researchers and their respective institutions trail in involvement when it boils to sharing scientific data. Such laziness is not harmless: the infographic highlights that 80% (!) of datasets over 20 years old are not available.

Such a delirious figure is still just the tip of the iceberg. Every time we produce data, we also generate metadata (“data about data”) and protocols (descriptions of methods, analysis and conclusions). Guess what, as files quickly pile up and are mismanaged, all that stuff falls into oblivion.

We need more open research data

This also means that the data we produce today is not accessible to the broader research community. A large amount of experiments gives negative or neutral results, thus not allowing to confirm the work hypotheses. This is an issue on two counts. First, we waste our time, energy and brains on redoing what colleagues have done and which does not work. But data is not shared. So, we joyfully dive into writing grants to ask for money to eventually produce data that will not end up in a paper… as publications today only account for ‘positive’ results (i.e., supporting the work hypotheses).

The second issue around with-holding data sharing is the impossibility to repeat or even statistically verify a study being presented. This has a name reproducible research. We have all heard about the shocking outcome of Glenn Begley’s survey of 53 landmark cancer research publications. (Hint: only six out of them could be independently reproduced).

The infographic below shows a bit different yet frightening picture. Thus, 54% of the resources used across 238 published studies could not be identified, making verification impossible. Sticking the knife in deeper, the infographic also highlights that the number of retractions due to error and fraud has grown fivefold since 1990. This complements another estimation showing that the number of retracted papers has grown tenfold since 2000.

Public policy to the rescue

We need public policies to the rescue. Funding bodies and various other institutions start to demand improved data management, tells the infographic; it cites the “Declaration on Access to Research Data from Public Funding” and the NIH, MRC and Wellcome Trust. These now request data management plans be part of applications.

The EU has also committed to considering data from publicly-funded studies as ‘public data’. It thus aligns its sharing with other public sector data in a broader Open Data move. The European Commission thus launched a Pilot on Open Research Data in Horizon 2020.

P.S. And in case you need additional incentives for data sharing, have a read.

P.P.S. From what I’ve heard, people at Projects would like to hear your views on data availability and how you manage your own data. So get in touch on Twitter @projects.

This post is cross-posted on SciLogs.com The Aggregator.