Thoughts on Open Data

Open Data stickers
Photo by Jonathan Gray (CC BY-SA 2.0)

In the last few days, the Public Library of Science (PLoS) has announced that they will shortly require every article published in one of their journals to be accompanied by a “data availability statement”, stating how the raw data underlying the manuscript can be accessed by readers. All of the data will have to be made “publicly available, without restriction, immediately upon publication of the article”. This constitutes PLoS’s formalised implementation of a so-called Open Data policy, which is philosophically a part of the broader Open Access movement.

The reaction has been very mixed, varying from incredulity that anyone would have a problem with the policy to vitriolic opposition to PLoS’s “waccaloonery” (a word which, incidentally, I have filed away for future use!). A cursory search on Twitter suggests that that constituency generally approves of the move, although tweets are so short that of course it is hard to gauge what motivates opinions one way or the other.

So, should PLoS be applauded, or not? Why have they made this move? Does it reflect the beginning of a broad shift in academic publishing; and if so, what does that mean for researchers?

I will start by saying that I think Open Data has a bigger down-side than open access to published papers, the latter trend being now quite firmly established and increasingly widely supported. Open access to papers involves a shift in the academic publishing model from a reader-pays to an author-pays arrangement, but journals more or less function the same way, and article authors just take on a fairly simple extra step in paying the publication fees. Open Data, on the other hand, poses a logistical challenge. Many datasets are too large to be simply attached to an article in a spreadsheet, and some are pretty huge even for data repositories. There are thousands of proprietary file formats out there. Public repositories, of the sort that PLoS stipulates should be used, aren’t necessarily available for all sorts of data. (I’m not aware of one for the medical imaging data I work with.) To be seriously useful, datasets will tend to require a significant amount of tidying up and annotation with metadata, all of which needs ideally to be standardised in domain-specific ways so that interested researchers can easily find the data they’re looking for. That requires time, money and infrastructure.

Another question is how much of this stuff will actually be accessed again by other researchers. PLoS’s stated ambition is that data be “available to other academic researchers who wish to replicate, reanalyse, or build upon the findings published in our journals”. But replication and reanalysis studies can be thankless, and one may reasonably prefer to build a new study around data whose provenance one fully understands, and which was designed to capture the information relevant to the research question under study. As any statistician will tell you, data acquisition strategy and experimental design are as much a part of a research project as anything else. In my field, and I suspect many others, data isn’t a standardised commodity that can be easily plugged into any new study.

That’s not to say that PLoS’s aims aren’t worthy. But if only a small proportion of the data that researchers archive so carefully ever gets reused, is the effort justified? Moreover, is it not likely that the fields that would benefit most are those in which well-established data repositories already exist?

Finally, the Open Data principle is hard to swallow for the many researchers for whom acquiring the data is the biggest challenge they face. Clinical imaging datasets are hard-won: patients (and usually healthy controls too) must be identified, contacted and invited to participate in the study, screened for any contraindications, scanned, perhaps interviewed, and compensated for their time. This is incredibly labour-intensive, and researchers typically acquire a lot of data at once, while they have the chance, and then investigate different aspects of it over the course of several years. But if the researcher is compelled to “give away” the data when they first publish results from it, the motivation to acquire such datasets will surely be undermined. In this context, fully embracing Open Data must entail a much broader shift in how such studies are approached, and how people are recognised and rewarded for their contributions. Perhaps centralised “data farms” at a handful of sites are the natural end-point, with everyone else just making requests and analysing what they get back. Maybe that is a good thing, but let us not underestimate the upheaval involved.

For PLoS, which has always triumphed Open Access in the broader sense, this new policy is a natural progression. But I think it’s too early to say whether this is the beginning of an irrevocable trend. I would imagine that other publishers will be watching closely to see how it plays out.