Publishing Your Research Data

You've invested significant time and resources into preparing your final publication. So, after peer review, you're done, right? Not necessarily. You may desire (or be required) to also publish the data underlying your research.

 

Co-authorship network map of physicians publishing on Hepatitis C

Why Should We Care About Publishing Data?

Sharing research data promotes transparency, reproducibility, and progress. In some fields, it can spur new discoveries on a daily basis. It’s not atypical for geneticists, for example, to sequence by day and post research results the same evening—allowing others to begin using their datasets in nearly real time (see, for example, Pisani & AbouZahr’s paper). The datasets researchers share can inform business or regulatory policymaking, legislation, government or social services, and more.

Publishing your research data can also increase the impact of your research, and with it, your scholarly profile. Depositing datasets in a repository makes them both visible and citable. You can include them in your CV and grant application biosketches. Conversely, scholars around the world can begin working with your data and crediting you. As a result, sharing detailed research data can be associated with increased citation rates (check out this Piwowar et al. study, among others).

Publishing your data may also be required. Federal funders (e.g. National Institutes of Health), granting agencies (e.g. Bill & Melinda Gates Foundation), and journal publishers (e.g. PLoS) increasingly require datasets be made publicly available—often immediately upon associated article publication.

 

How Do We Publish Data?

Merely uploading your dataset to a personal or departmental website won’t achieve these aims of promoting knowledge and progress. Datasets should be able to link seamlessly to any research articles they support. Their metadata should be compatible with bibliographic management and citation systems (e.g. CrossRef or Ref Works), and be formatted for crawling by abstracting and indexing services. After all, you want to be able to find other people’s datasets, manage them in your own reference manager, and cite them as appropriate. So, you’d want your own dataset to be positioned for the same discoverability and ease of use.

How can you achieve all this? It sounds daunting, but it’s actually pretty straightforward and simple. You’ll want to select a data publishing tool or repository that is built around both preservation and discoverability. It should:

  • Offer you a stable location or DOI (which will provide a persistent link to your data’s location), 
  • Help you create sufficient metadata to facilitate transparency and reproducibility
  • Optimize the metadata for search engines.

You can learn about a variety of specific tools through the Research Data Management program website, on their Data Preservation & Archiving page. Briefly, here are some good options:

Sample Tools

  • DASH:  Dash is an open-source, self-service toolkit for managing, openly publishing, and effectively describing data for access and reuse. Dash features geolocation metadata, ORCID, DOI, and FundRef identifiers, and generates a citation for all of your datasets. Additionally, Dash allows you to set a timed-release of data while undergoing peer-review. UC Berkeley DASH is administered and operated by the UC Berkeley library and the UC Curation Center (UC3) at the California Digital Library. 
  • Figshare:  Figshare is a multidisciplinary repository where users can make all of their research outputs available in a citable, shareable and discoverable manner. Figshare allows users to upload any file format to be made visualisable in the browser so that figures, datasets, media, papers, posters, presentations and filesets can be disseminated. Figshare uses Datacite DOIs for persistent data citation. Users are allowed to upload files up to 5GB in size and have 20 GB of free private space. Figshare uses Amazon Web Services - backups are performed on a daily basis, which are kept for 5 days. 
  • re3data:  re3data.org is a global registry of research data repositories that covers research data repositories from different academic disciplines. It presents repositories for the permanent storage and access of data sets to researchers, funding bodies, publishers and scholarly institutions. re3data.org promotes a culture of sharing, increased access and better visibility of research data. The registry went live in autumn 2012 and is funded by the German Research Foundation (DFG).

To explore others, check out OpenDOAR, the Directory of Open Access Repositories.

We also recommend that, if your chosen publishing tool enables it, you should include your ORCID (a persistent digital identifier) with your datasets just like with all your other research. This way, your research and scholarly output will be collocated in one place, and it will become easier for others to discover and credit your work.

 

What Does it Mean to License Your Data For Reuse?

Uploading a dataset—with good metadata, of course!—to a repository is not the end of the road for shepherding one’s research. We must also consider what we are permitting other researchers to do with our data. And, what rights do we, ourselves, have to grant such permissions—particularly if we got the data from someone else, or the datasets were licensed to us for a particular use?

To better understand these issues, we first have to distinguish between attribution and licensing.

Citing datasets, or providing attribution to the creator, is an essential scholarly practice.

The issue of someone properly citing your data is separate, however, from the question of whether it’s permissible for them to reproduce and publish the data in the first place. That is, what license for reuse have you applied to the dataset?

The type of reuse we can grant depends on whether we own our research data and hold copyright in it. There can be a number of possibilities here.

  • Sometimes the terms of contracts we’ve entered into (e.g. funder/grant agreements, website terms of use, etc.) dictate data ownership and copyright. We must bear these components in mind when determining what rights to grant others for using our data.
  • Often, our employers own our research data under our employment contracts or university policies (e.g. the research data is “work-for-hire”).

Remember, the dataset might not be copyrightable to begin with if it does not constitute original expression. We could complicate things if we try to grant licenses to data for which we don’t actually hold copyrights. For an excellent summary addressing these “Who owns your data?” questions, including copyright issues, check out this blog post by Katie Fortney written for the UC system-wide Office of Scholarly Communication.

 

What’s the Right License or Designation for Your Data?

To try to streamline ownership and copyright questions, and promote data reuse, often data repositories will simply apply a particular “Creative Commons” license or public domain designation to all deposited datasets. For instance:

  • Dryad and BioMed Central repositories apply a Creative Commons Zero (CC0) designation to deposited data—meaning that, by depositing in those repositories, you are not reserving any copyright that you might have. Someone using your dataset still should cite the dataset to comply with scholarly norms, but you cannot mandate that they attribute you and cannot pursue copyright claims against them.

Otherwise, it’s worth considering what your goals are for sharing the data to begin with, and selecting a designation or license that both meets your needs and fits within whatever ownership and use rights you have over the data. We can help you with this.

Ambiguity surrounding the ability to reuse data inhibits the pace of research. So, try to identify clearly for potential users what rights are being granted in the dataset you publish.

 

Questions?

Please contact us at schol-comm@berkeley.edu.