Rapid release of prepublication data has served the field of genomics well. Attendees at a workshop in Toronto recommend extending the practice to other biological data sets.
Open discussion of ideas and full disclosure of supporting facts are the bedrock for scientific discourse and new developments. Traditionally, published papers combine the salient ideas and the supporting facts in a single discrete 'package'. With the advent of methods for large-scale and high-throughput data analyses, the generation and transmission of the underlying facts are often replaced by an electronic process that involves sending information to and from scientific databases. For such data-intensive projects, the standard requirement is that all relevant data must be made available at a publicly accessible website at the time of a paper's publication.
One of the lessons from the Human Genome Project (HGP) was the recognition that making data broadly available prior to publication can be profoundly valuable to the scientific enterprise and lead to public benefits. This is particularly the case when there is a community of scientists that can productively use the data quickly — beyond what the data producers could do themselves in a similar time period, and sometimes for scientific purposes outside the original goals of the project.
The principles for rapid release of genome-sequence data from the HGP were first formulated at a meeting held in Bermuda in 1996; these were then implemented as policy by several funding agencies. In exchange for 'early release' of their data, the international sequencing centers retained the right to be the first to describe and analyze their complete datasets in peer-reviewed publications. The draft human genome sequence was the highest profile dataset rapidly released before publication, with sequence assemblies greater than 1,000 base pairs usually within 24 hours of generation. This experience ultimately demonstrated that the broad and early availability of sequence data greatly benefited life sciences research by leading to many new insights and discoveries, including new information on 30 disease genes published prior to the draft sequence.
At a time when advances in DNA sequencing technologies mean that many more laboratories can produce massive datasets, and when an ever-growing number of fields (beyond genome sequencing) are grappling with their own data sharing policies, a Data Release Workshop was convened in Toronto in May 2009 by Genome Canada and other funding agencies. The meeting brought together a diverse and international group of scientists, ethicists, lawyers, journal editors, and funding representatives. The goal was to reaffirm and refine, where needed, the policies related to the early release of genomic data, and to extend, if possible, similar data release policies to other types of large biological datasets – whether from proteomics, biobanking or metabolite research.
By design, the Toronto meeting continued policy discussions from previous meetings, in particular the Bermuda meetings (1996, 1997 and 1998) and the 2003 Fort Lauderdale meeting, which first recommended that rapid pre-publication release be applied to other datasets whose primary utility was a resource for the scientific community, and also established the responsibilities of the resource producers, resource users, and the funding agencies. A similar 2008 Amsterdam meeting extended the principle of rapid data release to proteomics data. Although the recommendations of these earlier meetings can apply to many genomics and proteomics projects, many outside the major sequencing centers and funding agencies remain unaware of the details of these policies, and so one goal of the Toronto meeting was to reaffirm the existing principles for early data release with a wider group of stakeholders.
In Toronto, attendees endorsed the value of rapid pre-publication data release for large reference datasets in biology and medicine that have broad utility and agreed that pre-publication data release should go beyond genomics and proteomics studies to other datasets – including chemical structure, metabolomic, and RNAi datasets, and annotated clinical resources (cohorts, tissue banks, and case-control studies). In each of these domains, there are diverse data types and study designs, ranging from the large-scale 'community resource projects' first identified at Fort Lauderdale (for which meeting participants endorsed pre-publication data release) to investigator-led hypothesis-testing projects (for which the minimum standard must be the release of generated data at the time of publication).
Several issues discussed at previous data release meetings were not revisited, as they were considered fundamental to all types of data release (whether pre-publication or publication-associated). These included: specified quality standards for all data; database designs that meet the needs of both data producers and users alike archiving of raw data in a retrievable form; housing of both 'finished' and 'unfinished' data in databases; and provision of long-term support for databases by funding agencies. New issues that were addressed include the importance of simultaneously releasing metadata (such as environmental/experimental conditions and phenotypes) that will enable users to fully exploit the data, as well as the complexities associated with clinical data owing to concerns about privacy and confidentiality.
To link to access full article: https://www.nature.com/articles/461168a