Abstract
The launch of the US BRAIN and European Human Brain Projects coincides with growing international efforts toward transparency and increased access to publicly funded research in the neurosciences. The need for data-sharing standards and neuroinformatics infrastructure is more pressing than ever. However, ‘big science’ efforts are not the only drivers of data-sharing needs, as neuroscientists across the full spectrum of research grapple with the overwhelming volume of data being generated daily and a scientific environment that is increasingly focused on collaboration. In this commentary, we consider the issue of sharing of the richly diverse and heterogeneous small data sets produced by individual neuroscientists, so-called long-tail data. We consider the utility of these data, the diversity of repositories and options available for sharing such data, and emerging best practices. We provide use cases in which aggregating and mining diverse long-tail data convert numerous small data sources into big data for improved knowledge about neuroscience-related disorders.
The premise that neuroscience will benefit from routine and universal data sharing has been around since the early days of the Internet. Calls to develop shared data repositories similar to those developed for genomics and protein structure communities were instantiated through the US Human Brain Project in the early 1990s, funded by the US National Institutes of Health (NIH)1. Part of the motivation behind this was the idea that an understanding of the brain would require cooperative efforts to integrate information across scales and modalities2, combining data generated with different techniques practiced across the various disciplines in neuroscience.
Through 2005 (refs. 3,4), the US Human Brain Project funded many software tools and databases for diverse data types, including neuroimaging, microscopy, physiology and computational modeling. As databases and community data repositories for neuroscience have continued to accrue, the Neuroscience Information Framework (NIF, http://neuinfo.org) has been charged with surveying, cataloging and federating public resources since 2008. NIF currently lists hundreds of neuroscience-specific databases comprising millions of records in its resource registry and data federation. Well-known examples of public data in neuroscience include the Allen Brain Atlas, and consortia such as the Alzheimer's Disease NeuroImaging Initiative (ADNI, http://www.adni-info.org/) and the Human Connectome project (http://www.humanconnectomeproject.org/). The utility of such resources is clear, as hundreds of publications have used these data (Supplementary Table 1). With the newly funded European Human Brain Project (https://www.humanbrainproject.eu/) and US Brain Research through Advancing Innovative Neurotechnologies (BRAIN) initiative (http://www.whitehouse.gov/share/brain-initiative), the amount of public data for neuroscience will continue to increase.
In the context of astronomy and high energy physics, the aforementioned projects might be termed big science5 projects, characterized by large, coordinated teams and extensive instrumentation6. Although they clearly argue for open data resources in neuroscience, these new initiatives do not address the issue of routine data sharing by neuroscience researchers. The myriad data sets produced by individual small-scale studies have come to be known as long-tail data6 (Fig. 1), as each data set may be small, but they collectively represent the vast majority of scientific data. Historically, raw long-tail data has been treated as a “supplement to the written record of science”6, rather than a primary research product for formally sharing. Investments in open data repositories, defined as databases or infrastructure that accept data contributions from the community at large for distributed reuse, have been driven by the premise that making such research data available benefits science. Data sharing in the long tail is viewed as essential for increasing transparency, for mitigating against known biases in publication and for increasing data reuse by third parties6,7. Ye t the value and effect of sharing non-standardized, heterogeneous data sets by neuroscientists across disciplines remains an open question. In this commentary, we review current practices and mechanisms for sharing long-tail neuroscience data. We distinguish long-tail data from big science initiatives such as the Allen Brain Atlas, whose mission is to produce data for the public domain, or large consortia such as ADNI or the Human Connectome Project, in which an agreement is in place to make the data arising from these initiatives publically available (that is, prospective data sharing). We focus instead on the discrete, unique data sets produced during the course of neuroscience research by individual researchers. We address issues such as data-sharing infrastructure, best practices and incentives, and case studies in which sharing long-tail data has yielded clear benefits.
Figure 1.
Schematic illustration of long-tail data. Studies that have plotted data set size against the number of data sources reliably uncover a skewed distribution. Well-organized big science efforts featuring homogenous, well-organized data represent only a small proportion of the total data collected by scientists. A very large proportion of scientific data falls in the long-tail of the distribution, with numerous small independent research efforts yielding a rich variety of specialty research data sets. The extreme right portion of the long tail includes data that are unpublished; such as siloed databases, null findings, laboratory notes, animal care records, etc. These dark data hold a potential wealth of knowledge but are often inaccessible to the outside world.
What are long-tail data and why share them?
Long-tail data in neuroscience can be defined as small, granular data sets, collected by individual laboratories in the course of day-to-day research. These data consist of small publishable units (for example, targeted endpoints), as well as alternative endpoints, parametric data, results from pilot studies and metadata about published data (Fig. 1). The long-tail of data is also composed of ‘dark data’, unpublished data that includes results from failed experiments and records that are viewed as ancillary to published studies (for example, veterinary care logs). Although these data may not be considered useful in the traditional sense, data-sharing efforts may illuminate important information and findings hidden in this long tail.
An analysis of the literature provides three historical arguments for increased access to long-tail research data in neuroscience. The first wave of calls for neuroscience data sharing was driven by the computational neuroscience and neuroimaging communities, which were interested in data-integration for modeling brain function1,8–10. The imaging community, particularly human neuroimaging, has recently renewed calls for data sharing, driven in part by gaps in what is currently available with a single research center11,12. This second call for data sharing emphasizes the development of large aggregated data sets to increase subpopulation sizes for improving analytical potential of participant-level data13. There is a third call extending across biomedicine in support of data sharing that shifts the focus from altruism and discovery to issues of transparency, reproducibility and waste7,14. Like many communities across biomedicine, neuroscience is grappling with issues of data quality and reproducibility11,14–16. Proponents of open data sharing contend that no scientific field is immune from errors or methodological limitations, and making primary data available for re-analysis serves to uncover and correct errors more quickly than our current practices. In a meta-analysis of psychology papers, h owever, Wicherts et al.17 found that studies with accessible data tended to have fewer errors and more robust statistical effects than those that did not, suggesting that when researchers know that data will be made public, more care is taken in data management and/or reporting.
Driven in part by the current squeeze in funding, biomedical communities are also raising concerns that insufficient data sharing has led to waste across the medical enterprise7. Recent estimates indicate that more than 50% of scientific findings do not appear in the published literature and instead reside in file drawers and personal hard drives7. This so-called ‘file-drawer phenomenon’ dominates “the long-tail of dark data”18. Lack of publication of dark data undermines the entire scientific research enterprise, leaving an incomplete and biased record7, needless duplication of scientific efforts (as previous attempts are unknown), and contributing to failures in scientific replication and translation19. A recent estimate suggests that over 50% of completed studies in biomedicine go unreported, often because results do not conform to author's hypotheses7. These issues have high costs for stakeholders beyond the data producers, from patients and taxpayers to policy makers, and suggest that there are wide-ranging inefficiencies across the current system of scholarly communication.
The potential benefits from sharing long-tail neuroscience data, including dark data, can be exemplified by experiences in the neurotrauma field. Stroke, traumatic brain injury (TBI) and spinal cord injury (SCI) collectively affect over 2.5 million people every year in the US. Given the prevalence of these disorders, substantial public resources have been dedicated to discovery of new therapeutics. Numerous high-profile clinical trials have failed, despite promising published findings from animal models20. In response to these failures, neurotrauma communities have dedicated substantial time and resources to standardizing study design at both the preclinical and clinical level21–23. For example, the Stroke Treatment and Academic Roundtable (STAIR) standards were implemented in 1999 (updated in 2009) to create a set of guidelines for testing neuroprotective therapies in preclinical stroke models24,25. However, reproducibility failures continue, even at the pre-clinical level16.
More recently, both the TBI and SCI communities engaged in substantial efforts to acquire and harmonize existing data sets through efforts such as IMPACT26 for human TBI data and VISION-SCI23 for animal research (Box 1). A tangible outcome of these activities is the emergence of new data standards, including a set of common data elements for prospective studies and powerful new prognostic statistical models for predicting neurological recovery. These case studies directly address questions about whether analyzing pooled long-tail data from multiple has value for human health. In the case of the IMPACT study, the answer is clearly yes, as aggregation of data from 43,000 patients has led to development of common data elements (CDEs) in clinical TBI studies and more accurate diagnostic/prognostic statistical mod-els27 (Box 1). These CDEs, in turn, standardize data collection to ensure that new studies produce long-tail data that can be more easily pooled across multiple centers and trials and combined and tested for common features present in TBI. Emerging methods for TBI neuroimaging, genetics and proteomic biomarkers are being further developed as part of newly announced international big-data discovery trials for TBI precision medicine, the CENTERTBI and TRACK-TBI projects28,29. Early results of these international efforts have already identified previously unknown magnetic resonance imaging (MRI) and molecular biomarkers to predict long term neurocognitive outcome after TBI30,31.
Box 1. Successful long-tail data sharing: case studies from translational neurotrauma.
After several failed clinical trials for TBI, a multinational consortium launched IMPACT. IMPACT gathered long-tail data from all of the major clinical trials in TBI conducted over the past 20 years into a single database (http://www.tbi-impact.org)26. The IMPACT database now contains data from over 43,243 patients with TBI and (re)analysis of these long-tail data has produced 62 publications to date57, including more accurate prognostic models of outcome. As an example, when data from about 8,700 patients with TBI was mined and reanalyzed, researchers found that combining information from the Glasgow Coma Scale (GCS), pupil reactivity, blood work and CT imaging improved outcome prediction30. This drove development of a publically available statistical ‘prognostic calculator’ (http://www.tbi-impact.org/?p=impact/calc) with unprecedented precision for predicting TBI recovery, providing clear guidance for tailoring patient care. These efforts also contributed to the creation of the NIH and National Institute of Neurological Disorders and Stroke data-reporting standards for TBI, known as the NINDS TBI Common Data Elements (TBI CDEs). These long-tail data sharing efforts provided proof of concept leading to ∼$60,000,000 investments by the US and Europe as part of the International Initiative for Traumatic Brain Injury Research (InTBIR). Early results suggest that these long-tail data-sharing efforts will help usher in a new era for TBI precision medicine.
Replication failures in the SCI research community16 have given rise to grass-roots preclinical data-sharing efforts known as Minimum Information about an SCI experiment (MIASCI)22 and Visualized Syndromic Information and Outcomes for Neurotrauma-SCI (VISION-SCI)23. Multivariate (re)analysis of long-tail data are revealing multidimensional syndromic patterns that translate across SCI injury models, laboratories and species23. For example, by combining data from multiple studies in cervical SCI models and performing multivariate statistical analysis, we identified a previously unknown set of overlapping measures of motor recovery that co-vary with gray matter and white matter lesion pathology both in rats32 and between rats and monkeys23. Statistical correction for this multivariate relationship revealed that coordinated weight bearing during locomotion is more sensitive to transection injuries, whereas stride length during locomotion is more sensitive to contusive injuries in the spinal cord32. By identifying conserved features expressed in multidimensional (syndromic) space, long-tail data sharing is now helping to screen for mechanistically precise therapeutic effects conserved across models and species to accelerate translation.
In preclinical SCI, similar wide-scale attempts to harmonize legacy long-tail data from dozens of laboratories, including unpublished data and ‘background data’6 (for example, animal care records), are helping to develop a more complete picture of SCI by deriving the computationally defined SCI syndromic space23,32. As the STAIR experience shows, it is difficult to completely standardize and control for initial conditions in models across laboratories. Thus, each individual animal study provides an incomplete glimpse into the syndrome across the full spectrum of possible injury conditions and outcome metrics. By piecing these studies together and harnessing big-data analytics to look across multiple endpoints, both traditionally reported and those residing in the file drawers (for example, postoperative and veterinary care logs), we can characterize the complex network of interactions of motor, sensory, autonomic and inflammatory responses following SCI. Big-data analytics on SCI long-tail data have uncovered pathophysiological targets that not only translate between injury paradigms32, but also between species23 (Box 1). Thus, in the neurotrauma field, aggregation of existing data is allowing us look across the full spectrum of research results to both improve our prospective data gathering efforts and make inroads into the complexity of nervous system disorders.
Potential caveats of data sharing
The above case studies support the arguments that the scientific enterprise benefits when data are shared and point to a role for data repositories and curators in aggregating and harmonizing these data to support meaningful re-use. However, it remains controversial whether sharing of long-tail data is uniformly beneficial to science. If transparency and reuse are considered benefits, what are the drawbacks?
Researchers often cite the fear that re-analysis of poor quality data sets or even good data sets by non-experts will lead to a flood of bad science in the literature33. Although this is certainly a concern, advocates of data sharing point to increased access to additional human capital available for a nalyzing data in new ways. We also know that the current literature, as evidenced by the lack of reproducibility, is rife with examples of poor data, poor experimental design and faulty analysis. Another objection concerns the costs associated with managing, hosting and curating data. As these activities are largely supported by research dollars, they are viewed as having a potentially negative effect on research funding. However, this concern must be balanced against the years of failed translational and clinical studies and the cost of generating new data34.
But as science is a human enterprise, arguments for and against data sharing often focus less on perceived benefits to science as a whole and more on the effect on individual researchers. Does data sharing benefit or harm scientific careers? Interviews and studies of attitudes toward data sharing clearly show that many researchers perceive the latter to be true35. An oft-stated reason for not sharing data is the desire to continue to mine the data and the fear of being scooped if the data are made public36. Historically, scientists have had a proprietary relationship with their data and, until recently, few have challenged this relationship37. Even among researchers willing to share their data, the time and resources required to prepare the data for use by others represent major disincentives. Scientists dedicate enormous time preparing papers for publication, as they serve as their primary index of career success. Metrics such as citation rates quantify the impact of these publications using a well-developed citation system. But no such metrics or norms exist for reuse of data. Given that data are considered to be supplements to the scientific record6 rather than primary products of research to be credited, cited and tracked, researchers must weigh the time and resources required for preparing data for release relative to the benefits they are likely to accrue.
Data sharing may lead to the fear that others will uncover errors in the data or question the validity of the analysis35. Unlike the open source software community, where error correction is encouraged and welcomed, uncovering errors in scientific data may be perceived as an attack on a researcher's reputation. In the hypercompetitive environment of biomedicine, such attacks may lead to hard feelings, finger pointing and a competitive disadvantage. In recognition of such potential abuses, data sharing has contributed to the advocacy for development of normative practices as to how researchers raise issues of errors in a manner that encourages open dialog. Replication etiquette38 might include contacting the data contributor and inviting them to review findings or perhaps co-author such a study when new findings result, or to help reveal and correct errors in a collaborative, collegial fashion. A recent example of this approach comes from neurophysiology, wherein re-analysis of pooled data from the CRCNS repository (http://crcns.org) by a third party yielded a high-impact paper reporting that distributed hippocampal local field potentials encode a rat's position in space39. The original data donors were directly engaged during re-analysis and served as coauthors (F. Sommer, personal communication), demonstrating that data sharing can benefit data donors and data end-users alike.
Although many discussions on incentives for data sharing focus on the harm done to the researcher sharing the data, fewer have focused on the harm done to researchers when data are not shared. Economic estimates indicate that lack of transparency and data inaccessibility in biomedicine cost tens of billions of dollars annually worldwide7. But there is likely a human cost as well. How many young scientists or graduate students are derailed by trying to replicate studies that essentially reported cherry-picked results or results based on faulty data or tools? One author19 refers to these findings as “occasional happy mistakes” and provides an anecdote about a frustrated graduate student who might not have attempted to replicate a finding had all the original data and results been available. Although difficult to quantify, conversations with colleagues and our own experiences suggest that such avoidable dead ends exact a human cost, discouraging scientists and perhaps driving some of them from science altogether.
A matter of incentives: how credit may be given for data sharing
Surveys on data-sharing practices6,35,40,41 find evidence of peer-to-peer sharing, where researchers barter for something in exchange for data, such as authorship or good will of colleagues. The incentives here are personal and controlled, and time requirements are minimized using a simple file transfer. But preparing data for hosting in a repository often involves more effort and the benefits from reuse of these data may not directly benefit the contributor. For example, statistics on reuse of data from the Cell Centered Database42, a database of high resolution imaging data (now part of the Cell Image Library), reveal that the majority of published studies come from computational scientists reusing the data for creation of models or algorithms for image analysis and segmentation43. This result is not surprising given that computational scientists may lack the skills or instruments for acquiring such data de novo. To this end, data producers have to expend considerable time and effort to make data discoverable and useful to third parties. Reuse of these data clearly benefits the third party, who gets a publication, and the resource provider, as reuse provides justification for further funding. Without a system of citation and reward for data reuse, the benefits to the data contributor are unclear.
Creating systemized incentives for data sharing to the individual contributor will be critical for making such practices routine in neuroscience. Development of a scholarly system for credit attribution for data, equivalent to our current system for literature citations, is underway. Currently, two main approaches are receiving attention. The first is the launch of ‘data papers’ or full ‘data journals’, that is, journals that are designed to describe a data set rather than an analysis of data. Data papers are designed to serve two purposes: to provide sufficient metadata and description of data such that it can be reused, and to co-opt our current paper-based reward system to credit researchers who make data available. Data papers require that data be deposited into a managed repository and that a stable identifier such as a digital object identifier (DOI) be assigned as a handle for identifying data44. A set of guidelines has emerged for data papers specifically to promote data sharing in neuroscience45, with an exemplar data paper in the journal Gigascience46 and linked data deposited in the OpenFMRI repository (OpenfMRI: ds000114; https://openfmri.org/dataset/ds000114).
The second approach is the creation of a citation and tracking system for data sets themselves, independent of whether a data paper appears. This developing citation system does not require the production of a separate paper, but supports formal referencing in articles. This system has the advantage of making data machine-readable, improving tracking and mining of data citations. There are currently several standards and principles for data citation, including the CODATA/ITSCI Taskforce on Data Citation47 and the Joint Declaration of Data Citation Principles48. Task forces are underway in several communities, including the Research Data Alliance49 and FORCE11, to develop recommendations for a data citation system. Although above solutions create mechanisms through which researchers can gain credit for data, research communities themselves will have to determine the relative value of a given data set in terms of academic promotions, funding and careers.
Some funding bodies, such as the NIH, have successfully instituted targeted data-sharing requirements, requiring communities to deposit data in a shared repository as a condition of funding. Notable examples include the National Database on Autism Research (NDAR) and the Federal Interagency TBI Research (FITBIR) informatics system. These focused efforts have implemented standards and tools for tracking compliance and have sustained intramural support from the NIH, US Department of Defense Congressionally Directed Medical Research Program and the US Army Medical Research and Materiel Command, among others. Coupled with support mechanisms, this infrastructure provides a model for sustained long-tail data sharing.
Data sharing in the neuroscience community: attitudes and best practices
Given the current reward system and the perceived disincentives to data sharing, do we have any evidence that neuroscientists are ready to share data? We believe that the answer is yes, although an examination of current repositories and communities yields some interesting observations on when, where and how. A sampling of community data repositories reveals that most public neuroscience data repositories are minimally populated relative to the total amount of data produced and the number of laboratories producing them (Supplementary Table 1). The minimal population suggests that a wide-scale culture of routine long-tail data s haring does not yet exist. Nevertheless, considerable variability exists across these resources, with some, for example, NeuroMorpho.org, NDAR and Cell Image Library, being well-populated with contributions from multiple laboratories (Supplementary Table 1). Searching FigShare for ‘neuroscience’ also returns thousands of data sets. This finding suggests that, in some communities, data sharing is occurring through third party repositories.
Surveys and studies of data sharing across science also indicate that attitudes in the research community toward routine data sharing are not uniformly negative, but are varied and in flux. For example, the neuroimaging community has undergone a substantial change in attitude about data sharing since the early 2000s, when the Journal of Cognitive Neuroscience requirement to deposit fMRI data into the fMRI Data Center prompted a loud and angry response12. By 2014, the ADNI and Human Connectome50 projects have made large amounts of neuroimaging data available. But beyond these large, institutionally coordinated consortia, there are notable proponents of long-tail data sharing in the neuroimaging community. Grass-roots projects are making data sets freely available for re-use, including the 1000 Functional Connectomes, now known as the International Neuroimaging Data-sharing Initiative (INDI)51, and the NeuroImaging Tools and Research Clearinghouse (NITRC, http://nitrc.org) lists 89 data resources, whereas NIF lists 56 databases for data-sharing in neuroimaging.
Developing a tracking and reward system for data will require that researchers themselves pay more attention to managing and sharing their data44,52. Although private data sharing via an individual laboratory webpage or as supplementary information to a publication is convenient, public repositories maintained by independent organizations can better ensure that the long-tail data sets are maintained, archived, searchable and visible. Although there is a perception that we lack data repositories for the types of data neuroscientists produce, NIF and other registries in fact list over 350 data repositories that cover a variety of data types. We have also seen the emergence of generic repositories like FigShare and Dryad that can accommodate multiple data types. More importantly, each community in neuroscience will need to agree and adopt best practices and standards so that long-tail data are transparent and informative to others (Box 2).
Box 2. Data-sharing best practices for long-tail data.
Discoverable
Data must be modeled and hosted in a way that they can be discovered through search. Many data, particularly those in dynamic databases, are considered to be part of the ‘hidden web’, that is, they are opaque to search engines such as Google. Authors should make their metadata and data understandable and searchable, (for example, use recognized standards when possible, avoid special characters and non-standard abbreviations), ensure the integrity of all links and provide a persistent identifier (for example, a DOI).
Accessible
When discovered, data can be interrogated. Data and related materials should be available through a variety of methods including download and computational access via the Cloud or web services. Access rights to data should be clearly specified, ideally in a machine-readable form.
Intelligible
Data can be read and understood by both human and machine. Sufficient metadata and context description should be provided to facilitate reuse decisions. Standard nomenclature should be used, ideally derived from a community or domain ontology, to make it machine readable.
Assessable
The reliability of data sources can be evaluated. Authors should ensure that repositories and data links contain sufficient provenance information so that a user can verify the source of the data.
Useable
Data can be reused. Authors should ensure that the data are actionable, for example, that they are in a format in which they can be used without conversion or that they can readily be converted. In general, PDF is not a good format for sharing data. Licenses should make data available with as few restrictions as possible for researchers. Data in the laboratory should be managed as if it is meant to be shared; many research libraries now have data-management programs that can help.
Next steps in sharing long-tail data
From the above discussion, we see that a basic infrastructure, a set of best practices and a system of citation for sharing research data are starting to take shape in neuroscience. With these tools, funding agencies, scientific journals and research institutions could have a more active role in developing policies about when, where, what and how data should be shared. Given the early stages of big-data science and the extreme variations in data set size and type, we don't think that a one-size-fits-all policy or infrastructure will work. In our modern networked world, it is unnecessary for all data to be in a single location; rather, the development of stable identifiers and reference systems for data sets allows dynamic indices to be maintained that connect required pieces together. Thus, different types of data, even if they derive from a single study, may be deposited into different resources as a cost-effective solution. For example, institutional repositories might be used for much of the data produced during a study, and especially dark data, whereas community repositories might host more specialized or curated subsets. Identifiers for data and a system of data tracking would also allow funding agencies and journals to monitor compliance with data-sharing policies, something that is currently difficult to do. And finally, communities will have to develop the normative practices to reuse data in a cooperative and ethical manner, as well as procedures that attribute and credit data contributors appropriately.
We don't want to give the impression that all of the challenges in routine data sharing have been addressed. As yet, perhaps the biggest unknown is who will pay for it all. Thus far, funding agencies, institutions, publishers and researchers have almost engaged in a game of ‘hot potato’, with each passing the burden for sustainability of such resources on to someone else. Although a wide range of revenue models exist across the digital content marketplace, there are many uncertainties about whether these will work for publicly funded basic research that is years away from affecting human health, that is, where the immediate value of the data are unknown. Databases such as NDAR pass the cost for data curation, ingestion and storage to the data providers, who must include these funds in their grant application. Such models work when deposition of data is a condition of funding, but are an uncertain revenue stream without this mandate. What is clear is that the accommodation of data as a primary product of research will require new funding models and market options be explored for both the content and the services these resources provide. For example, recently, the idea of data persistence insurance” 53 was proposed; although the viability of this proposal is uncertain, the call for creative engagement with the commercial sector for scientific data is timely.
Despite these many challenges, emerging evidence suggests that long-tail neuroscience data is of value. Individual data sets can be reanalyzed for new insights and multiple data sets can be aggregated and meaningfully analyzed when databases are broadly populated and well-curated, enabling researchers to ask questions across the data space that are not addressable with a single study23,54,55. Integrating curated data across scales of analysis through data links and data federation engines will continue to accelerate neuroscience data–driven discovery56. Although big science is expected to produce big data, the scientific community already has vast and not yet fully archived big data, particularly when we consider all of the long-tail data that have been sitting in desk drawers in every laboratory and office. It's time that we take advantage of all that long-tail data have to offer.
Supplementary Material
Supplementary Table 1. A sample of Neuroscience-centered data repositories available to the community. Emphasis was placed on those hosting a variety of different data types relevant to neuroscience. Databases not accepting contributions from the community were not included, nor were well known genetic and molecular resources like GEO or the Protein Data Bank, as these are well established. Information collected from internet searches on 7/31/14. Data repositories are rank ordered by number of datasets or data elements (e.g. subjects, images) from largest to smallest. Resources were classified as either “Open” or “Account” depending on whether the resource provided downloads of datasets without a required account, or if an account and log in were required to gain access to downloadable data, respectively. Most of the “Access” resources are available following a few simple steps (e.g. INCF dataspace, CARMEN), whereas others require submission of institutional credentials and proposal reviews for use of data (e.g. NDAR). The year the resource was created was determined either from information on the website or from the date of the earliest publication on the resource. Publications citing each resource were determined from web pages created by the resource authors, in these cases the links lead to publication pages on the resource website, or from text-mining results for the URL, alternative URLs, old URLs, redirected URLs, resource names, and resource abbreviation. In all cases the URLs mentions were accepted and resource name results were only added if the first 10 references were accurate for the resource. For example, CRCNS text mining results retrieve 110 publications, but most are results acknowledging the funding mechanism not the data resource. PubMed URLs are limited to a search of the first 24 references to the resource due to URL length restrictions. More detailed information for each resource, including resource identification (ID), webpage/URL, funding agencies and related conditions can be searched and downloaded from the NIF registry (https://neuinfo.org/mynif/search.php?t=indexable&nif=nlx_144509-1). Blank sections of (-) represent data not found at the time of the search. Downloads represented as either millions (M) or thousands (k).
Acknowledgments
We thank the NIF staff, especially B. Ozyurt for his text mining expertise and tools that contributed substantially to Supplementary Table 1. The Neuroscience Information Framework is supported by a contract from the NIH Neuroscience Blueprint HHSN271200800035C via the National Institute on Drug Abuse. VISION-SCI is supported by NIH grants NS067092 (A.R.F.) and NS079030 (J.L.N.), and the Craig H. Neilsen foundation (A.R.F.) and Wings for Life foundation (A.R.F). This material is based on (M.H.C.) work supported while serving at the National Science Foundation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of the National Science Foundation.
Abbreviations
- ADNI
Alzheimers Disease Neuroimaging Initiative
- BIRN
Biomedical Informatics Research Network
- CARMEN
code analysis repository and modelling for e-neuroscience
- CCDB
cell centered database
- CIL
cell image library
- CRCNS
Collaborative Research in Computational Neuroscience
- DTI
diffusion tensor imaging
- ECoG
electrocorticography
- EEG
electroencephalography
- EMSCI
European multicenter study about spinal cord injury
- FITBIR
Federal Interagency Traumatic Brain Injury Research Informatics System
- fMRI
functional magnetic resonance imaging
- INCF
international neuroinformatics coordinating facility
- IMPACT
International Mission for Prognosis and Analysis of Clinical Trials in TBI
- INDI
international neuroimaging data-sharing initiative/1000 functional connectomes project
- LONI
laboratory of neuroimaging
- MEG
magnetoencephalography
- MP RAGE
magnetization-prepared radio-frequency pulses and rapid gradient-echo
- NDAR
national database for autism research
- NIRS
near infrared spectroscopy
- PET
positron emission tomography
- PPMI
Parkinsons Progression Markers Initiative
- TBI
traumatic brain injury
Footnotes
Note: Any Supplementary Information and Source Data files are available in the online version of the paper.
Competing Financial Interests: The authors declare competing financial interests: details are available in the online version of the paper.
CFI statement: M.E. Martone is the principal investigator of the Neuroscience Information Framework. A.E. Bandrowski is the NIF Project Leader. A.R. Ferguson, J.L. Nielson and M.H. Cragin are not affiliated with NIF.
References
- 1.Huerta MF, Koslow SH, Leshner AI. Trends Neurosci. 1993;16:436–438. doi: 10.1016/0166-2236(93)90069-x. [DOI] [PubMed] [Google Scholar]
- 2.Roysam B, Shain W, Ascoli GA. Neuroinformatics. 2009;7:1–5. doi: 10.1007/s12021-008-9043-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.National Institutes of Health. NIH Program Announcement NOT-MH-05–014. 2005 http://grants.nih.gov/grants/guide/notice-files/NOT-MH-05-014.html.
- 4.Shepherd GM, et al. Trends Neurosci. 1998;21:460–468. doi: 10.1016/s0166-2236(98)01300-9. [DOI] [PubMed] [Google Scholar]
- 5.Weinberg AM. Science. 1961;134:161–164. doi: 10.1126/science.134.3473.161. [DOI] [PubMed] [Google Scholar]
- 6.Wallis JC, Rolando E, Borgman CL. PLoS ONE. 2013;8:e67332. doi: 10.1371/journal.pone.0067332. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Chan AW, et al. Lancet. 2014;383:257–266. doi: 10.1016/S0140-6736(13)62296-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Ascoli GA, Donohue DE, Halavi M. J Neurosci. 2007;27:9247–9251. doi: 10.1523/JNEUROSCI.2055-07.2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Gardner D, et al. Neuroinformatics. 2008;6:149–160. doi: 10.1007/s12021-008-9024-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Gardner D, et al. Neuroinformatics. 2003;1:289–295. doi: 10.1385/NI:1:3:289. [DOI] [PubMed] [Google Scholar]
- 11.Boline J, Lee EF, Toga AW. Front Neurosci. 2008;2:100–106. doi: 10.3389/neuro.01.012.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Van Horn JD, Gazzaniga MS. Neuroimage. 2013;82:677–682. doi: 10.1016/j.neuroimage.2012.11.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Perrino T, et al. Perspect Psychol Sci. 2013;8:433–444. doi: 10.1177/1745691613491579. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Poline JB, Poldrack RA. Front Neurosci. 2012;6:96. doi: 10.3389/fnins.2012.00096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Poldrack RA, et al. Front Neuroinform. 2013;7:12. doi: 10.3389/fninf.2013.00012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Steward O, Popovich PG, Dietrich WD, Kleitman N. Exp Neurol. 2012;233:597–605. doi: 10.1016/j.expneurol.2011.06.017. [DOI] [PubMed] [Google Scholar]
- 17.Wicherts JM, Bakker M, Molenaar D. PLoS ONE. 2011;6:e26828. doi: 10.1371/journal.pone.0026828. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Heidorn PB. Libr Trends. 2008;57:280–299. [Google Scholar]
- 19.Mueck L. Nat Nanotechnol. 2013;8:693–695. doi: 10.1038/nnano.2013.204. [DOI] [PubMed] [Google Scholar]
- 20.Sena ES, van der Worp HB, Bath PM, Howells DW, Macleod MR. PLoS Biol. 2010;8:e1000344. doi: 10.1371/journal.pbio.1000344. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Fawcett JW, et al. Spinal Cord. 2007;45:190–205. doi: 10.1038/sj.sc.3102007. [DOI] [PubMed] [Google Scholar]
- 22.Lemmon VP, et al. J Neurotrauma. 2014;31:1354–1361. doi: 10.1089/neu.2014.3400. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Nielson JL, et al. J Neurotrauma. 2014 Jul 31; doi: 10.1089/neu.2014.3399. [DOI] [Google Scholar]
- 24.Fisher M, et al. Stroke. 2009;40:2244–2250. doi: 10.1161/STROKEAHA.108.541128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Kwon BK, Hillyer J, Tetzlaff W. J Neurotrauma. 2010;27:21–33. doi: 10.1089/neu.2009.1048. [DOI] [PubMed] [Google Scholar]
- 26.Marmarou A, et al. J Neurotrauma. 2007;24:239–250. doi: 10.1089/neu.2006.0036. [DOI] [PubMed] [Google Scholar]
- 27.Maas AI, et al. J Neurotrauma. 2011;28:177–187. doi: 10.1089/neu.2010.1617. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Manley GT, Maas AI. J Am Med Assoc. 2013;310:473–474. doi: 10.1001/jama.2013.169158. [DOI] [PubMed] [Google Scholar]
- 29.Yue JK, et al. J Neurotrauma. 2013;30:1831–1844. doi: 10.1089/neu.2013.2970. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Steyerberg EW, et al. PLoS Med. 2008;5:e165. doi: 10.1371/journal.pmed.0050165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Yuh EL, et al. Ann Neurol. 2013;73:224–235. doi: 10.1002/ana.23783. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Ferguson AR, et al. PLoS ONE. 2013;8:e59712. doi: 10.1371/journal.pone.0059712. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Turner CF, et al. Database (Oxford) 2011;2011:bar043. doi: 10.1093/database/bar043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Turner JA, et al. Front Neuroinform. 2010;4:10. doi: 10.3389/fninf.2010.00010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Tenopir C, et al. PLoS ONE. 2011;6:e21101. doi: 10.1371/journal.pone.0021101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Roche DG, et al. PLoS Biol. 2014;12:e1001779. doi: 10.1371/journal.pbio.1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Boulton G, Rawlins M, Vallance P, Walport M. Lancet. 2011;377:1633–1635. doi: 10.1016/S0140-6736(11)60647-8. [DOI] [PubMed] [Google Scholar]
- 38.Bohannon J. Science. 2014;344:788–789. doi: 10.1126/science.344.6186.788. [DOI] [PubMed] [Google Scholar]
- 39.Agarwal G, et al. Science. 2014;344:626–630. doi: 10.1126/science.1250444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Cragin MH, Palmer CL, Carlson JR, Witt M. Philos Trans A Math Phys Eng Sci. 2010;368:4023–4038. doi: 10.1098/rsta.2010.0165. [DOI] [PubMed] [Google Scholar]
- 41.Halavi M, Hamilton KA, Parekh R, Ascoli GA. Front Neurosci. 2012;6:49. doi: 10.3389/fnins.2012.00049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Martone ME, et al. J Struct Biol. 2002;138:145–155. doi: 10.1016/s1047-8477(02)00006-0. [DOI] [PubMed] [Google Scholar]
- 43.Fernandez JJ. BMC Bioinformatics. 2009;10:178. doi: 10.1186/1471-2105-10-178. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Goodman A, et al. PLoS Comput Biol. 2014;10:e1003542. doi: 10.1371/journal.pcbi.1003542. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Gorgolewski KJ, Margulies DS, Milham MP. Front Neurosci. 2013;7:9. doi: 10.3389/fnins.2013.00009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Gorgolewski KJ, et al. Gigascience. 2013;2:6. doi: 10.1186/2047-217X-2-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Klein T, et al. Data Sci J. 2013;12:1–9. [Google Scholar]
- 48.The Future of Research Communications and e-Scholarship (FORCE11). Joint Declaration of Data Citation Principles–FINAL. 2013 https://www.force11.org/datacitation.
- 49.Research Data Alliance. Research data sharing without barriers. 2014 https://rd-alliance.org/group/data-citation-wg.html.
- 50.Van Essen DC, et al. Neuroimage. 2013;80:62–79. [Google Scholar]
- 51.Mennes M, Biswal BB, Castellanos FX, Milham MP. Neuroimage. 2013;82:683–691. doi: 10.1016/j.neuroimage.2012.10.064. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.The Royal Society. Science as an open enterprise. 2012 https://royalsociety.org/policy/projects/science-public-enterprise/Report/
- 53.Kennedy DN. Neuroinformatics. 2014;12:361–363. doi: 10.1007/s12021-014-9239-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Costa LF, Zawadzki K, Miazaki M, Viana MP, Taraskin SN. Front Comput Neurosci. 2010;4:150. doi: 10.3389/fncom.2010.00150. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Hansen MB, Jespersen SN, Leigland LA, Kroenke CD. Front Integr Neurosci. 2013;7:31. doi: 10.3389/fnint.2013.00031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Martone ME, Gupta A, Ellisman MH. Nat Neurosci. 2004;7:467–472. doi: 10.1038/nn1229. [DOI] [PubMed] [Google Scholar]
- 57.Maas AI, et al. Lancet Neurol. 2013;12:1200–1210. doi: 10.1016/S1474-4422(13)70234-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Table 1. A sample of Neuroscience-centered data repositories available to the community. Emphasis was placed on those hosting a variety of different data types relevant to neuroscience. Databases not accepting contributions from the community were not included, nor were well known genetic and molecular resources like GEO or the Protein Data Bank, as these are well established. Information collected from internet searches on 7/31/14. Data repositories are rank ordered by number of datasets or data elements (e.g. subjects, images) from largest to smallest. Resources were classified as either “Open” or “Account” depending on whether the resource provided downloads of datasets without a required account, or if an account and log in were required to gain access to downloadable data, respectively. Most of the “Access” resources are available following a few simple steps (e.g. INCF dataspace, CARMEN), whereas others require submission of institutional credentials and proposal reviews for use of data (e.g. NDAR). The year the resource was created was determined either from information on the website or from the date of the earliest publication on the resource. Publications citing each resource were determined from web pages created by the resource authors, in these cases the links lead to publication pages on the resource website, or from text-mining results for the URL, alternative URLs, old URLs, redirected URLs, resource names, and resource abbreviation. In all cases the URLs mentions were accepted and resource name results were only added if the first 10 references were accurate for the resource. For example, CRCNS text mining results retrieve 110 publications, but most are results acknowledging the funding mechanism not the data resource. PubMed URLs are limited to a search of the first 24 references to the resource due to URL length restrictions. More detailed information for each resource, including resource identification (ID), webpage/URL, funding agencies and related conditions can be searched and downloaded from the NIF registry (https://neuinfo.org/mynif/search.php?t=indexable&nif=nlx_144509-1). Blank sections of (-) represent data not found at the time of the search. Downloads represented as either millions (M) or thousands (k).