Abstract
Although omics datasets represent valuable assets for hypothesis generation, model testing, and data validation, the infrastructure supporting their reuse lacks organization and consistency. Using nuclear receptor signaling transcriptomic datasets as proof of principle, we developed a model to improve the discoverability, accessibility, and citability of published omics datasets. Primary datasets were retrieved from archives, processed to extract data points, then subjected to metadata enrichment and gap filling. The resulting secondary datasets were exposed on responsive web pages to support mining of gene lists, discovery of related datasets, and single-click citation integration with popular reference managers. Automated processes were established to embed digital object identifier-driven links to the secondary datasets in associated journal articles, small molecule and gene-centric databases, and a dataset search engine. Our model creates multiple points of access to reprocessed and reannotated derivative datasets across the digital biomedical research ecosystem, promoting their visibility and usability across disparate research communities.
Keywords: dataset, curation, metadata, citation, omics, reuse
INTRODUCTION
While the value of omics-scale datasets for hypothesis generation, model testing, and data validation is unquestionable, infrastructure supporting such reuse is underdeveloped. For example, public deposition of such datasets is historically far from routine (40–45% in the case of transcriptomic datasets)1,2 and those datasets that are present in repositories are frequently sporadically or inconsistently annotated. As a result, omics datasets exist for the most part in a state that is refractory to routine interrogation and citation by bench scientists lacking informatics training. In this case report, we describe a systematic approach by the Nuclear Receptor Signaling Atlas (NURSA) consortium,3 under the auspices of the National Institutes of Health (NIH) Big Data to Knowledge (BD2K) program, to improve the discoverability, accessibility, and citability of transcriptomic datasets in the field of nuclear receptor and coregulator signaling.4,5 Although developed initially for 1 type of omics dataset in a specific field, the principles developed and lessons learned are broadly applicable to solving problems of dataset accessibility, discoverability, and citability in any omics field.
RESULTS
We previously embarked upon a project to systematically aggregate, process, and annotate published NR signaling pathway-relevant transcriptomic datasets and expose them through a user-friendly query tool, Transcriptomine.6 Raw files from archived primary datasets were processed to map features to gene symbols, and a consistent industry-standard pipeline was used to calculate relative abundances (for perturbant vs control) and their associated significance values.7 In the Transcriptomine model, individual fold-change data points were mapped to individual experiments (equivalent to a perturbant vs control contrast). Given that the experiments were not organized into discrete datasets, they were not citable by users of Transcriptomine, which denied the authors of the original study attribution as creators of the experiments. Accordingly, the purpose of this study was to organize the Transcriptomine experiments into discrete datasets that would allow linkage from their associated journal articles, discovery through linkage from small molecule and gene knowledge bases and dataset citation indexes, and one-click citation of the datasets in a user’s reference manager of choice.
Datasets: definition, core metadata elements, and persistent identification
The first step was to organize Transcriptomine experiments as children of individual datasets; these were clearly identified as secondary datasets to distinguish them from the records in the primary dataset repository. Through the Joint Declaration on Data Citation Principles and related initiatives, the FORCE11 and DataCite organizations have recommended a set of core dataset landing page metadata elements that give effect to the dataset sharing and citation ideals embodied in the FAIR (findable, accessible, interoperable, and reusable) principles.8–11 Accordingly, we first set out to design a comprehensive annotation procedure that would ensure strong compliance of the secondary datasets with these principles. The elements we included on the dataset landing page are dataset identifier, title, description, creators (authors), publisher, publication and release dates, version, and provenance. Fields such as title and description were curated de novo, whereas the list of creators was extracted from the contributor fields in the associated primary dataset deposition or, in the case of datasets that were not archived, from the AU fields in the MEDLINE record of the original associated primary research article. See Supplementary File 1 for a full discussion of these and other metadata elements to which NURSA datasets and experiments are mapped.
The vast majority of biomedical research datasets originate in association with a primary research article in whose biological context they are inextricably embedded. In addition to the core Joint Declaration on Data Citation Principles–compliant metadata elements described above, we included an additional field on the dataset landing page, Associated Article. To achieve the twin aims of conveying the biological provenance of the dataset to the user and visually promoting citation of the dataset (see one-click citation of datasets below), we separated the Associated Article information from the core dataset metadata elements on the dataset landing page using a horizontal line.
We next associated the datasets with persistent, actionable identifiers to ensure the stability of their integration with the biomedical digital ecosystem. Given its familiarity to the research community as an identifier for journal articles, we selected the digital object identifier (DOI) standard to stably identify our secondary datasets. An automated process was developed so that each dataset assigns and mints with CrossRef a unique, random 10-character alphanumerical string (eg, 10.1621/p5Ve2SlK43), which resolves to the dataset landing page. Individual experiments are then uniquely identified within a dataset using the dataset DOI, to which an experiment number is appended (10.1621/p5Ve2SlK43.1, 10.1621/p5Ve2SlK43.2, etc.).
Dataset landing page functionality
Datasets are listed in a dataset directory that can be filtered using dropdown menus to restrict the listing to any combination of NR signaling pathway and physiological system or organ. In brief, the dataset landing page provides for browsing, searching and sorting of gene lists, and discovery of datasets related by RNA source or regulatory molecule. In addition, a link to Transcriptomine queries across all of the approximately 40 000 000 transcriptomic data points in the NURSA database, rapidly affording the user a sophisticated perspective on the tissue-specific NR pathways impinging on the gene or genes of interest. Gene lists can be browsed on the website or downloaded along with associated metadata in comma-separated values (CSV) format from the dataset landing page. A more comprehensive discussion of the dataset directory and individual dataset pages is beyond the scope of this case report and is the subject of another manuscript (Becnel et al., in preparation).
Engagement of publisher partners
Although many journals currently link from articles to primary archives of transcriptomic datasets, these datasets are often presented in a manner that is not easily interrogated by bench scientists. Reasoning that journal users might benefit from routine access to a resource where they can not only easily mine a dataset relevant to a specific article, but discover biology in other related datasets, we next engaged a publisher partner within whose articles links to NURSA dataset pages would be embedded. Based upon its Supported Data Repository program,12 which provides for the establishment of programmatic bidirectional links between articles and datasets in recognized data repositories, Elsevier was chosen as the initial publishing partner. Article DOIs were identified during the dataset curation process and used to establish a web service that culminated in the appearance of DOI-based links between a NURSA dataset page and its associated Elsevier journal article (Figure 1).
Figure 1.
Link to NURSA landing dataset page from associated research article.
Integration with small molecule and gene-centric knowledge bases
Despite the profound impact of NR ligands and other small molecules on cellular transcriptomes, links from small molecule knowledge bases to data points documenting these effects are underdeveloped. Reciprocally, there is a dearth of dataset-derived information in gene-centric databases on regulation of these genes by small molecules. Accordingly, we next established access points to the NURSA datasets and/or Transcriptomine from databases that curate information on small molecules (ChEBI) and their genomic targets (NCBI Entrez Gene). Programmatic processes were established to automate the embedding of links to NURSA in appropriate locations in these resources using, respectively, the ChEBI Automatic Xrefs13 and NCBI LinkOut14 features (Figures 2A and B). These links serve to increase the visibility of NURSA transcriptomic resources to researchers in disparate communities with no explicit connection to NR signaling pathways.
Figure 2.
Exposure of NURSA datasets in community knowledge bases and a dataset metadata index. (A) The LinkOut section of the NCBI Entrez Gene page for (in this example) GREB1 links to a page in the Transcriptomine tool displaying data points documenting regulation of GREB1 by NR signaling pathways. The datasets underlying the data points can be cited in Transcriptomine in the same manner as on the dataset landing pages. A full discussion of this functionality can be found elsewhere (Becnel et al., manuscript under preparation). (B). NURSA transcriptomic datasets mapped to 17β-estradiol, the physiological ligand for the estrogen receptors, are displayed in the Gene Expression section of the page for this molecule in ChEBI, the European Bioinformatics Institute’s bioactive compound resource. (C) NURSA transcriptomic dataset record in the Thomson Reuters Data Citation Index, a dataset metadata index.
Deposition in dataset metadata index
The next component of the model was to ensure visibility of the datasets through metadata indexes and search engines. Given that the central NIH-supported dataset metadata index bioCADDIE15 is not yet in production, we established proof of principle of dataset metadata deposition using the Data Citation Index (DCI), a service of Thomson Reuters’s Web of Science. Deposition of dataset metadata with DCI was achieved via the Open Archives Initiative Protocol for Metadata Harvesting standard,16 which supports interoperability between digital archives and has been adopted by numerous open access publishers. Due to space constraints, a full discussion of the procedure for this standard-based deposition of our dataset metadata with DCI can be found in Supplementary File 2, and the code is freely available for reuse on the NURSA Github page at https://github.com/BCM-DLDCC/nursa.17 An example of a NURSA dataset record in DCI is shown in Figure 2C.
One-click citation of datasets
Broad consensus has been established on the importance of dataset citation in publications and grant proposals in establishing datasets as first-class research objects.9,11,18–21 To lower the usability barrier to citing our secondary datasets, we next created a button on the NURSA dataset landing pages that facilitated one-click addition of a dataset citation in RIS format to a user’s citation manager software. A survey of the most recent versions of the 4 major reference managers (Endnote, Mendeley, Zotero, and Papers3) established that Endnote explicitly recognized either Dataset (Endnote) or Data File (Papers3) as a reference type and, accordingly, Endnote was chosen as an initial implementation. Space constraints prevent full discussion of the procedures for mapping the NURSA dataset metadata elements to RIS tags, which are provided in Supplementary File 3. An example of a citation is included in the references.22 Note that although they do not currently specifically support dataset as a reference type, we verified that the file was consumable by the Mendeley, Zotero, and Papers applications (Supplementary File 3). Moreover, since few, if any, journals have any guidelines on citation of datasets, and their Endnote styles do not specify dataset as a reference type, additional manual configuration of Endnote, described in detail in Supplementary File 3, is required in order to display dataset citations in an appropriate format.
DISCUSSION
Despite their transformative impact on discovery-driven research, modest rates of public archiving, and sporadic compliance with metadata annotation standards, have restricted the full potential of transcriptomic datasets in the field of eukaryotic cellular signaling. The net result has been to promote in this field a research information cycle from which transcriptomic datasets have been largely excluded as independently citable resources. We previously designed a searchable database of transcriptomic experiments in the field of nuclear receptor signaling, Transcriptomine. Although Transcriptomine to some extent addressed the problem of discoverability of these data points for reuse by scientists, the fact that the data points were not organized into discrete datasets prevented their citation by users of our website, denying the authors of the original article attribution as creators of the dataset. Accordingly, we set out in this study to organize Transcriptomine data points into discrete secondary datasets, add functionality that supported their routine citation by end users, and expose them to the broader research ecosystem. Our model repackages primary archived omics-scale datasets into a FAIR principle–compliant format that retains their essential findings, thereby promoting their reuse as first-class research objects. Annotation of these datasets using ontologies and standard persistent identifiers facilitates their distribution across key digital biomedical research resources, including journal articles, gene-centric and small molecule knowledge bases, end user citation managers, and metadata indexes (Figure 3). Such efforts promote the discoverability and reuse of these datasets by diverse research communities, and help broaden the exposure of previously informatically isolated and underused datasets.23 Work is currently ongoing to extend NURSA transcriptomic dataset interoperability with additional collaborators, including gene-centric resources such as GeneCards24 and PharmGKB,25 small molecule resources such as PubChem26 and DrugBank,27 journal publishers (the Endocrine Society), and dataset metadata indexes such as dkNET.28 Moreover, through our involvement with the BD2K bioCADDIE project, we will provide for discovery of our secondary datasets in the public dataset discovery index currently under development by this consortium.
Figure 3.
Overview of Case Report. Metadata enrichment and standardization processes, coupled to automated distribution of dataset metadata and data points, affords transcriptomic datasets in the field of NR signaling a greater level of exposure to multiple diverse research disciplines than has previously been achieved.
A major goal in achieving and maintaining high levels of discoverability of the secondary transcriptomic datasets was to establish enduring, low-maintenance linkage from multiple different electronic sources to the NURSA datasets. Given its well-established infrastructure, broad community adoption, and familiarity to researchers, the DOI standard was selected to support the creation of these linkages. Indeed, NURSA was the first NIH-supported database to mint DOIs for its primary research datasets, through its 2005 membership with CrossRef. Our adoption of the DOI standard does not necessarily preclude our adoption of any future alternative dataset identifier standards that might emerge.
Several current data science initiatives at NIH suggest that parity of esteem for datasets within the biomedical research information life cycle is a realistic goal. These include establishment of the NIH Commons,29 development of a central dataset discovery index, bioCADDIE15 (Bourne, 2015), and revision of the NIH Genomic Data Sharing to mandate the deposition of a broader set of omics-scale dataset types.30 These welcome developments notwithstanding, a variety of cultural and technical barriers persist that discourage reuse of existing public datasets. A strong interdependence exists between the deposition of fully and accurately annotated datasets and their downstream discovery and citation by researchers: put another way, poorly or inconsistently annotated datasets are less likely to be found, reused, or cited. Unfortunately, the prevailing academic credit and scientific publishing models in general do little to incentivize the deposition of fully and accurately annotated datasets in appropriate repositories. Indeed, our own experience during our dataset annotation work has been that a significant percentage of archived primary transcriptomic datasets have plainly not been subject to a level of peer review comparable to that of their associated articles. Developing tools to facilitate annotation of datasets using ontologies and structured vocabularies and, albeit more speculatively, formally incorporating datasets into the peer-review process will go a long way toward redressing this situation and ensuring that archived datasets are optimally positioned for downstream reuse and citation by the scientific community.
Even assuming public availability of well-annotated datasets, enthusiastic participation of investigators and journal publishers is essential in encouraging a culture of dataset citation. In the cell signaling field at least, publicly archived datasets almost always originate from research described in an associated primary research article. Accordingly, steps must be taken to ensure that citation of a dataset translates to tangible academic credit for the original authors of its associated article. Equally, publishers whose financial health is linked to the prominence of their journal titles, such as society-based publishers, should be reassured that citation of a dataset will not adversely impact, and rather would benefit, the status of the publication (ie, article plus dataset). These 2 aims could be achieved by ensuring, on a technical level, that citations of both the article and its associated dataset count toward a single unified metric and, on a cultural level, that this more inclusive metric be taken to reflect the overall value of the publication (ie, article plus dataset) to the community.
The absence of a culture of dataset citation in the biomedical research community hampers the seamless integration of datasets with the research reporting process that, for research articles at least, is taken for granted. For example, although our RIS download feature removes some of the logistical impediments to citation of a dataset, not all citation managers currently recognize datasets in their reference types. Moreover, we were able to identify no journals, even among those with more progressive positions on dataset reuse and citation, that included dataset in their endnote bibliography templates. The practical consequence of this is that for a dataset citation to appear in an article in a format that communicates the minimal DataCite and FORCE11 recommended metadata (authors, title, year, repository, identifier, and version), additional manual configuration is required on the part of the end user. Updating citation managers to instantiate datasets as a reference type and updating their journal styles to formally incorporate datasets into their reference templates represent substantial logistical undertakings that will require close coordination of the activities of publishers, citation managers, and dataset metadata index stakeholders. Feedback from end users during beta-testing sessions of the NURSA interface suggests that the very notion that a dataset should be cited, let alone how this might be achieved, is foreign to most researchers (McKenna, personal communication). Evidently considerable obstacles persist in this area, and a joint effort is currently ongoing between the bioCADDIE and FORCE11 organizations to address these and other issues surrounding dataset citation.
To the best of our knowledge, the model we have developed – exposing secondary, derivative transcriptomic datasets for querying as a biological continuum, and emphasizing the provenance of the underlying data points in an individual citable dataset, as opposed to a research article – is unique in the field of cellular signaling. Despite their already considerable contributions in supporting cutting-edge research, we believe that transcriptomic datasets have an enduring value that extends far beyond the purpose envisaged by their original creators. Our model extends to other discovery-scale platforms in the field of cellular signaling, such as interactomics, post-translatomics, and metabolomics, which document the impact of signaling pathways on the cellular universes of, respectively, protein interactions, protein residue covalent modifications, and metabolic intermediates. Mapping and exposing these resources using consistent identifiers for experimental perturbants and biosources holds out the possibility of greater insights for bench biologists into the coordination and integration of cellular processes by signal transduction pathways. We hope that our model will serve the dual purpose of catalyzing novel insights into the tissue-specific biology of nuclear receptor signaling and providing a blueprint for effective distribution and attribution of secondary omics-scale datasets.
Supplementary Material
ACKNOWLEDGMENTS
We thank Dan Auld and Sharon Ennis of the Thomson Reuters Corporation for helpful discussions. We appreciate the input of the NURSA program officers, Corinne Silva (National Institute of Diabetes, Digestive, and Kidney Diseases) and Dr Koji Yoshinaga (Eunice Kennedy Shriver National Institute of Child Health and Development). Special thanks are due to the NURSA scientific officer at the time this research was carried out, Dr Ron Margolis (NIDDK).
CONTRIBUTIONS
Planning and design: N.J.M., L.B.B.
Project management: Y.D.
Software development: A.N., A.M., W.K.
Manuscript drafting: N.J.M., L.B.B., Y.D.
FUNDING
Work described in this manuscript was supported by National Institutes of Health Big Data to Knowledge (BD2K) program supplement (DK097748-S1) and a Cancer Center support grant to BCM (CA125123). The NURSA consortium is funded by an award from NIDDK and NICHD (DK097748).
COMPETING INTERESTS
The authors declare no competing interests.
SUPPLEMENTARY MATERIAL
Supplementary material are available at Journal of the American Medical Informatics Association online.
REFERENCES
- 1. Ochsner SA, Steffen DL, Stoeckert CJ, Jr, McKenna NJ. Much room for improvement in deposition rates of expression microarray datasets. Nat Methods 2008;5(12):991. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Witwer KW. Data submission and quality in microarray-based microRNA profiling. Clin Chem 2013;59(2):392–400. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Becnel LB, Darlington YF, Ochsner SA, et al. Nuclear receptor signaling atlas: opening access to the biology of nuclear receptor signaling pathways. PloS One 2015;10(9):e0135615. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Mangelsdorf DJ, Thummel C, Beato M, et al. The nuclear receptor superfamily: the second decade. Cell 1995;83(6):835–839. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. McKenna NJ, Lanz RB, O'Malley BW. Nuclear receptor coregulators: cellular and molecular biology. Endocrine Rev 1999;20(3):321–344. [DOI] [PubMed] [Google Scholar]
- 6. Ochsner SA, Watkins CM, McOwiti A, et al. Transcriptomine, a web resource for nuclear receptor signaling transcriptomes. Physiol Genomics 2012;44(17):853–863. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. NURSA. Transcriptomine FAQs. 2016. https://www.nursa.org/nursa/about/faqs.jsf#Transcriptomine3. Accessed May 5, 2016. [Google Scholar]
- 8. FORCE11. The FAIR Data Principles. 2015. https://www.force11.org/group/fairgroup/fairprinciples. Accessed May 5, 2016. [Google Scholar]
- 9. Martone ME. Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. San Diego, FORCE11 2014. https://www.force11.org/datacitation. Accessed May 5, 2016. [Google Scholar]
- 10. Starr J, Castro E, Crosas M, et al. Achieving human and machine accessibility of cited data in scholarly publications. Peer J Comp Sci 2015;1(e1). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. DataCite. Cite Your Data. 2015. https://www.datacite.org/services/cite-your-data.html. Accessed May 5, 2016. [Google Scholar]
- 12. Elsevier. Database Linking Tool. 2015. https://www.elsevier.com/?a=57755. Accessed May 5, 2016. [Google Scholar]
- 13. ChEBI. ChEBI User Manual. 2015. http://www.ebi.ac.uk/ena/about/cross-references. Accessed May 5, 2016. [Google Scholar]
- 14. NCBI. NCBI LinkOut. 2015. http://www.ncbi.nlm.nih.gov/projects/linkout/. Accessed May 5, 2016. [Google Scholar]
- 15. bioCADDIE. Biomedical and healthCAre Data Discovery Index Ecosystem. 2015. www.biocaddie.org. Accessed May 5, 2016. [Google Scholar]
- 16. OAI. Open Archives Initiatve Protocol for Metadata Harvesting (OAI-PMH). 2015. https://www.openarchives.org/pmh/. Accessed May 5, 2016. [Google Scholar]
- 17. NURSA. NURSA Github Account. 2015. https://github.com/BCM-DLDCC/nursa. Accessed May 5, 2016. [Google Scholar]
- 18. Margolis R, Derr L, Dunn M, et al. The National Institutes of Health's Big Data to Knowledge (BD2K) initiative: capitalizing on biomedical big data. J Am Med Inform Assoc 2014;21(6):957–958. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Borgman CL. Why are the attribution and citation of scientific data important?. 2011. http://sites.nationalacademies.org/PGA/brdi/PGA_064019. Accessed May 5, 2016. [Google Scholar]
- 20. Practices C-ITGoDCSa. Out of cite, out of mind: the current state of practice, policy, and technology for the citation of data. Data Sci J 2013;12:CIDCR1–CIDCR75. [Google Scholar]
- 21. Goodman L, Lawrence R, Ashley K. Data-set visibility: cite links to data in reference lists. Nature 2012;492(7429):356. [DOI] [PubMed] [Google Scholar]
- 22. Santos SJ, Aupperlee MD, Xie J, et al. Age dependence analysis of the R5020-regulated transcriptome in mouse mammary gland. Nuclear Receptor Signaling Atlas Datasets 2009. doi: 10.1621/T2pyrsMPAB. [Google Scholar]
- 23. Bourne PE, Lorsch JR, Green ED. Perspective: sustaining the big-data ecosystem. Nature 2015;527(7576):S16–S17. [DOI] [PubMed] [Google Scholar]
- 24. Stelzer G, Dalah I, Stein TI, et al. In-silico human genomics with GeneCards. Human Genomics 2011;5(6):709–717. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Thorn CF, Klein TE, Altman RB. PharmGKB: the pharmacogenomics knowledge base. Methods Mol Biol 2013;1015:311–320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Kim S, Thiessen PA, Bolton EE, et al. PubChem substance and compound databases. Nucleic Acids Res 2015;44(D1):D1202-D1213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Law V, Knox C, Djoumbou Y, et al. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res 2014;42(Database issue):D1091–D1097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Whetzel PL, Grethe JS, Banks DE, Martone ME. The NIDDK Information Network: a community portal for finding data, materials, and tools for researchers studying diabetes, digestive, and kidney diseases. PloS One 2015;10(9):e0136206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. NIH. NIH Commons. 2015. https://datascience.nih.gov/commons. Accessed May 5, 2016. [Google Scholar]
- 30. NIH. NIH Genomic Data Sharing Policy. 2015. https://gds.nih.gov/03policy2.html. Accessed May 5, 2016. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



