Big data and data science investigations hold great promise for making efficient use of data generated in the course of daily life: from social media transactions, news, and a variety of apps used by a large portion of the world’s population, including data generated for health care and life sciences research. Data science brings new insights when large-scale datasets are brought together to characterize and address complex problems. The past decade has seen a plethora of federal and private investments in biomedical data science collection, organization, and analysis, including the National Institutes of Health’s Big Data to Knowledge program, the Patient-Centered Outcomes Research Institute’s PCORnet, and investments from various industries. The work is maturing and interesting, and exciting results are emerging.
Biomedical data science offers new and powerful tools to better understand health and disease through insights gleaned from data. Linking data science advances with knowledge representation and clinical information understanding, which have been traditional topics in the biomedical informatics field since its early days, has the potential to accelerate data-driven discovery. Biomedical informatics has also been addressing data-driven discovery. However, until this decade, examples where big data were available for this type of pursuit were limited. Biomedical informatics has thus evolved and overlaps significantly with biomedical data science, the subfield of data science that is concerned with discoveries using primarily clinical and other health-relevant data. All data science investigations must address important and interesting questions that are relevant to the areas they are applied to, have access to comprehensible datasets, and devise and apply methods robust enough to cope with complex unstructured observations.
The 8 papers in this JAMIA special issue illustrate methods and motivations, data and analytics applied to make sense of and draw biomedical and health implications from a wide range of observations about life sciences phenomena that can be used to study health and disease. These articles epitomize the intersection between biomedical informatics and data science. For example, Estiri et al. describe an open source data quality assessment tool for evaluating and visualizing the completeness and conformance of electronic health record (EHR) data repositories, which is an important step toward addressing challenges to integrating clinical data across distributed networks, as conceptualized in the Big Data to Knowledge initiative.1 Hribar et al. illustrate the application of EHR audit log data, which are generated in the course of routine clinical care, for workflow analysis, with the goal of understanding and optimizing clinical workflow efficiency.2 Kasthurirathne et al. explore social and behavioral domains of health, which provides an introduction to a novel data type, indicators of social and behavioral domains, and how these important indicators can be integrated into a process of better understanding clinical phenotypes based on life experiences known to influence long-term health outcomes.3
Any science is predicated on reproducibility, and data science is no exception. Johnson et al.4 describe how a very popular data repository has been extended to also disseminate code that enables reproducibility in critical care research, and Yeung et al.5 focus on the reproducibility of bioinformatics workflows by describing the role of interactive notebooks and containers. Reproducibility relies on reusing data, software, and processes. To achieve reuse, it is important that all these digital objects be organized and standardized in a way that they can be found and accessed. Gonzalez-Beltran et al.6 report on lessons learned from the development and implementation of Data Tag Suite (DATS), a metadata common model. Xia et al7 report on the importance of calibration when sharing temporally sensitive biomedical data.
These papers, and others that will appear in future monthly issues of JAMIA, illustrate a wide range of methodologies with the potential to accelerate data-driven discovery. They also illustrate some of the challenges at the intersection of biomedical informatics and data science. For example, the multiple formats in which metadata can be represented (or sometimes not represented) in various biomedical datasets represent a great barrier to data discoverability and reuse. The papers also illustrate some thorny problems that remain unresolved, such as issues with data quality, improper characterization of workflows, and difficulties in representing health-relevant data that are nonclinical in nature. Data quality continues to pose challenges, and although emerging methods can be robust in handling data that are not well behaved, it remains difficult to document workflows and ensure that they are reproducible. The challenges of curation persist.
Yet the articles in this special-focus issue of JAMIA offer promise and opportunities for the new generation of biomedical informatics professionals. They document the existing platforms for data science investigation to accelerate reuse of existing data. There is ample opportunity to develop new methods. In particular, there is a broadening of the areas where knowledge representation and metadata strategies can provide real value.
JAMIA encourages the writing of papers that provide original data science investigation, particularly when the informatics challenges and solutions are highlighted in a manner that can be applied in this intersecting area. We are committed to promoting the biomedical informatics developments that have the potential to accelerate data-driven discoveries. In addition to predictive analytics, of particular interest are papers reporting the integration of clinical data with research data, the efficient structure of data repositories and directory services to locate them, and novel ways of linking datasets to articles.
Stay tuned for additional biomedical data sciences articles that will appear in our monthly journal issues, in addition to outstanding biomedical informatics articles that describe enabling technologies and the many creative uses of methods and models developed by biomedical data scientists.
REFERENCES
- 1. Estiri et al. Exploring completeness in clinical data research networks with DQe-c. J Am Med Inform Assoc. 2018;25(1):17–24. [DOI] [PMC free article] [PubMed]
- 2. Hribar et al. Secondary use of electronic health record data for clinical workflow analysis. J Am Med Inform Assoc. 2018;25(1):40–6. [DOI] [PMC free article] [PubMed]
- 3. Kasthurirathne et al. Assessing the capacity of social determinants of health data to augment predictive models identifying patients in need of wraparound social services. J Am Med Inform Assoc. 2018;25(1):47–53. [DOI] [PMC free article] [PubMed]
- 4. Johnson et al. User needs analysis and usability assessment of DataMED—a biomedical data discovery index. J Am Med Inform Assoc. doi: 10.1093/jamia/ocx134. [DOI] [PMC free article] [PubMed]
- 5. Yeung et al. Reproducible bioconductor workflows using browser-based interactive notebooks and containers. J Am Med Inform Assoc. 2018;25(1):4–12. [DOI] [PMC free article] [PubMed]
- 6. Gonzalez-Beltran et al. Data discovery with DATS: exemplar adoptions and lessons learned. J Am Med Inform Assoc. 2018;25(1):13–6. [DOI] [PMC free article] [PubMed]
- 7. Xia et al. It’s all in the timing: calibrating temporal penalties for biomedical data sharing. J Am Med Inform Assoc. 2018;25(1):25–31. [DOI] [PMC free article] [PubMed]