. 2016 Apr 23;23(4):791–795. doi: 10.1093/jamia/ocv213

An informatics research agenda to support precision medicine: seven key areas

Jessica D Tenenbaum 1,, Paul Avillach 2, Marge Benham-Hutchins 3, Matthew K Breitenstein 4, Erin L Crowgey 5, Mark A Hoffman 6, Xia Jiang 7, Subha Madhavan 8, John E Mattison 9, Radhakrishnan Nagarajan 10, Bisakha Ray 11, Dmitriy Shin 12, Shyam Visweswaran 13, Zhongming Zhao 14, Robert R Freimuth 4
PMCID: PMC4926738  PMID: 27107452


The recent announcement of the Precision Medicine Initiative by President Obama has brought precision medicine (PM) to the forefront for healthcare providers, researchers, regulators, innovators, and funders alike. As technologies continue to evolve and datasets grow in magnitude, a strong computational infrastructure will be essential to realize PM’s vision of improved healthcare derived from personal data. In addition, informatics research and innovation affords a tremendous opportunity to drive the science underlying PM. The informatics community must lead the development of technologies and methodologies that will increase the discovery and application of biomedical knowledge through close collaboration between researchers, clinicians, and patients. This perspective highlights seven key areas that are in need of further informatics research and innovation to support the realization of PM.

Keywords: precision medicine, informatics, biomarkers, data sharing

The recent announcement of the Precision Medicine (PM) Initiative by President Obama1 has brought PM to the forefront for healthcare providers, researchers, regulators, and funders alike. In order for PM to be fully realized, we must move toward a Learning Healthcare System model that extends evidence-based practice to practice-based evidence by using data generated through clinical care to inform research (Figure 1).2 The leadership and members of the American Medical Informatics Association Genomics and Translational Bioinformatics Working Group have identified seven key areas that informatics research should explore to enable PM’s vision.

Figure 1:

Figure 1:

Informatics methodology enables precision medicine (PM) throughout the Learning Healthcare System cycle. Patients – past, present, and future – are at the beginning and end of the cycle. Both healthcare and research participation result in the generation of data. Informatics methods and tools help turn data into information, and information into knowledge. That knowledge, in turn, influences individuals’ behavior and informs patient care. Informatics plays a key role in enabling each stage and transition of this cycle.

Patients: Past, Present, and Future

Stakeholders in the biomedical enterprise include researchers, providers, payers, and patients. But nearly everyone has been or will be a patient at some point. Patients thus are, and must remain, at the heart of the biomedical enterprise.

Key Area One: Facilitate Electronic Consent and Specimen Tracking

In the era of PM, research studies produce more data than they can possibly use and, paradoxically, would benefit from more data than they can possibly generate. As genomic sequencing becomes increasingly available, using de-identified biospecimens for research becomes more nuanced.3 Research participants may be asked to give broad consent to the future use of their data and biospecimens, and to acknowledge the possible, though unlikely, prospect of sequence-based re-identification.4,5 To maximize data and biospecimen reuse while protecting study participants’ privacy and adhering to their wishes, it is essential to develop machine-readable consent forms that enable electronic queries.6 As large biorepositories linked to electronic health records (EHRs) become more common, informatics will enable researchers to identify cohorts – both intra- and interinstitutionally – that meet their study criteria and have given the requisite consent. Proper local management of specimens and derived samples enables accurate tracking of chain of custody, sample derivations, processing/handling, and quality control – all of which are key elements of rigorous and reproducible research.7 Structured and electronically available consent forms can empower study participants by allowing them to access, review, and modify their preferences. A number of large-scale initiatives, including Sage Bionetworks, the Genetic Alliance, and the Global Alliance for Genomic Health, are making progress in this area.

Areas of informatics that can facilitate study participant consent and sample tracking include the development of structured consent forms and the adoption of relevant ontologies,6,8 user interface design, and infrastructure to enable participant engagement after the point of enrollment. Developing an infrastructure to perform role-based distributed queries over cohorts and sample collections, such as those provided by OpenSpecimen, the Shared Health Research Information Network (SHRINE), and PopMedNet, will also be important.9–11

Data to Knowledge

The promise of PM can only be realized by aggregating (virtually or otherwise) and analyzing data from multiple sources. A recent report by the National Academy of Sciences calls for the development of an information commons (IC) that amasses medical, molecular, social, environmental, and health outcomes data for large numbers of individual patients.12 The IC would be continuously updated, enable data analyses, and serve as the foundation for a knowledge base (KB) (see Key Area Five). Creating an IC would require informatics expertise to develop data standards, ensure data security, standardize processing pipelines, and establish data provenance.

Key Area Two: Develop, Deploy, and Adopt Data Standards to Ensure Data Privacy, Security, and Integrity, and to Facilitate Data Integration and Exchange

Transparency, reciprocity, respecting study participant preferences, data quality/integrity, and security are key to obtaining and maintaining the massive data stores needed for the advancement of PM.13 Data security does not mean data lock-down. Data-sharing can allow a study to proceed despite low numbers of eligible participants at any single institution, and can enable data reuse or meta-analyses. Data and metadata standards are required for data integration and exchange to be successful, but the lack of such standards or inconsistent use of existing standards are frequent barriers to this goal, especially in emergent “omics” disciplines.14 Data gaps are often discovered when existing standards are adopted for other purposes. Rather than creating yet another standard, those seeking to adopt an existing standard should work with its owners to help extend its scope. Conversely, funders and standards owners should place more emphasis on outreach and education/training for potential adopters of existing data standards. A number of initiatives are working to tackle different aspects of this challenge, including BioSharing, the Center for Expanded Data Annotation and Retrieval (CEDAR), the Biomedical and Healthcare Data Discovery Index Ecosystem (bioCADDIE), and Integrating Data for Analysis, Anonymization, and Sharing (iDASH).15–18

Although there have been significant efforts to share molecular datasets publicly, less progress has been made on sharing healthcare data. An emerging strategy is the development of clinical research networks in which EHR-derived data is stored locally, mapped to a common data model, and queried by proxy for members of a consortium or collaboration. Sharing queries rather than data resolves many of the issues that are involved in data standardization and harmonization, data governance, as well as the legal and privacy concerns surrounding other federated or aggregation models. This strategy has been adopted by initiatives such as MiniSentinel, Observational Health Data Sciences and Informatics (OHDSI), and the National Patient-Centered Clinical Research Network (PCORNet).19–21 Building on these networks to include genomic and other “omics” data, environmental data, and social data is one way forward in the development of ICs for PM.

Work on data and metadata standards should be recognized and incentivized by the organizations that use and benefit from them, including academia, industry, government regulators, and funding agencies. New methods of encrypting and sharing genomic data in a way that enables collaborative research without compromising patient privacy are needed.

Key Area Three: Advance Methods for Biomarker Discovery and Translation

A primary goal of PM is to uncover subphenotypes defined by the distinct molecular mechanisms that underlie variations in disease manifestations and outcomes.12 One step toward defining subphenotypes is to establish agreed-upon phenotype definitions for existing disease classifications, a surprisingly complex task.22 A number of different initiatives (eg, the Electronic Medical Records and Genomics [eMERGE] Network and the National Institutes of Health [NIH] Collaboratory) are working to make phenotype definitions computationally tractable and reproducible between sites.23,24 Although some progress in sub-phenotyping has been made, new methods, including analyses of high-dimensional data,25 integration of different types of data (eg, “omics,” imaging, clinical, environmental),26,27 and simulating disease behaviors across multiple biological scales in space and time,28 are needed to address a number of challenges.

Although molecular biomarkers can help elucidate underlying physiological mechanisms of disease, only a minority of currently known biomarkers are clinically actionable. Moreover, critical disease subtype distinctions may be impacted by nonmolecular factors, such as socioeconomic status.29 Many questions must be answered before a potentially actionable biomarker can become part of a clinical guideline and translated into practice.30 Information that is necessary for bridging this gap might include the functional characterization of genes and pathways related to the biomarker, the level of evidence, and data about economic feasibility. Clinical decision making regarding actionable biomarkers would be facilitated by a framework for presenting different levels of evidence regarding whether and how a molecular abnormality, genomic or otherwise, might represent a therapeutically relevant biomarker.31,32 Variant annotations with actionable clinical information will enable decision support systems to provide interpretable and actionable patient-specific reports.33–35

Immediate areas for informatics research to focus on include computational phenotyping, biomarker discovery based on heterogeneous data sources, and frameworks for evaluating clinical actionability and utility.

Key Area Four: Implement and Enforce Protocols and Provenance

Scaling up PM requires complex processing and analytic steps applied to large, heterogeneous datasets. With so many “moving parts,” there are many opportunities for errors in the analysis, interpretation, or exchange of information. It is important that both final results and intermediate steps be well documented and fully reproducible. Protocols, and deviations from them, must also be documented. Software versions, analytical parameters, and reference database builds must all be captured as readily available metadata. Although spreadsheets and documents can be useful for informal data exploration, they do not constitute an adequate data management system.

Large projects often share data between groups and may last several years, during which time key personnel may change institutions. All data processing and analysis for final results should be automated and documented so that another researcher can reproduce the work without making assumptions about what was done. There are various tools that enable this approach, including Taverna, preconfigured virtual machines, and Sage Bionetworks’s Synapse Platform.36–38 Though new challenges will always require novel and innovative solutions, the adoption of standard operating procedures when appropriate will facilitate consistency and improve interoperability. In addition, policies must be enacted and enforced to ensure responsible, reproducible, and reusable science.

Processes and protocols for capturing and exchanging metadata and data provenance must be established, standardized, and widely adopted. Furthermore, this information must be considered to be as important as the primary data it describes, and funding agencies and publishers should insist that it be included with any dataset that is produced and released publicly.

Knowledge to Action

Clinical decision making requires the consolidation of PM knowledge and the development of clinical decision support tools (CDS), which, together with individual patient data, will provide actionable information at the point of care.

Key Area Five: Build a Precision Medicine Knowledge Base

A comprehensive KB will contain information about disease subtypes, disease risk, diagnosis, therapy, and prognosis that emerges from the ongoing analysis of data in an IC. Such a KB must be flexible, scalable, and extensible. Current KBs (eg, on genomic variants) are isolated from one another and do not support federated querying. Informatics solutions are needed for data-sharing and building a consensus on clinical interpretations of disparate, multiscale data. This KB must be machine-readable, as well as human-readable. Knowledge management technologies must enable effective ontological modeling, knowledge provenance, and new methodologies for updating and maintaining the integrated KB. Novel computational reasoning approaches must be utilized to allow efficient federated queries to be run across billions of knowledge units, enabling causal inference and decision support.

New methods and processes must be developed to organize biomedical knowledge into integrated and interconnected KBs that will enable precision diagnostics and therapeutics based on the latest genomic discoveries and clinical evidence. Such KBs must provide federated queries and flexible computational analytics capabilities tailored for use by physicians and researchers.

Key Area Six: Enhance EHRs to Promote Precision Medicine

Commercial EHRs enable CDS for PM that is focused on information about a single gene variant.39 Informatics challenges for CDS include integrating next generation CDS with PM KBs to provide genome-based risk predictions, prognoses, and drug dosing at the point of care, as well as representing discrete genomic findings and interpretations in a machine-readable format (vs a free-text pathologist or geneticist report). Masys et al.40 proposed a framework for integrating genome-level data (stored external to the EHR) in which decision support systems are implemented through the EHR. EHRs will need to better aggregate and display patient information in order to allow users to view the heterogeneous data available for each patient, and EHRs will also need to structure and visually display the aggregated knowledge about each patient. Open interfaces that facilitate modular development of genomic CDSs outside of monolithic EHR vendor systems, enabling unencumbered parallel innovation/evolution of each element, should be provided.

EHR systems must provide standards-based programming interfaces that enable the integration of external data and knowledge sources as well as the development of tools that support custom workflows, novel analytics, data visualization, and data aggregation. The informatics community must partner with EHR vendors to author use cases and develop interfaces, such that both parties benefit from the collaboration.

Key Area Seven: Facilitate Consumer Engagement

PM includes more than the medical care administered in a provider’s office. Most of the population spends far more time outside of the doctor’s office than in it. PM will require explicit acknowledgement of this fact as well as deeper consumer participation, which will involve making consumers aware of their own ongoing health status and engaging them in healthcare decision making. It will also involve collecting more information about a person’s environment and lifestyle choices between visits to the doctor – eg, activity level, nutrition information, exposure, and sleep patterns – and incorporating that information into targeted therapeutic and preventive treatments.

Consumer access to genetic testing will increase as provider-ordered and direct-to-consumer genetic tests become more comprehensive and less expensive. Along with the recent announcement from 23andMe that the company will once again offer health-related information and Ancestry’s launch of AncestryHealth41 comes the increased importance of ensuring that consumers understand basic genetic principles and the implications of genetic testing, of trust in the accuracy of genetic tests, and of understanding of how these results, together with family history, will influence treatment decisions.

User-friendly interfaces for the collection, visualization, and integration of consumer data with healthcare information will be key to realizing the potential value of nontraditional data sources. Standards for new consumer data types, as well as patient engagement around ethical, legal, and social issues, will also be important.


The emergence of PM as a priority in biomedical research and healthcare emphasizes the importance of informatics’ contributions to PM. This brief overview highlights essential research directions for both informatics researchers and funding organizations.


The authors thank our colleagues in the Genomics and Translational Bioinformatics Working Group. Their contributions to discussions online, during formal Working Group meetings, and in casual encounters, both at our home institutions and at annual conferences, have helped shape our thoughts and perspectives as reflected in this manuscript. We also thank Joseph Romano, Peggy Peissig, Carolyn Petersen, Li Lang, and Alexis B. Carter, who participated in early discussions of these ideas. Finally, we thank the reviewers, whose insightful questions and thoughtful suggestions helped to significantly improve the manuscript.


All authors contributed to overall intellectual content and specific sections of writing. JDT and RRF edited the manuscript for coherence.


This work was funded in part by NIGMS U19 GM61388 (the Pharmacogenomics Research Network) (RRF), NCATS UL1-TR001117 (JDT), U19-GM61388-13 and R25-CA092049 (MKB), and NLM R01-LM012095 (SV), NLM R01-LM011177 (ZZ), R00-LM010822 and R01-LM011663 (XJ), Delaware INBRE #P20 GM103446 (EC), NCATS UL1-TR000117 (RN), NCI P30-CA51008, NCATS UL1-TR001409 (SM). The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.

Competing interests

P.A. is a paid consultant for Claritas Genomics.


