The $48 billion U.S. investment in electronic health record (EHR) adoption1 was predicated on a promise—that data stored electronically rather than on paper would be used not only for care and maximization of revenue, but also to underpin precision medicine through research from the molecular to the population scale.2 Instrumenting the health system for discovery will capacitate identification of drug targets, repositioning of medications, partitioning populations and their variants for personalized medicine, quantifying the impact of environment, comparing effectiveness of treatments, and conducting postmarket surveillance of therapeutics. EHRs have indeed spawned big data on patients, their diagnoses, notes, laboratory results, and medications. In the next wave, genome sequence, mobile device output and other innovative data types will enter clinical workflows. And under appropriate consent, patients may enrich pure clinical data sources by permitting linkage to additional data from self-report, mobile apps, pharmacies, and social media.3
But to achieve their full power, these data must be combined across settings.4 Rare disease research needs millions of patients to match study criteria.5 Genomic studies quantifying weak effects of myriad genetic variants require hundreds of thousands.6 Accountable care risk calculations need data stratified by demographics and clinical characteristics. Similarly, clinical trials, device development, and quality improvement often require multiple sites for statistical power.
Recently, hundred of millions of federal U.S. dollars have been allocated to centrally managed research and public health networks dependent on data from and participation by hundreds of healthcare organizations. Physicians, researchers, department chairs, and health system chief executive officers and chief information officers (CIOs) are grappling with a mounting array of data access requests by myriad disparate governmental, scientific and commercial constituencies. A hospital CIO already manages dozens of mandated outbound data feeds to federal and state agencies, the Joint Commission, and numerous payers. Each requires a different format and the CIO must invest resources to accurately respond to each distinct request. Network participation presents a real opportunity but also new expensive and complex requirements. Because hospital IT departments are the final common pathway to big data for medical discovery, the sociotechnical approach taken to engaging health system participation will dictate the pace, scale, and cost of discovery. Network organizers understanding the health system perspective and conforming to health system needs will drive biomedical progress.
Instrumenting the healthcare system for research
There will be countervailing pressures on the coordinating organization for a network—an organization that might well desire data in a highly specified format to meet perceived needs for a study. The status quo in the research enterprise is and has been to generate data de novo and for each particular project. In traditional clinical trials or disease-based registries, the huge lift for data collection is generally funded by a pharmaceutical industry or an infrastructure grant from the National Institutes of Health. And, most often, those approaches have produced expensive datasets that are often not reused in future studies. In this context, the most direct and simple approach to acquiring data, in a prospective multicenter study, has been to define a comprehensive common data model for a specific prospective study and then require that each clinical site hew closely to that model.
The fundamental idea behind using health system data to drive research is that a preponderance of data needed for discovery and improvement is already collected through routine processes in the delivery system. There are tremendous efficiencies to be realized in using these existing data, rather than collecting all data de novo for each study (Table 1). An alternative approach—reshaping data acquisition across the diverse and unruly delivery system to collect, at the point-of-care, the data we want for research—is a harder battle—worth fighting, but a battle that will not be won quickly.
Table 1.
Instrumented Health System Study vs. Traditional Trial or Registry
| Traditional clinical trial or registry | Instrumented health system study | |
|---|---|---|
| Data source | All data generated during and for the trial. | Electronic health records, bio-specimen banks, laboratory information systems, payor claims, e-prescribing data, inpatient pharmacy data |
| Data specifications | Data formats are fully specified but traditionally are specific to the particular study, rather than universal | Highly varied clinical data formats with slowly increasing federal specification by CMS and other agencies |
| Data acquisition | Meticulously collected by trained personnel according to well-specified standard operating procedures | Collected during the course of routine care by non-standardized systems including the “free text” dictation of physician notes. |
| Study design | Study design fully specified, including data types acquires | No preexisting nationwide standard of data from laboratory systems, or annotations such as clinical notes |
| Study hypotheses | Small number of hypotheses tested—e.g., is drug a superior to drug b. Often no secondary analysis planned | Myriad questions and hypotheses to be asked and tested in the future, not specified at the time of data acquisition |
| Cost | High cost for data standardization and collection | Low cost for acquisition, but variable cost for transformation and transmission |
So for now, in forming a data network, each node—a hospital, a health system or a practice—collects data for routine care and then vends it to a variety of customers, internal and external. Achieving efficiency in reuse of these data for research means conforming research use cases to the data we have. Research questions that can only be answered with fully standardized, complete and perfect data of the sort collected during a traditional pharmaceutical trial, are not the right ones to ask in a trial dependent on a contemporary instrumented health system.
So while a coordinating organization designing a health-system study might initially assume that it looks far simpler and less risky to fully specify all aspects of the data transactions according to the particular study, the problem is that any data model that the network articulates, that doesn’t closely adhere to the data formats as they were initially collected, will take time, be expensive, and often result in loss of information. Further, each transformation needed will reduce the likelihood of a system participating and reduce the efficiencies gained from using health system data in a format close to its native format.
The healthcare organization perspective
By what principles should a healthcare organization choose to share data, by what tenets should those efforts be governed, and what technologies are most cost-effective? With each request, organizations weigh concerns regarding privacy, leakage of business intelligence, and cost against a local benefit to the organization or a public good. The value of participation by healthcare organizations in data research networks must be skillfully framed if the healthcare system is to nimbly use clinically generated data to learn, discover and improve.
Data federation and network design
To network data collected during routine care requires a priori agreement on standards for data exchange, a process called federation. Federated data enable identification of cohorts suitable for hypothesis testing—for example, identify the list of all patients in the network with ulcerative colitis who have been prescribed infliximab. If the network captures robust longitudinal trajectories, outcome and epidemiology studies are possible—for example, measuring the incidence of lymphoma in children on infliximab. At start-up of a clinical trial or creation of a disease registry, leaders agree upon standard formats for data collection. Ideally, data collected for routine care by EHRs could be used in the native format. But in practice, every installation of every brand of EHR generally stores data in a unique, proprietary format, and those data need to be extracted from the EHR for meaningful analysis inside another software system.
The most salient, defining decision by a federated research network is whether to combine data or keep data separate and ultimately controlled by those responsible for the care of the patients described by these data. This can be accomplished by keeping data at each site of care in a traditional distributed model. In a less familiar, but powerful network model, the data are stored centrally, but not combined and still controlled by each local institution.7 In general, when healthcare data are combined centrally, they are de-identified. Another critical decision is whether the system that stores the local data and answers queries also facilitates local workflows, or simply vends data to a central network, serving no local purpose.
Local data, local benefit
The distributed model in which each institution’s data are maintained separately not only affords local control over data and participation in studies, but also enables member institutions to develop important local applications which use their own data plus networked-derived intelligence. For research studies, local processes include patient contact, patient consent, record review, and patient-permissioned linkage to external data sources; for example, outreach to a cohort of patients on infliximab to solicit patient-reported measures of side effects including shortness of breath or numbness. Other examples are enrollment of ulcerative colitis patients in a pragmatic randomized trial testing efficacy of antibiotics during a flare or collection of biospecimens to assess interleukin-15 receptor α expression after infliximab in a sample of patients. Clinical uses of network data and intelligence are almost exclusively local. Distributed data networks can have benefits beyond simply providing local control over a queryable database. A thoughtfully designed network permits a full end-to-end informatics system for research with federated data.8 That is, rather than merely informing investigators, local control of a decentralized network enables policy to be effected by those with direct care responsibility for the patients whose data are monitored.
Success of decentralized healthcare networks
In 1994, when the World Wide Web was only two years old, a Boston collaborative was the first to use web protocols9 to federate EHR data from disparate systems; patient medication and problem lists from five hospitals were assembled on the fly. In the next decade, we introduced a public health outbreak detection system using a distributed architecture with robust institutional controls.10 More recently CARRANet, a thriving 60-children’s hospital federated, distributed rheumatology network has begun to blend data for discovery and care improvement, offering full institutional control over participation and data contribution on a project-by-project basis.7
Early successes in distributed data networks have encouraged major federal programs, including the Food and Drug Administration’s Mini-Sentinel Network and the National Institutes of Health-funded Shared Health Information Research Network (SHRINE). These efforts and others involving payors, pharma, the Patient Centered Outcomes Research Institute, foundations, and public health agencies, now vie for clinical data from the clinical care sites.
Success and failures in the distributed model
The Internet itself should serve as the model of a distributed system—every webserver can be a peer and can pull data from other servers. And in the health system, every participating hospital or practice should be able to use the network as a peer. Our proposed model maximizes benefit while minimizing costs through “on-the-fly” translation of EHR data into established data standards. The data are then securely conveyed across the web to authorized users. This approaches mirrors the successful and incremental evolution of the web standards themselves (e.g. HTML and HTTP) to create an international system of data sharing transcending any single use case.
Unfortunately, there are in recent memory massive failures of large-scale, costly federally funded federated healthcare networks that ignore the founding principles of the internet and world wide web. The US National Cancer Institute’s $300M caBIG implemented a top down model, developing tools which found limited adoption by healthcare enterprises.11 The CDC’s Biosense, an emergency department biosurveillance network conducted analyses centrally and returned limited value to participants, while competing with existing efforts and losing engagement of network members.
To foster participation and sustainability of distributed networks, what incentives should be offered to organizations and how should the path to sharing be facilitated? We make recommendations, based on a history of successful low-marginal-cost multi-institutional data sharing systems.
Principles for stakeholder engagement in federated networks
Transparency. Local institutions need full insight into use of their data by others.
Representation. Participating institutions must be allowed to take part in the design, selection and approval of studies.
Local benefit. Networks should support participating organizations with data and analytic tools to advance local research agendas and clinical improvements. Local needs can be distinct—one size does not fit all populations or healthcare systems, so the tools must provide flexibility in data access and a range of analytic capabilities. An informatics capability for easy exploration—interactive queries with real time response—particularly empowers local users, as does capacity to implement local clinical research workflows such as data analysis, consenting, patient contact and trial matching.
Right to reassortment. The mesh of participating healthcare institutions should be able to organize and reorganize into opportunistic and productive networks, including for short-lived projects. Each network’s infrastructure should support participating institutions in readily joining multiple networks with low overhead (Figure 1).
Cost-neutrality. Participation should be cost neutral, either because participating brings monetary value (for example, gaining insight toward lower cost care pathways) or because payment is commensurate for work required.
Access. Investigators at participating institutions should be able to use the network as a peer. Experts gaining access across the federated network will advance science, clinical care and public health, generating wholly different and transformative sets of questions that no single committee can achieve.
Parsimony of data storage standards. Emerging federated networks should engage health systems by understanding their needs and then helping to simplify the handling of outbound data. Networks should avoid requiring expensive, one-off data transformations. Health systems have no choice but to comply with requirements of the Centers for Medicare and Medicaid Services (CMS). Because CMS requires nearly all healthcare organizations to transform EHR data into standardized formats for health information exchange, divergent costly transformations should be avoided. Instead, if a network requires a specific format, a simple transformation from the CMS-required format, which can be performed on the fly—as needed for the query—should be defined. Also, other emerging clinical formats (e.g., the Blue Button Initiative for data access by patients, or SMART Platforms12 for exchange of data with apps) should be investigated as a lingua franca for inter-institutional and intersystem data transfer.
Figure 1. A Self-Organizing Federated Data Research Network.

Each institution becomes a node able to join diverse networks by extracting data from the EHR and, as allowable under consent, linking to external data sources. Since different networks may require different datasets and formats, each node may attach more than one “network adapter” (here shown in blue and orange), which enables on the fly data transformation into the network format. This network design enables a health system to (a) invest in a single informatics resource to serve potentially dozens of “customers” of its data; and (b) to use an informatics resource that serves local data and workflow needs (e.g., patient contact, consent, trial matching, analysis) beyond responding to query for any given data requestor.
Conclusion
The U.S. founding fathers debated federalist principles—balancing between federal authority and states’ rights. As healthcare attempts to become a data-driven enterprise that learns from itself,13 and drives toward the practice of precision medicine,14 it must instrument the delivery system for discovery research and cost-effective federation of data. How a research network will nimbly balance essential centralized functions with local participation and authority must be an important and essential national discussion.
The investment in instrumenting healthcare with information technology to better understand expenses, reimbursement, quality, disease evolution, public health processes and the genetic basis of disease, could underpin an unprecedentedly fertile ecosystem of studies with multiple layers of validation and reproducibility testing.
But there is risk of overwhelming the already over-extended information technology staff of healthcare institutions. Engaging the health system means not treating it as a square peg being forced into the round hole of traditional prospective trials design. To collectively achieve sustainable national-scale data federation, institutions must be able to fluidly join or leave an evolving assortment of networks. The burden of data provision should be minimized by harmonizing research requests with mandated clinical data formats. Participating clinical organizations should be first class peers in the network both responding to and issuing queries (Figure 1) and deriving meaningful local benefit. The opportunity is to create a flexible, dynamic resource that serves both current national and local needs, but also future functions not yet imagined. Instrumenting the healthcare system for research means leveraging existing care delivery processes to create the big data needed for discovery and improvement.
Acknowledgements
We thank Rachel Eastwood for graphic design and conceptual work on the figure.
Footnotes
Neither author reports any conflict of interest.
Contributor Information
Kenneth D. Mandl, Children’s Hospital Informatics Program, Boston Children’s Hospital, Center for Biomedical Informatics, Harvard Medical School, 300 Longwood Avenue, Boston, MA 02115, 617-355-4145, Kenneth_Mandl@Harvard.edu.
Isaac S. Kohane, Harvard Medical School Center for Biomedical Informatics, Boston Children’s Hospital Informatics Program.
References
- 1.Blumenthal D. Launching HITECH. The New England journal of medicine. 2010;362:382–385. doi: 10.1056/NEJMp0912825. [DOI] [PubMed] [Google Scholar]
- 2.Kohane IS, Drazen JM, Campion EW. A glimpse of the next 100 years in medicine. The New England journal of medicine. 2012;367:2538–2539. doi: 10.1056/NEJMe1213371. [DOI] [PubMed] [Google Scholar]
- 3.Weber GM, Mandl KD, Kohane IS. Finding the Missing Link for Big Biomedical Data. JAMA : the journal of the American Medical Association. 2014 doi: 10.1001/jama.2014.4228. [DOI] [PubMed] [Google Scholar]
- 4.Dolgin E. Trial networks move beyond single-disease strategies. Nature medicine. 2011;17:1525. doi: 10.1038/nm1211-1525. [DOI] [PubMed] [Google Scholar]
- 5.Patten IS, Rana S, Shahul S, et al. Cardiac angiogenic imbalance leads to peripartum cardiomyopathy. Nature. 2012;485:333–338. doi: 10.1038/nature11040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Berndt SI, Gustafsson S, Magi R, et al. Genome-wide meta-analysis identifies 11 new loci for anthropometric traits and provides insights into genetic architecture. Nature genetics. 2013;45:501–512. doi: 10.1038/ng.2606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Natter MD, Quan J, Ortiz DM, et al. An i2b2-based, generalizable, open source, self-scaling chronic disease registry. Journal of the American Medical Informatics Association : JAMIA. 2013;20:172–179. doi: 10.1136/amiajnl-2012-001042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Mandl KD, Kohane IS, McFadden D, et al. Scalable Collaborative Infrastructure for a Learning Healthcare System (SCILHS): architecture. Journal of the American Medical Informatics Association : JAMIA. 2014;21:615–620. doi: 10.1136/amiajnl-2014-002727. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Kohane IS, Greenspun P, Fackler J, Cimino C, Szolovits P. Building national electronic medical record systems via the World Wide Web. Journal of the American Medical Informatics Association : JAMIA. 1996;3:191–207. doi: 10.1136/jamia.1996.96310633. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.McMurry AJ, Gilbert CA, Reis BY, Chueh HC, Kohane IS, Mandl KD. A self-scaling, distributed information architecture for public health, research, and clinical care. Journal of the American Medical Informatics Association : JAMIA. 2007;14:527–533. doi: 10.1197/jamia.M2371. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Masys DR, Harris PA, Fearn PA, Kohane IS. Designing a public square for research computing. Science translational medicine. 2012;4 doi: 10.1126/scitranslmed.3004032. 149fs32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Mandl KD, Mandel JC, Murphy SN, et al. The SMART Platform: early experience enabling substitutable applications for electronic health records. Journal of the American Medical Informatics Association : JAMIA. 2012;19:597–603. doi: 10.1136/amiajnl-2011-000622. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Friedman CP, Wong AK, Blumenthal D. Achieving a nationwide learning health system. Science translational medicine. 2010;2 doi: 10.1126/scitranslmed.3001456. 57cm29. [DOI] [PubMed] [Google Scholar]
- 14.National Research Council (US) Committee on A Framework for Developing a New Taxonomy of Disease. Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease. Washington (DC): National Academies Press; 2011. [PubMed] [Google Scholar]
