Abstract
Federated networks of clinical research data repositories are rapidly growing in size from a handful of sites to true national networks with more than 100 hospitals. This study creates a conceptual framework for predicting how various properties of these systems will scale as they continue to expand. Starting with actual data from Harvard's four-site Shared Health Research Information Network (SHRINE), the framework is used to imagine a future 4000 site network, representing the majority of hospitals in the United States. From this it becomes clear that several common assumptions of small networks fail to scale to a national level, such as all sites being online at all times or containing data from the same date range. On the other hand, a large network enables researchers to select subsets of sites that are most appropriate for particular research questions. Developers of federated clinical data networks should be aware of how the properties of these networks change at different scales and design their software accordingly.
Keywords: Algorithms, Hospital Shared Services, Medical Record Linkage, Medical Records Systems, Computerized, Search Engine
Graphical Abstract
1. INTRODUCTION
Federated query tools enable researchers to search the medical records of millions of patients across multiple hospitals, while allowing the hospitals to retain control over their data. In 2008, the Shared Health Research Information Network (SHRINE) gave investigators, for the first time, access to the full patient populations at four Harvard-affiliated hospitals. Since then, multiple hospital networks have emerged throughout the United States based on SHRINE and similar platforms like PopMedNet and FACE [1-3]. The Patient-centered Outcomes Research Institute (PCORI) has accelerated the growth of these networks by recently awarding $100 million to 29 health data networks to create PCORnet: The National Patient-Centered Clinical Research Network, which will connect around 100 hospitals across the country [4-15]. By giving investigators unprecedented access to large populations, these networks are already having an impact on biomedical research [16,17].
There is no reason to think that the growth of federated data networks will end with PCORnet. As an increasing number of health centers adopt electronic health records, someday soon nearly all 5,700 hospitals in the United States may be connected to a data network. However, is the software powering these networks ready for such growth? SHRINE was originally created for four hospitals. Today, even the largest networks have only a few dozen sites. Are future networks with 100 or 1000-fold as many sites simply bigger versions of what we currently have, or will we need to approach such networks in a fundamentally different way? This study seeks to answer this question by first defining a set of attributes for evaluating federated clinical data networks, and then using this as a conceptual framework for predicting what a future 4000 site network would look like. The starting point is actual data from a four site SHRINE network at Harvard. The current Harvard SHRINE sites are Partners Healthcare (Brigham and Women's Hospital and Massachusetts General Hospital), Beth Israel Deaconess Medical Center, Boston Children's Hospital, and Dana Farber Cancer Institute.
2. MATERIALS AND METHODS
2.1. Conceptual Framework
The purpose of the conceptual framework is not to evaluate the performance of any particular software program in terms of speed or resource requirements, but rather to determine if certain fundamental properties of a network change as the number of sites increases, which could affect how the networks are built or used. Eight properties are considered in this study:
1. Functional Equivalence
Sites in a network are functionally equivalent if they can process the same types of queries, such as temporal queries or queries that require natural language processing.
2. Temporal Equivalence
Sites that are temporally equivalent have patient data covering the same date range. “Complete coverage” means that all data for those patients are available for that date range. In other words, the patients did not receive care at facilities outside the network during that time.
3. Data Release Cycle Synchronicity
Typically, hospitals do not connect their live clinical systems directly to the federated research networks. The data are first copied into separate research data repositories, which are then exposed to the network. Unless all sites update their repositories at the same time, some sites will have more recent data than others.
4. Ontological Equivalence
Sites that are ontologically equivalent can map their local coding systems to a shared ontology (e.g., standard vocabularies).
5. Semantic Discernibility
Even when sites use the same ontology, they might use a given code in different ways. For example, there might be a preference to use one billing code over another at a particular site, or a diagnosis date might be when the code was recorded rather than when the patient was seen. The semantic discernibility of a network describes whether these differences can be detected, either directly from the ontology or indirectly from analysis of the results.
6. System Availability
The availability of a network is the fraction of time when sites are running properly.
7. Population Overlap
Different hospitals might have data about the same patient. This can either lead to over-counting the number of patients in a network (e.g., two hospitals count the same patient) or under-counting (e.g., a patient matches a complex query, but no single site has enough data to know it.) The more the patient populations in a network overlap, the greater the uncertainty in the results [18].
8. Data Access Restrictions
A researcher can query all sites in a network only if he or she meets all the requirements needed to access those sites (e.g., human subjects training).
2.2. Data from the Harvard SHRINE Network
Data from the Harvard SHRINE network was used to predict what a future national network would look like. It is certainly a great leap to use data from only four sites to envision a network with four thousand hospitals. However, the fact that Harvard SHRINE, as one of the earliest federated networks, has had more than five years to mature means that it may be one of the best available sources from which to predict a future national network.
To study temporal equivalence, the Harvard SHRINE query tool was used to determine the number of patients with any of 40 common International Classification of Diseases (ICD-9) codes at each site by year from 2000 through 2013. The codes, which are listed in Table 1, correspond to the most frequent diagnosis categories as reported in the National Ambulatory Medical Care Survey. Because the codes cover a wide range of diseases, including both adult and pediatric diagnoses, the fraction of patients with these codes should be relatively stable over short periods of time. Therefore, if sites had complete data and were temporally equivalent, then number of patients at each site matching the 40 codes would roughly follow population growth, which was only about 10% in Boston from 2000 to 2013 [19]. Note that the purpose of this query is to estimate data completeness across all diseases over time—it does not reflect the typical use of SHRINE, which is to study a single disease.
Table 1.
Top 40 ICD-9 diagnosis codes.
| Diagnosis Group | Top ICD-9 Codes | |
|---|---|---|
| Acute upper respiratory infections, excluding pharyngitis | 465.9 | 466.0 |
| Allergic rhinitis | 477.9 | 477.0 |
| Arthropathies and related disorders | 719.46 | 719.41 |
| Asthma | 493.90 | 493.92 |
| Benign neoplasms | 211.3 | 216.9 |
| Cataract | 366.9 | 366.16 |
| Diabetes mellitus | 250.00 | 250.01 |
| Disorders of lipoid metabolism | 272.0 | 272.4 |
| Essential hypertension | 401.9 | 401.1 |
| Follow up examination | V67.09 | V67.2 |
| General medical examination | V70.0 | V70.7 |
| Gynecological examination | V72.31 | V72.32 |
| Heart disease, excluding ischemic | 424.0 | 427.31 |
| Malignant neoplasms | 174.9 | 185 |
| Normal pregnancy | V22.1 | V22.0 |
| Otitis media and eustachian tube disorders | 382.9 | 381.81 |
| Rheumatism, excluding back | 729.5 | 729.1 |
| Routine infant or child health check | V20.2 | V20.0 |
| Specific procedures and aftercare | V50.2 | V58.66 |
| Spinal disorders | 724.2 | 724.5 |
Two frequently used ICD-9 codes in each of the top 20 primary diagnosis groups for physician office visits in the United States in 2012.
As an example of semantic discernibility, a SHRINE query was run to determine the number of patients between 0 and 17 years old. A second query was then run to determine the number of patients between 0 and 17 years old from 2005 through 2009. This was an actual query that initially caused confusion as we were developing Harvard SHRINE. Despite each site mapping its local codes for age to the same common ontology (i.e., ontological equivalence), the query unexpectedly returned wildly different results across sites. This was later discovered to be due to subtle differences in how sites interpreted this query, rather than true differences in patient populations.
The Harvard SHRINE network has an automated monitoring tool that sends a test query to each site every two hours and generates an email alert if a site did not respond. All email alerts from 1/1/2013 through 12/31/2013 were collected to determine the availability of each site's system.
3. RESULTS
3.1. Functional Equivalence
The four Harvard SHRINE sites use an open source clinical data repository platform called Informatics for Integration Biology & the Bedside (i2b2). Since Harvard SHRINE's launch in 2008, i2b2 has had five major software updates (versions 1.3 through 1.7), or approximately one per year. Each site has its own timeframe for updating the software, and nationally there are many sites still using version 1.3. In just a four site network, if each version is equally likely, the probability that all sites are using the same version is just 0.23 = 0.008. With 4000 sites, the probability of functional equivalence is negligible. Also, i2b2 is just one of many similar software programs used across the country, which makes it even less likely that all sites in a large network can support the exact same types of queries.
3.2. Temporal Equivalence
PCORNet requires sites to identify patients with “complete data” over a longitudinal timespan; and, the Harvard SHRINE website states that it has a “complete set” of diagnosis data from each of its participating hospitals, starting from January 1, 2001. In both cases, the meaning of completeness is confusing, and this can cause scientists using these networks to misinterpret the results of queries. For PCORNet, because patients may receive care from multiple hospitals, an individual hospital cannot be certain that it has the full medical history of a patient. In Harvard SHRINE, completeness refers to the fact that hospitals include all the data they have, not that patients’ records are complete. For a network as a whole, it is important to understand how data completeness actually varies both across sites and over time.
Figure 1 shows the number of patients with any of 40 common diagnoses in each year for each Harvard SHRINE site, as of January 1, 2014. Between 2001 and 2012, the counts from the four sites increased 49.8%, 92.3%, 105.1%, and 313.3%. This both far exceeds the population growth over this period of time, and the results are grossly inconsistent with each other. It is more likely due to increased use of electronic health record (EHR) systems at different rates and better overall data coverage in more recent years. The Harvard SHRINE sites might be unusual in having at least some data that goes back more than a decade. They will eventually be joined with hospitals that have begun building their repositories only recently. This will make temporal equivalence even more difficult to achieve, at least initially. In the future, this might become less of a problem as EHRs become more commonplace.
Figure 1. Relative incidence of 40 common diagnoses in four Harvard SHRINE sites over time.
Data are normalized with respect to counts in 2012.
3.3 Data Release Cycle Synchronicity
In Figure 1, there are also large drops in the patient counts in 2013, which mainly reflect the update frequency of each site's data repository—the most recent data are several months old. The age of the most recent data in Harvard SHRINE has varied since the network was launched. At one point, the data in some hospitals’ databases had not been updated in two or three years. More recently, data updates have been occurring every six months, with individual hospitals updating their repositories over a window of a couple weeks of each other . Even with just four sites, synchronizing these updates has been challenging for several reasons: additional storage must sometimes be procured, staff time must be allocated for manual steps in the data load process, the actual data loads can take several days, and errors such as network interruptions can cause unpredictable delays. In a large network, the only feasible options might either be to keep the data static (never any updates) or to assume that every day there will be an average number of sites whose data have been updated. For a 4000 site network with hospitals that update their data on average annually, there could be roughly 11 sites per day (or one every 2.2 hours) with new data. A scientist running multiple queries would see the results changing almost continually, even though the data within any given site changes only once a year.
3.4. Ontological Equivalence
This network attribute scales similarly to functional equivalence. The Harvard SHRINE sites map local codes in a common ontology, which includes demographics, diagnoses (ICD-9), laboratory tests using Logical Observation Identifiers Names and Codes (LOINC), and medications using RxNorm. However, the ontology includes only a partial list of LOINC and RxNorm codes—the ones the sites have been able to map to so far. Each Harvard SHRINE site has additional types of local data, such as vital signs or genetic markers. However, the variability across sites in whether these data are available and can be mapped to common codes has prevented their inclusion in the federated network. As networks become larger, there will inevitably be fewer codes that can be mapped to all sites in the network.
3.5. Semantic Discernibility
The ratio between the number of patients “between 0 and 17 years old from 2005 to 2009” and the number of patients “between 0 and 17 years old” at the four Harvard SHRINE sites was found to be 0.000, 0.304, 1.000, and 1.000. A user seeing only these four values would be unable to determine if this represented true variation in the sites’ patient populations. However, the actual differences were due to how sites configured their software to run this type of query. One site interpreted the date range to mean when the data were loaded into the i2b2 database. Since the data were loaded in 2013, no patients matched. Another site calculated the number of patients between 0 and 17 years old who were born between 2005 and 2009 (30.4%). Two sites apply date ranges only to visit data, such as diagnoses, and ignore dates in the context of patient age.
Obviously, the Harvard SHRINE sites could now agree to a common interpretation of dates to solve this particular problem. However, the problem was unknown to us until several years after the network went live, and it is unknown how many other semantic differences still exist. With just four sites, there is little information to identify systematic biases. In contrast, in a network of 4000 sites it might be possible to identify distinct clusters of hospitals returning different results to the same query. If the underlying patient populations are expected to be similar, then the differences could be due to how the hospitals interpret the queries.
3.6. System Availability
During 2013, the four Harvard SHRINE hospitals were unavailable 0.41%, 1.14%, 3.84%, and 4.46% of the time (mean = 0.0246, standard deviation = 0.0198). Combined, there was one hospital unavailable 9.38% of the time and two hospitals unavailable 0.23% of the time (Figure 2). There were no periods when three or four hospitals were unavailable. Much of the downtime was unavoidable and due to planned maintenance activities, such as monthly operating system updates, software upgrades, and data loads. If the availability of an individual site is modeled as a Bernoulli distribution, with a downtime probability of 0.0246 (the Harvard SHRINE mean), then the expected number of unavailable sites at any given time in a network of 4000 hospitals would follow a binomial distribution, with a mean of 4000 * 0.0246 = 98.4 and variance 4000 * 0.0246 * (1 - 0.0246) = 96.0 (standard deviation = 9.8). In other words, during 95% of the time, between 79 and 118 sites will be unavailable, and there essentially will never be a time when all sites are simultaneously available. Thus, site outages, which are rare in small networks, become guaranteed for large ones.
Figure 2.
The fraction of time Harvard SHRINE sites were available during 2013.
3.7. Population Overlap
Because patients receive care at multiple hospitals, the aggregate counts from each site in a federated network cannot simply be added to determine the total number of distinct patients who match a query. In a previous study, I presented a detailed analysis of this problem and possible solutions, but the main conclusion was that larger federated networks have increasing uncertainty in the actual number of distinct patients[18]. In the extreme case, each of the N sites in a network return a count of M patients who match a query. If there is no overlap, then there are N*M distinct patients. However, if each hospital is counting the same patients, then there are only M patients in the network. In a four hospital network, the ratio between the upper and lower bounds is 4:1. In a four thousand site network, it is 4000:1. Privacy-preserving patient linkage algorithms can be used either to estimate or calculate the exact amount of population overlap, and this becomes essential in large networks[18].
3.8. Data Access Rules
The four Harvard SHRINE sites have different local data access requirements. For example, one hospital allows research fellows to initiate queries, while another requires a faculty member of instructor or higher rank to grant permission to the fellows. The Harvard SHRINE network must be at least as restrictive as all the individual sites so that researchers cannot bypass local policies through the federated network. With a network of 4000 sites, there could be conflicting local policies that prevent anyone from using the network.
4. DISCUSSION
The conceptual framework and examples in this study may seem trivial or self-evident. For example, it should be obvious that the more sites in a network, the more likely some are unavailable. However, they represent real technical challenges that are currently being deferred or overlooked at the national level as discussions focus on policy, funding, and the clinical use cases for these networks. Explicitly thinking about the eight attributes presented here helps illustrate important points that developers must not forget if we are to get the most out of these networks. It is therefore the author's hope that this study can serve as a checklist of technical considerations as federated data networks continue to grow in size.
In this light, Table 2 presents a summary of how network properties are expected to differ between the four site Harvard SHRINE network and a large 4000 hospital network. In a network with only a few sites, the goal is to maximize the number of sites that can respond to queries in order to reach the largest patient population possible. This is necessary to gain benefit from joining a federated network. However, the costs of doing this include losing precision when mapping to common ontologies, not taking advantage of query capabilities at particular sites, and limiting the types of researchers who can access the network. For example in Harvard SHRINE, the common ontology includes only data types that can be mapped to all sites, some sites have newer versions of the i2b2 software than what the network supports, and fellows require a faculty sponsor despite some hospitals not having this restriction for running local queries. The hospitals also attempt to achieve temporal equivalence by including data from the same date range; however, differences in data completeness limit their ability to do this fully.
Table 2.
Summary of network properties at different scales.
| Attribute | 4 Harvard SHRINE Sites | 4000 Hospital Network |
|---|---|---|
|
Functional Equivalence Can all hospitals process the same types of queries? |
Medium Each hospital uses i2b2 but upgrades to new versions at different times. |
Low Different software platforms and versions are used throughout the country. |
|
Temporal Equivalence Do all hospitals have complete data from the same date range? |
Medium Hospitals agreed to use the same date range, but they vary in completeness over those years. |
Low Many more years of migration to EHRs are needed before there will be an extended period of complete coverage. |
|
Data Release Cycle Synchronicity Do hospitals update their data at the same time? |
Medium Updates occur twice a year, but it is difficult to synchronize updates to less than a two week window. |
High Every day multiple hospitals will likely be making updates. |
|
Ontological Equivalence Can all hospitals map local codes to a common ontology? |
Medium Demographics and diagnoses are mapped, but only a subset of laboratory tests and medications. |
Low Even differences in diagnosis codes (e.g., ICD-9 vs ICD-10) can be problematic. |
|
Semantic Discernibility Can differences in how hospitals use the same code be detected? |
Low With only four data points, semantic differences are hard to distinguish from true patient population variability. |
Medium Semantic differences may appear as distinct clusters of hospitals with similar results. |
|
System Availability Are all hospitals responding to queries? |
High All four nodes are available more than 90% of the time. |
Low A predictable number of nodes will be down at any given time. |
|
Population Overlap How many patients are treated by more than one hospital in the network? |
High The close proximity of the Harvard SHRINE sites results in large patient overlap. Though, geographically disparate hospitals should have low overlap. |
Medium The amount of overlap will vary across hospitals, but the large number of hospitals leads to high uncertainty in the number of distinct patients. |
|
Data Access Restrictions How many requirements must a researcher meet in order to use the network? |
Medium Users must be faculty employed by participating hospitals and have an approved query topic. |
High It might be impossible to satisfy all local policies and requirements. |
The four site Harvard SHRINE network is compared to a theoretical 4000 hospital network.
This general approach has been used for small federated networks across the country. However, the conceptual framework demonstrates that this cannot scale to a national network with 4000 sites. As the number of sites increase, networks have to deal with an increasing number of software platforms and local access policies, there are fewer data types shared among all sites, it becomes impossible to ensure that all sites are available, there are daily changes in the underlying data, and variability in data completeness prevents temporal equivalence across sites. Compared to a four site network, a 4000 site network also has much greater uncertainty due to population overlap. Though, not all is bad. The large number of sites might increase semantic discernibility by making it easier to detect unexpected biases due to differences in how hospitals encode health data. It is hard to identify patterns in just four data points, but with 4000 sites, distinct clusters might emerge as a result of these biases.
These significant differences between small and large federated networks means that the informatics community should consider a fundamental change in how they approach a national federated network compared to what has been successful at small scales (Table 3). The homogeneity across sites that small networks strive for cannot be achieved for thousands of sites. In contrast, you do not need every site in a national network to do good science. Not every study requires 100 million patients at 4000 sites. Having 10 million patients at 100 carefully selected sites is probably sufficient to answer most research questions. Once this is recognized, it becomes possible to think about different ways of using a national network, and the heterogeneity across hospitals could become an asset rather than a problem.
Table 3.
Recommendations for large federated data networks.
| Attribute | Recommendations |
|---|---|
| Functional Equivalence | Each site should report its query capabilities. As users build queries, show which sites can run the queries. Send queries only to sites that can run the queries. |
| Temporal Equivalence | Graphically display the amount of data at each site over time. Explicitly define what is meant by “complete data”. |
| Data Release Cycle Synchronicity | Each site should report its last data update date. Enable users to select sites based on last update date. |
| Ontological Equivalence | Allow sites to use different ontologies. As users build queries, show which sites use the selected ontologies. Map codes to different ontologies only if a user requests it. |
| Semantic Discernibility | Search for distinct clusters of results from the same query. Look for queries that return 0 or 100% of patients at only some sites. |
| System Availability | Show which sites are available. Show the average availability of sites over a period of time. Enable multiple queries to be run in batch. Maintain mirrors, or copies, of individual sites to provide redundancy. |
| Population Overlap | Report lower and upper bounds on the number of distinct patients. Use privacy-preserving patient linkage for more accurate counts. |
| Data Access Restrictions | Post site-specific data access policies publicly. Enable users to send queries only to sites where they have access. |
In a national network, instead of attempting to run a query at every site in the network, it might be better to run the query at only the subset of sites that are most appropriate for that query. For example, a query based on ICD-9 diagnoses could ignore sites that use ICD-10 and still have access to thousands of hospitals without the need to map any codes. If a researcher is concerned that population overlap might introduce too much uncertainty, then he or she can select sites that are geographically distant from each other. If having data that are static is important, then a researcher can run the query at only sites that have not yet updated their data. If a query requires the new temporal query features introduced in i2b2 version 1.7, then the query can be run at just those sites that have updated their software. If a research fellow wants to run a query, then he or she can select sites that do not limit access to faculty of higher academic ranks.
In other words, with so many sites in the network, one does not have to sacrifice functionality or lose semantic specificity by forcing sites to use the same software and a common ontology. Instead, the network should guide users to the best subset of sites for their research question. One way to do this is to enable users to perform a special “site selection” query that focuses on the properties of the sites in the network rather than the patients within those sites. For example, a researcher might first ask which sites use RxNorm and have not updated their data in the past month. Then, the researcher can ask just those sites how many of their patients use a particular medication and compare the results to a previous query. Another advantage of this two-stage query process—a site selection query followed by a patient data query—is that the latter is much more computationally intensive than the former. Hospitals can quickly respond to the numerous queries asking about properties of their sites and perform the slower patient data queries only for the subset of users who select those sites. Since the site selection queries do not access patient data, they could even be made public. For example, anonymous researchers could first query a public national network to determine what the data access policies are at each site, and then run the patient data queries at only the sites where they can login and demonstrate that they meet the requirements.
Temporal equivalence and population overlap affect researchers using both small and large networks. Researchers often do not think about how these properties affect the results of their queries and consequently they might incorrectly interpret their findings. For example, differences in the completeness of data over time could incorrectly lead researchers to conclude that the prevalence of a disease is increasing or that a new medication is having previously unrecognized side effects. Population overlap could result in researchers overestimating the number of patients that can be recruited for a clinical trial. Therefore, two things managers of federated networks should always avoid are (1) simply reporting data start and end dates, and (2) adding the aggregate counts returned by the hospitals in a network. Instead, they should display graphs such as Figure 1 to illustrate both the date range and completeness, and they should give upper and lower bounds on the total number of distinct patients in the network who match their queries.
The issue of system availability and data release cycle synchronicity in a large network might be addressed with the use of batch queries. For example, a researcher could perform a series of real-time queries to fine tune query parameters and obtain some preliminary results. During this stage, different sites might be returning results and the underlying data might be changing. However, once the researcher knows what final set of queries need to be run, she could send these as a single batch request to each site. These batch queries can likely be completed faster than if the researcher had to wait for each individual query to finish before entering the next query. This could increase the number of sites that are able to return results.
As noted above, a limitation of this study is that large extrapolations are being made from a single relatively small network. The properties of a true national network will not be known until one is actually built. The i2b2 and SHRINE software used at Harvard have different features and limitations than other programs such as PopMedNet, and this could also affect how networks scale. Though, while the specific predictions of this study might not be accurate, many of the general trends might be correct. In the near future, PCORNet will provide another test bed to apply the conceptual framework and refine once again how we think about national federated data networks.
5. CONCLUSION
Federated data networks have been transforming how multi-site clinical research studies have been conducted since their emergence a few years ago. However, we are on the brink of moving to another level as we connect our regional systems into increasingly large national networks. However, we need to consider both the properties and the use cases of national networks to ensure we are designing them properly. As this study demonstrates, there are fundamental differences between small and large networks that cannot be ignored. The variability among sites in a small network is problematic and must be addressed by conforming to a common set of features, date ranges, ontologies, and data access restrictions. However, the same variability is an asset in a large network since it increases the likelihood that researchers can find some subset of hospitals that are most appropriate for their queries.
Assumptions about system availability and data release cycle synchronicity in a small network are not valid when there are thousands of sites, though there are technical solutions to deal with these challenges. The dream of having access to data on millions patients for clinical research is quickly approaching, but it will become a reality only if we understand the properties of federated networks at large scales.
Highlights.
Small regional federated data networks are used today for clinical research.
National data networks with 100+ hospitals are being built.
A conceptual framework was developed for evaluating networks of different sizes.
A real four site network was compared to an imagined 4000 site network.
Large networks have limitations but can take advantage of their heterogeneity.
ACKNOWLEDGEMENTS
This work was conducted with support from: (1) Harvard Catalyst | The Harvard Clinical and Translational Science Center (National Center for Research Resources and the National Center for Advancing Translational Sciences, National Institutes of Health Award UL1 TR001102) and financial contributions from Harvard University and its affiliated academic healthcare centers. The content is solely the responsibility of the authors and does not necessarily represent the official views of Harvard Catalyst, Harvard University and its affiliated academic healthcare centers, or the National Institutes of Health. (2) Informatics for Integrating Biology & the Bedside, which is sponsored by the National Institutes of Health Office of the Director, National Library of Medicine and the National Institute of General Medical Sciences (2U54LM008748). (3) Scalable Collaborative Infrastructure for a Learning Healthcare System (SCILHS), which is funded by the Patient Centered Outcomes Research Institute (CDRN130604608). The author would like to thank Denis Agniel for his assistance with statistical analysis in this study.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
REFERENCES
- 1.McMurry AJ, Murphy SN, MacFadden D, et al. SHRINE: enabling nationally scalable multi-site disease studies. PLoS One. 2013;8(3):e55811. doi: 10.1371/journal.pone.0055811. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Klann JG, Buck MD, Brown J, et al. Query Health: standards-based, cross-platform population health surveillance. J Am Med Inform Assoc. 2014 Jul-Aug;21(4):650–6. doi: 10.1136/amiajnl-2014-002707. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Wyatt MC, Hendrickson RC, Ames M, et al. Federated Aggregate Cohort Estimator (FACE): An easy to deploy, vendor neutral, multi-institutional cohort query architecture. J Biomed Inform. 2013 Dec;4:S1532–0464(13)00190-1. doi: 10.1016/j.jbi.2013.11.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Amin W, Tsui F, Borromeo C, et al. PaTH: towards a learning health system in the Mid-Atlantic region. J Am Med Inform Assoc. 2014 Jul-Aug;21(4):615–20. doi: 10.1136/amiajnl-2014-002759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Califf RM. The Patient-Centered Outcomes Research Network: a national infrastructure for comparative effectiveness research. N C Med J. 2014 May-Jun;75(3):204–10. doi: 10.18043/ncm.75.3.204. [DOI] [PubMed] [Google Scholar]
- 6.Collins FS, Hudson KL, Briggs JP, et al. PCORnet: turning a dream into reality. J Am Med Inform Assoc. 2014 Jul-Aug;21(4):576–7. doi: 10.1136/amiajnl-2014-002864. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Curtis LH, Brown J, Platt R. Four health data networks illustrate the potential for a shared national multipurpose big-data network. Health Aff (Millwood) 2014 Jul;33(7):1178–86. doi: 10.1377/hlthaff.2014.0121. [DOI] [PubMed] [Google Scholar]
- 8.Fleurence RL, Beal AC, Sheridan SE, et al. Patient-powered research networks aim to improve patient care and health research. Health Aff (Millwood) 2014 Jul;33(7):1212–9. doi: 10.1377/hlthaff.2014.0113. [DOI] [PubMed] [Google Scholar]
- 9.Fleurence RL, Curtis LH, Califf RM, et al. Launching PCORnet, a national patient-centered clinical research network. J Am Med Inform Assoc. 2014 Jul-Aug;21(4):578–82. doi: 10.1136/amiajnl-2014-002747. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kaushal R, Hripcsak G, Ascheim DD, et al. Changing the research landscape: the New York City Clinical Data Research Network. J Am Med Inform Assoc. 2014 Jul-Aug;21(4):587–90. doi: 10.1136/amiajnl-2014-002764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Mandl KD, Kohane IS, McFadden D, et al. Scalable Collaborative Infrastructure for a Learning Healthcare System (SCILHS): architecture. J Am Med Inform Assoc. 2014 Jul-Aug;21(4):615–20. doi: 10.1136/amiajnl-2014-002727. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.McGlynn EA, Lieu TA, Durham ML, et al. Developing a data infrastructure for a learning health system: the PORTAL network. J Am Med Inform Assoc. 2014 Jul-Aug;21(4):596–601. doi: 10.1136/amiajnl-2014-002746. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ohno-Machado L, Agha Z, Bell DS, et al. pSCANNER: patient-centered Scalable National Network for Effectiveness Research. J Am Med Inform Assoc. 2014 Jul-Aug;21(4):621–6. doi: 10.1136/amiajnl-2014-002751. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.PCORnet PPRN Consortium. Daugherty SE, Wahba S, et al. Patient-powered research networks: building capacity for conducting patient-centered clinical outcomes research. J Am Med Inform Assoc. 2014 Jul-Aug;21(4):583–6. doi: 10.1136/amiajnl-2014-002758. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Waitman LR, Aaronson LS, Nadkarni PM, et al. The Greater Plains Collaborative: a PCORnet Clinical Research Data Network. J Am Med Inform Assoc. 2014 Jul-Aug;21(4):637–41. doi: 10.1136/amiajnl-2014-002756. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Kohane IS, McMurry A, Weber G, et al. The co-morbidity burden of children and young adults with autism spectrum disorders. PLoS One. 2012;7(4):e33224. doi: 10.1371/journal.pone.0033224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Patten IS, Rana S, Shahul S, et al. Cardiac angiogenic imbalance leads to peripartum cardiomyopathy. Nature. 2012 May 9;485(7398):333–8. doi: 10.1038/nature11040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Weber GM. Federated queries of clinical data repositories: the sum of the parts does not equal the whole. J Am Med Inform Assoc. 2013 Jun;20(e1):e155–61. doi: 10.1136/amiajnl-2012-001299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Wikipedia. Boston: [Dec 21, 2014]. http://en.wikipedia.org/wiki/Boston. [Google Scholar]



