Abstract
Objective
To address challenges in large-scale electronic health record (EHR) data exchange, we sought to develop, deploy, and test an open source, cloud-hosted app “listener” that accesses standardized data across the SMART/HL7 Bulk FHIR Access application programming interface (API).
Methods
We advance a model for scalable, federated, data sharing and learning. Cumulus software is designed to address key technology and policy desiderata including local utility, control, and administrative simplicity as well as privacy preservation during robust data sharing, and artificial intelligence (AI) for processing unstructured text.
Results
Cumulus relies on containerized, cloud-hosted software, installed within a healthcare organization’s security envelope. Cumulus accesses EHR data via the Bulk FHIR interface and streamlines automated processing and sharing. The modular design enables use of the latest AI and natural language processing tools and supports provider autonomy and administrative simplicity. In an initial test, Cumulus was deployed across 5 healthcare systems each partnered with public health. Cumulus output is patient counts which were aggregated into a table stratifying variables of interest to enable population health studies. All code is available open source. A policy stipulating that only aggregate data leave the institution greatly facilitated data sharing agreements.
Discussion and Conclusion
Cumulus addresses barriers to data sharing based on (1) federally required support for standard APIs, (2) increasing use of cloud computing, and (3) advances in AI. There is potential for scalability to support learning across myriad network configurations and use cases.
Keywords: electronic health record, interoperability, public health, federated networks
Introduction
The HITECH Act’s $48 billion federal investment led to the widespread adoption of electronic health records (EHRs), with over 95% uptake.1,2 Though EHRs were not initially designed to support population level analytics or the exporting and sharing of data,3 they hold invaluable information that can be cornerstone assets for tasks requiring population data—from early warnings of public health threats to training of artificial intelligence (AI) algorithms. Lack of standardization and technical complexities have made data extraction challenging, restricting capabilities to the most technologically advanced healthcare systems.
To facilitate EHR population data sharing at scale, we designed, developed, and tested Cumulus, a lightweight, open source and free, cloud hosted application that a healthcare organization (HCO) can install behind its firewall.
We sought to share population data on defined cohorts and enable ready participation in federated data sharing networks. To leverage free-text clinical notes, Cumulus includes an AI natural language processing (NLP) pipeline to glean information from unstructured data in a privacy-preserving fashion.
For nationwide scalability, Cumulus relies on data standardized in Fast Healthcare Interoperability Resources (FHIR) format and on a public application programming interface (API), SMART/HL7 Bulk FHIR Access.4 The API, which must be supported in all certified health information technology under the 21st Century Cures Act Rule,5 exposes as FHIR, the more than 100 data elements defined in the US Core for Data Interoperability (USCDI),6 which includes many categories of clinical notes.
We describe the goals for Cumulus as well as its architecture and present early findings from its first deployment across 5 health systems, each in partnership with a state or local public health authority.
Methods
The Office of the National Coordinator of Health Information Technology Leading Edge Acceleration Projects program7 supports Cumulus development. Cumulus instantiates 12 core technology and policy features (Table 1).8 Though Cumulus is intended to multi-solve across many population health uses,9 the driving use case informing its design was population health monitoring as collaboration between the HCOs and public health agencies.
Table 1.
Core technology and policy features instantiated in Cumulus software.
| Push button EHR data access. Ready export from the EHR in a uniform, standardized format, eliminating complex extraction or mapping processes. |
| Data processing pipeline. Orchestration platform automating end-to-end data extraction, de-identification, application of AI, aggregation, and transmission. |
| Data de-identification. Configurable data de-identification to minimize risk of unauthorized and unintended disclosures. |
| AI/NLP text processing. Extraction of insights from clinical notes, with transform into FHIR data elements for privacy-preserving sharing of insights from text. |
| HCO administrative simplification. Integrated platform for end-to-end processing from EHR to public health and other use cases. |
| HCO autonomy. Local control of EHR data with permissioned external sharing. |
| Scalable, equitable, federated networking. Standardized data export and sharing under control of any HCO (eg, federally qualified health center, large academic health system) constitutes a de facto federated network. |
| Advanced computable phenotypes and case definitions. Pre-configured AI/NLP algorithms for defining cohorts, endpoints, and outcomes from both structured and unstructured data. Enables dynamic cohort creation and data monitoring. |
| Aggregate data sharing. Privacy-preserving sharing of aggregate data counts (tallies) beyond the provider's firewall. |
| Browser-based third-party access. Access by authorized external parties to aggregate data through dashboards and enclaves without additional software. |
| Turnkey deployment. Simplified configuration of a cloud-hosted, containerized solution for rapid and secure installation behind a provider’s firewall. |
| Open source and free. All Cumulus components are open source, liberally licensed, and available free of charge. |
Abbreviations: AI/NLP = artificial intelligence/natural language processing, EHR = electronic health record, FHIR = Fast Healthcare Interoperability Resources, HCO = healthcare organization.
Design sprint and development process
A 5-day, principled, user-centered design sprint in collaboration with public health and HCOs followed the Google Ventures method.10,11 For a dashboard for public health practitioners, we defined objectives, validated assumptions, and created wireframe prototypes, testing solutions with end users. Feedback from HCOs and public health partners was sought throughout the development process. Public health partners were engaged in virtual meetings and evaluation check-in calls.
Cloud datastore
A common challenge in building and managing federated systems is deploying, maintaining, and updating the infrastructure that runs at each node. In particular, tuning and scaling individual datastores as the amount of data increases accounts for a substantial component of the cost and complexity of these systems. We leveraged a managed cloud-based data lake environment as the primary data store, taking advantage of the scaling expertise of cloud providers to reduce the burden on individual institutions. We evaluated solutions from Amazon Web Services (AWS), Google Cloud, Microsoft Azure, and Databricks from a cost, functionality, and industry adoption perspective, and decided to use AWS Athena as a query engine rather than data stored in AWS S3 in DeltaTable format. This choice was heavily weighted by widespread existing use of AWS for clinical data in healthcare institutions at the time of selection. Longer term, we intend to develop software abstraction layers supporting use of cloud data stores from multiple providers.
Components for de-identification and extract, transform, load (ETL) orchestration were similarly evaluated, considering only open source technologies. Key criteria included broad industry adoption and ability to easily deploy them in the cloud and on-premises in Docker containers. We wanted to enable HCOs to configure the pipeline based on their preferences and policies. Some institutions de-identified data before transmitting them to a cloud storage bucket, while others ran the de-identification process in their cloud environment.
Modular AI/NLP
We performed a landscape analysis of clinical NLP tools for converting unstructured text in clinical notes to structured FHIR data serialized in JSON format. The NLP pipeline was designed modularly to leverage rapidly evolving language models and always use the latest, validated models. Because we do not anticipate workflows commonly sharing the full text of clinical notes outside of originating HCOs, our assumption was that NLP would occur behind the hospital firewall. Our initial work developing the SMART Text2FHIR pipeline demonstrates the feasibility of extracting privacy-preserving, standardized, structured FHIR data from notes before sharing.12
Federated access
Recognizing that well-structured, queryable versions of clinical data are often unavailable or costly to obtain for care, research, or public health purposes, Cumulus was designed to function as a local environment in addition to a node in a federated network. Additionally, this approach enables local users to leverage the NLP being applied to clinical notes to access data that would otherwise require substantial effort (eg, chart reviews) to obtain. Using a local version of the Cumulus dashboard app, users can monitor key clinical metrics with minimal technical effort.
Deployment and testing
The first Cumulus deployment was across 5 sites, 1 using a Cerner and 3 using an Epic EHR, with the fifth implementing a façade FHIR Access API on top of the local data repository of a health information exchange (HIE). The work occurred under a Centers for Disease Control and Prevention contract and was not characterized as research but rather allowed for disclosure of data for public health purposes. Therefore, no institutional review board protocol approvals were required.
Example implementation metrics include time needed to configure Cumulus nodes and the scale of cohorts, encounters, and clinical notes able to be exported and processed during the pilot. Sites provided input on their experiences when implementing the open-source software, accessing bulk data through their FHIR APIs, discussions with EHR vendors, including any patches and updates to the APIs, and any local configurations they needed to make based on IT security or data sharing policies. We tested computable case definition distribution across 2 sites. We tested end-to-end dashboard access to aggregate data subscriptions by public health partners of 2 HCO participants.
Results
Architecture
The open source Cumulus system enables the dataflow shown in Figure 1. It is comprised of modular elements described below.
Figure 1.
Cumulus architecture overview. (A) We use Bulk FHIR to retrieve all data elements contained in the US Core Data for Interoperability (USCDI) from any EHR. (B) An extract, transform, load (ETL) pipeline processes FHIR resources of potential interest to population health use cases and prepares them for analytic query. (C) Exported clinical notes are run through natural language processing (NLP). (D) NLP-derived concepts and structured diagnoses, medications, and laboratory test results are then de-identified and (E) uploaded to a locally controlled cloud-hosted data lake. (F) Computable phenotypes for the conditions of interest are built from definitions in the open source Cumulus Library and used to screen the deidentified data for cases meeting the validated definitions. (G) In the current configuration, aggregate matrices containing stratified variables of interest, such as new cases of COVID-19, are sent in a highly secure and privacy-preserving fashion to the graphically rich Cumulus dashboard. Dashboard users can review and drill into the aggregate data and (H) can perform their own analyses with the analytics enclave using only a web browser.
FHIR Bulk Data API
The ETL pipeline is a container run inside the health institution under local control. Within this container, a SMART/FHIR Bulk Data client authenticates with the local EHR and executes API queries to download USCDI data elements for cohorts of patients that are pre-defined in the EHR. Document references are parsed from these data and retrieved to obtain the raw text of clinical notes.
NLP pipeline
High-throughput NLP methods, including large language models (LLM), are used to extract knowledge from clinical notes which are represented as JSON FHIR data models (resources). Each component of the NLP pipeline is instantiated as a container module (using Docker) with a REST interface for accessing separate functions. This modular design allows for multiple NLP pipelines to be run in parallel in the future. The default configuration runs the open-source Apache cTAKES,13–15 a library built for clinical text analysis that normalizes mentions of clinical entities to UMLS terminologies (SNOMED CT and RxNORM), and a Bidirectional Encoder Representations from Transformers (BERT)-based model for negation detection.16 The BERT-based model is trained on the same SHARP dataset used to train the cTAKES negation module17 but updated with the latest pre-trained transformer-based methods.18 Because these modules are independent containers, each can be updated and re-deployed without changes to the client. NLP processing can occur under local control, behind a hospital firewall. No personally identifiable text from the notes is either retained or shared.
Output from the NLP pipeline is saved to the cloud datastore as study variables for the aggregate matrix. To measure NLP accuracy, Cumulus library enables cohort selection to compare NLP results to human expert chart review. Cumulus enables chart review with an integration of LabelStudio, an open-source web-based tool for performing and managing manual data annotation. Clinical note storage, NLP processing, and chart review all occur within the control of the home institution. Currently, Cumulus supports notes presented in text or referenced as html documents. Additional note formats, such as PDF and XML, may be supported in the future.
De-identification (DEID) pipeline
PHI is removed from structured FHIR data and FHIR data generated from the NLP pipeline prior to uploading to the cloud datastore. The open source Microsoft FHIR anonymization tool19 is used to comprehensively remove unnecessary identifiers and only preserve de-identified fields of potential interest for population analysis. Cumulus provides templates for configuration of this tool so that new health institutions can adopt standard practices that have been reviewed by multiple institutions. Additionally, a secure codebook method is used to obscure certain potentially identifying information such as patients’ unique FHIR identifiers from entering the local cloud data lake, which is under local control.
Cloud data lake
The data lake and related cloud components can be configured on an AWS instance using CloudFormation templates. De-identified data are loaded into an AWS S3 bucket in DeltaTable format and corresponding AWS Athena schemas are updated based on the FHIR data elements present in the dataset.
Cumulus library
Recent improvements in storing and querying nested data with heterogeneous elements make modern cloud data stores ideal for processing FHIR data. However, native FHIR resources contain complex data structures that can make querying difficult, requiring the use of esoteric SQL features and resulting in very large queries. For this reason, Cumulus supports a multi-stage approach that provides users with the ability to first generate use case specific, simplified, tabular representations of standardized FHIR data models (a “library”), then query within these libraries to identify desired patient cohorts (a “study”), and finally dynamically calculate and return counts of relevant study variables for the cohort (“aggregate matrix”). Commonly used data elements are included in a base library (“core”) including patient demographics, encounters, documents, vital signs, laboratory results, conditions, and medications. Other data elements present in USCDI, such as allergies, immunizations, and implantable devices, can also be added to a library. The Cumulus library tooling also supports manual processes for importing existing value sets, such as the code lists publicly available in the National Library of Medicine Value Set Authority Center (VSAC). Study criteria contain 3 parts: case definition, study variables, and the study period. Each study defines 1 or more patient cohorts. By default, only patients matching the case definition in the study period are selected. Optionally, propensity score matching is used to select cohorts for comparison. Cumulus uses the study criteria to calculate an aggregate matrix containing counts of all study variables for each defined patient cohort.
Computable phenotypes and case definitions
A central feature is ability to collaboratively define and distribute computable phenotypes or computable case definitions to identify diverse, representative patient groups using either USCDI coded data alone or incorporating structured data, optionally enhanced by NLP of text. Several domains were explored. Symptoms of COVID-1920 were extracted from notes and output as FHIR. Cumulus provides a powerful library that combines the simplicity of SQL with the quality of FHIR resources extracted from the longitudinal patient history. The broad range of needs for computable case definitions resulted in Cumulus support for AI/NLP, standard VSAC value sets, and custom user defined criteria. For example, the hypertension definition is based on public health partner needs and the CMS electronic quality measure that provided inclusion and exclusion criteria for hypertension treatment.21 Computable case definitions were derived from FHIR observation vital signs at critical values of 140/90 mm/Hg. Custom value sets were user defined for self-harm in the domains of mental health study22 and opioid overdose.
Aggregate matrix
Counts of every study variable combination are calculated first by each participating healthcare site and then aggregated to produce a sum total of counts across the Cumulus network for the Cumulus Dashboard and Analytics Enclave. The aggregate matrix can be refreshed on demand or as a scheduled task depending on the clinical study needs. The subscription metadata includes which healthcare sites provided data, the study period of the data collection, descriptions of study variables, and when the aggregate matrix was compiled. Formally, the aggregate matrix is a power set23 containing counts of all combinations of study variables, including the null set, representing the total size of the selected cohort, eqn (1). The Cumulus library produces the aggregate matrix by generating a SQL select count query with the cube function.24,25 The user specifies the study variables to count, typically the number of patients or encounters. The cardinality of the power set is 2^n, where n is the number of discrete elements among the study variables. In practice, the aggregate matrix is much sparser than a pure power set: Cumulus removes set sizes with fewer than 10 patients. Examples of discrete elements include disease status (Boolean), patient age at encounter (integer), encounter month (date), and antihypertensive medications.26,27
| (1) |
In eqn (1), Cumulus aggregate matrix is a power set. Let S be a set. T represents a subset of S. P(S) denotes the power set of S as all subsets of S, including the empty set and S itself. The discrete elements of each subset T include 1 or more study variables.
Cumulus dashboard
The design sprint prototype instantiated key features including facilitated iterative data exploration and refinement of definitions, user review of recency and provenance of data in the dashboard for context, power user analyses on the data without downloading or managing additional software, and a workflow for requesting targeted exports of line-level data for vetted use cases. Development of the dashboard is ongoing with input from users.
The dashboard enables public health users to graph, stratify, and compare patient populations using any combination of requested study variables from the aggregate matrix and to apply filters to include or exclude patient populations, for example, filtering by age at encounter, encounter week, diagnosis, test result, or any other study variable. The user selects which study variables to graph and whether to present counts or percentages of the population. Figure 2 is an example of COVID-19 symptoms graphed by encounter month. It shows the prevalence of patient symptoms at emergency department encounters during the COVID-19 pandemic. In this example, the dashboard is used to compare 2 methods for measuring symptoms, NLP computable phenotypes and ICD-10 diagnosis codes.20
Figure 2.
Cough and fever symptom trends for patients with COVID-19 visiting the ED during the pandemic, as identified using 2 methods. Lines represent the percent of COVID-19 patients with each symptom using NLP-driven computable case definitions. Bars represent the percent identified using ICD-10 codes. Red denotes fever or chills, and blue denotes cough. For patients with COVID-19, symptoms were detected in a greater proportion of ED visits using NLP compared to ICD-10.
Analytics enclave
The enclave, for power users to programmatically analyze Cumulus aggregate data, is a Python notebook accessed via a web browser. The Python notebook is preloaded regularly with aggregate data and commonly used data science software for graphing and numerical analysis (Matplotlib, Pandas, SciPy, scikit-learn, others). Aggregate data include all combinations of study variables as a power set and crosstab tables. These counts are directly applicable for population health measures including disease prevalence, odds ratio, relative risk, conditional probability, chi-square tests for significance, Bayesian classifiers, and decision tree classifiers. The analytics environment also enables propensity score matching.
Federation
The Cumulus network implements the federalist principles of local control for healthcare sites,28 privacy protection for patients, and sharing of aggregated counts for authorized users. Cumulus federation is a push model and not a query model—no central authority has access to directly query the line-level data of any participating institution. The aggregate matrix includes every pre-computed combination of study variables, allowing for near-instantaneous responses to user actions. A network deployed with 5 sites—4 hospitals and 1 HIE—is shown in Figure 3. Each site remains in control of the patient data for which they are legally responsible as a HIPAA covered entity. The ETL pipeline is a container within the site intranet behind the firewall that extracts data from the EHR, runs NLP and de-identification pipelines, and loads the prepared FHIR data into the private cloud environment. Cumulus library is run within the private cloud and outputs an aggregate matrix of counts for each study. Each site uploads the aggregate data to a coordinating site. The aggregate matrix is then merged across all 5 sites resulting in a sum total of counts across the network. The aggregated dataset only contains counts. Credentialed users are then able to graph and analyze the aggregate data using the Cumulus Dashboard and Analytics Enclave.
Figure 3.
Cumulus configured within a federated network of 5 healthcare organizations (4 academic medical centers and 1 health information exchange [HIE]). The left (red) column denotes data behind each hospital firewall where FHIR coded data are extracted, and natural language processing performed to extract computable case criteria from patient notes. The next (blue) column denotes de-identified data in private clouds with elements of personally identifiable information removed. The next (green) column denotes aggregated patient counts shared with authorized public health users (final, right-side column). Empty boxes denote the same configurations as Hospital #1, except that in the left (red) column, only the 4 hospitals used FHIR ETL to connect to their local EHR.
Deployment
The first Cumulus testbed was deployed for 5 dyads consisting of an HCO partnered with public health: Boston Children’s Hospital (BCH) and Massachusetts Department of Public Health, Regenstrief Institute and Marion County Public Health Department, Rush University Medical Center and Chicago Department of Public Health, Washington University in St Louis and City of St Louis Department of Health, and UC Davis Health and both Yolo County Health and Human Services and Sacramento County Public Health.
Cumulus installations had some local variation. Regenstrief Institute developed a bulk FHIR API following the Bulk Data Implementation Guide to surface USCDI elements from an existing data warehouse containing HL7 V2 messages stored by the Indiana HIE. University of California Davis Health installed the Cumulus ETL and NLP pipelines on premises to connect through a locally required API manager service rather than in the cloud. Washington University in St Louis ran the ETL and NLP pipelines in a locally approved Microsoft Azure environment. All sites configured and ran Cumulus Library queries in AWS, sending aggregate results to BCH for the dashboard. Features are captured in Table 2.
Table 2.
Implementation statistics for 5 sites
| University of California, Davis Health | Regenstrief Institute | Boston Children’s Hospital | Rush University Health | Washington University of St Louis | |
|---|---|---|---|---|---|
| Unique FHIR patient records loaded | 222 (+12k soon) | 334 573 | 179 176 | 2001 | 1268 |
| FHIR encounters loaded | 6K (+4.45M soon) | 9.2M | 3M | 385K | 0 (250K soon) |
| Clinical notes processed | 23K (+6.25M soon, metadata only) | 23.2M | 1.8M | 157K (metadata only) | 0 (328K soon) |
| Symptom mentions extracted via computable case definition | 609 833 | 85 678 | |||
| FHIR API vendor | Epic | Custom HIE API | Cerner | Epic | Epic |
| PHI processing environment | on-prem | on-prem | AWS | AWS | Azure |
| Configuration time after approvals | 19 calendar days | 53 calendar days | 14 calendar days | 26 calendar days | 39 calendar days |
| API supported date filtering? | N | Y | Y | N | N |
| Health Departments with live dashboard in test bed | Yolo County Health and Human Services & Sacramento County Public Health | Marion County Public Health Department |
Abbreviations: API = application programming interface, AWS = Amazon Web Services, FHIR = Fast Healthcare Interoperability Resources, HIE = health information exchange
An early measurement made in the testbed was performance of the first nascent implementations of the Bulk FHIR API across multiple dimensions.29 The first wave of APIs varied across vendors, products, and configurations, from the order of 2000 to 11 000 FHIR data models (FHIR resources) per minute. Sites with access to APIs that supported optional date filtering parameters were able to make more targeted requests and could export relevant data for larger cohorts of interest. APIs with low throughput compared to their patient volume had to restrict inclusion criteria to complete exports of USCDI elements within their EHR. Additionally, APIs that did not support date filtering did not enable efficient cohort data refresh with FHIR resources already created for a previous export. Thus, the entire patient history had to be exported for each request.
HCO and public health feedback
HCOs and public health partners recognized the potential of Cumulus as a public health tool. Public health partners also expressed the need for more time to assess its utility comprehensively. The pilot phase encouraged HCO participants to augment their infrastructure, be pioneers in use of the Bulk FHIR API, and to strengthen relationships with public health partners. HCOs shared that having a strong multidisciplinary team, organization leadership buy-in, and support from the SMART Cumulus Coordinating Center were key factors accelerating implementation. The flexibility of Cumulus to support both cloud and on-premises server options was identified by all HCOs as an adoption facilitator. The privacy-preserving nature of aggregate data accelerated data sharing approval from HCOs. However, the need for standardized documentation and guidelines on minimum requirements for the installation of Cumulus was evident, particularly for future deployments across a wider diversity of HCOs. During the pilot phase, standardized materials for future deployments were created based on what was learned. Example statements from HCOs are in Table 3. Public health partners helped validate and add to the understanding of the benefits across varying levels of adoption (Table 4).
Table 3.
Selected examples of HCO participant feedback.
|
On Joining the Cumulus Pilot
Testing Bulk FHIR for the first time: “[W]e’ve been interested in Bulk FHIR capabilities for some time. We see a lot of opportunities in building out better FHIR capabilities, especially for population health and for public health reporting. And, also, for being able to build out…patient registries in a much more performant and efficient way at the electronic medical record level…[T]hat’s been an area of strong interest for us.” Using the new FHIR standards for the benefit of public health: “FHIR is the latest standard. We would like to see more providers, as well as public health, embrace FHIR and make use of it. So, I think this pilot is a good opportunity to work with others on refining FHIR, for public health, kind of, defining the best practices for implementing FHIR in a way that it can be used to support the population health work that is done by public health agencies.” “Bulk FHIR holds promise for getting population level data out of EHRs in a way that should meet the needs of public health in terms of data and information that they’re seeking.” |
|
Understanding the Importance That Cumulus Holds Value for Public Health
Potential to become a tool for public health preparedness, emergency response and surveillance but requires additional testing from public health partners: “I think [public health agencies] have to absolutely see the value in the first [use case] once they see the dashboard.” “After that buy-in happens, I think the discussion should be about how [Cumulus] is actually used in their workflows…we want to create this for them, we want it to be helpful, but we want them to actually find ways to use it that it's useful, and not just something that they refer to occasionally.” |
|
Collaborating Closely with Public Health
Have not reached a definitive conclusion regarding Cumulus as the ideal tool; pilots should continue to identify ways to improve Cumulus and continue to collaborate with public health partners to exchange data: “(W)e also have to think about long term for sustainability, having this community–such that if the tool does go into production that there’s a feedback mechanism…where the users then start to come up with ideas, share that with whoever's managing it, and then that is fed in.… it's responsive to what the users want.” “An approach that works well is to formulate a community practice of volunteers because some jurisdictions want to have that kind of input. So, let's allow them the opportunity to do that.…We have to meet health departments where they are.” |
Abbreviation: FHIR = Fast Healthcare Interoperability Resources.
Table 4.
Public health partner feedback.
|
Features of Most Potential
Theoretical value for data modernization efforts: “My approach to this is—where can we get novel surveillance data? The piece about it being real-time or near real-time is definitely a value-add.” Accessing historically unavailable data: “I think it is especially exciting because of the natural language processing and the opportunity to get some context.…It will still just be numbers, but I think there are some creative queries we can come up with.” “I’m thrilled with the options for the free-text parsing as well…it can be hard to have the time and expertise in-house to do that. So this really kind of takes away that burden, which is nice when you are resource strapped.” |
|
Value in Accessing New Data
Tackling analytical challenges: “I think Cumulus will be a good test for how we can use Bulk FHIR, and I think it will add another data source for us. There are a lot of things that we don’t have a good data source for that we have been discussing with the Cumulus group. So, I think that will be one of the main benefits is having this other set of data that we can use for things we don’t have data for.” “It’s about continuing to show added value…and I think there’s really something here, which is what can we do through this that really can’t be done, where, if it’s very important, but small volume issues where it wouldn’t be worthwhile implementing some large-scale solution, I think this can be really targeted and focused. It has that ability.” |
|
Initial Use Cases and Future Adaptability
Importance of testing diverse use cases: “I like that those are different sector[s], like communicable diseases, chronic, behavioral health, and mental health [for] example. I think it has a breadth to apply and sort of make the case across whoever is concerned about or interested in the project.…Sounds like the ultimate tool, and the future, will be very adaptable. And it can really run the gamut of diseases or conditions.” |
|
Utility Across Adoption
Potential benefits at different levels of adoption: “I think it’s got huge potential to be interesting. And even if it remains a single site…dashboard for at least the foreseeable future, I still think that’s useful, because we can look and see, set up other pieces of our other surveillance systems to mimic this and will have comparable information.” “The utility of the dashboard gets stronger and stronger as more [healthcare] sites are added.” |
|
Value of Co-design With Public Health
The need for trust and buy-in from partners: “We’re appreciative that we got to build something from the ground up…we got to design it, and now we get to play with it. And so that’s really cool. I also think that it’s great that we’re going to get access to it eventually to some of the other use cases that we didn’t build. It’s kind of a nice balance of getting more out of [Cumulus] than we expected.” |
|
Interest in and Need to Continue Testing Cumulus
Addressing concerns about data completeness and validation; and the need to engage additional health systems to provide a more accurate representation of the population: “I think we have to play with it more…I think we’ll play with it and say, ‘This meets our needs. This doesn’t.…This is a first draft, not a final product.” “There are going to be a lot of things…what’s meaningful out of it. Not everything is in any dataset, right? You can draw a lot of erroneous conclusions accidentally so really just making sure that we’re clear about what it actually represents when we present it.” “We welcome the opportunity to continue to pilot and participate in these pilots—the technology is advanced, and the CHIP team is asking great questions about how to most impactfully develop this tool to support public health. We feel we can be a valuable voice contributing to this work.” |
FHIR, Fast Healthcare Interoperability Resources
Discussion
The Cumulus cloud hosted “listener” is a viable technology to instrument provider sites to access data from EHRs for public health purposes or internal use. It functions as an “app” running against the universally available SMART/HL7 Bulk FHIR Access API—now required by the 21st Century Cures Act—that facilitates point-to-point data exchange of both structured and free text data.
We demonstrated Cumulus comprising a federated network. In this architecture, Cumulus shares insights but only exchanges deidentified aggregate data outside the healthcare institution. In future work, line level extracts of data that underlie aggregate counts could be provided as an optional second step in public health or other investigations, while adhering to all data access policy and legal requirements. We introduced this approach for a national information infrastructure demonstration,30 used it extensively in the SHRINE network,31,32 disease registries,33 and the current NCATS-funded Genomic Information Commons (Mandl PI),34 and it has become the model for our collaborators establishing PCORnet capabilities.35
Existing common data models at well-resourced institutions, with dedicated data teams and appropriate infrastructure, demonstrate the value of EHR data for public health surveillance, research, and other use cases. Examples include the Observational Medical Outcomes Partnership (OMOP)36 and PCORnet37 common data models. Cumulus widens opportunities for lower resourced settings to make greater use of their EHR data with FHIR, where data mappings are maintained by EHR vendors. The USCDI is extensive, tightly defined, and well documented. It also has a strong governance process involving extensive community input, shepherded by the federal government and the Argonaut FHIR Accelerator, which drives the addition of new data elements needed by the community.38 Working with data in FHIR supports rapid reuse of resulting models and alerts at the point of care through FHIR-based technologies built into EHRs such as SMART on FHIR39,40 and CDS Hooks.41
By ensuring that institutions serving patients across a full spectrum of demographic characteristics—from major medical centers to federally qualified health centers—can participate in public health surveillance, we aim to achieve more equitable, inclusive, and representative data. This approach is crucial for public health response, developing treatments, and implementing healthcare strategies that are effective across populations, irrespective of their ethnic, socio-economic, or geographical backgrounds.
We identified several noteworthy technical challenges. The AI/NLP/LLM field is progressing at an unprecedented pace, posing a challenge to keep the Cumulus project aligned with the latest advancements. However, we view use of this technology as critical to correctly identifying and analyzing cohorts. Modular interfaces to the AI model are used to minimize hard coupling. These routines are specifically designed to be adaptable and effective for emerging models, ensuring that Cumulus remains at the forefront of technological advancements. Because the output and quality of LLM models can vary significantly over time, impacting the consistency and reliability of the phenotyping, we continue to investigate open source tooling to monitor LLM quality and validity.
Currently, there is no widely adopted, standardized set of metrics to assess the data quality of FHIR elements extracted from EHRs. Future work will address this by defining a broad set of FHIR data quality metrics focused on the USCDI dataset. We will collaborate with a range of interested parties to develop open source tooling to execute these metrics at care delivery sites using the Cumulus infrastructure and share benchmarks from results at pilot sites.7 This work will support the access of standardized data within EHRs.
If security were ever compromised, patient counts do not constitute a HIPAA disclosure and do not require patient contact. This simplifies administrative and IT security reviews, lowering barriers to participation.
By providing standardized access to population level clinical data, Bulk FHIR interfaces in EHRs and HIEs open the door to myriad use cases that are currently costly to implement or unwieldy to deploy broadly. At the same time, these APIs represent an early stage technology that is being refined through real-world use in projects like Cumulus and standards development efforts like the Argonaut FHIR Accelerator’s Bulk Export Optimization project.42 As these efforts proceed and best practices are identified, we expect use to grow, driving a positive cycle of learning, improvement, and deployment that will benefit future Cumulus implementations and operation, as well as other healthcare and research projects.
Conclusion
Cumulus tackles obstacles to data sharing through mandated support for standard APIs, the growing adoption of cloud computing, and advancements in AI. This approach offers the scalability needed to facilitate learning across various self-organized federated network configurations and use cases.
Acknowledgements
We acknowledge the participation of Regenstrief Institute, Inc. in this project. The authors would also like to thank the SMART Cumulus Network which, in addition to the named authors, includes: Momeena S. Ali, Elizabeth A. Bowman, Ranjit Dhaliwal, Rosa J. Ergas, Swati Goyal, Anna M. Hammelrath, Matthew D. Haslam, Robert L. Herrick, Eugene Kang, Olivia Kasirye, Ian Lackey, Andrew M. Martin, Keisuke Nakagawa, Tanha Patel, Dylan T. Phelan, Aimee Sisson, Sita C. Smith, Darnesha G. Tabor, David E. Taylor, Nicole Venteris, and Jennifer L. Zuker.
Contributor Information
Andrew J McMurry, Computational Health Informatics Program, Boston Children’s Hospital, Boston, MA 02215, United States; Department of Pediatrics, Harvard Medical School, Boston, MA 02115, United States.
Daniel I Gottlieb, Computational Health Informatics Program, Boston Children’s Hospital, Boston, MA 02215, United States; Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States.
Timothy A Miller, Computational Health Informatics Program, Boston Children’s Hospital, Boston, MA 02215, United States; Department of Pediatrics, Harvard Medical School, Boston, MA 02115, United States.
James R Jones, Computational Health Informatics Program, Boston Children’s Hospital, Boston, MA 02215, United States.
Ashish Atreja, Innovation Technology, UC Davis Health, Rancho Cordova, CA 95670, United States.
Jennifer Crago, Center for Biomedical Informatics, Regenstrief Institute, Indianapolis, IN 46202, United States.
Pankaja M Desai, Department of Internal Medicine, Rush University Medical Center, Chicago, IL 60612, United States.
Brian E Dixon, Center for Biomedical Informatics, Regenstrief Institute, Indianapolis, IN 46202, United States; Department of Health Policy and Management, Fairbanks School of Public Health, Indiana University, Indianapolis, IN 46202, United States.
Matthew Garber, Computational Health Informatics Program, Boston Children’s Hospital, Boston, MA 02215, United States.
Vladimir Ignatov, Computational Health Informatics Program, Boston Children’s Hospital, Boston, MA 02215, United States.
Lyndsey A Kirchner, CDC Foundation, Atlanta, GA 30308, United States.
Philip R O Payne, Institute for Informatics, Data Science, and Biostatistics, Washington University School of Medicine in St Louis, St Louis, MO 63110, United States; Department of Medicine, Washington University School of Medicine in St Louis, St Louis, MO 63110, United States.
Anil J Saldanha, Department of Health Innovation, Rush University Medical Center, Chicago, IL 60612, United States.
Prabhu R V Shankar, Innovation Technology, UC Davis Health, Rancho Cordova, CA 95670, United States; Department of Public Health Sciences, UC Davis Health, Davis, CA 95817, United States.
Yauheni V Solad, Innovation Technology, UC Davis Health, Rancho Cordova, CA 95670, United States.
Elizabeth A Sprouse, Double Lantern Informatics, Atlanta, GA 30305, United States.
Michael Terry, Computational Health Informatics Program, Boston Children’s Hospital, Boston, MA 02215, United States.
Adam B Wilcox, Institute for Informatics, Data Science, and Biostatistics, Washington University School of Medicine in St Louis, St Louis, MO 63110, United States; Department of Medicine, Washington University School of Medicine in St Louis, St Louis, MO 63110, United States.
Kenneth D Mandl, Computational Health Informatics Program, Boston Children’s Hospital, Boston, MA 02215, United States; Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States.
Author contributions
Kenneth D. Mandl obtained funding. Andrew J. McMurry, Daniel I. Gottlieb, Timothy A. Miller, James R. Jones, Elizabeth A. Sprouse, Kenneth D. Mandl conceptualized the study and wrote the first draft. Andrew J. McMurry, James R. Jones, Ashish Atreja, Jennifer Crago, Pankaja M. Desai, Brian E. Dixon, Matthew Garber, Vladimir Ignatov, Lyndsey A. Kirchner, Philip R. O. Payne, Anil J. Saldanha, Prabhu R. V. Shankar, Yauheni V. Solad, Elizabeth A. Sprouse, Michael Terry, and Adam B. Wilcox were involved in data curation and project administration. Matthew Garber, Vladimir Ignatov, and Michael Terry developed the software. Daniel I. Gottlieb, James R. Jones, Brian E. Dixon, Lyndsey A. Kirchner, Prabhu R. V. Shankar, Elizabeth A. Sprouse, Philip R. O. Payne, and Kenneth D. Mandl conducted the formal analysis. All authors were involved in review and editing.
Funding
This work was supported by the Office of the National Coordinator for Health Information Technology contract numbers 90AX0031/01-00, 90AX0022/01-00, and 90AX0040/01-00; Centers for Disease Control and Prevention of the United States Department of Health and Human Services (HHS) as part of a financial assistance award, Strengthened Community Partnerships for More Holistic Approaches to Interoperability totaling $1,985,178 (The contents are those of the author(s) and do not necessarily represent the official views of, nor an endorsement, by the CDC Foundation, CDC/HHS, or the U.S. Government); The National Center for Advancing Translational Sciences/National Institutes of Health Cooperative Agreements U01TR002623 and U01TR002997; National Association of Chronic Disease Directors/Centers for Disease Control and Prevention Grant No. NU38OT000286; Centers for Disease Control and Prevention Grant No. U18DP006500; and Centers for Disease Control and Prevention Cooperative Agreement Nos. NU58IP000004 and 1U01TR002997-01A1.
Conflicts of interest
Boston Children’s Hospital receives philanthropic contributions on behalf of the laboratory of K.D.M. from the SMART Advisory Committee with members including Microsoft, Cambia, Humana, and HCA Healthcare.
Data availability
Data and code are available at https://docs.smarthealthit.org/cumulus/. ETL Pipeline is available at https://github.com/smart-on-fhir/cumulus-etl. Library is available at https://github.com/smart-on-fhir/cumulus-library. Computable case definition libraries for COVID-19 symptoms, hypertension, suicidality, and opioid overdose and use disorder are available at: https://github.com/smart-on-fhir/cumulus-library-covid; https://github.com/smart-on-fhir/cumulus-library-hypertension/; https://github.com/smart-on-fhir/cumulus-library-suicidality-los; https://github.com/smart-on-fhir/cumulus-library-opioid. Aggregator is available at https://github.com/smart-on-fhir/cumulus-aggregator.
References
- 1.Office of the National Coordinator for Health Information Technology. Adoption of electronic health records by hospital service type 2019-2021, Health IT Quick Stat #60. 2022. Accessed April 30, 2024. https://www.healthit.gov/data/quickstats/adoption-electronic-health-records-hospital-service-type-2019-2021
- 2.Office of the National Coordinator for Health Information Technology. National trends in hospital and physician adoption of electronic health records, Health IT Quick-Stat #61. 2022. Accessed April 30, 2024. https://www.healthit.gov/data/quickstats/national-trends-hospital-and-physician-adoption-electronic-health-records
- 3. Mandl KD, Kohane IS. Escaping the EHR trap—the future of health IT. N Engl J Med. 2012;366(24):2240-2242. [DOI] [PubMed] [Google Scholar]
- 4. Mandl KD, Gottlieb D, Mandel JC, et al. Push button population health: the SMART/HL7 FHIR Bulk Data access application programming interface. NPJ Digit Med. 2020;3(1):151-159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Health and Human Services Department. 21st Century Cures Act: interoperability, information blocking, and the ONC Health IT Certification Program. 2020. Accessed May 29, 2024. https://www.federalregister.gov/d/2020-07419
- 6. Office of the National Coordinator of Health Information Technology. United States Core Data for Interoperability (USCDI). Accessed April 30, 2024. https://www.healthit.gov/isa/united-states-core-data-interoperability-uscdi
- 7.Leading Edge Acceleration Projects (LEAP) in Health Information Technology (Health IT). Accessed April 30, 2024. https://www.healthit.gov/topic/leading-edge-acceleration-projects-leap-health-information-technology-health-it
- 8. Institute of Medicine (US) Roundtable on Evidence-Based Medicine. The Learning Healthcare System: Workshop Summary. National Academies Press; 2007. [PubMed] [Google Scholar]
- 9.SMARTHealthIT. Multi-solving population data use with SMART® Bulk FHIR Access. 2022. Accessed April 30, 2024. https://smarthealthit.org/multi-solving-population-data-use-with-smart-bulk-fhir-access/
- 10. Garrity S. Silverorange: Running a design sprint in a healthcare organization. Sprint Stories. 2016. Accessed April 30, 2024. https://sprintstories.com/running-a-design-sprint-in-a-healthcare-organization-56001ac9d1bf
- 11. Knapp J, Zeratsky J, Kowitz B. Sprint: How to Solve Big Problems and Test New Ideas in Just 5 Days. Simon & Schuster; 2016. [Google Scholar]
- 12. Miller TA, McMurry AJ, Jones J, et al. The SMART Text2FHIR pipeline. AMIA Annu Symp Proc. 2023;2023:514-520. [PMC free article] [PubMed] [Google Scholar]
- 13.Apache cTAKESTM. Accessed April 30, 2024. https://ctakes.apache.org/
- 14. Savova GK, Masanz JJ, Ogren PV, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17(5):507-513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Miller T, Dligach D, Bethard S, et al. Towards generalizable entity-centric clinical coreference resolution. J Biomed Inform. 2017;69:251-258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.BERT 101—state of the art NLP model explained. Accessed April 30, 2024. https://huggingface.co/blog/bert-101
- 17. Wu S, Miller T, Masanz J, et al. Negation’s not solved: generalizability versus optimizability in clinical natural language processing. PLoS One. 2014;9(11):e112774. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.cnlp_transformers: Transformers for Clinical NLP. Github. Accessed April 30, 2024. https://github.com/Machine-Learning-for-Medical-Language/cnlp_transformers
- 19.FHIR Data Anonymization. Github. Accessed April 30, 2024. https://github.com/microsoft/Tools-for-Health-Data-Anonymization/blob/master/docs/FHIR-anonymization.md#fhir-data-anonymization
- 20. McMurry AJ, Zipursky AR, Geva A, et al. Moving biosurveillance beyond coded data using AI for symptom detection from physician notes: retrospective cohort study. J Med Internet Res. 2024;26:e53367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Controlling high blood pressure. Accessed April 30, 2024. https://ecqi.healthit.gov/ecqm/ec/2023/cms165v11
- 22. Zipursky AR, Olson KL, Bode L, et al. Emergency department visits and boarding for pediatric patients with suicidality before and during the COVID-19 pandemic. PLoS One. 2023;18(11):e0286035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Wikipedia contributors. Power set. Accessed April 30, 2024. https://en.wikipedia.org/wiki/Power_set
- 24.Presto 0.285 Documentation. Accessed April 30, 2024. https://prestodb.io/docs/0.285/
- 25.Complex grouping operations. Accessed April 30, 2024. https://prestodb.io/docs/0.285/sql/select.html#complex-grouping-operations
- 26.Data type descriptions (Coding.code). Accessed April 30, 2024. https://hl7.org/fhir/R4/datatypes-definitions.html#coding
- 27. Value set details. Accessed April 30, 2024. https://vsac.nlm.nih.gov/valueset/2.16.840.1.113762.1.4.1010.4/expansion
- 28. Mandl KD, Kohane IS. Federalist principles for healthcare data networks. Nat Biotechnol. 2015;33(4):360-363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Jones JR, Gottlieb D, McMurry AJ, et al. Real world performance of the 21st Century Cures Act population-level application programming interface. J Am Med Inform Assoc. 2024;31(5):1144-1150. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. McMurry AJ, Gilbert CA, Reis BY, et al. A self-scaling, distributed information architecture for public health, research, and clinical care. J Am Med Inform Assoc. 2007;14(4):527-533. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. McMurry AJ, Murphy SN, MacFadden D, et al. SHRINE: enabling nationally scalable multi-site disease studies. PLoS One. 2013;8(3):e55811. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Weber GM, Murphy SN, McMurry AJ, et al. The Shared Health Research Information Network (SHRINE): a prototype federated query tool for clinical data repositories. J Am Med Inform Assoc. 2009;16(5):624-630. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Abman SH, Mullen MP, Sleeper LA, Pediatric Pulmonary Hypertension Network, et al. Characterisation of paediatric pulmonary hypertensive vascular disease from the PPHNet Registry. Eur Respir J. 2022;59(1):2003337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Mandl KD, Glauser T, Krantz ID, et al. The Genomics Research and Innovation Network: creating an interoperable, federated, genomics learning system. Genet Med. 2020;22(2):371-380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Forrest CB, McTigue KM, Hernandez AF, et al. PCORnet® 2020: current state, accomplishments, and future directions. J Clin Epidemiol. 2021;129:60-67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.OMOP common data model. Accessed April 30, 2024. https://ohdsi.github.io/CommonDataModel/
- 37.PCORnet common data model. Accessed April 30, 2024. https://pcornet.org/wp-content/uploads/2023/04/PCORnet-Common-Data-Model-v61-2023_04_031.pdf
- 38.Argonaut project home. Accessed April 30, 2024. https://confluence.hl7.org/display/AP
- 39. Mandel JC, Kreda DA, Mandl KD, et al. SMART on FHIR: a standards-based, interoperable apps platform for electronic health records. J Am Med Inform Assoc. 2016;23(5):899-908. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Mandl KD, Mandel JC, Murphy SN, et al. The SMART Platform: early experience enabling substitutable applications for electronic health records. J Am Med Inform Assoc. 2012;19(4):597-603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.CDS hooks. Accessed April 30, 2024. Https://cds-hooks.org/
- 42.Bulk optimize. Accessed April 30, 2024. https://confluence.hl7.org/display/AP/Bulk+Optimize
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data and code are available at https://docs.smarthealthit.org/cumulus/. ETL Pipeline is available at https://github.com/smart-on-fhir/cumulus-etl. Library is available at https://github.com/smart-on-fhir/cumulus-library. Computable case definition libraries for COVID-19 symptoms, hypertension, suicidality, and opioid overdose and use disorder are available at: https://github.com/smart-on-fhir/cumulus-library-covid; https://github.com/smart-on-fhir/cumulus-library-hypertension/; https://github.com/smart-on-fhir/cumulus-library-suicidality-los; https://github.com/smart-on-fhir/cumulus-library-opioid. Aggregator is available at https://github.com/smart-on-fhir/cumulus-aggregator.



