Skip to main content
JAMIA Open logoLink to JAMIA Open
. 2022 Aug 8;5(3):ooac067. doi: 10.1093/jamiaopen/ooac067

Case report: evaluation of an open-source synthetic data platform for simulation studies

Daniella Meeker 1, Crystal Kallem 2, Yan Heras 3, Stephanie Garcia 4, Casey Thompson 5,
PMCID: PMC9360775  PMID: 35958672

Abstract

Simulation is a mainstay of comparative- and cost-effectiveness research when empirical data are not available. The Synthea platform, originally designed for generating realistically coded longitudinal health records for software testing, implements data generation models specified in publicly contributed modules representing patients’ life cycle and disease and treatment progression. We test the hypothesis that Synthea can be used for simulation studies that draw parameters from observational studies and randomized trials. We benchmarked the results and assessed the effort required to create a Synthea module that replicates a recently published cost-effectiveness simulation comparing levofloxacin prophylaxis to usual care for leukemia. A module was iteratively developed using published parameters from the original study; we replicated the initial conditions and simulation endpoints of demographics, health events, costs, and mortality. We compare Synthea’s Generic Module Framework to platforms designed for simulation and show that Synthea can be used, with modifications, for some types of simulation studies.

Keywords: Electronic Health Records, computer simulation, privacy, software validation, leukemia

INTRODUCTION

Simulations are used in health care when desired empirical data are not available, such as forecasting scenarios when changes in policies or practices may impact longitudinal population-level outcomes or cohort-level disease progression. This demonstration study focused on evaluating the use of Synthea V2.6.1 (Synthea) as a potential platform for simulation studies of patient healthcare utilization and outcomes. Such studies produce a simulated data set for statistical analysis comparing alternative treatment or policy scenarios or forecasting alternative future circumstances, including outcomes like disease progression, quality of life, and mortality.

Synthea1 is an open-source, freely available software package designed to generate realistic, standards-based synthetic healthcare records for developing and testing technology operating against clinical data. Unlike other software that generates coded synthetic health data (eg, EMRBots),2 Synthea enables user-specified models of healthcare and disease progression3 to generate longitudinal patient record sets with meaningful and technically realistic sequences of coded events. For this reason, researchers interested in health services research, economics, and policy have speculated that Synthea could be applied for simulation studies.4 Most existing Synthea modules use models derived from guideline pathways, rather than models that also include alternative or hypothetical pathways, including guideline-discordant pathways.5 Third-party evaluations of the validity of data generated with Synthea’s existing modules revealed limitations to this guideline-based approach.3 However, Synthea’s ability to support simulation-based research with appropriate models has not yet been evaluated.

Despite the differences in intended application, Synthea has features in common with microsimulation software designed to simulate healthcare outcomes and utilization, including the Future Elderly Model and other purpose-specific simulation software.6,7 Like many microsimulation models, Synthea’s model parameters are typically sourced from domain experts and published studies. Synthea initiates with a set of patients having predefined transition parameters and initial conditions, each evolving through discrete states over time according to probabilistic models or rules to produce an analytic population. By default, Synthea simulates the initial conditions in the Massachusetts population but can be modified to reflect custom demographics. However, Synthea is intended for “innovation, development, education, and other nonclinical secondary uses.”8 It does not support the ingestion of record-level data for estimating model parameters for high-fidelity deidentification of existing data sets9,10 or include user-friendly ways to specify complex, multivariate models from such data.

We hypothesized that Synthea can be repurposed for some types of simulation studies without changes to source code. For example, our demonstration evaluated Synthea’s capability to replicate a published simulation study that employed commercial software. Successful replication would indicate that Synthea can be applied in this way. This evaluation complements Synthea’s originally intended use as a means to generate technically useful, realistic records from existing public data in a way that does not impact patient privacy.11

MATERIALS AND METHODS

Reproducing an existing study with published parameters enabled us to benchmark results from a new Synthea module against study results and report module development methods and effort (we note that parameter development is often the most time-intensive part of simulation study design and Synthea module building, but not informative in evaluating the Synthea platform for simulation studies, and thus not included in this evaluation).

A literature scan identified simulation studies with published endpoints and parameters for initial conditions, health states, transition probabilities, and service utilization. We selected a 2020 cost-effectiveness analysis comparing levofloxacin prophylaxis to usual care for pediatric patients undergoing an episode of chemotherapy for acute myeloid leukemia (AML) (McCormick).12 To summarize the parameter development process used in this study (referenced in detail in McCormick,12 S1), published parameters were drawn from 31 studies and an original retrospective cohort analysis of AML patients in the Pediatric Health Information System database. Probabilities of intensive care unit (ICU) admission and mortality were drawn from several published studies, and cost parameters were drawn from the Federal Supply Schedule.13 As described above, this approach to data-driven modeling from research cohorts and process models differs from the prevailing guideline-based approach in existing Synthea modules: guidelines do not incorporate pathways that include adverse outcomes and guideline discordant treatment patterns.

To evaluate Synthea usability for new adopters, our module builders were moderately experienced with Synthea’s generic module framework (GMF). They gained experience by developing four new modules for Opioids, Pediatrics, and Complex Care.14–17 The primary module builder is a nurse informaticist specializing in data standards and healthcare interoperability. The secondary builder, consulted for troubleshooting and quality assurance after each iteration, is a bioinformatics researcher specializing in data standards and clinical model design. A researcher with experience in simulation studies resolved complex parameter alignment issues when McCormick results required conversion to Synthea’s specification format.

To help inform potential users of Synthea’s challenges and capabilities, we adopted an iterative approach to Synthea module building, testing the differences between Synthea's output and the output in McCormick’s et al. in each version of the Synthea module as shown in Table 1. By reporting the alignment and technical solutions required in each iteration, we provide insight into the level of effort and technical details of steps required to repurpose Synthea for simulation studies. To verify reproducibility, we conducted an iterative development process, building an AML module in Synthea conforming to McCormick results and model specifications and generating a new patient data set for analysis. After each iteration, Synthea-generated patients’ demographic attributes, treatment, and health outcomes were compared to McCormick results. Point estimates in McCormick tables and Supplementary Material were compared to Synthea data using chi-squared and t tests for binary and continuous variables, respectively. Pass-fail tests for each iteration were assessed and updates made until all comparisons “passed” with P > .001. Issues were tracked, and module builders identified and implemented solutions. We measured effort required for each development cycle, issues encountered, and resolutions. Module builders relied on Synthea’s public documentation18 and GitHub issue tracker19 throughout the process. Data analysis was conducted using Stata15.20

Table 1.

Synthea AML module iterations

AML Module Iteration Challenges Lessons Learned
Iteration 1
  • Initial version of module

  • Developed expertise in states and transitions based on parameters in a simulation study

Iteration 2
  • Editing and enhancing cost table to match published costs for encounters, procedures, and medications

  • Cost data are in specified in different places within Synthea, some cost data are contained within lookup tables, and the remainder are default costs in the Synthea demographics file

Iteration 3
  • Learning how to edit/enhance distribution table for table transition

  • Table transitions are new to Synthea and not used by many modules. There is limited information related to this type of transition on the Synthea wiki

Iteration 4
  • Removed distribution table and used complex transitions

  • Complex transitions for race/ethnicity caused a large proportion of patients to be eliminated from the module

  • Costs were updated in Synthea cost lookup tables. When corresponding codes were not present in lookup tables, additional lines were added to accommodate new codes and affiliated costs

Iteration 5
  • Moved complex transitions to below the age guard/year guard in the module

  • Run Synthea for a default city in a diverse area

  • Create new city in Synthea with race and ethnicity which match McCormick

  • Procedures and observations older than 10 years ago were not displaying in the output. A setting must be updated in the synthea.properties file to allow older observations and procedures to display in the output

  • Updating the Synthea demographics file to create a new city requires updating the appropriate zip code and provider files and adding a latitude/longitude

Iteration 6 (V0.6a–0.6e)
  • Modified existing city in the Synthea default demographics file with parameters to match McCormick for base population along with table transition within the module to differentiate levofloxacin/nonlevofloxacin populations

  • Added 22 additional age delays to the top of the module to accommodate for age

  • Modifying an existing city with specific parameters, adding a distribution table within the module, and adding age delays in the module generated findings which matched McCormick

RESULTS

Developers designed initial Synthea module states and transitions based on the McCormick model diagram (Figure 1).

Figure 1.

Figure 1.

Comparison of McCormick model10 to Synthea AML module. Displays a comparison of a portion of the McCormick model to the Synthea AML module. AML: acute myeloid leukemia.

The AML module initialized with 25 states and 24 transitions and finalized with 47 states and 46 transitions. During development, 22 additional states were incorporated. The AML module reproduced McCormick model pathways, including assuming only febrile neutropenia patients develop bacteremia during the episode of care and routing only nonfebrile neutropenia patients to a terminal state. Transition probabilities and attribute distributions were based on McCormick Supplementary Table 1 parameters. Workarounds were developed to address Synthea’s modeling interface limitations, resulting in the increased number of states and transitions. The final AML module is displayed in Figure 2.

Figure 2.

Figure 2.

Final Synthea AML module (Iteration 0.6e).21 Displays the final version of the Synthea AML module as displayed in the module builder tool. AML: acute myeloid leukemia.

Initial conditions

Synthea’s default demographic data are based on city and state-level census data rather than specific patient cohorts.21 Replicating McCormick initial conditions for AML patients (eg, race, age) into Synthea’s framework required configuring a city within the Synthea demographics file with parameters matching the McCormick population.

Transition probability specification

The AML module included state transitions for medication with levofloxacin, development of bacteremia, admission to an ICU, and mortality. The McCormick defining state transition for hypothesis testing was assignment to levofloxacin treatment or standard of care, and these groups were not demographically identical. Thus, conditional transition probabilities were required. Synthea’s GMF offers multiple mechanisms for conditional transitions, including table-based transitions enabling transition probabilities depending on patient demographics and health attributes. Synthea distribution tables generate a separate CSV file to be called upon during the simulation. Synthea documentation contained limited information on table transitions, so development required significant trial and error.

Continuous distributions of values

At the time of the study, Synthea supported only uniform probability distributions for transitions. Support for Gaussian distributions could have been useful when attempting to match certain McCormick distributions, including patients’ age. Patient age at diagnosis (time elapsed between birthdate and date of diagnosis) was implemented as a Synthea delay state following birthdate. We approximated McCormick’s Gaussian age distribution using 22 uniform distributions for delays between 0 and 21 years from birthdate to initial diagnosis. In response to user requests, Synthea developers recently added support for Gaussian distributions in the GMF, including distributions for laboratory values and delays. However, delays were not constrained to nonnegative values, resulting in simulated visits predating patient birth. Thus, the final version of the AML module relies on a series of uniform distributions rather than the new Gaussian features.

Cost specifications

Synthea’s GMF supports customization of lookup tables for costs and default values based on Medicare cost files with adjustments for each state. Synthea cost data are based on a simplified version of real-world costs—publicly available Medicare records are multiplied by state adjustment factors. Costs are provided for encounters, procedures, medications, and immunizations. Costs not provided in lookup tables are assigned a default cost in the Synthea properties file.22

Levofloxacin was added to the medications cost lookup table with a McCormick-based specified cost. Bacteremia cost data were updated for the inpatient encounter in the encounters cost lookup table. ICU admission and mortality costs were updated in the procedures cost lookup table, and a code for death event was added. Other parameters were adjusted to avoid impacting McCormick costs during generation. Tables were customized in the developer’s local version of Synthea.

Execution of simulation

To execute the AML module,23 builders implemented the model and custom lookup tables for costs and transition probabilities. Details are in the AML module companion guide.24

Statistical analysis

To fully reproduce the McCormick results, six AML module versions (with multiple sub-iterations of V0.6) were iteratively developed and tested. Module refinement continued until Synthea distributions for all patient attributes in both the treatment and standard of care populations were comparable to the point estimates reported in the McCormick results. Table 2 shows the results of comparisons between Synthea and McCormick for each sub-iteration of AML module V0.6. As described above, matching the Gaussian distribution for age was the most challenging aspect of module development, failing to replicate reference values in multiple attempts over the course of development.

Table 2.

Summary of results for AML module V0.6

AML module iteration
V0.6a
V0.6b
V0.6c
V0.6d
V0.6e (w/age restriction)
V0.6e (w/delays)
Tx versus SoC Tx SoC Tx SoC Tx SoC Tx SoC Tx SoC Tx SoC
Tx versus SoC Tx SoC Tx SoC Tx SoC Tx SoC Tx SoC Tx SoC
Population size Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass
Age (SD) Fail Fail Fail Fail Fail Fail Fail Fail Fail Fail Pass Pass
Race: white Fail Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass
Race: black  Fail Fail Pass Pass Fail Pass Pass Pass Pass Pass Pass Pass
Race: other/missing Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass
Hispanic ethnicity Fail Fail Fail Fail Fail Pass Pass Pass Pass Pass Pass Pass
Bacteremia Fail Pass Fail Pass Fail Pass Fail Pass Fail Pass Pass Pass
ICU Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass
Mortality Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass

AML: acute myeloid leukemia; ICU: intensive care unit; SoC: standard of care; Tx: treatment.

The level of effort to develop the AML module was approximately 20% higher than other modules developed by the same team, largely due to the distinct objective of matching results of prior research rather than the more straightforward objective of matching guideline pathway diagrams without quantitative data validation. Matching McCormick parameters and a Gaussian distribution required creating numerous delay states and complex table transitions. Achieving a precise level of accuracy when comparing Synthea output to McCormick results required developing and testing numerous iterations of the AML module.

DISCUSSION

Microsimulation models have been used for real-world applications in forecasting policies6,7 and assessing disease progression trajectories and intervention scenarios.25 This demonstration shows that Synthea can be repurposed for simulation studies comparable to the reference study without advanced programming. Although workarounds were required, replication was feasible. We found that Synthea could be readily modified to initialize simulations with a cohort matching the characteristics of the AML population. Synthea also has features allowing for more complex and conditional state transitions required to reproduce differential treatment allocation across our initial AML cohort. While reproducing these capabilities of simulation software required some nonintuitive uses of the Synthea GMF, the same approach could be used in future simulation studies using Synthea in other contexts.

The McCormick simulation did not involve multivariate models for conditional state transitions or interactions between individuals. However, it is representative of a broad class of microsimulation studies. Future releases of Synthea may include user-friendly support for these features, extending the complexity of possible studies and models. Synthea currently lacks the capability to support complex multivariate distributions, but newly introduced support for nonuniform distributions (including univariate Gaussian distribution related to this study) may serve as a foundation for this capability. Enhancements to the documentation of Synthea’s existing features may facilitate future simulation research.

For health economics research, Synthea can represent costs as they might appear in billing data for encounters, procedures, medications, and immunizations using billing codes provided in lookup tables. McCormick assigned a cost to each outcome and medication generically, thus cost comparisons can be calculated with these frequencies. Future analyses and module iteration might add this granularity and complexity and assess or extend Synthea’s default values for costs. Although Synthea allows for more extensive modeling, the current strategy is sufficient for study designs like McCormick, where simple frequencies of generic simulated events are used to generate cost comparisons.

Other research applications

Our results also indicate Synthea can be used for study planning, software testing, and development of analysis routines for research studies. For complex experimental designs, like stratified or cluster randomized trials, simulations are often used in power analysis to prepare for clinical trials. Synthea can generate data for power analysis and sample size requirements. For example, should a follow-up to the McCormick study include a randomized trial of levofloxacin prophylaxis, investigators might apply our Synthea AML module to help plan sample size requirements and prepare data infrastructure. Because an arbitrary number of simulated patient records can be produced, Synthea has become the “gold standard” for performance benchmarking in health data management strategies.26,27 Synthea can also produce realistic patterns in healthcare records, comparable to data generated in Electronic Health Records (EHRs) during pragmatic trials. Informaticians can generate and use data to apply and test software that might, in a forthcoming trial or registry project, use EHR data. Once a database is designed with realistic data, analysis routines for data safety and other monitoring reports can be developed and tested before beginning data collection, reducing the time between implementing research infrastructure and initiating enrollment.

CONCLUSION

Synthea was not designed for simulation studies but includes many features of commercial microsimulation software. Synthea offers flexibility for estimating coded cost data in a way that allows even greater variation than the McCormick reference study. The user interface is not optimized for complex specifications of initial conditions and transition probabilities, but Synthea modules can be configured, with reasonable effort, to support some simulation research. Although Synthea is an open-source tool, it is actively supported by a team of MITRE developers who address new feature requests, consider the clinical validity of contributed modules, and manage other issues identified by the user community.

Supplementary Material

ooac067_Supplementary_Data

ACKNOWLEDGMENTS

The authors would like to thank Maureen Tan, for copy editing support and for reference formatting assistance. We would also like to thank Carmen Smiley from the Office of the National Coordinator for Health Information Technology and Anita Samarth from Clinovations Government+Health, for leadership and contributions on the overarching project that examined the scope and utility of synthetic data for patient-centered outcomes research. The work described here is part of the Synthetic Health Data Generation to Accelerate PCOR project that was supported by the Office of the Secretary Patient-Centered Outcomes Research Trust Fund. Finally, the authors acknowledge the MITRE Synthea team for their assistance in responding to issues encountered over the course of this work.

FUNDING

This work was funded through U.S. Department of Health and Human Services Office of the Secretary Patient-Centered Outcomes Research Trust Fund under interagency agreement number 750119PENC0002, contract number HHSP233201500099I, task order number: HHSP75P00119F37004 with Clinovations Government+Health.

AUTHOR CONTRIBUTIONS

All authors acknowledge that they: (1) made substantial contributions to the conception or design of the work, or the acquisition, analysis, or interpretation of the data for the work; (2) contributed substantially to the drafting and final approval of the version to be published; and (3) agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

SUPPLEMENTARY MATERIAL

Supplementary material is available at JAMIA Open online.

CONFLICT OF INTEREST STATEMENT

None declared.

Contributor Information

Daniella Meeker, Keck School of Medicine, University of Southern California, Los Angeles, California, USA.

Crystal Kallem, Clinovations Government+Health, Washington, District of Columbia, USA.

Yan Heras, Optimum eHealth, LLC, Irvine, California, USA.

Stephanie Garcia, Office of the National Coordinator for Health Information Technology, Washington, District of Columbia, USA.

Casey Thompson, Clinovations Government+Health, Washington, District of Columbia, USA.

Data Availability

Data are available from a GitHub Repository: https://github.com/casey7083/Demonstration-Study-Output.

REFERENCES

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ooac067_Supplementary_Data

Data Availability Statement

Data are available from a GitHub Repository: https://github.com/casey7083/Demonstration-Study-Output.


Articles from JAMIA Open are provided here courtesy of Oxford University Press

RESOURCES