Abstract
Objective
Synthea is a synthetic patient generator that creates synthetic medical records, including medication profiles. Prior to our work, Synthea produced unrealistic medication data that did not accurately reflect prescribing patterns. This project aimed to create an open-source synthetic medication database that could integrate with Synthea to create realistic patient medication profiles.
Materials and Methods
The Medication Diversification Tool (MDT) created from this study combines publicly available prescription data from the Medical Expenditure Panel Survey (MEPS) and standard medication terminology/classifications from RxNorm/RxClass to produce machine-readable information about medication use in the United States.
Results
The MDT was validated using a chi-square goodness-of-fit test by comparing medication distributions from Synthea, Synthea+MDT, and the MEPS. Using a pediatric asthma population, results show that Synthea+MDT had no statistical difference compared to the real-world MEPS with a P value = .84.
Discussion
The MDT is designed to generate realistic medication distributions for drugs and populations. This tool can be used to enhance medication records generated by Synthea by calculating medication-use data at a national level or specific to patient subpopulations. MDT’s contributions to synthetic data may enable the acceleration of application development, access to more realistic healthcare datasets for education, and patient-centered outcomes’ research.
Conclusions
The MDT, when used with Synthea, provides a free and open-source method for making synthetic patient medication profiles that mimic the real world.
Keywords: Synthea, dataset, RxNorm, MEPS, database
OBJECTIVE
Synthea is an open-source application that can generate synthetic patient medication profiles. However, Synthea’s synthetically created medication profiles did not mimic the real world. This study’s objective was to create a method that can generate a realistic synthetic medication database using open-source databases.
BACKGROUND AND SIGNIFICANCE
The sharing and use of real healthcare data are limited due to patient privacy concerns and laws.1 Synthetic data present researchers with an intriguing solution to this issue of limited access.2 Without robust and representative datasets, researchers and developers cannot advance healthcare research and technology. Synthetic data can allow users to access large volumes of data with minimal privacy risks. However, this introduces a new challenge: the usefulness of synthetic data and its applications are limited by how accurately the data mimics real-world data.
The Office of the National Coordinator for Health Information Technology (ONC) adopted a synthetic health data generation tool called Synthea to accelerate patient-centered outcomes research.3 However, the synthetic database did not mimic the diversity of medications and prescribing patterns. This introduces bias into the generated dataset and research results. For example, one study found that Synthea diabetes patients had a 4000% increase in amputations, and 100% of Synthea type-2 diabetics had at least one amputation.4 This type of inaccurate result was also present in the medication distributions produced by Synthea.
The process for researching and inputting realistic statistical distributions into Synthea was manual and difficult for nonclinical users. Given a specific disease state, the Synthea tool created inaccurate records with an unrealistically narrow selection of medications.5,6 For example, 100% of asthma patients were getting the same asthma inhaler regardless of age—which is not consistent with real-world clinical practice.7
Medication Diversification Tool (MDT) is a series of decision trees that utilizes statistical medication distributions to create lifelike patient medication records. By using databases which are open-source and maintained regularly, such as the Medical Expenditure Panel Survey (MEPS), RxNorm (visualized via RxNav), and RxClass, MDT—in combination with Synthea—can create realistic synthetic medication data for patients and requires very little maintenance.8–10 Since the release of MDT, the Synthea development team has incorporated several MDT-generated medication modules into Synthea, including asthma maintenance inhalers and asthma rescue inhalers.
MATERIALS AND METHODS
Our team developed MDT to increase variation and realism of medication orders in Synthea. MDT leverages the open-source medication classification hierarchies in RxClass, the open-source medication ontology of RxNorm (visualized using RxNav as a graphical user interface), and descriptive statistics from MEPS data.8,9 RxNorm and RxClass are drug information databases that contain drug names, codes, and classes. MEPS is a national survey of households and their medical providers to collect information on the use and cost of health care.9 Linking RxClass and RxNorm allows users to translate between human-readable drug classes and specific drug product codes, and link to other databases like MEPS prescription records. MEPS data were used to calculate drug utilization distributions, stratified by US state, patient age, and patient gender.
Figure 1 shows the framework of MDT and how it leverages the 3 publicly available databases to create an output that can be used in Synthea.
Figure 1.
Medication Diversification Tool (MDT) framework.
MDT is designed to be usable by both clinical (nontechnical) and technical (nonclinical) users. The workflow begins by exploring the RxClass and RxNav graphical user interfaces (GUIs) hosted by the National Library of Medicine (NLM). Users search for ingredients (in RxNav) or medication classes (in RxClass) and can combine ingredients and classes using include and/or exclude logic in MDT (ie, include this medication class, but exclude these medication ingredients).
With pediatric asthma, for example, a clinician might tell a developer that inhaled corticosteroids are typically used as maintenance medications. The developer would search RxClass for “corticosteroids” and find an ATC1-4 class R01AD which represents a list of ingredients considered “Corticosteroids.” The clinician may also specify that only single-ingredient products (as opposed to multi-ingredient, combination products) are used in this population, and that the dose forms for these products should be limited to metered dose inhalers, dry powder inhalers, and inhalation suspensions. The developer can easily input these settings into MDT settings file.
MDT will first combine the ingredient and class settings based on the optional include/exclude logic and produce a final list of ingredient RxNorm concept unique identifiers (RXCUIs). In this case, it would query the NLM-hosted RxClass application programming interface (API) for the class of R01AD, which would result in a list of ingredient RXCUIs. Then, using a custom local SQLite database created by MDT upon initialization, that list of medication ingredients is expanded into a list of medication product RXCUIs (representing prescribable products by ingredient as well as dose form and strength)—excluding products that do not match the other settings (for the asthma example, excluding multi-ingredient products and non-inhalation dose form products). Once the user finalizes a list of filtered product RXCUIs, that list may be expanded into a list of National Drug Codes (NDCs) available for those products.
Since MEPS stores patient survey results at the NDC level, MDT can calculate the distribution of patients that report taking a given NDC compared to the total number of patients matching their demographic population (optional settings for age range, gender, and state of residence). All NDC-level distributions are rolled back up to the ingredient and product RXCUI levels. Ultimately, this results in a 2-step distribution process where—using population data—MDT can determine the likelihood that a patient would be prescribed a specific medication product. For example, MDT can calculate that a patient aged 0–5 years would be prescribed a certain ingredient. If they are prescribed that ingredient, MDT can determine the likelihood that they would be prescribed a specific product versus a different one (eg, a different brand, strength, or dose form). See Figure 2 for an example of this 2-step distribution process.
Figure 2.
MDT-generated ingredient and product-level distributions.
The calculated distributions are output to comma separated value (CSV) files—one for the ingredient distributions (ie, fluticasone) from the first step, and one per ingredient-containing product (ie, Flovent 44 µg metered dose inhaler) in the second step. MDT also generates a JavaScript Object Notation (JSON) file, which is in the exact format Synthea uses for its modules and submodules. The JSON file represents a flowchart of this 2-step process and references the CSV files to determine the probability a patient would go down one path versus any other (ie, ingredient vs ingredient, or product vs product). The last step in the JSON flowchart determines a product RXCUI, which represents a prescribed medication product for a synthetic patient.
Finally, a user can take the JSON file and the CSV files output from MDT and integrate them with Synthea. The instructions for doing this are listed in MDT GitHub repository, which is linked at the end of this article. After integrating the output of MDT with Synthea, a user can re-run Synthea and note the increase in the diversity of medications in the resulting synthetic patient population—which align approximately with the distributions calculated by MDT. Synthea adds an element of randomness and sometimes a large synthetic patient population is required to observe prescriptions for low-probability products.
For validation, a use case of pediatric asthma for children ages 0–5 years was chosen. Synthetic medication record distributions for asthma patients ages 0–5 years generated by Synthea and Synthea+MDT were each compared with real distributions from MEPS using a chi-square goodness-of-fit test.11 Outputs were reviewed to ensure alignment with clinical practice standards for this population.
RESULTS
Distributions of asthma medications in patients ages 0–5 years were compared to MEPS to determine how well the model fit. The Synthea population did not fit the MEPS distribution, however, the Synthea+MDT model fit well. Figure 3 illustrates the chi-square goodness-of-fit test where MEPS medication distributions have a statistically significant difference from Synthea distributions (X2 = 7168.52, df = 5, N = 14 410 P < .01).12 A separate Synthea cohort of similar population size and characteristics was run using Synthea+MDT, and the distributions had no statistically significant distribution difference to MEPS (X2 = 2.73, df = 6, N = 13 906 P = .84).
Figure 3.
MEPS, Current Synthea, and Synthea+MDT distributions.
MDT outputs were also reviewed for clinical validation. A sample of the output is illustrated in Figure 3. It was expected that 0% of patients should be prescribed products that are clinically inappropriate for children ages 0–5. For example, higher strengths of Qvar (beclomethasone) and Flovent (fluticasone) inhalers can reach levels that exceed the maximum dosage for pediatric patients—making them clinically inappropriate for children. The 0% values for these high strengths in MDT output (eg, beclomethasone 40 µg or fluticasone 220 µg) align with real-world prescribing patterns.
DISCUSSION
With MDT, a user can input a list of medication classes or ingredients and obtain distributions for clinically relevant medications based on real-world prescription data from MEPS. These distributions can be used as statistical inputs into Synthea. Increasing the accuracy, diversity, and complexity of these synthetic datasets leads to more realistic statistical properties and relationships in the data.1
Most importantly, this simple method for applying realistic statistical distributions of medications can be applied to other synthetic databases to generate realistic medication records. This makes MDT and process especially useful to users outside of Synthea and to researchers without access to a large realistic database of medication distributions.
There were some limitations from using MEPS data. The most realistic data to use for medication distributions would most likely come from prescription claims databases such as pharmacy claim switch exchanges. However, these are proprietary databases and not open data for public use. While this would most likely give more accurate medication distributions with Synthea, access would be extraordinarily limited, if possible at all.
While MEPS is a national survey, not all US states are represented in MEPS data. Leveraging these data forces users to extrapolate distributions from one state to represent another in Synthea, which could bias the created synthetic dataset to some degree. However, region-to-state mappings are available in MEPS, which allows users to account for region-level medication patterns based on the synthetic patient’s state of residence. Additionally, the reference population of MEPS represents noninstitutionalized patients. A person applying this method to a hospital or nursing home population would be creating a biased dataset.
While there is not a consensus on the best methods to validate the representativeness of synthetic data generation, the distributions of Synthea and Synthea+MDT were compared against MEPS data using a chi-square goodness-of-fit test.8,13 Comparing distributions of each group to MEPS showed that MDT improves Synthea’s validity to real-world distributions.
MEPS accuracy can affect the overall outcome of Synthea+MDT as distributions are extracted from this dataset. Because of MEPS importance to MDT’s function, it is important to examine how well MEPS represents real-world healthcare utilization trends. MEPS has been used in multiple publications and by various state and federal governments.14 A study from 2011 compared MEPS medication survey results with claims data for Medicare Part D patients. The authors found their validation sample to be reasonably accurate. There is one study that stated that medication expenditures were 10% lower than those found in national health expenditure accounts.15 Based on the validation of MEPS data’s population-level estimates and trends, the authors conclude that using MEPS as a data source is reasonable for use with MDT.
MEPS data also have a time bias that should be taken into account. MEPS data are released 2 years after data collection by the Agency for Healthcare Research and Quality (AHRQ).9 In this study, MEPS data from 2018 were used, while the research was conducted in 2021. This degree of data lag is standard for such a large scale of data collection, considering the time to compile and process. However, researchers applying this dataset should be aware that medication distributions will represent patterns from 2 years prior—not the current year. Additionally, new medications that enter the market will not appear in the generated synthetic data until 2 years after they are released.
Maintainability can also be an issue within healthcare datasets as healthcare trends change yearly. Any time a new medication comes to market, or a new drug study is published, prescribers may change their prescribing habits based on new evidence. Fortunately, MDT leverages regularly maintained datasets like MEPS, which releases new data annually.9 RxNorm (RxNav) and RxClass data are both updated monthly with each RxNorm release.16 These updated data sources allow researchers to update MDT medication distribution yearly. However, users must keep in mind the MEPS data lag prevents any new medication from presenting in MDT distributions as previously stated. For example, if MEPS updated their prescribed medication file in July and a new medication was released in November, the newly released medication would not show up in the distributions until 2 years later.17 There are multiple ways around this issue. One could use a pharmacy claim switch exchange with constantly updated distribution lists as previously mentioned. Another method would be to use data from a large electronic health record. A third option is to use a prescriber’s database for electronic prescriptions. However, these methods involve proprietary data sources that contain protected health information and are much more difficult to access.
Traditional analytic methods have been difficult and time-consuming to use with big healthcare datasets, creating a need among healthcare researchers for machine learning and artificial intelligence (AI).18 However, machine learning and AI in medicine have been somewhat limited and slow compared to other fields due to barriers that prevent sharing of patient data.19 The operational validity of MDT could allow healthcare researchers to conduct more research on patient medication data without requiring access to protected information. This is extremely important, as getting operationally valid data is probably one of the most important barriers in machine learning or AI with synthetic databases.4 If synthetic data are not realistic, then researchers start their projects with biased, inaccurate, and unreliable data. This barrier to realistic synthetic data is important as it has been estimated that up to 60% of data used in AI could be synthetic data.20 Operational validity of synthetic data also improves the scalability of machine learning and AI models. As deep learning can sometimes require substantial amounts of data to train models, synthetic datasets allow researchers to scale up that data rather than wait for additional data to become available.21
Research using MDT is not limited to machine learning or AI. MDT could also be used in health economics and outcomes research (HEOR) for certain populations.18 For example, since this synthetic data are keyed at the NDC level, it is possible to join other NDC-based datasets such as the National Average Drug Acquisition Cost dataset (NADAC).22 With a joined dataset such as NADAC, researchers can use MDT synthetic data to model medication costs and make inferences on how to improve costs for patients or payers. Currently, machine learning is less commonly used in HEOR studies.23 Perhaps using synthetic data, such as data produced with Synthea+MDT, would allow this area to expand.
Further research can also be done with MDT, as it connects with MEPS, which contains other patient survey data including insurance type, health status, language, and medical conditions.9 This could allow researchers to investigate questions about medication use trends among different subpopulations. For example, are there differences in the medication distribution of English speakers and non-English speakers? Do patients with different insurance types have different medication distributions? Researchers may also be able to model prescription fill and medication adherence over time from MDT with additional MEPS data. A more advanced feature could link medication trends to lab values. For example, one could potentially research a population of patients on levothyroxine where the synthetic patient’s dosage corresponds to T3/T4 lab values. However, further work will need to be done to make sure that these variables link validly using one of the many different statistical weighting methods to their respective data points.24 A dataset with incorrect statistical weighting would show very little or no relationships between levothyroxine dose and T3/T4 lab values or could show relationships where none exist with other lab tests and medications.
A companion database of common disease states and linked medication names could enhance the usability of Synthea+MDT. In the absence of this, a Synthea developer would still need additional clinical input to determine the appropriate products for medications that overlap multiple indications.
Other future modifications could also be made which would make MDT more robust. One could use a database, such as the drug–drug interaction database provided by the NLM, to include drug–drug interactions. This would allow Synthea to mimic real-world patients who have drug interactions that need to be addressed.25
CONCLUSION
Prior to integration of MDT, Synthea alone was not able to create realistic medication datasets and required manual data inputs to achieve better realism. MEPS is a usable open data source which can be used to calculate medication distributions and improve realism of synthetic data generators like Synthea. MDT links MEPS with RxNorm (RxNav) and RxClass. By combining these datasets, MDT allows anyone to use Synthea to create realistic patient medication datasets with minimal maintenance.
The results of this study validate that MDT improved the degree to which the Synthea asthma module represents prescribed medications in the United States. The diversity of medication products in Synthea+MDT produced a more complex and realistic population that mimics real-world medication distributions.
Further enhancements and work should be done to improve upon our tool to increase the realism of Synthea’s medication data and unlock further areas of research.
Supplementary Material
ACKNOWLEDGMENTS
This work was completed as a submission for a data challenge. The data challenge was organized by Clinovations on behalf of the Office of the National Coordinator for Health Information Technology. Funding for the data challenge was provided through the Synthetic Health Data Generation to Accelerate PCOR project, which was supported by the Office of the Secretary Patient-Centered Outcomes Research Trust Fund.
Team CodeRx was awarded a $40 000 prize for their work on MDT.26 Team CodeRx for this project includes Kent Bridgeman, Yevgeny Bulochnik, Dalton Fabian, Robert Hodges, Kristen Tokunaga, and Joseph LeGrand. The authors would like to thank Kent Bridgeman, Yevgeny Bulochnik, and Dalton Fabian from CodeRx for their work to create MDT.
CodeRx is a community of hundreds of (mostly) pharmacists with an interest or skillset in tech. One of the goals of CodeRx is to apply our skills toward building useful healthcare-related projects (like MDT) using open-source tools and our domain expertise in pharmacy and the medication use process. More information about CodeRx can be found at https://coderx.io/.
The authors would also like to thank reviewers Jason Walonoski, Emily Mitchell, and John Poikonen.
Finally, the authors acknowledge and thank the MITRE Synthea team for their open-source tool, which has enabled us to generate synthetic data for research and applications of MDT.
Contributor Information
Robert Hodges, Utica University, Charlotte, North Carolina, USA.
Kristen Tokunaga, Data Product Management, Komodo Health, San Francisco, California, USA.
Joseph LeGrand, HealthIT, Vanderbilt University Medical Center, Nashville, Tennessee, USA.
FUNDING
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors..
AUTHOR CONTRIBUTIONS
All authors have contributed substantially to this research and article. RH, KT, and JL conceived the work. KT and JL designed the work and acquired the data. RH, KT, and JL analyzed and interpreted the data. RH and KT drafted the work, and RH and JL reviewed the work critically for important intellectual content. RH, KT, and JL reviewed the version to be published, with RH, leading review for final approval and corresponding author requirements. RH, KT, and JL agree to be accountable for all aspects of the work.
SUPPLEMENTARY MATERIAL
Supplementary material is available at https://github.com/coderxio/medication-diversification.
CONFLICT OF INTEREST STATEMENT
None declared.
DATA AVAILABILITY
The data underlying this article are available in the article and in its online supplementary material.
REFERENCES
- 1. Goncalves A, Ray P, Soper B, Stevens J, Coyle L, Sales A. Generation and evaluation of synthetic patient data. BMC Med Res Methodol 2020; 20 (1): 108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Dilmegani C. Synthetic Data for Healthcare: Benefits & Case Studies in 2021. AIMultiple; 2021. https://research.aimultiple.com/synthetic-data-healthcare/. Accessed November 23, 2021.
- 3. Synthetic Health Data Generation to Accelerate Patient-Centered Outcomes Research. Synthetichealth.github.io.; 2021. https://www.healthit.gov/topic/scientific-initiatives/pcor/synthetic-health-data-generation-accelerate-patient-centered-outcomes. Accessed November 23, 2021.
- 4. MacLachlan S. realism in Synthetic Data Generation [Thesis]. 2017; 143–6. https://mro.massey.ac.nz/bitstream/handle/10179/11569/02_whole.pdf. Accessed November 23, 2021.
- 5. Tokunaga K. Creating Realistic Synthetic Rx Data with Open-Source Tools. AMIA Clinical Informatics Conference 2022 #CIC22; 2022.
- 6. Walonoski J, Kramer M, Nichols J, et al. Synthea: an approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J Am Med Inform Assoc 2018; 25 (3): 230–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Hogan AD, Mahr TA. Update on pediatric asthma treatment options, doses, label changes. AAP News; 2021. https://www.aappublications.org/news/2020/07/01/focusasthma070120. Accessed November 23, 2021.
- 8. RxNav. RxNav; 2021. https://lhncbc.nlm.nih.gov/RxNav/. Accessed November 23, 2021.
- 9. Medical Expenditure Panel Survey Home. Meps.ahrq.gov.; 2021. https://www.meps.ahrq.gov/mepsweb/. Accessed November 23, 2021.
- 10. RxClass. 2015. Mor.nlm.nih.gov. https://mor.nlm.nih.gov/RxClass/#. Accessed February 25, 2021.
- 11. 1.3.5.15. Chi-Square Goodness-of-Fit Test. Itl.nist.gov.; 2021. https://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm. Accessed November 23, 2021.
- 12. Kleijnen JPC. Statistical validation of simulation models. Eur J Oper Res 1995; 87 (1): 21–34. [Google Scholar]
- 13. Chen J, Chun D, Patel M, Chiang E, James J. The validity of synthetic clinical data: a validation study of a leading synthetic data generator (Synthea) using clinical quality measures. BMC Med Inform Decis Mak 2019; 19 (1): 44. doi: 10.1186/s12911-019-0793-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Cohen JW, Cohen SB, Banthin JS. The Medical Expenditure Panel Survey. Med Care 2009; 47 (7 Suppl 1): S44–50. [DOI] [PubMed] [Google Scholar]
- 15. Hill SC, Zuvekas SH, Zodet MW. Implications of the accuracy of MEPS prescription drug data for health services research. Inquiry 2011; 48 (3): 242–59. [DOI] [PubMed] [Google Scholar]
- 16. RxNav Applications. 2015. Lhncbc.nlm.nih.gov. https://lhncbc.nlm.nih.gov/RxNav/applications/RxNavFAQ.html. Accessed October 16, 2022.
- 17. Medical Expenditure Panel Survey Data Release Schedule. n.d. https://meps.ahrq.gov/mepsweb/about_meps/faq_results.jsp?ChooseTopic=All+Categories&keyword=&Submit2=Search. https://meps.ahrq.gov/mepsweb/about_meps/releaseschedule.jsp. Accessed October 27, 2022.
- 18. Crown WH. Potential application of machine learning in health outcomes research and some statistical cautions. Value Health 2015; 18 (2): 137–40. [DOI] [PubMed] [Google Scholar]
- 19. Goncalves A, Ray P, Soper B, Stevens J, Coyle L, Sales AP. Generation and evaluation of synthetic patient data. BMC Med Res Methodol 2020; 20: 1. doi: 10.1186/s12874-020-00977-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. White A. July 2021, 60% of the data used for the development of AI and analytics projects will be synthetically generated. 2021. https://blogs.gartner.com/andrew_white/2021/07/24/by-2024-60-of-the-data-used-for-the-development-of-ai-and-analytics-projects-will-be-synthetically-generated/. Accessed October 27, 2022.
- 21. Nava V. Machine Learning Requires Big Data. 2021. https://www.qubole.com/blog/machine-learning-requires-big-data. Accessed October 27, 2022.
- 22. Pharmacy Pricing. Medicaid.gov; United States Government; 2022. https://www.medicaid.gov/medicaid/prescription-drugs/pharmacy-pricing/index.html. Accessed October 27, 2022.
- 23. Lee W, Schwartz N, Bansal A, et al. A scoping review of the use of machine learning in health economics and outcomes research: part 2—data From nonwearables. Value Health 2022; 25 (12): 2053–61. [DOI] [PubMed] [Google Scholar]
- 24. 1. How Different Weighting Methods Work. Pew Research Center Methods; Pew Research Center Methods; 2018. https://www.pewresearch.org/methods/2018/01/26/how-different-weighting-methods-work/. Accessed November 20, 2022.
- 25. findDrugInteractions—Drug Interaction API. n.d. https://lhncbc.nlm.nih.gov/RxNav/APIs/api-Interaction.findDrugInteractions.html. Accessed November 20, 2022.
- 26. HHS Announces Synthetic Health Data Challenge Winners. HHS.gov; 2021. https://www.healthit.gov/sites/default/files/page/2021-10/20211019_SHD%20Winning%20Solutions%20Webinar%20Materials_compressed.pdf. Accessed November 23, 2021.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data underlying this article are available in the article and in its online supplementary material.