Abstract
The use of electronic health records has garnered interest as an approach for conducting innovative outcome research and producing real-world evidence at a reduced cost compared to traditional clinical trials. The study aimed to evaluate the utility of deidentified EHR data from a multicenter research network to identify characteristics associated with treatment escalation (TE) in newly diagnosed pediatric ulcerative colitis patients. EHR data (01/2010-12/2021) from 13 Midwest healthcare systems (Greater Plains Collaborative) were collected for pediatric ulcerative colitis patients. We identified standard treatments, excluded missing initial therapy data, and analyzed the TE and time-to-TE outcomes. The clinical and laboratory characteristics at baseline were extracted. Logistic and Cox models were used, and the missing risk factors were imputed. Machine-learning Bayesian additive regression trees were also utilized to create partial dependence plots for assessing the associations between risk factors and clinical outcomes. A total of 502 eligible pediatric patients (aged 4–17 years) who initiated standard treatment were identified. Among them, 205 out of 502 (41%) experienced TE, with a median (P25, P75) duration of 63 (9, 237) days after the initial treatment. Additionally, 20 out of 509 (4%) patients underwent colectomy (COL) with a median (P25, P75) duration of 80 (3, 205) days. Both multivariable logistic regression and Cox proportional hazards regression demonstrated moderate discriminative power in predicting TE and time-to-TE, respectively. Common positive predictors for both TE and time-to-TE included a high monocyte proportion and elevated platelet counts. Conversely, BMI z-score, albumin, hemoglobin levels, and lymphocyte proportion were negatively associated with both TE and time-to-TE. This study demonstrates that multicenter EHR data can be used to identify a trial-comparable study sample of potentially larger size and to identify clinically meaningful endpoints for conducting outcome analysis and generating real-world evidence.
Keywords: Bayesian additive regression trees, clinical outcome, Cox proportional hazards regression, logistic regression, pediatric patients, ulcerative colitis
1. Introduction
Pediatric ulcerative colitis (UC) is a chronic, inflammatory bowel disease affecting children and adolescents. It causes inflammation and ulcers in the colon and rectum. Symptoms include diarrhea, abdominal pain, bloody stools, weight loss, fatigue, and poor appetite. Treatment focuses on inducing and maintaining remission. Medications like aminosalicylates, corticosteroids, immunosuppressants, and biologics are used. Surgery may be necessary in some cases. Pediatric UC patients are generally considered as difficult to treat due to high variability of responses to initial therapies.[1–3] PROTECT is the largest study in pediatric UC with standardized treatments. Data obtained from this recent clinical trial offer recommendations for evaluating how children, who have recently been diagnosed with UC, react to standardized initial therapy and for identifying factors that can predict their treatment response.[4,5]
Electronic Health Records (EHRs) represent digital renditions of patients’ medical histories, constituting a comprehensive compilation of their health-related details. Within EHR data, a broad spectrum of information is encapsulated, encompassing the patient’s medical background, diagnoses, treatment regimes, prescribed medications, laboratory findings, and more. These records are created, updated, and maintained by healthcare providers and institutions, including hospitals, clinics, and healthcare practices. EHR data is used to streamline healthcare processes, enhance patient care, reduce errors, and enable healthcare professionals to access critical patient information efficiently and securely. It also plays a crucial role in medical research and public health efforts. Widespread EHR adoption has increased interest in using real-world data to conduct novel outcome research and generate real-world evidence at a lower cost than traditional clinical trials.[6] The National Patient-Centered Clinical Research Network (PCORnet) began with an initial $100 million investment from the Patient-Centered Outcomes Research Institute, aiming to conduct clinical and translational research faster and at lower cost by harnessing the power of “real-world” EHR data. The establishment of PCORnet was guided by a set of aims and requirements for each of the clinical research network and one of highest priorities was for EHR data from the different healthcare organizations and systems within each clinical research network to be harmonized and interoperable using PCORnet’s CDM.
However, there are many concerns about the fitness-for-use of EHR data to conduct clinical research and support clinical decision support systems.[7] The complexity of aggregation, acquisition, and processing of EHR data to be ready for secondary use and harmonizing data from different clinical sites create many quality issues that may impact the validity of research results and affect stakeholders’ decisions. In this article, we evaluated the utility of routinely collected EHR data from a PCORnet network to identify clinical, demographic, and lab characteristics that are associated with increased risk of treatment escalation (TE) in children newly diagnosed with UC.
2. Methods
2.1. Data source
The Greater Plains Collaborative (GPC) is one of the PCORnet Clinical Data Research Network, which includes 13 medical centers (“sites”) across 8 states. The total catchment population in the GPC network is over 34 million patients.[8,9] GPC reusable observable unified study environment is the GPC centralized data enclave that unifies multi-site EHR data in PCORnet CDM format and creates a unique data resource and environment to facilitate large-scale observational studies.[10] Consequently, GPC serves as a comprehensive data source for conducting clinical research using healthcare data. Annually, EHRs from all GPC sites are integrated into the GROUSE analytic environment, generating interoperable deidentified databases in the PCORnet CDM format. The GROUSE unified EHR data with claims data from the Center for Medicare and Medicaid Services. This makes GPC a comprehensive data source for conducting clinical research using healthcare and insurance claims data. Annually, EHRs and billing information from all GPC sites are integrated with Center for Medicare and Medicaid Services insurance claims in the GROUSE analytic environment which creates interoperable deidentified databases using PCORnet CDM format. To study GPC-based pediatric UC, we requested data from the clinical sites according to the Institutional Review Board protocol and received approval from 10 sites. GROUSE established Institutional Review Board protocol includes a data sharing agreement and provides deidentified data from all sites. A single dataset was obtained from GROUSE containing patients from all sites. The use of this deidentified data for this project has been determined as nonhuman subject research.
2.2. Study cohort and covariate selection
Eligible pediatric population with UC was selected using International Classification of Disease version 9 (ICD9: 556.XX) and 10 codes (ICD10: K51.XX) for UC at the participating sites from 2010 to 2020 excluding chronic proctitis (ICD9: 556.2X, 556.4X, ICD10: K51.2X, K51.4X). Children aged 4–17 years with the above diagnoses were eligible for the study. Standard treatment regimens of mesalamine or oral/IV corticosteroids, use of biologics and/or immunomodulators (IM), as well as colectomy (COL), were identified using a curated list of RXNORM and CPT codes (Supplemental Table 1, http://links.lww.com/MD/L893). We excluded patients without an explicit indication of initial therapy (i.e., initiation of standard treatment regimens), as it is rarely the case that patients would skip the standard treatments and lack of that indication was most likely to suggest measurement bias in the data. Demographic characteristics and relevant clinical factors (height, weight, BMI, and hospitalization), as well as laboratory characteristics (hemoglobin, serum albumin, erythrocyte sedimentation rate, C-reactive protein, platelet count, total white blood cell count, and differential) at baseline, were extracted based on domain knowledge using ICD and LOINC codes (see Supplemental Table 2, http://links.lww.com/MD/L886). BMI z-score was calculated to assess an individual’s relative BMI compared to a reference population of the same age and sex by quantifying the number of standard deviations by which their BMI deviates from the mean.
2.3. Endpoint selection and ascertainment
The TE outcome was defined as a composite endpoint of receiving biologics and/or IM and/or COL after the initial standard treatment. We considered 2 types of outcomes for analysis: If-to-TE (ITE)—a dichotomous outcome indicating whether the patients progressed to TE (y = 1 if they progressed to TE; y = 0 otherwise); and Time-to-TE (TTE)—a time-to-event outcome measured from the initial UC diagnosis to the TE endpoint or censoring. Figure 1 illustrates the endpoints in the study design.
Figure 1.
Retrospective Cohort study design.
2.4. Statistical analysis
We initiated the fitness-for-use analysis by conducting univariate comparisons of all baseline demographic and clinical characteristics between the retrospective EHR cohort and the prospective pediatric UC cohort from the PROTECT Study. Categorical assessments were analyzed using a χ2 test, and continuous assessments were evaluated with a t test.
To assess associations, we adopted an imputation-based approach to address missing data. We generated 100 multiple imputations and combined the results using Rubin rules[11] to evaluate univariate factor associations with TE. Risk factors with P values <.1 in univariable logistic regression (LR) were selected for multivariable LR. We reported odds ratios for both univariable and multivariable models. For model performance assessment, we used the area under the receiver operating characteristic curve (AUC) for ITE. Sensitivity, specificity, positive predictive value, and negative predictive value were computed at a probability cutoff of 0.5. We combined point estimates and their corresponding 95% confidence intervals from multiple imputations using Rubin rules. In the case of TTE for each imputed dataset, hazard ratios were computed from both univariable and multivariable Cox proportional hazards models. We evaluated the C-index for model performance and again used Rubin rules to combine results.
To develop more accurate and robust predictive models, we employed nonparametric machine learning through Bayesian additive regression trees (BART).[12] BART models the response variable as the sum of predictions from individual regression trees, allowing for nonlinear relationships and interactions between predictors. It incorporates a Bayesian approach, integrating prior knowledge with observed data, quantifying uncertainty, estimating parameters, and making predictions. BART employs Markov Chain Monte Carlo (MCMC) methods for posterior distribution sampling. In each imputed sample, we used BART to model ITE with a probit BART model consisting of 50 trees and default priors as recommended.[12] We used a burn-in of 100 draws and saved every 10th sample from the MCMC chain, resulting in 1000 draws from the posterior distributions for the target function given risk factors. We compared the mean AUC values from all posterior distributions with LR using the DeLong test.[13] The average P value from all imputation samples was reported. Summary statistics of AUC were pooled from all posterior distributions across all imputed samples. To summarize the marginal effect of risk factors, we used partial dependence plots, averaging over the others.[14] For each imputed sample, the BART survival model was used to model TTE within the framework of discrete-time survival analysis without assuming proportional hazards.[15] The time scale was coarsened to months to reduce computational complexity. The model included 50 trees and default priors, with a burn-in of 250 draws, saving every 10th sample from the MCMC chain, resulting in 1000 draws from the posterior distributions for the survival function given risk factors. Summary statistics of the C-index were pooled from all posterior distributions and discrete times across imputed samples. Partial dependence survival functions were computed to summarize the marginal effect of risk factors.
Statistical significance was determined at a P value of <.05. The analyses were performed using R version 4.3.0 (April 14, 2023) and the BART package version 2.9.4 with default configurations.
3. Results
We began with a cohort of 2102 pediatric patients who had at least one eligible ICD code for UC (Fig. 2). Out of these, 1614 were aged between 4 and 17 years at their initial UC diagnosis. Through the application of a series of inclusion and exclusion criteria, we narrowed the cohort down to 502 eligible pediatric UC patients. Among this group, we were able to observe a more complete treatment trajectory based on domain knowledge, specifically those who started with standard treatment. It is important to note that the largest reduction in sample size was due to the exclusion of patients who did not have an observable initial treatment (844 out of 1613). The sites encompass patients ranging from 14 to 133, with an average of 50 and a standard deviation of 22.6.
Figure 2.
Consort diagram with patient inclusion and exclusion.
Demographic and clinical characteristics are presented in Table 1. The mean age was 13.2 years (SD 3.4), with 273 out of 502 patients (54%) being female, and 406 out of 492 patients with race data (83%) identifying as white. At baseline, 135 patients (27%) had been hospitalized. Across sites, statistical significance was observed for race and hospitalization status (P values < .001), while other characteristics in Table 1 did not show statistical significance across sites.
Table 1.
Demographic and clinical characteristics comparison between GPC retrospective cohort and PROTECT prospective cohort.
| GPC Retrospective | PROTECT | P value | |
|---|---|---|---|
| N | 502 | 428 | |
| Age at initial UC (mean ± SD) | 13.2 ± 3.4 | 12.7 ± 3.3 | .02 |
| % ≥ 12 years | 364 (73%) | 289 (68%) | .09 |
| Female (%) | 273 (54%) | 212 (50%) | .22 |
| nonwhite (%) | 86/492 (17%) | 69/420 (16%) | .69 |
| Hispanic/latino (%) | 48/493 (10%) | 38/424 (9%) | .61 |
| Weight z-score | 0 ± 1.4 | −0.1 ± 1.2 | .24 |
| Height z-score | −0.2 ± 1.3 | 0.1 ± 1.0 | <.001 |
| BMI z-score | 0 ± 1.4 | −0.2 ± 1.3 | .02 |
| Hospitalized at baseline | 135 (27%) | 166 (39%) | <.001 |
Summary statistics regarding the PROTECT study were previously provided.[4] However, the addition of the comparison between the EHR and PROTECT study is new. In comparison to the PROTECT cohort (Table 1), our data were largely comparable, except for a lower rate of hospitalization at baseline. TTE had a range of 0 to 4135 days with median 350 days. Missing baseline laboratory data varied from 139 (28%) of 502 patients for leukocyte counts to 476 (95%) for calprotectin. Additionally, 180 (36%) patients were missing BMI z-scores. TE outcomes occurred in 205 out of 502 (41%) patients at a median of 63 days (P25: 9, P75: 237) since initial treatment, while 20 out of 502 (4%) underwent COL at a median of 80 days (P25: 3, P75: 205). TTE ranged from 0 to 4135 days, with a median of 350 days.
Table 2 displays the LR models that illustrate the associations between baseline characteristics and TE. Specifically, higher values of monocyte and platelet count were linked to worse outcomes, while higher values of lymphocytes were associated with better outcomes. Additionally, higher values of BMI z-score and albumin were linked to better outcomes in the univariable LR models. Further analysis revealed that age, sex, and ethnicity were not associated with the outcome. The multivariable LR model demonstrated moderate predictive power for ITE, with an AUC of 0.667 (95% CI: 0.606–0.723). The probit BART model exhibited slightly better predictive performance, with an AUC of 0.690 (95% CI: 0.653–0.736) and a P value of <.0001 (Fig. 3). In Figure 4, associations between the probability of ITE and clinical and laboratory values are shown. Notably, there was a weakly decreasing probability of ITE with increasing BMI z-score or hemoglobin levels. The decreasing associations were more pronounced for albumin and lymphocyte levels, while the opposite trend was observed for monocyte levels and platelet count.
Table 2.
Logistic regression models of ITE.
| Univariable LR | Multivariable LR | BART | |
|---|---|---|---|
| Baseline predictors | |||
| BMI z-score | 0.872 (0.743, 1.026), P = .098 | 0.914 (0.770, 1.087), P = .308 | |
| Albumin (g/dL) | 0.731 (0.531, 1.005), P = .054 | 0.938 (0.653, 1.347), P = .727 | |
| Hemoglobin (g/dL) | 0.879 (0.794, 0.973), P = .013 | 0.924 (0.827, 1.032), P = .161 | |
| Monocyte (%) | 1.048 (0.996, 1.103), P = .069 | 1.069 (1.011, 1.129), P = .019 | |
| Lymphocyte (%) | 0.979 (0.963, 0.995), P = .010 | 0.975 (0.958, 0.993), P = .008 | |
| Platelet count (10^9/L) | 1.002 (1.001, 1.004), P = .008 | 1.002 (1.000, 1.003), P = .029 | |
| Model evaluation | |||
| AUC | 0.667 (0.606, 0.723) | 0.696 (0.653, 0.736) | |
| Comparison of AUC | P < .0001 | ||
| Sensitivity | 0.382 (0.276, 0.489) | 0.484 (0.351, 0.607) | |
| Specificity | 0.802 (0.747, 0.858) | 0.769 (0.680, 0.852) | |
| Positive predictive value | 0.583 (0.486, 0.680) | 0.605 (0.548, 0.667) | |
| Negative predictive value | 0.586 (0.642, 0.699) | 0.674 (0.636, 0.714) | |
Figure 3.
ROC for TE from multivariable LR and BART.
Figure 4.
Partial dependence probabilities with posterior mean and 95% credibility intervals.
Table 3 presents the PR models of baseline characteristics associated with TTE. The results were consistent with the LR models in terms of associations and model predictions. The survival BART model achieved a moderate C-index of 0.703 (95% CI: 0.677–0.730), whereas the PR model yielded a C-index of 0.630 (95% CI: 0.582–0.676). However, it’s important to exercise caution when making this comparison because the 2 models were based on different time representations, discrete-time and continuous-time, respectively. Figure 5 illustrates the relationships between clinical and laboratory values and the probability of surviving event TE. There was an increase in the probability of TE survival with rising albumin, hemoglobin, and lymphocyte levels, while the reverse pattern was observed for monocyte levels and platelet count. The relationship between BMI z-score and the probability of surviving event TE exhibits a complex pattern with both decreasing and increasing trends.
Table 3.
Proportional hazards regression models of TTE.
| Univariable PHR | Multivariable PHR | BART | |
|---|---|---|---|
| Baseline predictors | |||
| BMI z-score | 0.917 (0.825, 1.019), P = .105 | 0.932 (0.827, 1.049), P = .239 | |
| Albumin (g/dL) | 0.740 (0.581, 0.942), P = .015 | 0.879 (0.675, 1.145), P = .338 | |
| Hemoglobin (g/dL) | 0.906 (0.838, 0.980), P = .014 | 0.943 (0.867, 1.026), P = .170 | |
| Monocyte (%) | 1.031 (0.997, 1.067), P = .073 | 1.041 (1.006, 1.078), P = .023 | |
| Lymphocyte (%) | 0.984 (0.972, 0.997), P = .014 | 0.984 (0.971, 0.997), P = .017 | |
| Platelet count (10^9/L) | 1.0015 (1.0004, 1.0026), P = .007 | 1.001 (1.000, 1.002), P = .026 | |
| Model evaluation | |||
| C-index | 0.630 (0.582, 0.676) | 0.703 (0.677, 0.730) | |
Figure 5.
Partial dependence survival functions with posterior mean and 95% credibility intervals.
4. Discussion
The results using EHRs align with the prospective PROTECT study, where higher albumin and hemoglobin values were associated with better outcomes, while higher platelet counts at baseline were associated with worse outcomes.[4,5] The partial dependence plots with the BART model for the binary outcome TE demonstrate consistent associations, similar to LR. Similarly, the partial dependence plots with the BART model for the TTE outcome display mostly consistent but more complex associations compared to PR.
This study has several strengths. Firstly, GPC data is derived from real-world clinical settings, offering valuable insights into the management and treatment of pediatric UC in practice. Secondly, the GPC includes a relatively large sample size of 502, exceeding the PROTECT study’s sample size of 428. A larger sample size supports robust statistical analyses and enhances the generalizability of research findings. Thirdly, EHRs provide a cost-effective analysis opportunity compared to the more expensive and time-consuming data collection process in the PROTECT study. Finally, implementing machine learning BART models can capture complex and nonlinear relationships in the data. BART provides a posterior distribution over model parameters, allowing for a more comprehensive understanding of model uncertainty.
There are several limitations to our study. First, we observed heterogeneous missing patterns across different EHR system and absence of trial-specific endpoints. In the prospective PROTECT study, the primary outcome was defined as achieving corticosteroid-free remission at week 52, which was characterized by a Pediatric Ulcerative Colitis Activity Index (PUCAI) score <10 without corticosteroid use for at least 4 weeks immediately before week 52.[5] The PUCAI score is a validated tool for evaluating disease activity in pediatric patients with UC. It considers various clinical parameters, including stool frequency, rectal bleeding, abdominal pain, and general well-being, and combines them to calculate a numerical score. However, the PUCAI scores were not consistently used in practice (thus missed in the GPC dataset), which necessitated the use of different real-world endpoints (i.e., TE). Additionally, it is challenging to evaluate medication adherence by only using EHR data, which has been found to be highly associated with TE.[16] Although the Center for Medicare and Medicaid Services claims linked to EHR data were also available on GROUSE, the acquisition would require additional effort and funding from the participant hospitals, potentially as a future research project.
5. Conclusions
This study demonstrates that multicenter EHR data can be used to identify a trial-comparable study sample of potentially larger size and to define clinically meaningful endpoints for conducting outcome analysis and generating real-world evidence. Utilizing EHR data for certain health conditions in pediatric studies can be valuable, particularly in addressing recruitment challenges in large clinical research studies involving children.
Author contributions
Conceptualization: Zhu Wang, Xing Song, Jeffrey S. Hyams, Lee A. Denson.
Data curation: Zhu Wang, Xing Song.
Formal analysis: Zhu Wang, Xing Song, Jeffrey S. Hyams.
Funding acquisition: Zhu Wang, Xing Song, Jeffrey S. Hyams.
Investigation: Zhu Wang, Xing Song, Jeffrey S. Hyams, Lee A. Denson.
Methodology: Zhu Wang, Xing Song, Jeffrey S. Hyams.
Project administration: Zhu Wang, Xing Song.
Resources: Zhu Wang, Xing Song.
Software: Zhu Wang, Xing Song.
Supervision: Zhu Wang, Xing Song.
Validation: Zhu Wang, Xing Song.
Visualization: Zhu Wang, Xing Song.
Writing—original draft: Zhu Wang, Xing Song.
Writing—review & editing: Zhu Wang, Xing Song, Lemuel R. Waitman, Jeffrey S. Hyams, Lee A. Denson.
Supplementary Material
Abbreviations:
- AUC
- Area under the (receiver operating characteristic) curve
- BART
- Bayesian additive regression trees
- BMI
- body-mass index
- CDM
- common data model
- EHR
- electronic health records
- ITE
- if-to-treatment escalation
- LR
- logistic regression
- P25
- 25th percentile
- P75
- 75th percentile
- PCORnet
- Patient-Centered Outcome Research Network
- PR
- proportional hazards regression
- TE
- treatment escalation
- TTE
- time-to-treatment escalation
- UC
- ulcerative colitis
Human and animal experiments are not involved in this paper. This research has been approved by the Institutional Review Board of University of Tennessee Health Science Center. All methods were carried out in accordance with relevant guidelines and regulations.
This research was supported by grant R21DK130006 (ZW, XS and JSH), from the National Institute of Diabetes and Digestive and Kidney Diseases of the National Institutes of Health. The datasets used for the analyses described were obtained from the Greater Plains Collaborative, which is supported by the Patient-Centered Outcomes Research Institute (RI-MISSOURI-01-PS1) and institutional funding from its member organizations. The funding sources had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
The authors have no conflicts of interest to disclose.
The data that support the findings of this study are available from a third party, but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are available from the authors upon reasonable request and with permission of the third party.
Supplemental Digital Content is available for this article.
How to cite this article: Wang Z, Song X, Waitman LR, Hyams JS, Denson LA. Fitness-for-use of Retrospective Multicenter Electronic Health Records to Conduct Outcome Analysis for Pediatric Ulcerative Colitis. Medicine 2024;103:11(e37395).
Contributor Information
Xing Song, Email: xsm7f@health.missouri.edu.
Lemuel R. Waitman, Email: russ.waitman@health.missouri.edu.
Jeffrey S. Hyams, Email: Jhyams@connecticutchildrens.org.
Lee A. Denson, Email: Lee.Denson@cchmc.org.
References
- [1].Hyams J, Davis P, Lerer T, et al. Clinical outcome of ulcerative proctitis in children. J Pediatr Gastroenterol Nutr. 1997;25:149–52. [DOI] [PubMed] [Google Scholar]
- [2].Gower-Rousseau C, Dauchet L, Vernier-Massouille G, et al. The natural history of pediatric ulcerative colitis: a population-based cohort study. Am J Gastroenterol. 2009;104:2080–8. [DOI] [PubMed] [Google Scholar]
- [3].Turner D, Mack D, Leleiko N, et al. Severe pediatric ulcerative colitis: a prospective multicenter study of outcomes and predictors of response. Gastroenterology. 2010;138:2282–91. [DOI] [PubMed] [Google Scholar]
- [4].Hyams JS, Davis S, Mack DR, et al. Factors associated with early outcomes following standardised therapy in children with ulcerative colitis (PROTECT): a multicentre inception cohort study. Lancet Gastroenterol Hepatol. 2017;2:855–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Hyams JS, Thomas SD, Gotman N, et al. Clinical and biological predictors of response to standardised paediatric colitis therapy (PROTECT): a multicentre inception cohort study. Lancet. 2019;393:1708–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Cheng AC, Banasiewicz MK, Johnson JD, et al. Evaluating automated electronic case report form data entry from electronic health records. J Clin Transl Sci. 2023;7:e29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Brown J, Kahn M, Toh S. Data quality assessment for comparative effectiveness research in distributed data networks. Med Care. 2013;51:S22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Waitman LR, Aaronson LS, Nadkarni PM, et al. The greater plains collaborative: a PCORnet clinical research data network. J Am Med Inform Assoc. 2014;21:637–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Forrest CB, McTigue KM, Hernandez AF, et al. PCORnet 2020: current state, accomplishments, and future directions. J Clin Epidemiol. 2021;129:60–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Waitman LR, Song X, Walpitage D, et al. Enhancing PCORnet clinical research network data completeness by integrating multistate insurance claims with electronic health records in a cloud environment aligned with CMS security and privacy requirements. J Am Med Inform Assoc. 2022;29:660–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Rubin D. Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New York. 1987. [Google Scholar]
- [12].Chipman HA, George EI, McCulloch RE. BART: bayesian additive regression trees. Ann Appl Stat. 2010;4:266–98. [Google Scholar]
- [13].DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–45. [PubMed] [Google Scholar]
- [14].Friedman J. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29:1189–232. [Google Scholar]
- [15].Sparapani RA, Logan BR, McCulloch RE, et al. Nonparametric survival analysis using Bayesian additive regression trees (BART). Stat Med. 2016;35:2741–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Carmody JK, Plevinsky J, Peugh JL, et al. Longitudinal non-adherence predicts treatment escalation in paediatric ulcerative colitis. Aliment Pharmacol Ther. 2019;50:911–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





