Abstract
Natural language processing (NLP) tools turn free-text notes (FTN) from electronic health records (EHR) into data features that can supplement confounding adjustment in pharmacoepidemiologic studies. However, current applications are difficult to scale. We used unsupervised NLP to generate high-dimensional feature spaces from FTN to improve prediction of drug exposure and outcomes compared to claims-based analyses. We linked Medicare claims with EHR data to generate 3 cohort studies comparing different classes of medications on the risk of various clinical outcomes. We used ‘bag-of-words’ to generate features for the top 20,000 most prevalent terms from FTN. We compared machine learning (ML) prediction algorithms using different sets of candidate predictors: Set1 (39 researcher-specified variables), Set2 (Set1+ML-selected claims codes), Set3 (Set1+ML-selected NLP-generated features), vs. Set4 (Set1+2+3).When modeling treatment choice, we observed a consistent pattern across the examples: ML models utilizing Set4 performed best followed by Set2, Set3, then Set1. When modeling outcome risk, there was little to no improvement beyond models based on Set1. Supplementing claims data with NLP-generated features from free text notes improved prediction of prescribing choices but had little or no improvement on clinical risk prediction. These findings have implications for strategies to improve confounding using EHR data in pharmacoepidemiologic studies.
INTRODUCTION
Healthcare data generated from routine-care delivery, including electronic health records (EHR) and administrative claims, can supplement randomized controlled trials to provide real-world evidence (RWE) on the effects of medical products for clinical decision support. While administrative claims databases have been the primary source for RWE studies, EHR data has tremendous potential to supplement administrative claims to provide clinical detail not typically available in claims data alone. However, leveraging the full information content in EHR data can be challenging since patient-reported records are often recorded in unstructured free-text documents that are not readily analyzable at a large scale.
Natural language processing (NLP) technology can be used to process unstructured clinical documents to identify and extract relevant information for further quantitative analyses. However, traditional applications of NLP for purposes of generating features from EHR data have primarily focused on supervised techniques that require manual annotation to establish the “ground truth.” This step can be costly and time-consuming due to knowledge acquisition, manual chart review, and training data creation and harmonization.1 2 This makes them unsuitable for scaling up for many tasks in RWE studies that require rapid-cycle analytics.3–5 Examples include generating features for high-dimensional confounding control for rapid monitoring of drug effectiveness and safety, and real-time development of large-scale risk prediction models to provide clinical decision support at the point of care.6–8
In this paper, we use 3 empirical studies to evaluate if unsupervised applications of NLP can scale to generate large numbers of structured features from unstructured EHR documents. We then evaluate if these generated features can supplement claims data to improve prediction of the choice of pharmacotherapies (as in building propensity score models for confounding adjustment)9–11 and risk of adverse clinical outcomes (as in clinical risk prediction),12–17 compared to prediction models that use information from claims data alone. The objective is to assess if unsupervised NLP tools, that require little to no human input, can scale to leverage the full information content in EHR documents to improve risk prediction models and prediction for confounding adjustment and clinical risk profiling.
METHODS
Data Source
We linked longitudinal claims data from the US Medicare system to the Research Patient Data Registry (RPDR) from 2007/01/01 to 2014/12/31. The RPDR data repository is based on all in- and outpatient activities of the Mass General Brigham (MGB), the largest healthcare delivery network in the greater Boston area. RPRD data records all medical records electronically, including diagnoses, procedures, test results (lab test, imaging, biopsies etc.), prescribing, and free text notes for all in and out-patient services. Linked Medicare claims with the RPDR data repository provide a record on the continuum of care (i.e., even information provided outside of the MGB system will be captured).
Study population
Based on Medicare fee-for-service beneficiaries aged 65 years or older, we generated 3 cohorts: 1) the Statin cohort: comparing high- versus low-intensity statins in patients with a history of myocardial infarction (MI) in terms of risk of major cardiovascular adverse events (MI and stroke); 2) the Analgesics cohort: comparing opioids vs. NSAIDs in patients with a history of osteoarthritis (OA) in terms of risk of renal failure; and 3) the PPI study: comparing high- vs low-dose proton pump inhibitors (PPI) in patients with a history of peptic ulcer in terms of risk of gastrointestinal (GI) bleeding. For each empirical study, we identified individuals who initiated the treatment (or comparator) after no use of either the treatment or comparator medication in the previous year (new-user design).18 19 The cohort entry date is the first record date of the medication of use. To ensure that the study population had adequate information recorded in our data source, we required the study population to have at least 364 days of Medicare continuous enrolment in parts A (inpatient coverage), B (outpatient coverage), and D (prescription coverage).
Generating Structured Features from Unstructured EHR Free Text Notes.
To generate structured features from unstructured free-text notes, we applied the unsupervised NLP approach bag-of-words or bag-of-n-grams.20 An n-gram is a sequence of consecutive items (in this case, words), where “unigram” refers to a single word, “bigram” refers to 2 consecutive words, and so on. Each document was tokenized and processed into unigrams and bigrams. We excluded stop words, i.e., words that occur frequently but convey little semantic meaning, such as articles (e.g., “a”, “the”) and prepositions (e.g., “in”, “on”). We considered the presence of each n-gram in the 365 days prior to the cohort entry as the candidate predictors and assessed their association with the outcome of interest. Examples of unigram features that were highly correlated with treatment assignment for each study is provided in Supplemental Tables 1 through 3.
It is important to emphasize that the key factor for the scalability of using NLP tools to generate large numbers of structured features for real-time clinical decision support is that researchers remain agnostic to the format and content of the processed information (e.g., text or coded information). The selected NLP methods are used to automatically identify concepts or patterns from EHR free-text notes which are then fed into machine learning prediction algorithms to model the treatment and outcome generating mechanisms (discussed below). Therefore, investigators are not concerned about the specific clinical meanings of the extracted features but only with the ability of the generated features to improve prediction. The application of NLP tools in this setting does not require time-intensive tasks that are common with supervised learning, such as manual chart review or training data creation, that are traditionally used to improve capture of specific phenotypes.
Prediction Model Development and Evaluation
For each empirical study, baseline covariates (features) included 39 researcher-specified variables (which included demographic variables like age and sex), thousands of additional claims codes, and thousands of NLP-generated features from EHR. The researcher-specified variables were constructed using claims data only that were determined based on diagnostic and procedural ICD-9 codes along with all NDC drug codes. Claims codes used for baseline covariate adjustment only included information in the year prior to cohort entry. NLP-generated features also only included EHR information in the year prior to cohort entry. Only the most prevalent 10,000 unigram and bigram were considered, respectively. We then screened features with a prevalence less than .0001 within any treatment or outcome level for the given study. The remaining claims codes and NLP generated features were used as binary predictors (i.e., presence of the feature/condition or not) in the model. We created four different covariate sets:
Researcher specified variables only,
Researcher specified variables + claims codes,
Researcher specified variables + NLP generated features from EHR free text notes
Researcher-specified variables + claims codes and NLP generated features from EHR free text notes
We applied 3 machine learning (ML) prediction algorithms to model the treatment and outcome (risk of outcome over a 6-month follow-up) in each empirical cohort: 1) least absolute shrinkage and selection operator (LASSO) Regression; 2) Random Forest; 3) Extreme Gradient Boosting (XGboost).21–23 For each prediction algorithm, we fit a separate model for each of the 4 covariate sets described above. A grid search for hyperparameter optimization was used for XGboost and random forest. We evaluated the calibration and discrimination for each model by using the 10-fold cross-validated negative log-likelihood and concordance statistic (C-statistic). The data were split into 10 folds. Each model was trained on 9 of the folds and prediction performance was evaluated on the held-out test fold. This process was repeated 10 times where a separate fold served as the held-out test set and results were averaged across the folds.
Sensitivity analysis:
To assess if the performance of the prediction models for the clinical outcomes could be improved by using data processing techniques for unbalanced data (e.g., when modeling a low-prevalence outcome), we additionally processed each of the high-dimensional datasets by 1) ‘partitioning around medoids’ (PAM)24 and 2) ‘synthetic minority over-oversampling technique’ (SMOTE).25 PAM is a unsupervised machine learning technique that reduces the dimension of the data by creating clusters (medoids) of similar data points and returning one covariate from each cluster that is most representative of that cluster. SMOTE is a technique for oversampling the minority class to reduce the likelihood of overfitting prediction models in highly unbalanced data (e.g., predicting a rare outcome).25 We implemented both PAM and SMOTE within the R computing environment using the ‘performanceEstimation’26 package for SMOTE and the screening function for PAM within the ‘SuperLearner’ package.27
RESULTS
Study cohorts:
Table 1 shows the sample size, treatment prevalence, outcome incidence, and the number of features available for prediction for each empirical example. For each cohort, treatment prevalence was much greater than outcome incidence which is common in healthcare database studies. For example, treatment prevalence for the Statin, Analgesic, and PPI cohorts was 35.3%, 62.6%, and 34.1% while outcome incidence was only 1.9%, 1.7%, and 1.1%, respectively. Application of bag-of-words to unstructured EHR text resulted in 20,017, 20,021, and 20,025 free-text features for the Statin, Analgesics, and PPI cohorts (Table 1). Features available for prediction modeling also included all claims codes that had a minimum prevalence of 0.001 within each treatment group and outcome level. This resulted in 18,409, 19,517, and 28,041 additional claims codes available for prediction within the NSAID, PPI, and OA cohorts, respectively (Table 1).
Table 1.
Study Cohorts
| Total N | # Baseline Covariates | ||||||
|---|---|---|---|---|---|---|---|
| No. | Descriptiona | Study Population | Treatment (%) | Outcome (%) | Investigator Specified | Claims Codesb | EHR featuresc |
| 1. | High vs low intensity statin with an outcome of major cardiac events | 3,529 | 1,244 (35.3) | 138 (3.9) | 39 | 18,409 | 20,017 |
| 2. | Opioids vs. NSAIDs with an outcome of renal failure | 9,571 | 5,991 (62.6) | 158 (1.7) | 39 | 19,517 | 20,051 |
| 3. | High vs. low dose PPI with an outcome of peptic ulcer complications | 20,862 | 7,108 (34.1) | 234 (1.1) | 39 | 28,041 | 20,025 |
NSAIDs= nonsteroidal anti-inflammatory drugs, PPI= proton pump inhibitor
The total number of claims codes with a prevalence >0.0001 within each treatment and outcome level. Each code was transformed into a binary variable indicating whether or not that code appeared during the baseline period for the given individual.
The total number of NLP generated features. Each NLP generated feature was transformed into a binary variable indicating whether or not that feature/term appeared during the baseline period for the given individual.
Predicting treatment choice:
Given the same covariate set, we observed that Lasso and XGboost models generally had better performance compared to the random forest models when predicting prescribing of the treatment of interest. For example, when using all available variables to model treatment choice in the Statin cohort, the LASSO and XGboost models yielded a cross-validated AUC of 0.0.931 and 0.932, respectively, while the Random Forest model had an AUC of 0.851 (Table 1). When modeling the treatment choice using the same modeling method, we observed a consistent pattern in model performance across the three examples: covariate set 4 (researcher-specified + claims codes + NLP features from EHR) resulted in the best predictive performance, followed by covariate set 2 (researcher-specified + claims codes), then covariate set 3 (researcher-specified + NLP features), and finally models using only covariate set 1 (researcher-specified variables) tended to have the worst performance. For example, for the Statin cohort, the Lasso model that included all baseline features (covariate set 4) resulted in an AUC of 0.931, compared to 0.903 for covariate set 2, 0.875 for covariate set 3, and 0.718 for covariate set 1 (Tables 2). We found a similar pattern in the Analgesics (Tables 3) and PPI cohorts (Table 4).
Table 2.
Prediction results for comparing high vs low intensity statins on cardiovascular events*
| Treatment Prediction | Outcome Prediction | ||||
|---|---|---|---|---|---|
| Covariate Set | Model | AUC | NLL | AUC | NLL |
| Prespecified | |||||
| LASSO | 0.718 | 0.569 | 0.613 | 0.163 | |
| XGboost | 0.700 | 0.596 | 0.614 | 0.164 | |
| Random Forest | 0.663 | 0.613 | 0.612 | 0.165 | |
| Prespecified + Claims Codes | |||||
| LASSO | 0.903 | 0.372 | 0.597 | 0.181 | |
| XGboost | 0.908 | 0.366 | 0.613 | 0.164 | |
| Random Forest | 0.885 | 0.481 | 0.611 | 0.171 | |
| Prespecified + NLP Features¶ | |||||
| LASSO | 0.875 | 0.426 | 0.580 | 0.221 | |
| XGboost | 0.874 | 0.436 | 0.603 | 0.164 | |
| Random Forest | 0.779 | 0.570 | 0.580 | 0.171 | |
| Prespecified + Claims + NLP¶ | |||||
| LASSO | 0.931 | 0.324 | 0.595 | 0.218 | |
| XGboost | 0.932 | 0.324 | 0.587 | 0.165 | |
| Random Forest | 0.851 | 0.536 | 0.609 | 0.168 | |
Comparing high- versus low-intensity statins inpatients with history of myocardial infarction (MI) in terms of risk of MI or stroke
The features generated by natural language processing (NLP) of the free-text electronic health records.
AUC=area under the receiver operating characteristic curve. NLL= negative log-likelihood. Boldface indicates the model has the highest cross-validated AUC given the set of predictors considered in the model (NLL was used to judge secondarily for tied AUC).
Table 3.
Prediction results for the analgesics study*
| Treatment Prediction | Outcome Prediction | ||||
|---|---|---|---|---|---|
| Covariate Set | Model | AUC | NLL | AUC | NLL |
| Prespecified | |||||
| LASSO | 0.878 | 0.409 | 0.799 | 0.074 | |
| XGboost | 0.878 | 0.409 | 0.796 | 0.074 | |
| Random Forest | 0.866 | 0.424 | 0.770 | 0.078 | |
| Prespecified + Claims Codes | |||||
| LASSO | 0.933 | 0.301 | 0.781 | 0.075 | |
| Xgboost | 0.934 | 0.307 | 0.787 | 0.074 | |
| Random Forest | 0.910 | 0.416 | 0.779 | 0.076 | |
| Prespecified + NLP features¶ | |||||
| LASSO | 0.922 | 0.343 | 0.769 | 0.076 | |
| Xgboost | 0.930 | 0.335 | 0.769 | 0.075 | |
| Random Forest | 0.826 | 0.539 | 0.700 | 0.081 | |
| Prespecified + Claims + NLP¶ | |||||
| LASSO | 0.945 | 0.281 | 0.784 | 0.075 | |
| Xgboost | 0.945 | 0.291 | 0.777 | 0.075 | |
| Random Forest | 0.843 | 0.512 | 0.764 | 0.077 | |
Comparing opioids vs. NSAIDs in patients with history of osteoarthritis in terms of risk of renal failure.
The features generated by natural language processing (NLP) of the free-text electronic health records.
AUC=area under the receiver operating characteristic curve. Boldface indicates the model has the highest cross-validated AUC given the set of predictors considered in the model (NLL was used to judge secondarily for tied AUC).
Table 4.
Prediction results for the PPI study*
| Treatment Prediction | Outcome Prediction | ||||
|---|---|---|---|---|---|
| Covariate Set | Model | AUC | NLL | AUC | NLL |
| Prespecified | |||||
| LASSO | 0.585 | 0.631 | 0.737 | 0.058 | |
| XGboost | 0.583 | 0.632 | 0.726 | 0.058 | |
| Random Forest | 0.551 | 0.646 | 0.670 | 0.063 | |
| Prespecified + Claims Codes | |||||
| LASSO | 0.870 | 0.392 | 0.761 | 0.056 | |
| Xgboost | 0.875 | 0.388 | 0.771 | 0.055 | |
| Random Forest | 0.868 | 0.450 | 0.762 | 0.057 | |
| Prespecified + NLP Features¶ | |||||
| LASSO | 0.751 | 0.549 | 0.731 | 0.058 | |
| Xgboost | 0.751 | 0.550 | 0.741 | 0.063 | |
| Random Forest | 0.691 | 0.604 | 0.680 | 0.058 | |
| Prespecified + Claims + NLP¶ | |||||
| LASSO | 0.900 | 0.363 | 0.755 | 0.056 | |
| Xgboost | 0.905 | 0.358 | 0.761 | 0.056 | |
| Random Forest | 0.853 | 0.531 | 0.737 | 0.058 | |
Comparing high- vs low-dose proton pump inhibitors (PPI) in patients with history of peptic ulcer in terms of risk of gastrointestinal bleeding.
The features generated by natural language processing (NLP) of the free-text electronic health records.
AUC=area under the receiver operating characteristic curve. Boldface indicates the model has the highest cross-validated AUC given the set of predictors considered in the model (NLL was used to judge secondarily for tied AUC).
Predicting clinical outcomes:
When modeling the risk of clinical outcomes over a 6 month follow-up using the same covariate set, we observed comparable model performance across the three modeling methods. Within each modeling method, we observed that adding additional claims codes or NLP generated EHR features did not improve the model performance in the smaller cohorts (the Statin [with 138 events, Table 2] and Analgesics study [with 158 events, Table 3]) and only resulted in a modest improvement in the larger cohort (the PPI study [with 234 events, Table 4]). When modeling the outcome risk, there was no additional benefit in predictive performance for any of the models when supplementing claims data with NLP generated features from EHR (Tables 2–4).
Sensitivity analysis:
After additional processing of the data by PAM and SMOTE, we did not observe appreciable improvement with any of the three modeling methods (LASSO, XGboost, or Random Forest). This was consistent for all the 3 empirical examples (not shown).
DISCUSSION
In this study, we applied the unsupervised NLP approach ‘bag-of-words’ to generate large numbers of features from EHR data in 3 empirical studies. We then evaluated if these NLP generated features could supplement administrative claims data to improve large-scale prediction modeling of the treatment and outcome mechanisms. We hypothesized that the addition of the NLP generated features would improve prediction when modeling the treatment and outcome. We found that unsupervised ‘bag-of-words’ can scale for rapid generation of large numbers of features and that supplementing claims data with these NLP generated features can improve predictive performance when modeling prescribing choices compared to models that used claims data alone. However, when modeling the outcome, we found little to no improvement in predictive performance across the various models when supplementing claims data with the NLP generated features from EHR.
Clinical outcome prediction has a variety of applications, ranging from clinical decision support7 8 to patient risk profiling.28 29 Modeling the prescribing choices is commonly done in building propensity score models for confounding adjustment in comparative safety and effectiveness research.9 11 Our findings demonstrated that adding empirically identified claims codes consistently improved the predictive performance when modeling the treatment exposure. This is consistent with the literature showing that use of the high-dimensional proxy adjustment algorithm yielded superior performance in confounding adjustment than methods based on only investigator-specified confounders.6 30–32 In this study, we found that incorporating NLP features can further improve predictive performance, compared to the models based on prespecified and empirically identified claims codes. However, this improved performance was only observed when modeling the treatment choice and not the clinical outcome.
The discrepancy in performance between the treatment and outcome prediction models may, in part, be due to the low outcome incidence across each empirical examples. Rare outcomes are common in healthcare database studies and can impact the performance of prediction models, particularly when modeling high-dimensional sets of features. Larger samples can help to compensate for rare outcomes as illustrated in the larger empirical cohort (PPI cohort) as there was an improvement in outcome prediction when using machine learning algorithms to incorporate additional information either from thousands of claims codes or NLP features, when compared to the models based on the researcher-specified variables alone. Yet, even in this larger study we did not observe a meaningful improvement when adding NLP features on top of the claims codes.
Several limitations deserve attention. First, in this study we only considered the one NLP approach (i.e., ‘bag-of-words’) for generating structured features from unstructured free text. We intentionally chose a simple but highly scalable NLP approach because the main context of application was RWE studies that require rapid-cycle analytics and risk prediction models to provide clinical decision support at the point of care.5 7 8 Future research could assess the trade-off between complexity and scalability when using more sophisticated unsupervised NLP tools, such as named entity recognition (clinical and contextual information extraction and encoding),33–36 distributional semantics models,37–39 and word embeddings.40 41 A thorough comparison of these alternative approaches is beyond our scope. In addition, when modeling claims codes, we did not consider the hierarchical structure of ICD-9 codes. Future research could also explore leveraging this hierarchical structure (e.g., consideration of less granular levels) to avoid redundant information and potentially better predict treatment and outcome. Future research is also needed to explore how results from this study generalize to ICD-10 coding structures.
Second, we only considered Lasso regression, XGboost, and random forest for prediction modeling. Other flexible machine learning algorithms, including deep learning, could be considered. However, the objective of this study was to focus on the potential added value, in terms of predictive performance, of incorporating large-scale information from unstructured EHR rather than focusing on a thorough comparison of a range of machine learning prediction algorithms. We chose to focus on Lasso, XGboost, and random forest as these approaches are highly scalable and have been shown to perform well in large, high-dimensional healthcare databases. When comparing these approaches, no single method performed best across all examples, and there was little difference between the predictive performance of the Lasso, XGboost, and random forest models in the studies considered here.
Finally, in this study we only considered the benefit of the NLP features in terms of improving prediction when modeling the treatment an outcome generating mechanisms. For many tasks in RWE studies the end goal is to reduce bias in estimated treatment effects (e.g., generating features for high-dimensional confounding control for rapid monitoring of drug effectiveness and safety). The improvement in predictive performance when incorporating EHR data to model treatment choice does not necessarily guarantee improved confounding control.42–45 Similarly, a lack of improved prediction when incorporating EHR data to model the outcome does not necessarily imply that there is no additional confounder information in the EHR generated features. Studies have also shown that models that optimize prediction can underfit for purposes of confounding adjustment and are not necessarily optimal for reducing bias in estimated treatment effects.43 Recent work independent of ours has explored the use of large-scale feature engineering from unstructured EHR text for improved confounding control for PS analyses.46 This work found benefits confounding control, but the benefits were incremental and findings were limited to a single empirical study.46 Determining to what extend the addition of these EHR generated features can improve confounding control is a more difficult problem that we leave to future research.
In conclusion, we found that the use of unsupervised NLP for harnessing unstructured EHR information could improve prediction when modeling the prescribing choice (i.e., the propensity score) in healthcare database studies. Such improvement was not seen when predicting rare clinical outcomes in small cohorts. Future research is needed to evaluate to what extent this additional information can improve confounding control in comparative effectiveness and safety studies.
Supplementary Material
STUDY HIGHLIGHTS.
What is the current knowledge on the topic? Unstructured electronic health records remain underutilized for large-scale confounding adjustment in pharmacoepidemiologic studies. ‘Bag-of-words’ is an unsupervised natural language processing (NLP) tool that can scale to generate large numbers of structured features from unstructured health notes to potentially improve treatment and outcome prediction models and confounding adjustment.
What question did this study address? This study sought to evaluate whether or not large numbers of NLP generated features could supplement administrative claims data to improve large-scale prediction modeling of the prescribing choice for treatment (i.e., the propensity score) and outcome.
What does this study add to our knowledge? Across 3 empirical studies, we found that large numbers of features generated from bag-of-words could supplement investigator-specified variables to improve prediction of prescribing choices. However, there was little to no improvement when predicting clinical outcomes across the 3 studies. The discrepancy in performance between the treatment and outcome prediction models may, in part, be due to the low outcome incidence across each empirical example. Rare outcomes are common in healthcare database studies and can impact the performance of prediction models, particularly when modeling high-dimensional sets of features.
How might this change clinical pharmacology or translational science? Using NLP to leverage the full information content in electronic health records may improve prediction of prescribing choices (the propensity score). Future research is needed to evaluate to what extent this additional information can improve confounding control in comparative effectiveness and safety studies.
Source of Funding:
This project was funded by NIH RO1LM013204; Dr. Wyss received additional funding from the Sentinel Innovation Center.
Conflicts of interest:
Dr. Schneeweiss (ORCID# 0000-0003-2575-467X) is participating in investigator-initiated grants to the Brigham and Women’s Hospital from Boehringer Ingelheim and UCB unrelated to the topic of this study. He is a consultant to Aetion Inc., a software manufacturer of which he owns equity. His interests were declared, reviewed, and approved by the Brigham and Women’s Hospital in accordance with their institutional compliance policies. Dr. Rassen is an employee of and has an ownership stake in Aetion, Inc. All other authors declared no competing interests for this work.
Footnotes
SUPPORTING INFORMATION
Supplementary information accompanies this paper on the Clinical Pharmacology & Therapeutics website (www.cpt-journal.com).
Availability of data and computing code:
Datasets for empirical studies are not available for public use due to data use agreements.
References:
- 1.Xia Z, Secor E, Chibnik LB, et al. Modeling disease severity in multiple sclerosis using electronic health records. PloS one 2013;8(11):e78927. doi: 10.1371/journal.pone.0078927 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Winnenburg R, Wachter T, Plake C, et al. Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? Brief Bioinform 2008;9(6):466–478. doi: 10.1093/bib/bbn043 [DOI] [PubMed] [Google Scholar]
- 3.Schneeweiss S, Eichler HG, Garcia-Altes A, et al. Real World Data in Adaptive Biomedical Innovation: A Framework for Generating Evidence Fit for Decision-Making. Clinical pharmacology and therapeutics 2016;100(6):633–46. doi: 10.1002/cpt.512 [DOI] [PubMed] [Google Scholar]
- 4.Schneeweiss S, Glynn RJ. Real-World Data Analytics Fit for Regulatory Decision-Making. Am J Law Med 2018;44(2–3):197–217. doi: 10.1177/0098858818789429 [DOI] [PubMed] [Google Scholar]
- 5.Schneeweiss S, Shrank WH, Ruhl M, et al. Decision-Making Aligned with Rapid-Cycle Evaluation in Health Care. Int J Technol Assess Health Care 2015;31(4):214–22. doi: 10.1017/S0266462315000410 [DOI] [PubMed] [Google Scholar]
- 6.Schneeweiss S, Rassen JA, Glynn RJ, et al. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology 2009;20(4):512–22. doi: 10.1097/EDE.0b013e3181a663cc [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Gallego B, Walter SR, Day RO, et al. Bringing cohort studies to the bedside: framework for a ‘green button’ to support clinical decision-making. Journal of comparative effectiveness research 2015;4(3):191–97. doi: 10.2217/cer.15.12 [DOI] [PubMed] [Google Scholar]
- 8.Longhurst CA, Harrington RA, Shah NH. A ‘green button’ for using aggregate patient data at the point of care. Health Aff (Millwood) 2014;33(7):1229–35. doi: 10.1377/hlthaff.2014.0099 [DOI] [PubMed] [Google Scholar]
- 9.Brookhart MA, Wyss R, Layton JB, et al. Propensity score methods for confounding control in nonexperimental research. Circulation Cardiovascular quality and outcomes 2013;6(5):604–11. doi: 10.1161/CIRCOUTCOMES.113.000359 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Rosenbaum PR RD. The central role of the propensity score in observational studies for causal effects. Biometrika 1983;70:41–55. doi: 10.1093/biomet/70.1.41 [DOI] [Google Scholar]
- 11.Sturmer T, Wyss R, Glynn RJ, et al. Propensity scores for confounder adjustment when assessing the effects of medical interventions using nonexperimental study designs. Journal of internal medicine 2014;275(6):570–80. doi: 10.1111/joim.12197 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Freedman AN, Yu B, Gail MH, et al. Benefit/risk assessment for breast cancer chemoprevention with raloxifene or tamoxifen for women age 50 years or older. Journal of clinical oncology : official journal of the American Society of Clinical Oncology 2011;29(17):2327–33. doi: 10.1200/JCO.2010.33.0258 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Gail MH, Costantino JP, Bryant J, et al. Weighing the risks and benefits of tamoxifen treatment for preventing breast cancer. Journal of the National Cancer Institute 1999;91(21):1829–46. doi: 10.1093/jnci/91.21.1829 [DOI] [PubMed] [Google Scholar]
- 14.Knaus WA, Wagner DP, Draper EA, et al. The APACHE III prognostic system. Risk prediction of hospital mortality for critically ill hospitalized adults. Chest 1991;100(6):1619–36. doi: 10.1378/chest.100.6.1619 [DOI] [PubMed] [Google Scholar]
- 15.Lyden P, Lu M, Jackson C, et al. Underlying structure of the National Institutes of Health Stroke Scale: results of a factor analysis. NINDS tPA Stroke Trial Investigators. Stroke; a journal of cerebral circulation 1999;30(11):2347–54. doi: 10.1161/01.STR.30.11.2347 [DOI] [PubMed] [Google Scholar]
- 16.Sagara Y, Freedman RA, Vaz-Luis I, et al. Patient Prognostic Score and Associations With Survival Improvement Offered by Radiotherapy After Breast-Conserving Surgery for Ductal Carcinoma In Situ: A Population-Based Longitudinal Cohort Study. Journal of clinical oncology : official journal of the American Society of Clinical Oncology 2016;34(11):1190–6. doi: 10.1200/JCO.2015.65.1869 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Teasdale G, Jennett B. Assessment of coma and impaired consciousness. A practical scale. Lancet 1974;2(7872):81–4. doi: 10.1016/s0140-6736(74)91639-0 [DOI] [PubMed] [Google Scholar]
- 18.Lund JL, Richardson DB, Sturmer T. The active comparator, new user study design in pharmacoepidemiology: historical foundations and contemporary application. Curr Epidemiol Rep 2015;2(4):221–28. doi: 10.1007/s40471-015-0053-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Ray WA. Evaluating medication effects outside of clinical trials: new-user designs. American journal of epidemiology 2003;158(9):915–20. doi: 10.1093/aje/kwg231 [DOI] [PubMed] [Google Scholar]
- 20.Zhang Y, Jin R, Zhou Z-H. Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics 2010;1:43–52. doi: 10.1007/s13042-010-0001-0 [DOI] [Google Scholar]
- 21.Tibshirani R Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B (Methodological) 1996:267–88. doi: 10.1111/j.2517-6161.1996.tb02080.x [DOI] [Google Scholar]
- 22.Breiman L Random Forests. Machine Learning 2001;45:5–32. [Google Scholar]
- 23.Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. KDD ‘16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2016;785–94. doi: 10.1145/2939672.2939785 [DOI] [Google Scholar]
- 24.Van der Laan M, Pollard K, Bryan J. A new partitioning around medoids algorithm. Journal of Statistical Computation and Simulation 2003;73(8):575–84. doi: 10.1080/0094965031000136012 [DOI] [Google Scholar]
- 25.Chawla NV, Bowyer KW, Hall LO, et al. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 2002;16:321–57. doi: 10.1613/jair.953 [DOI] [Google Scholar]
- 26.Torgo L performanceEstimation. R package version 1.1.0 2016; https://CRAN.R-project.org/package=performanceEstimation [Google Scholar]
- 27.Polley EC, Rose S, van der Laan MJ. Super Learning. Targeted Learning: Springer 2011:43–66. doi: 10.1007/978-1-4419-9782-1_3 [DOI] [Google Scholar]
- 28.Olesen JB, Lip GY, Hansen ML, et al. Validation of risk stratification schemes for predicting stroke and thromboembolism in patients with atrial fibrillation: nationwide cohort study. Bmj 2011;342:d124. doi: 10.1136/bmj.d124 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Pisters R, Lane DA, Nieuwlaat R, et al. A novel user-friendly score (HAS-BLED) to assess 1-year risk of major bleeding in patients with atrial fibrillation: the Euro Heart Survey. Chest 2010;138(5):1093–100. doi: 10.1378/chest.10-0134 [DOI] [PubMed] [Google Scholar]
- 30.Rassen JA, Choudhry NK, Avorn J, et al. Cardiovascular outcomes and mortality in patients using clopidogrel with proton pump inhibitors after percutaneous coronary intervention or acute coronary syndrome. Circulation 2009;120(23):2322–9. doi: 10.1161/CIRCULATIONAHA.109.873497 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Schneeweiss S, Patrick AR, Solomon DH, et al. Comparative safety of antidepressant agents for children and adolescents regarding suicidal acts. Pediatrics 2010;125(5):876–88. doi: 10.1542/peds.2009-2317 [published Online First: 2010/04/14] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Hallas J, Pottegard A. Performance of the High-dimensional Propensity Score in a Nordic Healthcare Model. Basic Clin Pharmacol Toxicol 2017;120(3):312–17. doi: 10.1111/bcpt.12716 [DOI] [PubMed] [Google Scholar]
- 33.Zhou L, Dhopeshwarkar N, Blumenthal KG, et al. Drug allergies documented in electronic health records of a large healthcare system. Allergy 2016;71(9):1305–13. doi: 10.1111/all.12881 [published Online First: 2016/03/13] [DOI] [PubMed] [Google Scholar]
- 34.Lai KH, Topaz M, Goss FR, et al. Automated misspelling detection and correction in clinical free-text records. J Biomed Inform 2015;55:188–95. doi: 10.1016/j.jbi.2015.04.008 [DOI] [PubMed] [Google Scholar]
- 35.Zhou L, Lu Y, Vitale CJ, et al. Representation of information about family relatives as structured data in electronic health records. Appl Clin Inform 2014;5(2):349–67. doi: 10.4338/ACI-2013-10-RA-0080 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Zhou L, Plasek JM, Mahoney LM, et al. Using Medical Text Extraction, Reasoning and Mapping System (MTERMS) to process medication information in outpatient clinical notes. AMIA Annu Symp Proc 2011;2011:1639–48. [PMC free article] [PubMed] [Google Scholar]
- 37.Tang C, Zhou L, Plasek J, et al. Comment Topic Evolution on a Cancer Institution’s Facebook Page. Appl Clin Inform 2017;8(3):854–865. doi: 10.4338/ACI-2017-04-RA-0055 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.McCallum AK. MALLET: A Machine Learning for Language Toolkit 2002. Available from: http://mallet.cs.umass.edu/.
- 39.Shao Y, Mohanty AF, Ahmed A, et al. Identification and Use of Frailty Indicators from Text to Examine Associations with Clinical Outcomes Among Patients with Heart Failure. AMIA Annu Symp Proc 2017;10:1110–18. [PMC free article] [PubMed] [Google Scholar]
- 40.Kumamaru H, Gagne JJ, Glynn RJ, et al. Comparison of high-dimensional confounder summary scores in comparative studies of newly marketed medications. Journal of clinical epidemiology 2016;76:200–208. doi: 10.1016/j.jclinepi.2016.02.011 [DOI] [PubMed] [Google Scholar]
- 41.Mikolov TCK, Corrado G, Dean J. Efficient Estimation of Word Representations in Vector Space. ArXiv13013781 Cs [Internet] 2013; Available from: http://arxivorg/abs/13013781 2013 [Google Scholar]
- 42.Brookhart MA, Schneeweiss S, Rothman KJ, et al. Variable selection for propensity score models. American journal of epidemiology 2006;163(12):1149–56. doi: 10.1093/aje/kwj149 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Ju C, Wyss R, Franklin JM, et al. Collaborative-controlled LASSO for constructing propensity score-based estimators in high-dimensional data. Statistical methods in medical research. 2019;28(4):1044–1063. doi: 10.1177/0962280217744588. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Shortreed SM, Ertefaie A. Outcome-adaptive lasso: Variable selection for causal inference. Biometrics 2017;73(4):1111–22. doi: 10.1111/biom.12679 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Wyss R, Ellis AR, Brookhart MA, et al. The role of prediction modeling in propensity score estimation: an evaluation of logistic regression, bCART, and the covariate-balancing propensity score. American journal of epidemiology 2014;180(6):645–55. doi: 10.1093/aje/kwu181 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Afzal Z, Masclee GMC, Sturkenboom MCJM, Kors JA, Schuemie MJ. Generating and evaluating a propensity model using textual features from electronic medical records. PLOS ONE. 2019;14(3): e0212999. doi: 10.1371/journal.pone.0212999 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Datasets for empirical studies are not available for public use due to data use agreements.
