Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Feb 1.
Published in final edited form as: Arterioscler Thromb Vasc Biol. 2023 Dec 14;44(2):491–504. doi: 10.1161/ATVBAHA.123.320331

Prediction of Venous Thromboembolism in Diverse Populations Using Machine Learning and Structured Electronic Health Records

Robert Chen 1,2,3, Ben Omega Petrazzini 1,3,4, Waqas Malick 5, Robert Rosenson 5, Ron Do 1,3,4
PMCID: PMC10872966  NIHMSID: NIHMS1950363  PMID: 38095106

Abstract

Background:

Venous thromboembolism (VTE) is a major cause of morbidity and mortality worldwide. Current risk assessment tools, such as the Caprini, Padua, and Wells scores, have limitations in their applicability and accuracy. This study aimed to develop machine learning models using structured electronic health record (EHR) data to predict diagnosis and 1-year risk of VTE.

Methods:

We trained and validated models on data from 159,001 participants in the Mount Sinai Data Warehouse. We then externally tested them on 401,723 participants in the UK Biobank and 123,039 participants in All of Us. All datasets contain populations of diverse ancestries and clinical histories. We used these datasets to develop small, medium, and large models with increasing features on a range of optimizing portability to maximizing performance. We make trained models publicly available in click-and-run format at https://doi.org/10.17632/tkwzysr4y6.6.

Results:

In the holdout and external test sets, respectively, models achieved areas under the receiver operating curve (AUROC) of 0.80–0.83 and 0.72–0.82 for VTE diagnosis prediction and 0.76–0.78 and 0.64–0.69 for 1-year risk prediction, significantly outperforming the Padua score. Models also demonstrated robust performance across different VTE types and patient subsets, including ethnicity, age, and surgical and hospitalization status. Models identified both established and novel clinical features contributing to VTE risk, offering valuable insights into its underlying pathophysiology.

Conclusions:

Machine learning models using structured EHR data can significantly improve VTE diagnosis and 1-year risk prediction in diverse populations. Model probability scores exist on a continuum, affecting mortality risk in both healthy individuals and VTE cases. Integrating these models into EHR systems to generate real-time predictions may enhance VTE risk assessment, early detection, and preventative measures, ultimately reducing the morbidity and mortality associated with VTE.

Graphical Abstract

graphic file with name nihms-1950363-f0004.jpg

Introduction

Venous thromboembolism (VTE), including deep vein thrombosis (DVT) and pulmonary embolism (PE), remains a major contributor to global morbidity and mortality, with an estimated incidence rate of 1 to 2 per 1000 individuals annually.1 VTE presents significant clinical challenges due to its often asymptomatic or nonspecific presentation,2 leading to delays in diagnosis and treatment that can result in fatal outcomes. Mortality rates for VTE are high, with 1-year mortality rates ranging from 20 to 30%.35 Early detection and prevention of VTE are thus imperative to reducing its substantial health burden and life-threatening complications,6 including recurrent thrombotic events and post-thrombotic syndrome.7

Electronic health records (EHRs) store a wealth of clinical data, such as demographics, laboratory measurements, medications, and past diagnoses, which can be used to assess risk for VTE. However, existing methods for VTE risk assessment, including the Caprini and Padua scores and Wells’ Criteria,810 are integer-based scoring systems that use a small number of available predictors and focus on perioperative patients (Caprini), hospitalized patients (Padua), or patients for whom VTE is a diagnostic possibility (Wells). These scores require physician assessment of patients’ mobility, history of trauma, and/or clinical presentation, limiting their use as population screening tools. Similarly, while there have been previous studies applying machine learning models to VTE patients, these studies have been limited by narrowly defined cohorts, small sample sizes, and have lacked validation.11 Importantly, development of robust machine learning models for VTE prediction in populations of diverse ancestries and clinical backgrounds has not been extensively explored.

Recently, healthcare systems have integrated machine learning models into EHR systems to generate real-time, automated predictions using existing patient data. For example, the Epic Sepsis Model uses 80 features to predict sepsis and alerts healthcare workers when the predicted score is above a threshold.12,13 Prior studies have also demonstrated that EHR-embedded risk-stratification tools are effective at increasing rates of guideline-appropriate VTE prophylaxis and reducing VTE incidence,1417 but these tools have similar limitations to the Caprini, Padua, and Wells scores. Building on these advancements, we propose that VTE prediction models could automatically alert clinicians to patients at high-risk for VTE prior to symptom manifestation or progression, enabling targeted measures to reduce VTE-associated morbidity and mortality.

Here, we assess whether machine learning models using structured EHR data can predict diagnosis and 1-year risk of VTE in two large biobanks. We train and evaluate these models using the Mount Sinai Data Warehouse (MSDW), comprised of 159,001 diverse ancestry patients from six New York City hospitals, and externally test them on the UK Biobank, comprised of 401,723 volunteers from across the United Kingdom, as well as All of Us, comprised of 123,039 volunteers from across the United States. These participants, representing a wide range of demographics and health statuses, allow us to develop models aimed at generalizing across different populations.

Methods

Data availability

Code used for the analyses and trained models have been made publicly available at Mendeley and can be accessed at https://doi.org/10.17632/tkwzysr4y6.6. Predictions can be generated online at https://colab.research.google.com/drive/1m0eh7Cj8BFaZpVQd7VZhLwMV8K8sGCBN?usp=sharing; they can also be generated locally by following the tutorial in Supporting Information 1. The UK Biobank and All of Us datasets used in this study are publicly available and can be accessed at https://bbams.ndph.ox.ac.uk/ams/ and https://workbench.researchallofus.org/, respectively. The MSDW dataset used in this study is only available to researchers at the Icahn School of Medicine at Mount Sinai; further information about this dataset is available at https://labs.icahn.mssm.edu/msdw/.

Sample cohorts

We trained and tested machine learning models using the Mount Sinai Data Warehouse (MSDW). MSDW is an Observational Medical Outcomes Partnership compliant database consisting of approximately 11 million patient records and 87 million patient encounters, both inpatient and outpatient, from six facilities across the Mount Sinai Health System (Mount Sinai Hospital, Mount Sinai Queens, Mount Sinai West, Mount Sinai Morningside, Mount Sinai Brooklyn, and Mount Sinai Beth Israel). For external testing (i.e., evaluation on independent datasets), we used the UK Biobank, a cohort comprising EHR and genetic data from 502,411 British volunteers aged 40–69 and enrolled between 2006 and 2010, as well as All of Us, a cohort comprising EHR and genetic data from 413,457 American volunteers over age 18 and enrolled from 2015 to present. The data collection and preprocessing methodologies for UK Biobank and All of Us were consistent with those employed for MSDW.

We defined VTE cases as participants with a deep vein thrombosis (DVT) or pulmonary embolism (PE) diagnosis. In the UK Biobank and All of Us, we identified VTE diagnoses using ICD-10 codes of I80, I81, I82, I26, O22.3, and O87.1 from inpatient visits. In MSDW, we curated a comprehensive list of VTE diagnostic concept names from both inpatient and outpatient visits. In both biobanks, we excluded participants with superficial venous thrombosis (SVT) prior to a DVT or PE diagnosis; although SVT is a risk factor for VTE, there is a high rate of concomitant DVT/PE among SVT patients,18,19 resulting in unclear case/control definitions. We also excluded participants who had a documented history of VTE or post-thrombotic syndrome but did not have a recorded prior VTE diagnosis. All other participants were defined as possible controls.

We assigned EHR cutoff dates to all participants using either the date of the first VTE diagnosis (MSDW, UK Biobank, and All of Us cases), the date of the most recent clinical encounter (MSDW and All of Us controls), or a randomly selected date within 3 years post-enrollment (UK Biobank controls). The distribution of cutoff dates was similar between cases and controls in MSDW and UK Biobank, although in All of Us, cases were biased towards earlier cutoff dates (Figure S1). In MSDW and All of Us, we filtered cases and controls for those whose cutoff date was after January 2000 and who were over the age of 18 years at the cutoff date. In the UK Biobank, we selected only participants of White, Asian, or Black self-reported ethnicity, removing those of “Mixed” or unspecified ethnicities due to ambiguous ethnicity classification. We also excluded participants with VTE occurrences prior to enrollment. Follow-up data was available for these participants until June 2022.

Features and data imputation

We used structured EHR data, including demographics, surgical and hospitalization history, laboratory and vital measurements, and diagnostic history to construct our models (Tables S1S4). For VTE diagnosis models, we used surgical and hospitalization history up to the cutoff date; however, to prevent data leakage, for all other features we only used data from at least 30 days prior to the VTE date. For 1-year risk prediction models, we did not include surgical and hospitalization history, and used only data from at least 1 year prior to the cutoff date. For laboratory and vital measurements, we additionally only considered data from at most three years prior to the cutoff date due to their variability over longer time scales. Notably, other than selecting additional features for the large model (see “Model selection”), we did not perform any importance-based feature selection in this study to avoid overfitting, loss of information, and decreased generalizability.

For laboratory and vital measurements, we used only common measurements to increase portability of our model (i.e., the ability of our models to be used across different healthcare settings). To do so, we identified the 60 most frequently measured laboratory and vitals in MSDW, and then retained only measurements that were present in at least 20% of all cases and controls (n = 46). Of these 46 measurements, 10 were unavailable in UK Biobank and 2 were unavailable in All of Us; we set these values as null during external testing (Table S4). Subsequently, for each dataset, we removed participants missing more than 20% of 46 (MSDW), 36 (UK Biobank), or 44 (All of Us) measurements. Notably, we observed decreased performance among participants missing more than 20% of measurements (Table S5).

Because we included only commonly assessed measurements, we assume that missing values are Missing at Random (i.e., they are dependent solely on other measurements that are available and observed, and not on the value of the missing data itself), which allows us to impute them using values that are present. Thus, separately for each cohort, we imputed missing values for remaining participants over 20 iterations using the IterativeImputer function of scikit-learn Python package (version 1.2.2). We observed better performance with imputation than without (Table S5), which we attribute to two reasons: first, the gradient boosting model bypasses nodes for missing features, resulting in performance loss that cannot be fully compensated by other features since they represent independent nodes; second, the gradient boosting model is not designed to capture relationships between features, resulting in loss of information without imputation. In each iteration, the scikit-learn function employs Bayesian ridge regression to estimate the missing values of each feature based on its relationship with other observed features, starting with features that have the fewest missing values and progressing to those with the most in a round-robin fashion. As background, Bayesian ridge regression is a modification of Bayesian regression wherein a Gaussian prior is imposed on the regression coefficients and a gamma distribution is used as the conjugate prior for the precision of this Gaussian.20 The hyperparameters governing these priors are estimated during model fitting by maximizing the log marginal likelihood.21 This provides stable coefficient estimates and reduces overfitting. We recommend that when deploying our models, either the same IterativeImputer function is used, or predictions are made with missing values left unimputed.

For diagnostic data, we used Elixhauser comorbidities (n = 31) rather than binary (yes/no) encoding of ICD-10 codes (n = 1,580) to reduce the number of features. Elixhauser comorbidities represent a comprehensive set of comorbidity measures that are associated with clinically relevant metrics (e.g., length of hospital stay and mortality).22 We defined these comorbidities using ICD-10 codes (Table S6). In the UK Biobank, 20,390 participants had ICD-9 diagnoses; we converted these to ICD-10 diagnoses using standard conversion tables.23 To capture time-dependent effects, we encoded Elixhauser comorbidities, ICD-10 codes, and medication classes using an ordinal 0 to 4 scale, with 4 indicating diagnosis or medication prescription within the past year, 3 between 2–5 years, 2 between 5–10 years, 1 beyond 10 years, and 0 indicating no diagnosis or medication prescription. In cases where the same diagnosis or medication prescription occurred more than once, we used the most recent event for six Elixhauser comorbidities (“Solid Tumor without Metastasis,” “Metastatic Cancer,” “Weight Loss,” “Leukemia/Lymphoma,” “Fluid and Electrolyte Disorders,” “Infection”), all ICD-10 codes, and medication prescriptions, and used the first event for all other Elixhauser comorbidities.

We assessed the importance of these features for our models using the SHAP Python package (version 0.41.0), which calculates Shapley Additive Explanations (SHAP) scores for each feature and participant. For a given participant, the SHAP score of a feature quantifies the contribution of that feature to that participant’s overall prediction.24 We used the participant-wise mean of the absolute values of SHAP scores for each feature as a proxy for feature importance.

Model selection

All our models use the LightGBM Python package (version 3.3.2), a computationally efficient gradient boosting framework. Gradient boosting builds on decision trees builds on the concept of decision trees by combining multiple trees in a sequential manner to improve predictive accuracy and robustness; each new tree corrects the errors made by the sum of the previously built trees. LightGBM is state-of-the-art for tabular data and outperformed logistic regression, random forest, and XGBoost models (Table S7).25

To balance portability with performance, we constructed three types of models in MSDW with increasing numbers of features: the small model (60 features) includes demographics, counts of hospital visits and surgeries over 7, 30, 90, and 365-day timeframes, laboratory values and vitals, and binary smoking and alcohol use (Table S4); the medium model (91 features) adds 31 Elixhauser comorbidities (Table S6); and the large model (131 features) adds 20 ICD-10 codes and 20 medication classes (Table S8). To select additional features for the large model, we performed SHAP analysis on a machine learning model using all available structured features from the EHR (all 1,580 ICD-10 codes and 737 medication classes) and screened the most important features for biological plausibility and concordance with the literature. We then selected the 20 ICD-10 codes and medication classes with the highest mean absolute SHAP values.

Model training and testing

To train and evaluate models in MSDW, we first partitioned the MSDW cohort into 10 equally sized subsets, or “outer folds.” We used a nested cross-validation approach to ensure robust model training and evaluation. In this process, each fold served as a holdout set once, while the other nine folds were further divided into 10 smaller “inner folds” for training and validation. During the training procedure, LightGBM generated gradient-boosted trees using a training set and used a separate validation set to determine optimal parameters and when to stop training; of the 10 inner folds, one fold was used for validation and the remaining nine for training. The objective of training was to minimize binary log loss, which quantifies the difference between predicted probabilities and true binary labels. After training using its designated validation set, each model was tested on the holdout set. This process was iterated 10 times for each outer fold, resulting in a total of 100 iterations (10 outer folds x 10 inner folds). For each performance metric, we took the average over the 10 inner folds for every outer fold, and then used these 10 averaged values to generate 95% confidence intervals.

To externally test models in UK Biobank, we generated 10 subsetted cohorts. For each cohort, we randomly selected cases and controls from the entire UK Biobank dataset such that the proportions of (1) overall cases and controls and (2) cases and controls with a 7-day history of surgery were the same as the MSDW cohort. We refer to this procedure as “matching” in the rest of the manuscript. This accounts for two issues: (1) the UK Biobank consists of volunteers that are healthier than the general population,26 whereas our models are designed to be used in healthcare settings; and (2) 76% of UK Biobank cases had recent surgery compared to only 0.4% of controls, which could inflate metrics by allowing models to identify VTE based on surgical history alone. Matching to the MSDW cohort reduces both discrepancies. We generated predictions for UK Biobank participants using the 100 pre-trained model iterations from MSDW, using model iterations from each of the 10 outer folds to predict one of the 10 subsetted cohorts. We used null placeholder values for features present in MSDW and absent in UK Biobank. We averaged predictions for each of the 10 cohorts and used these averages to generate 95% confidence intervals.

To externally test models in All of Us, we randomly split the entire dataset into 10 subsetted cohorts. Unlike UK Biobank, we did not perform matching as proportions of cases and controls were already similar to the MSDW cohort. We generated predictions using the 100 pre-trained model iterations from MSDW, using model iterations from each of the 10 outer folds to predict one of the 10 cohorts. We used null placeholder values for features present in MSDW and absent in All of Us. We averaged predictions for each of the 10 cohorts and used these averages to generate 95% confidence intervals.

We compared machine learning models to the Padua score because 10 of its 11 variables can be extracted from EHR data, unlike the Caprini score and Wells’ Criteria. To construct a modified Padua score, we included all components of the original Padua score other than “reduced mobility” and “recent trauma”; additionally, “previous VTE” was set to 0 for all patients as we only considered the first VTE diagnosis for all patients. Due to limited data on cancer metastasis and treatment status that were used in the original Padua score,9 we also redefined active cancer as any cancer diagnosis (ICD-10 codes C00-D49) within the past year. We defined “ongoing hormonal treatment” as any use of selective estrogen receptor modulators, menopausal hormonal therapy, or estrogenic agents within the past year.

Model evaluation and statistical analyses

We performed all evaluation and statistical analyses in Python 3.10.9 and R 3.5.0. We set the significance level at 0.05 for all tests. We generated plots using the seaborn Python package (version 0.12.2).

Using the scikit-learn Python package (version 1.2.2), we evaluated the ability of machine learning models to accurately classify VTE cases and controls using both threshold-dependent metrics [area under the receiver operating curve (AUROC) and area under the precision recall curve (AUPRC)] and threshold-independent metrics [sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV)]. We assessed model calibration using Brier scores. Using the statsmodels Python package (version 0.13.5), we generated 95% confidence intervals (x-±1.96×SE) for these metrics across 100 iterations of model training and testing (see “Machine learning model”) and conducted paired-sample t-tests to test for significant differences in metrics between different models.

Using the lifelines Python package (version 0.27.4), we generated Kaplan-Meier curves demonstrating the relationship between survival probability and model prediction quintile. Using the survival R package (version 3.5–7), we tested the association between model prediction scores and both post-VTE mortality and VTE recurrence using Cox proportional hazards regressions adjusted for age, sex, and self-reported ethnicity. We checked for the proportional hazards assumption (Schoenfeld residuals), influential observations (deviance residuals), and nonlinearity (Martingale residuals); results of these checks for all regressions are available in Supporting Information 23.

Results

Study population characteristics

Our study encompasses 683,763 individuals from three biobanks (MSDW, UK Biobank, and All of Us), selected from more than 11 million individuals for those meeting criteria for data completeness (Figure 1). We trained and validated models on 143,101 MSDW participants and evaluated them on a holdout set of 15,900 MSDW participants, with approximately 92% controls and 8% cases in both datasets (Table S1). We then externally tested models on UK Biobank and All of Us: UK Biobank participants are, on average, healthier than the general population,26 whereas All of Us participants are primarily recruited from healthcare settings and have significant comorbidities (Tables S2S3).

Figure 1.

Figure 1.

Study design and flowchart. Machine learning models were trained and validated in Mount Sinai Data Warehouse (MSDW), assessed in a holdout set in MSDW, and externally tested in the UK Biobank and All of Us.

*: Exclusionary diagnoses are diagnoses indicating a history of VTE or post-thrombotic syndrome without a corresponding VTE diagnosis (controls) or before the first known VTE diagnosis (cases). VTE: venous thromboembolism; AUROC: area under the receiver operating curve; AUPRC: area under the precision recall curve.

Of 159,001 MSDW participants of diverse ancestries (84,110 White, 32,376 Hispanic/Latino, 29,434 Black, 11,837 Asian, 1,244 other), the median age was 57 years [IQR 31]; 67,526 [42%] were male and 91,475 [57%] were female; and 12,092 [8%] were diagnosed with VTE (7,995 DVT, 4,097 PE) in inpatient and outpatient settings. 2,732 [23%] cases had D-dimer testing within 10 days of their diagnosis; 2,499 [85%] of these tests were positive (≥ 0.5). 99% of controls and 92% of cases had no surgeries within the past 7 days, while 98% of controls and 89% of cases had no history of hospitalization within the past 7 days.

Among 401,723 participants from the UK Biobank (386,172 White, 9,279 Asian, 6,272 Black), the median age was 59 years [IQR 13]; 181,407 [45%] were male and 220,316 [55%] were female; and 1,314 [0.3%] were diagnosed with VTE (766 PE, 548 DVT) in inpatient settings. For each iteration of testing, we randomly selected 4,249 [92%] controls and 338 [8%] cases from the 401,723 participants such that 99% of controls and 92% of cases had no 7-day history of surgery, matching the MSDW cohort. We performed this matching to prevent the marked difference in 7-day surgical history between VTE cases (76.2%) and controls (0.4%) from otherwise inflating performance (Table S9).

Among 123,039 participants from All of Us (73,781 White, 24,789 Hispanic/Latino, 21,238 Black, 3,069 Asian, 162 other), the median age was 58.0 years [IQR 26.2]; 46,901 [38.1%] were male and 76,138 [61.9%] were female; and 8,350 [6.8%] were diagnosed with VTE (4,533 DVT, 3,817 PE) in inpatient and outpatient settings. Similar to MSDW, 98% of controls and 91% of cases had no surgeries within the past 7 days, while 98% of controls and 90% of cases had no history of hospitalization within the past 7 days. Notably, both cases and controls had significantly higher prevalence of most Elixhauser comorbidities compared to both MSDW and UK Biobank.

We note the prevalence of cancer diagnoses was significantly higher in all three biobanks than the general population. For example, a diagnosis of non-metastatic solid tumor was present in 34.7% of cases and 28.9% of controls in MSDW; 26.3% of cases and 5.2% of controls in UK Biobank; and 36.8% of cases and 20.6% of controls in All of Us. The proportions of participants with metastatic solid tumor and leukemia/lymphoma diagnoses were similarly elevated. The high prevalence of cancer may have contributed to the high prevalence of VTE in MSDW (8%) and All of Us (7%), as well as affected prevalence-dependent metrics like AUPRC, PPV and NPV.

Machine learning models improve VTE diagnosis

We developed three models (small, medium, and large) with 60, 91, and 131 features, respectively, on a spectrum of optimizing portability to optimizing performance. In the MSDW holdout, UK Biobank, and All of Us datasets, respectively, we achieved AUROCs of 0.80 (95% CI 0.80–0.81), 0.78 (0.77–0.79), and 0.72 (0.72–0.72) for the small model; 0.81 (0.81–0.81), 0.80 (0.79–0.80), and 0.83 (0.83–0.83) for the medium model; and 0.83 (0.83–0.83), 0.79 (0.79–0.81), and 0.82 (0.82–0.82) with the large model (Table 1). We achieved maximum AUPRCs of 0.36 (0.35–0.37) and 0.33 (0.31–0.34) in the holdout and external test sets, respectively, significantly higher than proportion of VTE cases in the MSDW and UK Biobank datasets (0.08). All models were calibrated with Brier scores of 0.06 (0.06–0.06) in both holdout and external test sets. Notably, the 140-feature large model had similar performance to a model using all 2,417 available features (including 1,580 ICD-10 codes and 737 medication classes), which attained an AUROC of 0.84 (0.83–0.84) in the holdout set.

Table 1.

Mean performance metrics for gradient boosting models in predicting venous thromboembolism diagnoses among Mount Sinai Data Warehouse (validation and holdout), UK Biobank (external test), and All of Us (external test) participants.

Dataset Model AUROC (95% CI) AUPRC (95% CI) Brier score (95% CI)

Validation Padua score 0.63 (0.63–0.63) 0.11 (0.11–0.11) 0.11 (0.11–0.11)
Small 0.80 (0.80–0.80) 0.31 (0.31–0.31) 0.06 (0.06–0.06)
Medium 0.81 (0.81–0.81) 0.32 (0.32–0.32) 0.06 (0.06–0.06)
Large 0.83 (0.83–0.83) 0.36 (0.36–0.36) 0.06 (0.06–0.06)
All available features 0.84 (0.84–0.84) 0.38 (0.37–0.38) 0.06 (0.06–0.06)
Holdout Padua score 0.63 (0.63–0.64) 0.11 (0.11–0.12) 0.11 (0.11–0.11)
Small 0.80 (0.80–0.81) 0.31 (0.30–0.31) 0.06 (0.06–0.06)
Medium 0.81 (0.81–0.81) 0.32 (0.31–0.33) 0.06 (0.06–0.06)
Large 0.83 (0.83–0.84) 0.36 (0.35–0.37) 0.06 (0.06–0.06)
All available features 0.84 (0.83–0.84) 0.37 (0.36–0.39) 0.06 (0.06–0.06)
External test (UK Biobank) Padua score 0.77 (0.77–0.77) 0.33 (0.32–0.33) 0.06 (0.06–0.06)
Small 0.78 (0.77–0.79) 0.29 (0.28–0.30) 0.07 (0.07–0.07)
Medium 0.80 (0.79–0.80) 0.32 (0.31–0.34) 0.06 (0.06–0.06)
Large 0.80 (0.79–0.81) 0.33 (0.31–0.34) 0.06 (0.06–0.06)
External test (All of Us) Padua score 0.61 (0.56–0.66) 0.10 (0.09–0.11) 0.12 (0.12–0.13)
Small 0.72 (0.72–0.72) 0.18 (0.18–0.19) 0.06 (0.06–0.06)
Medium 0.83 (0.83–0.83) 0.29 (0.28–0.29) 0.06 (0.06–0.06)
Large 0.82 (0.82–0.82) 0.27 (0.27–0.28) 0.06 (0.06–0.06)

Interpretation: AUROC evaluates the trade-off between sensitivity (ability to identify true positives) and specificity (ability to avoid false positives) across multiple thresholds, with 1.0 indicating perfect discrimination and 0.5 indicating random performance. AUPRC evaluates a model’s effectiveness at predicting positives over a range of thresholds by analyzing the balance between precision (correctly predicted positives out of all predicted positives) and recall (correctly predicted positives out of all actual positives). AUPRC values should be interpreted relative to the proportion of VTE cases, which is 0.08 in the validation, holdout, and UK Biobank datasets, and 0.07 in the All of Us dataset. Brier scores measure the accuracy of probabilistic predictions; the lower the score, the better the predictions are calibrated.

Abbreviations: AUROC: area under the receiver operating curve; AUPRC: area under the precision recall curve; CI: confidence interval.

Because AUROC and AUPRC summarize performance across various thresholds values, we subsequently analyzed sensitivity, specificity, PPV, and NPV at specific thresholds in MSDW (Table S10). Among all participants (7.6% VTE prevalence), a probability threshold of 0.025 may be appropriate for rule-out testing, where the large model achieves 95% sensitivity, 39% specificity, a PPV of 0.11, and a NPV of 0.99. Conversely, a threshold of 0.20 may be appropriate for a clinical alert system, where the large model achieves 40% sensitivity, 94% specificity, a PPV of 0.36, and a NPV of 0.95. However, optimal thresholds are highly dependent on the target population: among participants with recent hospitalization (37.6% VTE prevalence) or surgery (47.1% VTE prevalence), optimal thresholds for rule-out testing are 0.25 and 0.30, respectively, whereas optimal thresholds for clinical alerts are 0.50 and 0.60, respectively (Tables S11S12).

We compared models to a modified Padua score, which achieved AUROCs of 0.63 (0.63–0.64), 0.77 (0.77–0.77), and 0.61 (0.56–0.66) in the MSDW holdout, UK Biobank, and All of Us datasets, respectively (Table 1). In all datasets, all three sized models outperformed the Padua score in AUROC and AUPRC (p < 0.01); further, at fixed sensitivity thresholds in the holdout set, the Padua score had lower specificity than all three machine learning models, and vice versa (Tables S10S12). The relatively strong performance of the Padua score in the UK Biobank may be explained by all VTE diagnoses in the UK Biobank being inpatient diagnoses, consistent with the Padua score being designed for hospitalized patients.

Although some surgical procedures, including orthopedic, neurosurgical, and major vascular procedures, carry higher VTE risk than other procedures,27,28 we observed that separating surgical procedures by organ system for the medium model attained an AUC of 0.81 (0.81–0.81) in the holdout set, which was not significantly different from the AUC of a medium model that only incorporated the total number of surgeries.

Model performance is similar among different patient cohorts

To assess the generalizability and robustness of our models, we evaluated how models trained on the entire training dataset performed when predicting VTE in specific subsets of patients within the MSDW holdout dataset, including hospitalization status, surgical history, emergency department visits, ethnicity, cancer diagnosis, anticoagulant use, and age (Table 2). Medium and large models significantly outperformed the Padua score in all tested subsets, including in recently hospitalized patients, where the three models achieved AUROCs of 0.72 (0.71–0.74), 0.75 (0.73–0.76), and 0.59 (0.58–0.61), respectively.

Table 2.

Mean performance metrics for medium and large models and the Padua score in predicting venous thromboembolism diagnoses among selected patient subsets.

Subset Medium model Large model Padua score

All patients 0.81 (0.81–0.81) 0.83 (0.83–0.84) 0.63 (0.63–0.64)
Recent medical history
  No 7-day hospitalization 0.80 (0.79–0.80) 0.82 (0.82–0.83) 0.63 (0.63–0.63)
  7-day hospitalization 0.72 (0.71–0.74) 0.75 (0.73–0.76) 0.59 (0.58–0.61)
  No 7-day surgery 0.80 (0.79–0.80) 0.82 (0.82–0.83) 0.63 (0.62–0.63)
  7-day surgery 0.78 (0.76–0.79) 0.80 (0.78–0.81) 0.60 (0.58–0.62)
  No 7-day ED visit 0.81 (0.81–0.81) 0.83 (0.83–0.84) 0.63 (0.63–0.64)
  7-day ED visit 0.78 (0.75–0.81) 0.80 (0.76–0.83) 0.68 (0.64–0.72)
Ethnicity
  Asian 0.85 (0.83–0.87) 0.87 (0.85–0.89) 0.70 (0.67–0.72)
  Black 0.78 (0.78–0.79) 0.81 (0.80–0.82) 0.61 (0.60–0.62)
  Hispanic/Latino 0.81 (0.79–0.82) 0.83 (0.83–0.84) 0.64 (0.63–0.64)
  White 0.81 (0.80–0.81) 0.83 (0.82–0.83) 0.64 (0.63–0.64)
Anticoagulation status
  Not anticoagulated 0.81 (0.81–0.82) 0.83 (0.82–0.84) 0.62 (0.62–0.62)
  Anticoagulated 0.75 (0.74–0.75) 0.79 (0.78–0.80) 0.60 (0.59–0.60)
Cancer diagnosis
  No 0.81 (0.81–0.82) 0.83 (0.83–0.84) 0.63 (0.62–0.64)
  Yes 0.80 (0.80–0.81) 0.83 (0.82–0.83) 0.64 (0.63–0.64)
VTE type
  DVT 0.81 (0.81–0.81) 0.83 (0.83–0.84) 0.64 (0.63–0.64)
  PE 0.81 (0.81–0.82) 0.83 (0.82–0.83) 0.63 (0.62–0.64)
Age group
  Age < 40 0.83 (0.82–0.84) 0.85 (0.84–0.86) 0.63 (0.61–0.64)
  40 ≤ Age ≤ 80 0.79 (0.79–0.79) 0.82 (0.81–0.82) 0.60 (0.60–0.61)
  Age ≥ 80 0.70 (0.69–0.71) 0.73 (0.72–0.74) 0.53 (0.52–0.55)

Methodology: In this analysis, models were trained and validated using all patients from these respective datasets and evaluated on different subsets of patients in the holdout set meeting the above criteria. Values are presented as mean area under the operating curve (95% confidence interval).

Definitions: Medium model: model with 91 features, including laboratory and vital measurements and Elixhauser comorbidities. Large model: model with 131 features, including 20 ICD-10 codes and 20 medication classes beyond the medium model. 7-day hospitalization: at least one day hospitalized within the 7 days prior to the cutoff date. 7-day surgery: at least one surgical procedure within the 7 days prior to the cutoff date. 7-day ED visit: at least one emergency department (ED) visit within the 7 days prior to the cutoff date.

Abbreviations: ED: emergency department; DVT: deep vein thrombosis; PE: pulmonary embolism.

We also observed consistently better performance across subsets for the large model than the medium model, with AUROCs of 0.81–0.87 in all ethnic groups and significantly higher AUROCs in recently hospitalized patients (0.75 versus 0.72) and perioperative patients (0.80 vs 0.78). To broaden our model’s applicability, we did not exclude patients on anticoagulants, given the diverse indications for anticoagulation beyond prophylaxis in patients with previous VTE diagnosis. However, adding 20 medication classes as features improved the AUROC among anticoagulated patients from 0.75 in the medium model to 0.79 in the large model.

Both previously known and novel clinical features contribute to VTE predictive power

To ascertain the significance of various features, we conducted a Shapley Additive Explanations (SHAP) analysis, which determines the contribution of each feature to the model prediction for each individual patient. Among the 20 most important features for the medium model, 12 were laboratory or vital measurements and three were Elixhauser comorbidities (Figure 2a). By examining the correlation between SHAP values and laboratory values, we were able to discern whether specific features enhanced or reduced the model’s probability score for venous thromboembolism (VTE). Among the six most important features (red cell distribution width, albumin, age, hospitalizations and surgeries in the past 7 days, and BMI), all but albumin were positively associated with SHAP values (Figure 2bg).

Figure 2.

Figure 2.

Analysis of important features for venous thromboembolism diagnosis models.

(A) Top 20 features for the medium model among MSDW patients. Feature importances were determined using Shapley Additive Explanations (SHAP) analysis. Bars are colored according to the Spearman’s correlation coefficient (ρ) between feature values and SHAP values. (B-G) Scatterplots of feature values against SHAP values for the top six features. SD: standard deviation; RDW: red blood cell distribution width; BMI: body mass index; MPV: mean platelet volume; MCHC: mean corpuscular hemoglobin concentration; eGFR: estimated glomerular filtration rate; A/G ratio: albumin/globulin ratio.

The medium model incorporated 43 laboratory values and three vitals, with 24 previously linked to VTE in existing research and 19 exhibiting no known or non-significant associations. Out of the 24 established associations, 19 displayed a correlation direction between SHAP values and laboratory values consistent with the current literature, whereas five (HDL cholesterol, aspartate aminotransferase, red blood cell count, neutrophil count, and monocyte count) demonstrated inconsistent correlations (Table S13). The medium model also identified complex relationships between laboratory values and VTE risk, including U-shaped responses in mean corpuscular hemoglobin (Figure S2ab) that confirmed findings from a previous study.29

Demographics and Elixhauser comorbidities were also important features. SHAP values increased with increasing age between ages 40 and 70, which concurs with previous reports that patients > 40 years of age are at greater risk of VTE compared to younger patients.30,31 However, there was a decrease in SHAP values with increasing age beyond age 70, which has not been previously reported. SHAP values also increased among those with recent cancer diagnoses (separated into solid tumor without metastasis, metastatic cancer, leukemia/lymphoma), peripheral vascular disorders (e.g., atherosclerosis, aneurysms, and arterial thrombosis), and thrombotic coagulopathies (e.g., diffuse intravascular coagulation and thrombophilia) (Figure S2cg). For these Elixhauser comorbidities, the time since diagnosis influenced the model outputs, with those diagnosed within the past year exhibiting significantly higher SHAP values than those first diagnosed more than 10 years ago.

Model probability scores are predictive of increased post-VTE mortality

As VTE is associated with increased mortality,5 we evaluated the association between model probability scores and all-cause mortality. We found that model probability scores were associated with an increase in risk of mortality on a continuum in both UK Biobank and All of Us (Figure 3ad; Figure S3ah), with adjusted hazard ratios (HR) per quintile increase in score of 1.28 (95% CI 1.26–1.29) and 1.21 (1.17–1.25) among both cases and controls, 1.25 (1.24–1.27) and 1.21 (1.17–1.26) among only controls, and 1.40 (1.09–1.81) and 1.22 (1.02–1.46) among only cases, respectively. In comparison, the HRs for positive VTE status were 7.90 (7.25–8.61) and 1.32 (1.12–1.55) in UK Biobank and All of Us, respectively (Figure 3eh). Model scores were also associated with VTE recurrence in All of Us, with an adjusted HR per quintile increase in score of 2.18 (2.14–2.23) (Figure S4ab); however, these associations were not significant in MSDW or UK Biobank (Figure S4cf).

Figure 3.

Figure 3.

All-cause mortality among UK Biobank and All of Us participants.

All-cause mortality among UK Biobank participants (A-B, E-F) and All of Us participants (C-D, G-H) was stratified by either model quintiles (A-D) or venous thromboembolism (VTE) status (E-H) among both cases and controls. Adjusted HR for mortality increased (A, C, E, G) and survival probability over time decreased (B, D, F, H) with positive VTE status as well as increasing model probability score quintiles. Below each timepoint of the Kaplan-Meier curves, “At risk” refers to the number of participants who are still alive; “Censored” refers to the number of participants who remained alive up to the latest data updates (2022–12-19 for UK Biobank and 2022–08-15 for All of Us); and “Events” refers to the cumulative number of deaths. Assumptions for Cox proportional hazards regressions are shown in Supporting Information 2. HR: hazard ratio; CI: confidence interval.

Machine learning models allow 1-year VTE risk prediction

Because early detection of VTE is crucial, we developed machine learning models to predict the 1-year risk of VTE. These are similar to VTE diagnosis models but lack hospitalization and surgical history. In the MSDW holdout, UK Biobank, and All of Us datasets, respectively, we achieved AUROCs of 0.76 (95% CI 0.75–0.77), 0.64 (0.63–0.65), and 0.64 (0.63–0.64) for the small model; 0.77 (0.76–0.78), 0.65 (0.64–0.65), and 0.69 (0.68–0.69) for the medium model; and 0.78 (0.78–0.79), 0.65 (0.64–0.65), and 0.69 (0.68–0.69) for the large model (Table 3). While 1-year risk prediction is outside of its designed purpose, the modified Padua score achieved significantly lower AUROCs of 0.58 (0.58–0.58) and 0.54 (0.50–0.58) in the holdout and All of Us datasets (p < 0.001). There was significantly decreased precision and recall for all models compared to VTE diagnosis models.

Table 3.

Mean performance metrics for gradient boosting models in predicting 1-year risk of venous thromboembolism among Mount Sinai Data Warehouse (validation and holdout), UK Biobank (external test), and All of Us (external test) participants.

Dataset Model AUROC (95% CI) AUPRC (95% CI) Brier score (95% CI)

Validation Padua score 0.57 (0.57–0.57) 0.10 (0.10–0.10) 0.13 (0.12–0.13)
Small 0.76 (0.76–0.76) 0.19 (0.19–0.19) 0.08 (0.08–0.09)
Medium 0.77 (0.77–0.77) 0.20 (0.20–0.20) 0.08 (0.08–0.09)
Large 0.78 (0.78–0.79) 0.22 (0.22–0.22) 0.08 (0.07–0.08)
Holdout Padua score 0.57 (0.56–0.58) 0.10 (0.08–0.13) 0.13 (0.13–0.13)
Small 0.76 (0.75–0.77) 0.19 (0.16–0.21) 0.08 (0.08–0.09)
Medium 0.77 (0.76–0.78) 0.20 (0.17–0.22) 0.08 (0.08–0.09)
Large 0.78 (0.78–0.79) 0.22 (0.19–0.25) 0.08 (0.07–0.08)
External test
(UK Biobank)
Padua score 0.65 (0.65–0.65) 0.14 (0.14–0.14) 0.07 (0.07–0.07)
Small 0.64 (0.63–0.65) 0.13 (0.13–0.14) 0.11 (0.11–0.12)
Medium 0.65 (0.64–0.65) 0.14 (0.13–0.15) 0.12 (0.11–0.12)
Large 0.65 (0.64–0.66) 0.15 (0.14–0.15) 0.12 (0.11–0.12)
External test
(All of Us)
Padua score 0.54 (0.50–0.58) 0.06 (0.05–0.07) 0.12 (0.11–0.13)
Small 0.64 (0.63–0.64) 0.09 (0.08–0.09) 0.09 (0.08–0.09)
Medium 0.69 (0.68–0.69) 0.12 (0.12–0.12) 0.08 (0.08–0.08)
Large 0.69 (0.68–0.69) 0.12 (0.12–0.12) 0.08 (0.08–0.08)

Interpretation: AUROC evaluates the trade-off between sensitivity (ability to identify true positives) and specificity (ability to avoid false positives) across multiple thresholds, with 1.0 indicating perfect discrimination and 0.5 indicating random performance. AUPRC evaluates a model’s effectiveness at predicting positives over a range of thresholds by analyzing the balance between precision (correctly predicted positives out of all predicted positives) and recall (correctly predicted positives out of all actual positives). AUPRC values should be interpreted relative to the proportion of VTE cases, which is 0.08 in the validation, holdout, and UK Biobank datasets, and 0.07 in the All of Us dataset. Brier scores measure the accuracy of probabilistic predictions; the lower the score, the better the predictions are calibrated.

Abbreviations: AUROC: area under the receiver operating curve; AUPRC: area under the precision recall curve; CI: confidence interval.

Discussion

In this study, we established that machine learning models leveraging pre-existing structured EHR data accurately predict VTE diagnosis and 1-year risk across diverse populations: diagnosis models outperformed the Padua score in three large-scale biobanks, including among the Padua score’s target population of hospitalized patients, while 1-year risk prediction models have not previously been reported. Although we trained models using data from only one health system, models performed similarly between the holdout (MSDW) and external test (UK Biobank and All of Us) datasets, despite significant demographic and health status differences between these cohorts. Models also demonstrated robust performance across various patient subsets within MSDW, including ethnicity, hospitalization status, surgical history, and cancer diagnosis. This suggests our models can be applied in a wide range of clinical settings (e.g., inpatient, outpatient, perioperative, and emergency department), contrasting existing clinical scores. However, performance was reduced in recently hospitalized patients and those using anticoagulants, possibly due to insufficient data for these specific subgroups and dynamic feature changes not captured in our model.

Consistent with the significant risk of mortality following VTE, probability scores from our models were associated with gradations of risk of mortality on a continuum among all participants in the UK Biobank and All of Us cohorts. These gradated associations were present separately among VTE cases, supporting the validity of our model in identifying patients most in need of intervention, as well as controls, suggesting a possible underdiagnosis of VTE among the general population.

Our models were able to identify both established and novel clinical features that contribute to VTE risk, providing insights into the underlying pathophysiology and potential avenues for further investigation. For instance, the two most important features, high red blood cell distribution width and hypoalbuminemia, have both been previously identified as risk factors for VTE;3234 although the pathophysiology is unknown, this may be due to high RDW and hypoalbuminemia being markers of inflammation,35,36 high RDW affecting intravascular hemodynamics,37 and hypoalbuminemia being a proxy for prothrombotic states in nephrotic syndrome.34 We also identified several possible risk factors for VTE that have not been previously reported, including low basophil counts and percentages, high anion gap, and high eosinophil counts and percentages (Table S8); these features may have been missed in prior analyses due to inadequate sample and/or effect sizes. Further, unlike clinical scores or logistic regression models, our gradient boosting models capture non-linear and temporal relationships between features and VTE risk, as we observed for nearly all features (Figure 2bg, Figure S2ab). Surprisingly, despite known associations between increased cancer stage and grade and increased VTE risk,3840 we observed higher SHAP values for cancer diagnoses within the past year compared to older diagnoses for all three types of cancer (non-metastatic solid tumor, metastatic cancer, and leukemia/lymphoma). This may be due to survivorship bias, imperfect correlations between diagnosis timing and cancer progression, and/or recent initiation of anticancer therapies, but ultimately suggests further investigation is needed.

In addition to improving VTE diagnosis, our models could also predict 1-year risk for VTE with AUROCs of 0.76–0.78 in the holdout set and 0.64–0.69 in external test sets. This reduced accuracy compared to diagnosis models likely reflects the importance of recent stasis and endothelial injury (e.g., through hospitalization and surgery) for VTE development. Although anticoagulation is indicated for short-term prophylaxis among hospitalized patients and for preventing VTE recurrence,41,42 long-term prophylaxis is uncommon due to bleeding risks. However, the possibility of 1-year VTE risk prediction warrants further research on long-term prophylaxis for high-risk patients, especially with the recent development of anticoagulants with minimal bleeding risk.43

Based on these results, we propose two clinical use-cases for our models. First, similar to existing EHR-embedded VTE risk-stratification tools,1417 our models could be used as an alert system, alerting clinicians if patients’ scores exceed a carefully selected threshold. In general patient population, a threshold of 0.20 yields 94% specificity and a PPV of 0.36 for the large model, while among hospitalized patients, a threshold of 0.60 yields 88% specificity and a PPV of 0.78 (Tables S10S12). Deciding on the exact threshold and subsequent actions requires evaluating several factors, including VTE prevalence, resources needed to perform workup, and the number of cases that would otherwise be missed. In populations with low VTE prevalence, subsequent non-invasive testing (e.g., D-dimer or compression ultrasound) may be necessary, whereas in populations with high VTE prevalence, physicians may decide to begin anticoagulation without further testing. Our model still has utility even when subsequent D-dimer or ultrasound evaluation is necessary as a significant portion of patients (39% in MSDW and 40% in All of Us) have sufficient existing data for our model to be run without any additional measurements.

Second, our model could be used as a rule-out test: among all participants, the large model has a sensitivity of 95% and a specificity of 39% at a threshold of 0.025; among hospitalized patients, the large model has a sensitivity of 90% and a specificity of 41% at a threshold of 0.25; and among postoperative patients, the large model has a sensitivity of 92% and a specificity of 43% at a threshold of 0.30. These values are comparable to D-dimer assays, which have sensitivities of 92–97% and specificities of 41–65% for DVT and PE;4446 unlike D-dimer, however, our model generates instantaneous predictions and uses already available data.

Ultimately, any clinical application of our machine learning models requires integration into EHR systems for automated prediction using real-time patient data, as it is infeasible for clinicians to manually input up to 130 features at the bedside. This is similar to the Epic Sepsis Score,12,13 which has already been implemented at hundreds of hospitals. To lower barriers to adoption, we demonstrate that portable models utilizing a reduced set of features (i.e., demographics, surgical and hospitalization history, and laboratory values) still achieved high predictive accuracy. We selected only laboratory values that are commonly assessed, with 33 of the 43 values being included in either a complete blood count or a comprehensive metabolic panel, and Elixhauser comorbidities may be easier to assess than a full or limited set of ICD-10 diagnostic codes.

Beyond EHR integration, developing these technologies for clinical implementation requires addressing important regulatory and ethical concerns. Stringent regulatory frameworks are needed to evaluate the safety and efficacy of AI-based medical devices and ensure they meet healthcare quality standards. Protecting patient privacy and confidentiality is also essential, requiring properly de-identified data and secure systems. The practical aspects of intuitive user interface design, clinician training, and seamless integration with clinical workflows are critical to ensure AI tools are effectively used in patient care.47,48 Further research is needed to incorporate these models into the clinical environment and validate their performance, enabling healthcare providers to easily access and effectively use the predictive information.

Despite the promising results of our study, several limitations should be acknowledged. The first limitation is that our VTE diagnosis models were trained to predict the diagnosis of VTE rather than the occurrence of VTE, which is a general limitation of retrospective studies. However, because we use laboratory and vital measurements from up to three years prior to the diagnosis date, our model may still capture physiological changes associated with VTE. The second limitation is that the performance of our models in other healthcare systems and populations remains to be assessed. Our study evaluated the model in one hospital-based cohort and two national biobanks, but all three datasets had higher prevalence of non-metastatic cancer and several other comorbidities than the general population.49 Nevertheless, our subset analyses suggest our models have robust performance across clinical settings (Table 2). The third limitation is that 59% of MSDW participants and 61% of All of Us participants lacked more than 20% of laboratory and vital measurements and were excluded from our study as our model performed poorly on individuals with large amounts of missing data (Table S5). This limits the applicability of our models to patients with comprehensive data and may have introduced ascertainment bias for participants with frequent clinical encounters. Our cohorts nevertheless consist of participants with a wide array of demographics and comorbidities (Tables S1S3), and our models can still be used for the ~40% of patients with limited amounts of missing data: just as we imputed data separately for MSDW, UK Biobank, and All of Us participants, health systems applying our models can perform imputation using their own patient data on a per-patient basis or in batches.50,51 If only a small number of features are missing, predictions can be generated without imputation as the LightGBM architecture of our model bypasses missing features; we did so in this study when performing external testing on UK Biobank and All of Us participants. For optimal performance, we recommend that participants have at least 80% of the 46 laboratory and vital measurements available. The fourth limitation is that while our models were able to predict VTE occurrence, they were not predictive of VTE recurrence, necessitating the development of separate models for this purpose.52 This may reflect differences in risk factors for initial VTE compared to VTE recurrence as well as lack of information regarding prophylaxis after initial VTE. The fifth limitation is that ICD-10 diagnosis codes, as used in the large model, can be biased and are sensitive to misclassification.53 However, the small and medium models, which did not use diagnosis codes, demonstrated similar performance as the large model. The sixth limitation is that we were unable to compare our model to existing VTE models. Prior studies generally have not made their models publicly available, so we cannot assess them on our datasets.11 Additionally, many such studies have strict inclusion criteria whereas our study targets general patient populations, so we cannot directly compare machine learning metrics.54,55

In conclusion, our study demonstrates that machine learning models using EHR data can significantly improve VTE diagnosis and 1-year risk prediction in diverse patient populations compared to the Padua score. The adoption of these models in clinical practice may enhance VTE risk assessment and early detection, ultimately reducing the morbidity and mortality associated with this condition. Future research should focus on validating and refining these models in different healthcare settings, as well as exploring their potential utility in predicting VTE recurrence and informing personalized treatment strategies.

Supplementary Material

Supplemental Publication Material

Highlights.

  • Machine learning models using structured EHR data accurately predicted VTE diagnosis and 1-year risk in diverse populations, outperforming the Padua risk score.

  • Models demonstrated robust performance across different clinical settings and patient subgroups, including hospitalized patients, surgical patients, different ethnicities, and cancer patients.

  • Model probability scores were associated with increased post-VTE mortality risk on a continuum.

  • Proposed clinical applications include using models as alert systems or rule-out tests to improve early VTE detection and prevention.

Acknowledgements:

This work was supported in part through the Mount Sinai Data Warehouse (MSDW) resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai. This work also required the use of the All of Us Research Program, which is supported by the Office of the Director of the National Institutes of Health (NIH) and would not be possible without the partnership of its participants.

Sources of Funding:

RC is supported by the National Institute of General Medical Sciences of the National Institutes of Health (NIH) (T32-GM007280). RD is supported by the National Institute of General Medical Sciences of the NIH (R35-GM124836).

Disclosures:

Dr. Do reported receiving grants from AstraZeneca, grants and nonfinancial support from Goldfinch Bio, being a scientific co-founder, consultant and equity holder for Pensieve Health (pending), and being a consultant for Variant Bio, all not related to this work. All other authors have reported that they have no relationships relevant to the contents of this paper to disclose.

Non-standard abbreviations and acronyms

VTE

venous thromboembolism

DVT

deep vein thrombosis

PE

pulmonary embolism

EHR

electronic health records

MSDW

Mount Sinai Data Warehouse

SHAP

Shapley Additive Explanations

AUROC

area under the receiver-operating curve

AUPRC

area under the precision-recall curve

PPV

positive predictive value

NPV

negative predictive value’

References

  • 1.Lutsey PL, Zakai NA. Epidemiology and prevention of venous thromboembolism. Nat. Rev. Cardiol 2023;20:248–262. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Waheed SM, Kudaravalli P, Hotwagner DT. Deep Vein Thrombosis [Internet]. In: StatPearls. Treasure Island (FL): StatPearls Publishing; 2023. [cited 2023 Apr 17]. Available from: http://www.ncbi.nlm.nih.gov/books/NBK507708/ [PubMed] [Google Scholar]
  • 3.Huang W, Goldberg RJ, Cohen AT, Anderson FA, Kiefe CI, Gore JM, Spencer FA. Declining Long-term Risk of Adverse Events after First-time Community-presenting Venous Thromboembolism: The Population-based Worcester VTE Study (1999 to 2009). Thromb. Res 2015;135:1100–1106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Tagalakis V, Patenaude V, Kahn SR, Suissa S. Incidence of and mortality from venous thromboembolism in a real-world population: the Q-VTE Study Cohort. Am. J. Med 2013;126:832.e13–21. [DOI] [PubMed] [Google Scholar]
  • 5.Søgaard KK, Schmidt M, Pedersen L, Horváth–Puhó E, Sørensen HT. 30-Year Mortality After Venous Thromboembolism. Circulation. 2014;130:829–836. [DOI] [PubMed] [Google Scholar]
  • 6.Nicholson M, Chan N, Bhagirath V, Ginsberg J. Prevention of Venous Thromboembolism in 2020 and Beyond. J. Clin. Med 2020;9:2467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Fanikos J, Piazza G, Zayaruzny M, Goldhaber SZ. Long-term complications of medical patients with hospital-acquired venous thromboembolism. Thromb. Haemost 2009;102:688–693. [DOI] [PubMed] [Google Scholar]
  • 8.Cronin M, Dengler N, Krauss ES, Segal A, Wei N, Daly M, Mota F, Caprini JA. Completion of the Updated Caprini Risk Assessment Model (2013 Version). Clin. Appl. Thromb 2019;25:1076029619838052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Barbar S, Noventa F, Rossetto V, Ferrari A, Brandolin B, Perlati M, De Bon E, Tormene D, Pagnan A, Prandoni P. A risk assessment model for the identification of hospitalized medical patients at risk for venous thromboembolism: the Padua Prediction Score. J. Thromb. Haemost. JTH 2010;8:2450–2457. [DOI] [PubMed] [Google Scholar]
  • 10.Wells PS, Anderson DR, Rodger M, Stiell I, Dreyer JF, Barnes D, Forgie M, Kovacs G, Ward J, Kovacs MJ. Excluding pulmonary embolism at the bedside without diagnostic imaging: management of patients with suspected pulmonary embolism presenting to the emergency department by using a simple clinical model and d-dimer. Ann. Intern. Med 2001;135:98–107. [DOI] [PubMed] [Google Scholar]
  • 11.Ryan L, Mataraso S, Siefkas A, Pellegrini E, Barnes G, Green-Saxena A, Hoffman J, Calvert J, Das R. A Machine Learning Approach to Predict Deep Venous Thrombosis Among Hospitalized Patients. Clin. Appl. Thromb. Off. J. Int. Acad. Clin. Appl. Thromb 2021;27:1076029621991185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Cull J, Brevetta R, Gerac J, Kothari S, Blackhurst D. Epic Sepsis Model Inpatient Predictive Analytic Tool: A Validation Study. Crit. Care Explor 2023;5:e0941. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wong A, Otles E, Donnelly JP, et al. External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients. JAMA Intern. Med 2021;181:1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Amland RC, Dean BB, Yu H, Ryan H, Orsund T, Hackman JL, Roberts SR. Computerized Clinical Decision Support to Prevent Venous Thromboembolism Among Hospitalized Patients: Proximal Outcomes from a Multiyear Quality Improvement Project. J. Healthc. Qual. JHQ 2015;37:221. [DOI] [PubMed] [Google Scholar]
  • 15.Novis SJ, Havelka GE, Ostrowski D, Levin B, Blum-Eisa L, Prystowsky JB, Kibbe MR. Prevention of thromboembolic events in surgical patients through the creation and implementation of a computerized risk assessment program. J. Vasc. Surg 2010;51:648–654. [DOI] [PubMed] [Google Scholar]
  • 16.Bhalla R, Berger MA, Reissman SH, Yongue BG, Adelman JS, Jacobs LG, Billett H, Sinnett MJ, Kalkut G. Improving hospital venous thromboembolism prophylaxis with electronic decision support. J. Hosp. Med 2013;8:115–120. [DOI] [PubMed] [Google Scholar]
  • 17.Rastogi R, Lattimore CM, Mehaffey JH, Turrentine FE, Maitland HS, Zaydfudim VM. Electronic health record risk-stratification tool reduces venous thromboembolism events in surgical patients. Surg. Open Sci. 2022;9:34–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Di Minno MND, Ambrosino P, Ambrosini F, Tremoli E, Di Minno G, Dentali F. Prevalence of deep vein thrombosis and pulmonary embolism in patients with superficial vein thrombosis: a systematic review and meta‐analysis. J. Thromb. Haemost 2016;14:964–972. [DOI] [PubMed] [Google Scholar]
  • 19.Beyer-Westendorf J Controversies in venous thromboembolism: to treat or not to treat superficial vein thrombosis. Hematol. Am. Soc. Hematol. Educ. Program 2017;2017:223–230. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Tipping ME. Sparse Bayesian Learning and the Relevance Vector Machine. J. Mach. Learn. Res 2001;1:211–244. [Google Scholar]
  • 21.MacKay DJC. A Practical Bayesian Framework for Backpropagation Networks. Neural Comput. 1992;4:448–472. [Google Scholar]
  • 22.Elixhauser A, Steiner C, Harris DR, Coffey RM. Comorbidity Measures for Use with Administrative Data. Med. Care 1998;36:8–27. [DOI] [PubMed] [Google Scholar]
  • 23.2018 ICD-10 CM and GEMs [Internet]. Cent. Medicare Medicaid Serv. [cited 2023 Oct 24];Available from: https://www.cms.gov/medicare/coding-billing/icd-10-codes/2018-icd-10-cm-gem
  • 24.Lundberg S, Lee S-I. A Unified Approach to Interpreting Model Predictions [Internet]. 2017. [cited 2023 May 6];Available from: http://arxiv.org/abs/1705.07874
  • 25.Shwartz-Ziv R, Armon A. Tabular Data: Deep Learning is Not All You Need [Internet]. 2021. [cited 2023 Sep 4];Available from: http://arxiv.org/abs/2106.03253 [Google Scholar]
  • 26.Fry A, Littlejohns TJ, Sudlow C, Doherty N, Adamska L, Sprosen T, Collins R, Allen NE. Comparison of Sociodemographic and Health-Related Characteristics of UK Biobank Participants With Those of the General Population. Am. J. Epidemiol 2017;186:1026–1034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Tadesse TA, Kedir HM, Fentie AM, Abiye AA. Venous Thromboembolism Risk and Thromboprophylaxis Assessment in Surgical Patients Based on Caprini Risk Assessment Model. Risk Manag. Healthc. Policy 2020;13:2545. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.O’Donnell M, Weitz JI. Thromboprophylaxis in surgical patients. Can. J. Surg 2003;46:129–135. [PMC free article] [PubMed] [Google Scholar]
  • 29.Rezende SM, Lijfering WM, Rosendaal FR, Cannegieter SC. Hematologic variables and venous thrombosis: red cell distribution width and blood monocyte count are associated with an increased risk. Haematologica. 2014;99:194–200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Anderson FA, Spencer FA. Risk factors for venous thromboembolism. Circulation. 2003;107:I9–16. [DOI] [PubMed] [Google Scholar]
  • 31.Stein PD, Hull RD, Kayali F, Ghali WA, Alshab AK, Olson RE. Venous thromboembolism according to age: the impact of an aging population. Arch. Intern. Med 2004;164:2260–2265. [DOI] [PubMed] [Google Scholar]
  • 32.Riedl J, Posch F, Königsbrügge O, et al. Red Cell Distribution Width and Other Red Blood Cell Parameters in Patients with Cancer: Association with Risk of Venous Thromboembolism and Mortality. PLOS ONE. 2014;9:e111440. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Bucciarelli P, Maino A, Felicetta I, Abbattista M, Passamonti SM, Artoni A, Martinelli I. Association between red cell distribution width and risk of venous thromboembolism. Thromb. Res 2015;136:590–594. [DOI] [PubMed] [Google Scholar]
  • 34.Folsom AR, Lutsey PL, Heckbert SR, Cushman M. Serum albumin and risk of venous thromboembolism. Thromb. Haemost 2010;104:100–104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Agarwal S Red cell distribution width, inflammatory markers and cardiorespiratory fitness: Results from the National Health and Nutrition Examination Survey. Indian Heart J. 2012;64:380–387. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Soeters PB, Wolfe RR, Shenkin A. Hypoalbuminemia: Pathogenesis and Clinical Significance. JPEN J. Parenter. Enteral Nutr. 2019;43:181–193. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Ananthaseshan S, Bojakowski K, Sacharczuk M, et al. Red blood cell distribution width is associated with increased interactions of blood cells with vascular wall. Sci. Rep 2022;12:13676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Ahlbrecht J, Dickmann B, Ay C, Dunkler D, Thaler J, Schmidinger M, Quehenberger P, Haitel A, Zielinski C, Pabinger I. Tumor grade is associated with venous thromboembolism in patients with cancer: results from the Vienna Cancer and Thrombosis Study. J. Clin. Oncol. Off. J. Am. Soc. Clin. Oncol 2012;30:3870–3875. [DOI] [PubMed] [Google Scholar]
  • 39.Dickmann B, Ahlbrecht J, Ay C, Dunkler D, Thaler J, Scheithauer W, Quehenberger P, Zielinski C, Pabinger I. Regional lymph node metastases are a strong risk factor for venous thromboembolism: results from the Vienna Cancer and Thrombosis Study. Haematologica. 2013;98:1309–1314. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Falanga A, Russo L, Milesi V, Vignoli A. Mechanisms and risk factors of thrombosis in cancer. Crit. Rev. Oncol. Hematol 2017;118:79–83. [DOI] [PubMed] [Google Scholar]
  • 41.Rodger MA, Le Gal G. Who should get long-term anticoagulant therapy for venous thromboembolism and with what? Blood Adv. 2018;2:3081–3087. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Badireddy M, Mudipalli VR. Deep Venous Thrombosis Prophylaxis [Internet]. In: StatPearls. Treasure Island (FL): StatPearls Publishing; 2023. [cited 2023 May 6]. Available from: http://www.ncbi.nlm.nih.gov/books/NBK534865/ [PubMed] [Google Scholar]
  • 43.Verhamme P, Yi BA, Segers A, Salter J, Bloomfield D, Büller HR, Raskob GE, Weitz JI. Abelacimab for Prevention of Venous Thromboembolism. N. Engl. J. Med 2021;385:609–617. [DOI] [PubMed] [Google Scholar]
  • 44.Di nisio M, Squizzato A, Rutjes AWS, Büller HR, Zwinderman AH, Bossuyt PMM. Diagnostic accuracy of D‐dimer test for exclusion of venous thromboembolism: a systematic review. J. Thromb. Haemost 2007;5:296–304. [DOI] [PubMed] [Google Scholar]
  • 45.Patel P, Patel P, Bhatt M, et al. Systematic review and meta-analysis of test accuracy for the diagnosis of suspected pulmonary embolism. Blood Adv. 2020;4:4296–4311. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Chunilal SD, Brill-Edwards PA, Stevens PB, Joval JP, McGinnis JA, Rupwate M, Ginsberg JS. The Sensitivity and Specificity of a Red Blood Cell Agglutination D-Dimer Assay for Venous Thromboembolism When Performed on Venous Blood. Arch. Intern. Med 2002;162:217–220. [DOI] [PubMed] [Google Scholar]
  • 47.Privacy Murdoch B. and artificial intelligence: challenges for protecting health information in a new era. BMC Med. Ethics 2021;22:122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Schwartz JM, Moy AJ, Rossetti SC, Elhadad N, Cato KD. Clinician involvement in research on machine learning–based predictive clinical decision support for the hospital setting: A scoping review. J. Am. Med. Inform. Assoc. JAMIA 2021;28:653–663. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Sharma N, Schwendimann R, Endrich O, Ausserhofer D, Simon M. Comparing Charlson and Elixhauser comorbidity indices with different weightings to predict in-hospital mortality: an analysis of national inpatient data. BMC Health Serv. Res 2021;21:13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Li J, Yan XS, Chaudhary D, et al. Imputation of missing values for electronic health record laboratory data. Npj Digit. Med 2021;4:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Luo Y Evaluating the state of the art in missing data imputation for clinical data. Brief. Bioinform 2022;23:bbab489. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Muñoz Martín AJ, Huerga Domínguez S, Souto JC, et al. Predicting recurrence of venous thromboembolism in anticoagulated cancer patients using real-world data and machine learning. J. Clin. Oncol 2022;40:e18742–e18742. [Google Scholar]
  • 53.Walraven C van. A comparison of methods to correct for misclassification bias from administrative database diagnostic codes. Int. J. Epidemiol 2018;47:605–616. [DOI] [PubMed] [Google Scholar]
  • 54.Shohat N, Ludwick L, Sherman MB, Fillingham Y, Parvizi J. Using machine learning to predict venous thromboembolism and major bleeding events following total joint arthroplasty. Sci. Rep 2023;13:2197. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Qiao N, Zhang Q, Chen L, et al. Machine learning prediction of venous thromboembolism after surgeries of major sellar region tumors. Thromb. Res 2023;226:1–8. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Publication Material

Data Availability Statement

Code used for the analyses and trained models have been made publicly available at Mendeley and can be accessed at https://doi.org/10.17632/tkwzysr4y6.6. Predictions can be generated online at https://colab.research.google.com/drive/1m0eh7Cj8BFaZpVQd7VZhLwMV8K8sGCBN?usp=sharing; they can also be generated locally by following the tutorial in Supporting Information 1. The UK Biobank and All of Us datasets used in this study are publicly available and can be accessed at https://bbams.ndph.ox.ac.uk/ams/ and https://workbench.researchallofus.org/, respectively. The MSDW dataset used in this study is only available to researchers at the Icahn School of Medicine at Mount Sinai; further information about this dataset is available at https://labs.icahn.mssm.edu/msdw/.

RESOURCES