Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Feb 20.
Published in final edited form as: Crit Care Med. 2025 Apr 8;53(6):e1224–e1234. doi: 10.1097/CCM.0000000000006662

Development and External Validation of a Detection Model to Retrospectively Identify Patients With Acute Respiratory Distress Syndrome

Elizabeth Levy 1,2,3, Dru Claar 4, Ivan Co 4,5, Barry D Fuchs 1, Jennifer Ginestra 2,6, Rachel Kohn 1,2,3, Jakob I McSparron 4, Bhavik Patel 1,2,3, Gary E Weissman 1,2,3, Meeta Prasad Kerlin 1,2,3, Michael W Sjoding 4
PMCID: PMC12919718  NIHMSID: NIHMS2137507  PMID: 40197621

Abstract

OBJECTIVE:

The aim of this study was to develop and externally validate a machine-learning model that retrospectively identifies patients with acute respiratory distress syndrome (acute respiratory distress syndrome [ARDS]) using electronic health record (EHR) data.

DESIGN:

In this retrospective cohort study, ARDS was identified via physician-adjudication in three cohorts of patients with hypoxemic respiratory failure (training, internal validation, and external validation). Machine-learning models were trained to classify ARDS using vital signs, respiratory support, laboratory data, medications, chest radiology reports, and clinical notes. The best-performing models were assessed and internally and externally validated using the area under receiver-operating curve (AUROC), area under precision-recall curve, integrated calibration index (ICI), sensitivity, specificity, positive predictive value (PPV), and ARDS timing.

PATIENTS:

Patients with hypoxemic respiratory failure undergoing mechanical ventilation within two distinct health systems

INTERVENTIONS:

None.

MEASUREMENTS AND MAIN RESULTS:

There were 1,845 patients in the training cohort, 556 in the internal validation cohort, and 199 in the external validation cohort. ARDS prevalence was 19%, 17%, and 31%, respectively. Regularized logistic regression models analyzing structured data (EHR model) and structured data and radiology reports (EHR-radiology model) had the best performance. During internal and external validation, the EHR-radiology model had AUROC of 0.91 (95% CI, 0.88–0.93) and 0.88 (95% CI, 0.87–0.93), respectively. Externally, the ICI was 0.13 (95% CI, 0.08–0.18). At a specified model threshold, sensitivity and specificity were 80% (95% CI, 75%–98%), PPV was 64% (95% CI, 58%–71%), and the model identified patients with a median of 2.2 hours (interquartile range 0.2–18.6) after meeting Berlin ARDS criteria.

CONCLUSIONS:

Machine-learning models analyzing EHR data can retrospectively identify patients with ARDS across different institutions.

Keywords: acute lung injury, ARDS, hypoxemic respiratory failure, machine learning, mechanical ventilation


Acute respiratory distress syndrome (ARDS) is a heterogenous disease, making identification challenging (13). For the last decade, ARDS has been defined by the Berlin criteria (4), requiring identification of respiratory failure timing, assessment of hypoxemia severity and chest imaging interpretation. While these criteria were designed to aid physicians in making the diagnosis of ARDS, they have poor interobserver reliability (2). Difficulty in ARDS diagnosis has many implications, including impeding quality improvement efforts to ensure patients are receiving life-saving evidence-based therapies and large-scale clinical research endeavors (3, 5, 6).

To overcome diagnostic uncertainty, systems to identify ARDS have been developed (7, 8). However, these systems are largely designed as prospective screening tools, favoring sensitivity over specificity to capture true positive patients. Using these tools retrospectively yields high rates of ARDS misidentification (9) and is unable to accurately identify the time of onset of ARDS (10). A scalable approach to retrospectively and accurately identify patients with ARDS is an unmet need with the potential to improve both clinical research and quality improvement efforts (11). Therefore, we sought to develop and validate a model to identify patients with ARDS using routinely collected electronic health record (EHR) data to retrospectively identify ARDS and time of onset at scale.

MATERIALS AND METHODS

Study Populations

Our study included three cohorts: 1) training, 2) internal validation, and 3) external validation. The training cohort included adult patients admitted at a single academic institution between January 2016 and June 2017 who developed acute hypoxemic respiratory failure (AHRF) within the first seven days of hospitalization. AHRF was defined as a Pao2/FIo2 ≤300 while receiving invasive mechanical ventilation (IMV), non-invasive mechanical ventilation, or high-flow nasal cannula. We excluded patients who were transferred from another hospital, patients admitted to a specialized intermediate unit that cares for patients with chronic home ventilation, and patients receiving post-operative ventilation after surgical procedures (due to lower ARDS incidence in this group) (8, 12). The internal validation cohort was a temporally distinct (July–December 2017) cohort of patients with identical inclusion and exclusion criteria.

The external validation cohort included patients with hypoxemia from an existing cohort of patients receiving IMV from October 2018 to September 2019 in five hospitals of a regionally and organizationally distinct academic health system. Hypoxemia was defined as Pao2/FIo2 ≤300 or Spo2/FIo2 ≤315 and a second Pao2/FIo2 ≤300 or Spo2/FIo2 ≤315 in the subsequent 1–6 hours (to ensure the first value was not spurious), consistent with criteria for major ARDS clinical trials (13, 14). Patients were excluded if they underwent IMV for ≤24 hours since the likelihood of ARDS in these patients is low. Other exclusion criteria were intubated patients transferred from another hospital and those with a chronic tracheostomy. The final external validation cohort included 199 randomly selected patients, stratified by ICU type.

Establishing the Gold Standard Definition: ARDS Adjudication

In the internal training and validation cohorts, patients were retrospectively and independently reviewed by two physicians with critical care training to identify ARDS (per the Berlin definition) (4). Physicians used a structured review tool to evaluate whether patients met ARDS criteria and then determined whether ARDS developed within the first seven hospital days on an 8-point confidence score (2) (Appendix 4, http://links.lww.com/CCM/H710). Whether patients developed ARDS was determined based on an average score across all physicians and ARDS onset was defined as the time the patient first met all Berlin criteria.

In the external validation cohort, three critical care clinicians retrospectively reviewed patient charts. Clinicians used the same structured review tool as the training and test site (2). Each patient was reviewed independently by 2 clinicians. For patients determined to have ARDS, clinicians identified the time of ARDS onset as the first time all Berlin criteria were met. Disagreements about ARDS classification or time of onset were discussed among all reviewers. If consensus could not be reached, charts were independently reviewed by two additional critical care experts and final ARDS diagnosis was determined by consensus.

Clinical Data Pre-Processing

Clinical data were extracted from the training cohort including structured EHR data, radiology reports, clinical notes, and medication administration records. Laboratory and vital sign-based predictors were selected based on their potential to capture various aspects of the ARDS definition. Data during the first 7 days of hospitalization were included because most ARDS develops within 48 hours of IMV initiation (5). The training data was windowed at 6-hour intervals to reduce inter-correlation (15), and then the model was applied to data windowed every 2 hours during testing to increase the precision of the predicted ARDS onset time. For EHR variables including vital signs, respiratory support data, and laboratory results, we followed a pre-determined set of rules specifying whether a minimum, maximum, or both values were included for each variable in each time interval (eTable 6, http://links.lww.com/CCM/H710). Chest radiology reports (x-ray and CT scans) and clinical notes were processed using the Clinical Text Analysis and Knowledge Extraction System (cTAKES) (16), identifying and mapping words within the Unified Medial Language System. Features were coded as “present” or “absent” in each time interval (details in Appendix 2: Methods supplement, http://links.lww.com/CCM/H710). Handling of missing data is also described in the supplement (http://links.lww.com/CCM/H710).

Model Development

We compared the incremental value of EHR data types: structured data (vitals, respiratory support, laboratories), radiology reports, medications, clinical notes, and machine-learning modeling approaches (regularized logistic regression, random forest, recurrent neural networks) by performing nested five-fold cross validation within the training data (17). The models were trained to estimate the probability of ARDS at each time interval during the 7-day timeframe. Additional details of model training are described (Appendix 2: Methods supplement, http://links.lww.com/CCM/H710). Models were compared based on their area under the receiver-operating characteristics curve (AUROC), area under the precision-recall curve (AUPRC), and positive predictive value (PPV) at a threshold probability that achieved 85% sensitivity. The first time at which the model predicted ARDS at or above this threshold probability was deemed the predicted ARDS onset time. For example, if 2 AM-4 AM was the first window where the predicted probability was at or above the threshold probability, ARDS was deemed as present at 4 AM. The model’s ARDS detection times were compared with physician-adjudicated ARDS onset time. Performance metrics were calculated at the patient encounter level.

Internal and External Validation

Based on the model development results, the best-performing models were regularized logistic regression models using structured data alone (EHR model), and a model using structured data and features from radiology reports (EHR-radiology). Thus, these models were trained using the entire training dataset and tested on the internal and external validation cohorts. Model discrimination was assessed using a code base written after model development measuring AUROC, AUPRC, sensitivity, PPV, and the ARDS detection time. Model calibration was assessed by calculating the integrated calibration index (ICI) and generating calibration plots (18). ICI measures the average difference between the model’s predicted probabilities and observed outcome rates. A perfectly calibrated model would have an ICI of 0. 95% empirical bootstrap CIs were determined by resampling patients in the test set 1,000 times.

Model Comparison to Individual Physician Reviewers

In the internal validation cohort, we compared model performance to the diagnostic performance of individual physicians. Physicians who contributed more than 50 reviews to the internal test set were compared with the model using a reference standard of all other physicians reviewing the same patient. We calculated the AUROC for each physician using their 8-point ARDS confidence scale. This comparison may overestimate the diagnostic accuracy of individual physicians required to make a diagnosis at an early point in time of the patient presentation, as they had the benefit of reviewing the entire medical record.

In the external validation cohort, we conducted a sensitivity analysis analyzing model performance in patients without disagreements during initial physician review. Based on the assumption that patients with disagreements may represent patients for whom physicians may have uncertainty, exclusion of these patients can support whether the model would perform as well as a bedside physician.

Model Comparison to an Existing ARDS Detection System

We compared the model to an existing ARDS detection algorithm described by Afshar (19), that extracts n-gram features from radiology reports, and has undergone external validation (20). We chose this model for comparison because of the similarities to our model: natural language processing approaches, retrospective identification of ARDS rather than prospective screening, and its public availability (21).

Role of Funding Sources, Ethics Approval, and Reporting Guidelines

This study was funded through grants from the NIH. No funding source had any role in the design, conduct, or analysis of the study. Modeling was performed in Python 3.7, text features were extracted using cTAKES (16), additional analyses were conducted in Stata version 14. The study was approved with a waiver of informed consent by the Institutional Review Board at each institution–see supplement (http://links.lww.com/CCM/H710) for more information. We followed reporting guidelines for clinical prediction models per TRIPOD+AI (22). A protocol for this study was not prepared or registered. Study data are not available due to data sharing policies of the respective institutions. A sample data file and code to generate ARDS predictions are provided (additional material).

RESULTS

Patient Characteristics of the Internal and External Cohorts

The training dataset included 1,845 patients who developed AHRF between January 1, 2016 and June 30, 2017 (eTable 1, http://links.lww.com/CCM/H710). 1,359 patients (74%) received IMV and 356 (19%) developed physician-adjudicated ARDS. The internal validation cohort included 556 patients who developed AHRF, of which 413 (74%) received IMV, and 95 (17%) developed physician-adjudicated ARDS.

The external validation dataset included 199 patients with AHRF requiring IMV between October 2018 through September 2019. 61 patients (31%) developed physician-adjudicated ARDS. Table 1 describes demographic data, ARDS risk factors, and patient outcomes. The external validation cohort included more Black patients, who had more severe hypoxemia, higher mortality, and higher ARDS prevalence compared with the development cohort.

TABLE 1.

Study Patient Populations

Internal Cohort External Cohort
Characteristic Training Validation Validation p a
n 1,359 413 199
Age, mean (sd) 58 (16) 60 (16) 61 (15) 0.46
Male, n (%) 828 (61) 252 (61) 114 (57) 0.38
Race, n (%) 0.00
 Caucasian 1,153 (85) 343 (83) 117 (59)
 Black 124 (9) 45 (11) 65 (33)
 Other 82 (6) 25 (6) 17 (8)
ARDS risk factorsb, n (%) 0.00
 Pneumonia 434 (32) 118 (29) 65 (33)
 Aspiration 199 (15) 39 (9) 103 (52)
 Sepsis 347 (26) 108 (26) 41 (21)
 High-risk surgery 218 (16) 59 (14) 65 (33)
 Trauma 118 (9) 26 (6) 20 (17)
Minimum P/F, n (%) 0.001
 <100 450 (34) 163 (39) 100 (50)
 100–200 591 (44) 155 (38) 77 (39)
 >200 300 (22) 95 (23) 22 (11)
ARDS, n (%) 312 (23) 83 (20) 61 (31) 0.003
Hospital mortality, n (%) 341 (25) 107 (26) 67 (34) 0.056

ARDS = acute respiratory distress syndrome.

a

Comparison between development and external validation cohorts test cohort using t-test for continuous and Fisher exact test or chi-square for categorical data.

b

Physicians identified more than one risk factor in the external cohort.

Model Development

During model development using the training dataset, a regularized logistic regression model analyzing structured data alone had an AUROC of 0.85 (95% CI, 0.81–0.89) for detecting ARDS, which was calculated based on nested cross-validation. In contrast, the model analyzing structured data and radiology reports had an AUROC of 0.91 (95% CI, 0.89–0.92). Models including clinical notes and/or administered medications did not improve model performance (Table 2). There was also no meaningful difference between a regularized logistic regression model and more complex machine-learning models (eTable 2, http://links.lww.com/CCM/H710). Based on these results, a regularized logistic regression model trained using structured EHR data (EHR model) and structured EHR and radiology report data (EHR-Radiology model) was re-trained using the entire training data. The EHR model had 60 of 65 model features with non-zero weights (Appendix 3, http://links.lww.com/CCM/H710). The top 5 features increasing the likelihood of ARDS included blood urea nitrogen, serum sodium, total bilirubin, minimum plateau pressure, and procalcitonin (eTable 3, http://links.lww.com/CCM/H710). The EHR-radiology model had 68 of 115 features with non-zero weights. Features from the radiology reports with the highest weights included a description of ARDS, pulmonary edema, or opacities (eTable 3, http://links.lww.com/CCM/H710).

TABLE 2.

Incremental Value of Various Electronic Health Record Data in Acute Respiratory Distress Syndrome Detection Model

Data Type Features Area Under The Receiver Operator Characteristic Curve (95% CI) Sensitivity (95% CI) Specificity (95% CI) PPV (95% CI) Detection Time, hr (95% CI)
Structured EHR data 65 0.85 (0.81–0.89) 86% (86%−87%) 65% (55%−75%) 43% (37%−50%) 3.0 (−1.2 to 7.2)
Structured EHR, radiology reports 115 0.91 (0.89–0.92) 86% (86%−87%) 78% (72%−84%) 55% (49%−61%) 4.0 (1.4–6.6)
Structured EHR, medications 165 0.86 (0.84–0.89) 86% (86%−87%) 69% (59%−80%) 47% (39%−55%) 4.7 (1.3–8.1)
Structured EHR, radiology reports, clinical notes 333 0.92 (0.90–0.94) 86% (86%−87%) 82% (74%−90%) 60% (50%−70%) 7.1 (3.2–10.9)
Structured, radiology reports, clinical notes, medications 433 0.91 (0.89–0.93) 86% (86%−87%) 80% (74%−85%) 57% (50%−63%) 5.4 (2–8.8)

EHR = electronic health record, PPV = positive predictive value.

Results were calculated using L1 logistic regression and represent the average from nested five-fold cross validation on the model internal derivation dataset. 95% CIs were estimated using the results across the five-fold using a t-distribution. Performance characteristics are determined at the patient encounter level. Specificity and PPV were calculated after determining the threshold to achieve a sensitivity of 85% on the training data.

Structured data include a prespecified group of vital signs, laboratory results, and respiratory support parameters. Medications: Medications mapped to Veterans Affairs medication class codes and the top 100 were selected using chi-square filtering. The top 50 Clinical Text Analysis and Knowledge Extraction System features from radiology reports and the top 250 from clinical notes were included based on chi-square filtering. Detection time is the median duration in hours after a patient met Berlin acute respiratory distress syndrome (ARDS) criteria when the model detected a patient as having ARDS.

Internal and External Validation

In the interval validation dataset, using the same threshold probability that achieved sensitivity of 85% in the training dataset (threshold = 0.33), the EHR-radiology model had very good performance (Fig. 1), with an AUROC of 0.91 (95% CI, 0.88–0.93), ICI of 0.11 (95% CI, 0.06–0.14), sensitivity of 90% (95% CI, 88%–93%), specificity of 76% (95% CI, 73%–80%) and PPV of 49% (95% CI, 42%–55%). It identified ARDS at a median of 3.9 hours (interquartile range [IQR] 3–5.5) after the time of onset based on expert clinician review (Table 3). At a threshold probability of 0.34 (achieved sensitivity of 85% in training dataset), the EHR model had good performance (AUROC 0.86, 95% CI, 0.83–0.9), although lower than the EHR-Radiology model (Fig. 1). This model had a sensitivity of 94% (95% CI, 90–98%), specificity of 56% (95% CI, 52–60%) and PPV of 35% (95% CI, 30–39%) and identified ARDS at a median of 3.6 hours (IQR 2–5.5) after the patient met Berlin ARDS criteria (Table 3).

Figure 1.

Figure 1.

Performance characteristics of internal validation and external validation cohorts. Model performance: A, Receiver operator curve and precision-recall curve for internal validation cohort. B, Receiver operator curve and precision-recall curve for external validation cohort.

TABLE 3.

Internal and External Acute Respiratory Distress Syndrome Model Validation

Model Area Under The Receiver Operator Characteristics Curve Integrated Calibration Index Sensitivity Specificity PPV Detection Time, hr
Internal validation
 EHR 0.86 (0.83–0.89) 0.22 (0.19–0.26) 94% (90%−98%) 56% (52%−60%) 35% (29%−39%) 3.6 (2, 5.5)
 EHR-radiology 0.91 (0.88–0.93) 0.11 (0.06–0.14) 90% (86%−96%) 76% (73%−80%) 49% (42%−55%) 3.9 (3, 5.5)
External validation
 EHR 0.82 (0.76–0.89) 0.28 (0.23–0.34) 75% (69%−81%) 75% (69%−81%) 58% (50%−64%) 8.0 (0.6, 34.5)
 EHR-radiology 0.88 (0.87–0.93) 0.13 (0.08–0.18) 80% (75%−86%) 80% (75%−86%) 64% (58%−71%) 2.2 (0.2, 18.6)

EHR = L1 logistic regression model including structured electronic health record (EHR) data, EHR-radiology = L1 logistic regression model including structured EHR data and radiology notes, PPV = positive predictive value.

Performance measures reported with 95% CIs. Detection time is the duration in hours after the patient met Berlin ARDS criteria before the model identified the patient as having ARDS. In the internal validation cohort, sensitivity, specificity, PPV, and detection time were determined based on an acute respiratory distress syndrome (ARDS) probability threshold that achieved 85% sensitivity in the training cohort. In the external validation cohort, the ARDS probability threshold was chosen to maximize sensitivity and specificity.

In the external validation cohort, there was a small decrease in model discrimination and calibration for the EHR model and EHR-Radiology model. The EHR-Radiology model had an AUROC of 0.88 (95% CI, 0.87–0.93) and an ICI of 0.13 (95% CI, 0.08–0.18) (Fig. 1 and Table 3). The simpler EHR model had an AUROC of 0.82 (95%, 0.76–0.89), and ICI of 0.28 (95% CI, 0.23–0.34). To compare diagnostic testing characteristics of these models, ARDS threshold probabilities were selected to maximize both sensitivity and specificity. At a threshold of 0.55 probability, the EHR-radiology model had a sensitivity of 80% (95% CI, 75%–86%), specificity of 80% (95% CI, 75%–86%), PPV of 64% (95% CI, 58%–71%) and identified ARDS a median of 2.2 hours (IQR 0.2–18.6) after patients met Berlin ARDS criteria. At a threshold of 0.73 probability, the EHR model had a sensitivity and specificity of 75% (95% CI for both sensitivity and specificity 69–81%), PPV of 58% (95% CI, 50%–64%), and identified ARDS at a median of 8 hours (IQR 0.6–34.5) (Table 3). Additional calibration metrics are included in eFigure 1 (http://links.lww.com/CCM/H710).

Comparison to Individual Physicians

In the internal validation dataset, the EHR-radiology model had equivalent or better performance than any of the 9 physician reviewers, all who had performed at least 50 reviews. The best-performing physician had an AUROC of 0.94 (95% CI, 0.91–0.98) for the 254 patients they reviewed while the model had an AUROC of 0.90 (95% CI, 0.86–0.95) for the same patient set (p value for difference = 0.134) (Table 4). The lowest performing physician had an AUROC = 0.70 (95% CI, 0.6–0.81) for the 71 patients they reviewed while the model had an AUROC = 0.86 (95% CI, 0.77–0.95) for the same patient set (p value for difference < 0.01).

TABLE 4.

Head-to-Head Performance of the Electronic Health Record-Radiology Model and Individual Physician’s Ability to Identify Patients With Acute Respiratory Distress Syndrome, Internal Validation Cohort

Physician N EHR-Radiology Model AUROC (95% CI) Physician AUROC (95% CI) p for Difference
1 343 0.92 (0.88–0.95) 0.85 (0.80–0.90) 0.032
2 254 0.90 (0.86–0.95) 0.94 (0.91–0.98) 0.134
3 202 0.83 (0.76–0.89) 0.86 (0.80–0.91) 0.391
4 180 0.86 (0.79–0.92) 0.80 (0.72–0.88) 0.176
5 172 0.91 (0.85–0.96) 0.92 (0.86–0.98) 0.693
6 71 0.86 (0.77–0.95) 0.70 (0.60–0.81) 0.004
7 67 0.91 (0.83–0.99) 0.86 (0.72–0.99) 0.512
8 60 0.92 (0.83–1.00) 0.86 (0.72–0.99) 0.103
9 58 0.90 (0.77–1.00) 0.83 (0.68–0.99) 0.480

AUROC = area under the receiver operator characteristic curve, EHR = electronic health record.

The discrimination of individual physicians and the model analyzing structured data, radiology reports, and clinical notes were compared based on the AUROC. Other physicians who reviewed the same patient were combined to derive the reference standard used in these comparisons.

In a sensitivity analysis in the external validation dataset, patients with initial physician disagreement about ARDS were excluded and 162 patients remained. In these patients, the AUROC of the EHR-radiology model increased to 0.91 (95% CI, 0.86–0.97). At a threshold of 0.64 (threshold at which sensitivity and specificity are maximized), sensitivity was 81% (95% CI, 76%–88%), specificity 88% (95% 83–93), PPV 72% (95% CI, 65%–79%) and identified ARDS a median of 3.6 hours (IQR 0.3–15.2) after patients met Berlin ARDS criteria (eTable 4, http://links.lww.com/CCM/H710).

Comparison to an Existing ARDS Detection Model

In both the internal and external cohorts, the Afshar model had lower performance and was less accurate in determining ARDS onset time. In the internal cohort, it had a sensitivity of 51% (95% CI, 39%–62%), specificity of 88% (95% CI, 84%–91%), and PPV 51% (39%−62%), and detected ARDS a median of 16.9 hours (IQR 10.9–22.9) after patients met Berlin ARDS criteria. In the external cohort, it had a sensitivity of 68% (95% CI, 61%–74%), specificity of 80% (95% CI, 75%–86%) and PPV 61% (54%−68%), and detected ARDS a median of 31.2 hours (IQR 4–98.6) after patients met Berlin ARDS criteria (eTable 5, http://links.lww.com/CCM/H710).

DISCUSSION

Given the challenges in identifying patients with ARDS, we developed a model designed to retrospectively identify patients with ARDS using structured and unstructured EHR data and tested internal and external validity. We found that a regularized logistic regression model could accurately identify the onset of ARDS with performance equivalent to or better than critical care physicians. As such, we offer a reliable way of identifying ARDS patients retrospectively with high fidelity and at scale.

Based on these results, this model may be well-suited to identify large cohorts of patients with ARDS across multiple sites, allowing for robust exploration of important ARDS research questions. Evidence-based practices for ARDS, mainly lung protective ventilation and prone positioning, continue to be poorly used (5, 23). Building a large retrospective cohort of ARDS patients or identifying ARDS patients within an existing prospective cohort, including time of onset of ARDS, would allow for better evaluation of use of these practices over time with the potential to identify system and organizational factors that contribute to implementation or de-implementation. In addition, retrospective identification within individual health systems could enhance clinical operations and quality improvement, by facilitating the assessment of initiatives to improve evidence-based care delivery, informing auditing and benchmarking, and contributing to clinical decision support programs.

When using these models, the threshold probability chosen to determine presence or absence of ARDS is an important consideration. The threshold determines the minimum probability required for a patient to be identified as ARDS. While a typical choice for the minimum probability might be 0.5, a more nuanced approach should consider the relative trade-offs between higher sensitivity and PPV. A key decision will be whether to prioritize the PPV (precision), which may reduce sensitivity, or accept an increased number of false positives in favor of completeness and prioritize sensitivity (recall). This decision will be primarily based on the specific use case for the model. In the internal validation cohort, a threshold probability that maximized sensitivity was chosen, which increased the number of false positives. In the external validation cohort, a threshold probability to maximize both sensitivity and specificity was chosen. For retrospective identification of ARDS, we suggest using the threshold probability that maximizes both sensitivity and specificity to balance the tradeoff between false positives and negatives.

Our study is novel in several ways. It is one of the first to compare the model performance to the performance of critical-care-trained physicians for ARDS identification. Given the poor interobserver reliability of the Berlin criteria, we sought to reach as close to truth with respect to ARDS diagnosis by requiring consensus among at least two physicians for each patient. While there are other clinical models that perform as well as physicians (24), ARDS diagnosis relies on recognition of a complex clinical syndrome including identification of a risk factor, imaging interpretation, and assessment of alternative etiologies of respiratory failure. As such, the ability of our model to identify ARDS at the level of an experienced physician is noteworthy. In contrast to prior ARDS detection models which primarily analyze radiology information (8, 19, 25), the current model incorporates laboratory and vital sign information. This may explain its higher performance compared with a model analyzing radiology information alone. The EHR-Radiology model was also able to detect timing of onset of ARDS within 2–4 hours of the gold standard ARDS diagnosis, while other ARDS detection models have not reported the ability to detect the onset time of ARDS as precisely (10, 26). This level of precision would be beneficial from an operations and quality perspective to assess barriers to the timely initiation of life-saving interventions. Lastly, our models performed well in the internal and external cohorts despite being geographically and temporally distinct, and having subtly different inclusion criteria, speaking to the generalizability of the models.

Our study has limitations. First, while the internal and external cohorts used for model validation are different institutions, they are both academic, tertiary care referral centers with similar levels of acuity. However, the external cohort did include patients from multiple sites within the health system including smaller, community-based hospitals, thus increasing the generalizability of our findings. Second, model performance was improved significantly with the addition of radiology reports. Chest imaging reports may vary across institutions thus performance of the model could differ in other settings. Directly analyzing chest images is an alternative approach with increased implementation complexity and technical expertise that may not scale as easily across health systems. Additionally, our work was based primarily on the Berlin definition, increasing the weight of ventilator measurements within the model. We do not know how well our model would perform at identifying nonventilated ARDS included in the new global definition of ARDS (27). This will require further testing and validation. We lacked the sample size necessary to measure model performance across relevant demographic groups for evaluating fairness (e.g., race, sex), though when we tested in the external cohort which included more patients who self-reported as black, there was not a large change in model performance. Finally, these results are dependent on an imperfect reference standard for ARDS, which is a limitation of any system designed to identify an outcome lacking an established gold standard.

CONCLUSIONS

We derived and validated models analyzing readily available EHR data and accurately identified patients with ARDS with performance equivalent to physicians across different institutions. These models can retrospectively identify the presence and timing of ARDS in large cohorts of mechanically ventilated patients and help answer important questions related to clinical interventions and outcomes of patients with ARDS.

Supplementary Material

supplement

Supplemental digital content is available for this article. Direct URL citations appear in the printed text and are provided in the HTML and PDF versions of this article on the journal’s website (http://journals.lww.com/ccmjournal).

KEY POINTS.

Question:

The goal of this study was to develop and externally validate a predictive model to retrospectively identify patients with acute respiratory distress syndrome.

Findings:

In this retrospective cohort study, acute respiratory distress syndrome (ARDS) was identified via physician-adjudication in three cohorts of patients with hypoxemic respiratory failure. Machine-learning models were trained to classify ARDS using vital signs, respiratory support, laboratory data, medications, chest radiology reports, and clinical notes. The value of different electronic health record data types and modeling approaches were compared, and the best-performing models were internally and externally validated revealing an area under receiver-operating curve (AUROC) of 0.91 in the internal cohort and an AUROC of 0.88 in the external cohort.

Meaning:

Our machine-learning models can retrospectively identify patients with ARDS and timing of ARDS onset in large cohorts of mechanically ventilated patients and help answer important questions related to clinical interventions and outcomes of patients with ARDS.

Acknowledgments

The funding support for this project was provided by the following grants: K01HL136687 and T32HL098054.

Michael Sjoding previously received royalties for a software technology that processes chest radiograph images to detect acute respiratory distress syndrome. This software was previously licensed to AirStrip Technologies, Inc. Meeta Kerlin is a member of a data safety monitoring board unrelated to this article. Dr. Levy received funding from the National Institutes of Health (NIH). Drs. Levy, Ginestra, Kohn, Patel, and Sjoding received support for article research from the NIH. Dr. Ginestra’s institution received funding from the National Heart, Lung, and Blood Institute. Dr. McSparron received funding from UpToDate and Springer. Drs. Kerlin’s and Sjoding’s institutions received funding from the NIH. Dr. Sjoding received funding from Airstrip. The remaining authors have disclosed that they do not have any potential conflicts of interest.

REFERENCES

  • 1.Sjoding MW, Cooke CR, Iwashyna TJ, et al. : Acute respiratory distress syndrome measurement error. Potential effect on clinical study results. Ann Am Thorac Soc 2016; 13:1123–1128 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Sjoding MW, Hofer TP, Co I, et al. : Interobserver reliability of the Berlin ARDS definition and strategies to improve the reliability of ARDS diagnosis. Chest 2018; 153:361–367 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Sjoding MW, Hyzy RC: Recognition and appropriate treatment of the acute respiratory distress syndrome remains unacceptably low. Crit Care Med 2016; 44:1611–1612 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Ranieri VM, Rubenfeld GD, Thompson BT, et al. ; ARDS Definition Task Force: Acute respiratory distress syndrome: The Berlin definition. JAMA 2012; 307:2526–2533 [DOI] [PubMed] [Google Scholar]
  • 5.Bellani G, Laffey JG, Pham T, et al. ; LUNG SAFE Investigators: Epidemiology, patterns of care, and mortality for patients with acute respiratory distress syndrome in intensive care units in 50 countries. JAMA 2016; 315:788–800 [DOI] [PubMed] [Google Scholar]
  • 6.Weiss CH, Baker DW, Weiner S, et al. : Low tidal volume ventilation use in acute respiratory distress syndrome. Crit Care Med 2016; 44:1515–1522 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Koenig HC, Finkel BB, Khalsa SS, et al. : Performance of an automated electronic acute lung injury screening system in intensive care unit patients. Crit Care Med 2011; 39:98–104 [DOI] [PubMed] [Google Scholar]
  • 8.Herasevich V, Yilmaz M, Khan H, et al. : Validation of an electronic surveillance system for acute lung injury. Intensive Care Med 2009; 35:1018–1023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.McKown AC, Brown RM, Ware LB, et al. : External validity of electronic sniffers for automated recognition of acute respiratory distress syndrome. J Intensive Care Med 2019; 34:946–954 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Rubulotta F, Bahrami S, Marshall DC, et al. : Machine learning tools for acute respiratory distress syndrome detection and prediction. Crit Care Med 2024; 52:1768–1780 [DOI] [PubMed] [Google Scholar]
  • 11.Artis KA, Dweik RA, Patel B, et al. : Performance measure development, use, and measurement of effectiveness using the guideline on mechanical ventilation in acute respiratory distress syndrome. An official American Thoracic Society Workshop Report. Ann Am Thorac Soc 2019; 16:1463–1472 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Milot J, Perron J, Lacasse Y, et al. : Incidence and predictors of ARDS after cardiac surgery. Chest 2001; 119:884–888 [DOI] [PubMed] [Google Scholar]
  • 13.National Heart L, Blood Institute PCTN, Moss M, et al. : Early neuromuscular blockade in the acute respiratory distress syndrome. N Engl J Med 2019; 380:1997–2008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Rice TW, Wheeler AP, Bernard GR, et al. ; National Institutes of Health, National Heart, Lung, and Blood Institute ARDS Network: Comparison of the SpO2/FIO2 ratio and the PaO2/FIO2 ratio in patients with acute lung injury or ARDS. Chest 2007; 132:410–417 [DOI] [PubMed] [Google Scholar]
  • 15.Reamaroon N, Sjoding MW, Lin K, et al. : Accounting for label uncertainty in machine learning for detection of acute respiratory distress syndrome. IEEE J Biomed Health Inform 2019; 23:407–415 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Savova GK, Masanz JJ, Ogren PV, et al. : Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): Architecture, component evaluation and applications. J Am Med Inform Assoc 2010; 17:507–513 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Cawley G, Talbot NL: On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res 2010; 11:2079–2107 [Google Scholar]
  • 18.Austin PC, Steyerberg EW: The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models. Stat Med 2019; 38:4051–4065 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Afshar M, Joyce C, Oakey A, et al. : A computable phenotype for acute respiratory distress syndrome using natural language processing and machine learning. AMIA Annu Symp Proc 2018; 2018:157–165 [PMC free article] [PubMed] [Google Scholar]
  • 20.Mayampurath A, Churpek MM, Su X, et al. : External validation of an acute respiratory distress syndrome prediction model using radiology reports. Crit Care Med 2020; 48:e791–e798 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Afshar Lab Github. Available at: https://github.com/AfsharJoyceInfoLab/ARDS_Classifier/tree/master. Accessed December 10, 2024
  • 22.Collins GS, Moons KGM, Dhiman P, et al. : TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 2024; 385:e078378. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Qadir N, Bartz RR, Cooter ML, et al. ; Severe ARDS: Generating Evidence (SAGE) Study Investigators: Variation in early management practices in moderate-to-severe ARDS in the United States: The severe ARDS: Generating evidence study. Chest 2021; 160:1304–1315 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Tschandl P, Rinner C, Apalla Z, et al. : Human-computer collaboration for skin cancer recognition. Nat Med 2020; 26:1229–1234 [DOI] [PubMed] [Google Scholar]
  • 25.Azzam HC, Khalsa SS, Urbani R, et al. : Validation study of an automated electronic acute lung injury screening tool. J Am Med Inform Assoc 2009; 16:503–508 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Tran TK, Tran MC, Joseph A, et al. : A systematic review of machine learning models for management, prediction and classification of ARDS. Respir Res 2024; 25:232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Matthay MA, Arabi Y, Arroliga AC, et al. : A new global definition of acute respiratory distress syndrome. Am J Respir Crit Care Med 2024; 209:37–47 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

RESOURCES