Diagnostic accuracy and cost-efficiency of ChatGPT in EIT-based pendelluft detection for mechanically ventilated patients: a multicenter study

Rui Zhang; Huaiwu He; Zhisheng Bi; Fang Long; Jing Xu; Jiayi Guan; Ruoming Tan; Knut Moeller; Donghuang Hong; Zhanqi Zhao; Yun Long; Hongping Qu

doi:10.1186/s12931-025-03489-y

. 2026 Jan 27;27:107. doi: 10.1186/s12931-025-03489-y

Diagnostic accuracy and cost-efficiency of ChatGPT in EIT-based pendelluft detection for mechanically ventilated patients: a multicenter study

Rui Zhang ^1,^✉,^#, Huaiwu He ^2,^#, Zhisheng Bi ^3,^4,^#, Fang Long ⁵, Jing Xu ⁶, Jiayi Guan ¹, Ruoming Tan ¹, Knut Moeller ⁷, Donghuang Hong ⁸, Zhanqi Zhao ^4,^7,^✉, Yun Long ^2,^✉, Hongping Qu ^1,^✉

PMCID: PMC12958570 PMID: 41593620

Abstract

Introduction

Electrical impedance tomography (EIT) enables real-time bedside monitoring of regional lung ventilation, but its clinical adoption is limited by complex data interpretation requiring substantial expertise and time. Pendelluft—the asynchronous movement of air between lung regions—is an important indicator of ventilation heterogeneity and poor outcomes in mechanically ventilated patients. There remains an unmet need for accurate, efficient, and clinically interpretable automated pendelluft detection tools.

Methods

In this retrospective multicenter study, we developed and validated an automated system for pendelluft detection using a ChatGPT-generated deep learning architecture. Consecutive mechanically ventilated adults from three tertiary hospitals in China (January 2020–December 2024) were screened; 278 patients met inclusion criteria. High-resolution EIT signals were processed using a hybrid model—the initial code generated by ChatGPT (GPT-4), further optimized by clinicians—which included a modified ResNet-34 encoder, bidirectional LSTM with self-attention, and Grad-CAM interpretability with integrated safety workflow. Primary outcomes were diagnostic accuracy (sensitivity, specificity, AUC-ROC) and analytic efficiency (processing time, cost per case) compared to expert review and conventional machine learning models.

Results

The ChatGPT-generated system demonstrated superior diagnostic accuracy for automated pendelluft detection, yielding an AUC-ROC of 0.91 (95% CI: 0.87–0.95), sensitivity of 89.6%, and specificity of 92.1%. These metrics significantly exceeded those of standard CNN (AUC 0.85), random forest (AUC 0.82), and SVM (AUC 0.79) models. Median analysis time per case was reduced from 12.5 min (manual expert review) to 2.8 min (AI system), with an average cost saving of 200 RMB per patient. Higher pendelluft grades correlated strongly with worse clinical outcomes, including increased 28-day mortality (25.4% vs. 9.0%, adjusted HR 2.8, 95% CI: 1.6–4.9, P = 0.001), fewer ventilator-free days, and greater risk of complications. The workflow demonstrated high operational reliability, robust safety, and strong agreement with clinician assessments (Cohen’s κ = 0.86).

Conclusions

The ChatGPT-generated AI system offers an accurate, rapid, and cost-effective solution for automated EIT analysis and pendelluft detection in mechanically ventilated patients. This novel workflow facilitates real-time, explainable clinical decision support, and may help optimize ventilator management and improve patient outcomes.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12931-025-03489-y.

Keywords: Electrical impedance tomography, Pendelluft phenomena, ChatGPT, Machine learning, Mechanical ventilation, Artificial intelligence, Diagnostic accuracy

Introduction

Mechanical ventilation remains a cornerstone intervention for patients with acute respiratory failure. However, prolonged mechanical ventilation can lead to various complications and adverse outcomes [1]. The timing and strategy of ventilator weaning are crucial factors that influence patient outcomes, with approximately 20–30% of patients experiencing difficult weaning, leading to prolonged mechanical ventilation and increased mortality [2]. This challenge is particularly evident in patients with acute respiratory distress syndrome (ARDS), where mortality rates remain high (35–46%) despite advances in mechanical ventilation strategies [3].

Among various respiratory parameters, pendelluft phenomenon - the movement of air between different lung regions without changes in total lung volume - has emerged as a critical indicator of ventilation heterogeneity [4]. The detection and quantification of pendelluft can provide valuable insights into regional lung mechanics and guide ventilation strategies [5]. As an important manifestation of ventilation heterogeneity, pendelluft has been linked to occult injurious inflation, elevated inflammatory biomarkers, prolonged mechanical ventilation, and weaning failure [5].

More recently, Cornejo et al. systematically evaluated how ventilatory settings modulate pendelluft and expiratory muscle activity in hypoxemic ARDS patients resuming spontaneous breathing [6]. In a randomized crossover physiological study, they showed that increasing pressure support consistently reduced pendelluft magnitude and expiratory muscle activation, whereas higher EIT‑guided PEEP had a more complex, context‑dependent effect: although PEEP could decrease pendelluft via improved dynamic lung compliance, this benefit could be counteracted by concomitant recruitment of expiratory muscles. These findings highlight that pendelluft reflects not only diaphragmatic effort and “solid-like” lung behaviour, but also the interplay between ventilatory settings, inspiratory effort, expiratory muscle activity, and regional mechanics.

Electrical impedance tomography (EIT) is a non-invasive, radiation-free monitoring technique that provides real-time visualization of regional lung ventilation. While this technology has emerged as a promising tool for optimizing mechanical ventilation strategies [7], its clinical implementation faces significant challenges. The interpretation of EIT data requires substantial expertise and time [8], leading to increased healthcare costs and potential delays in clinical decision-making. Current automated analysis approaches for EIT interpretation face several challenges including data complexity, lack of standardization in analysis methods, and implementation difficulties, as highlighted in the international consensus statement [9]. These limitations underscore the need for more advanced analytical approaches. Together with emerging AI‑EIT applications in MASLD [10], these advances underscore the potential of combining EIT with deep learning to deliver organ‑specific, bedside decision support. However, there is still no validated AI‑enabled EIT system for detecting and grading pendelluft in mechanically ventilated patients.

Traditional machine learning and image reconstruction methods for EIT face challenges in spatial resolution and real-time processing [11, 12]. Recent advances in deep learning, particularly hybrid-fusion approaches, have demonstrated significant improvements in both reconstruction accuracy and spatial resolution of conductivity images [12]. Beyond pulmonary applications, AI-enabled EIT systems have recently been extended to other organ systems. For example, Bahukhandi et al. developed a portable multi-spectral EIT platform combined with deep learning (ResNet‑18) to predict the Controlled Attenuation Parameter for metabolic associated steatotic liver disease (MASLD) screening, achieving an AUC of 0.95, sensitivity of 91.7%, and specificity of 88.1% compared with vibration‑controlled transient elastography [10]. This work illustrates that AI‑driven EIT can provide accurate, non-invasive, and cost‑effective assessment of tissue pathology in a high‑throughput setting. Following prior work [5, 12, 13], pendelluft can be quantified from EIT signals as the relative discrepancy between regional and global tidal impedance changes and graded into severity categories with established prognostic relevance. The detailed definition and computation are provided in the Methods.

The emergence of large language models, particularly ChatGPT, offers new possibilities for medical image interpretation and clinical decision support [14, 15]. Unlike traditional machine learning approaches that require extensive computational resources and specialized expertise, ChatGPT provides a potentially cost-effective solution for automated medical image analysis [16]. It enables sophisticated pattern recognition while maintaining interpretability - a crucial factor for clinical implementation [17].

This study aimed to develop and validate a ChatGPT generated system for automated pendelluft detection in mechanically ventilated patients. The primary objectives were twofold: (1) To evaluate and compare the performance of our ChatGPT generated system against conventional machine learning approaches for EIT-based pendelluft detection. (2) To assess the clinical utility of automated pendelluft detection in predicting patient outcomes.

The primary objective of this study was not simply to demonstrate that ChatGPT or similar LLMs can auto-generate code for analytic tools. Rather, we aimed to:

Evaluate the feasibility and limitations of LLM-assisted rapid AI development in complex clinical settings.
Quantify and clinically validate the performance of such auto-generated tools after expert-driven refinement and cross-institutional real-world validation.
Establish a practical, transparent, and reproducible framework for safe LLM-powered medical AI research and deployment, especially in scenarios requiring real-time, explainable, and workflow-compatible solutions.

Methods

Study design and setting

This retrospective multicenter study was conducted across three tertiary hospitals in China (Ruijin Hospital, Peking Union Medical College Hospital, Fujian Provincial Hospital) between January 2020 and December 2024. Institutional review boards approved the protocol with waived consent (RJ-IRB2024-505) due to retrospective anonymized data analysis [18].

Patient population and EIT data acquisition

After screening 458 mechanically ventilated adults, 278 patients met inclusion criteria: age ≥ 18 years undergoing pressure support ventilation with concurrent electrical impedance tomography (EIT) monitoring, complete clinical records, and adequate signal quality. Exclusion criteria encompassed poor EIT signal quality (signal-to-noise ratio < 10 dB), chest wall deformities, hemodynamic instability (norepinephrine > 0.5 µg/kg/min), or incomplete follow-up data.

EIT segmentation and dataset composition

Electrical impedance tomography (EIT) data were acquired using PulmoVista 500 systems (Dräger Medical) with standardized 16-electrode belt placement at the fourth intercostal space, following international consensus recommendations [18, 19]. Each patient’s recording had a median duration of 38 min [IQR 32–46]. Raw signals were sampled at 50 Hz and underwent preprocessing including cardiac artifact removal using ECG-gated subtraction, motion correction via principal component analysis, band-pass filtering (0.1–1.0 Hz), and normalization to 32 × 32-pixel matrices. Data quality criteria required electrode contact impedance variation < 10% and global tidal impedance variation > 5% [20].

For analysis, recordings were segmented into non-overlapping 10-second windows, each representing approximately 3–5 breaths under pressure support ventilation. This yielded 15,984 windows in total, with 11,188, 2,398, and 2,398 windows assigned to training, validation, and testing respectively, using a 5-fold stratified cross-validation framework. Pendelluft amplitude was quantified as:

where TIVi denotes regional tidal impedance variation for each pixel and global TIVglobal denotes global tidal variation [5, 12]. Severity grading adhered to validated thresholds [13]: Grade 0 (< 2.5%), Grade 1 (2.5–5%), Grade 2 (5–10%), Grade 3 (> 10%). The dataset distribution was Grade 0: 54.3%, Grade 1: 27.6%, Grade 2: 12.1%, Grade 3: 6.0%, indicating moderate class imbalance with fewer high-grade cases. To address this in model training, class-weighted cross-entropy loss was applied with higher weights for Grades 2–3, supplemented by mild oversampling of high-grade windows in training folds only; validation and test folds were left unaltered to ensure unbiased performance estimates. Grading was conducted independently by two intensivists blinded to clinical details, achieving excellent inter-rater agreement (κ = 0.91).

Annotation protocol and inter-rater agreement

Two intensivists independently graded each EIT window while blinded to all clinical and model data. Disagreement occurred in 7.4% of windows (1,183 of 15,984). These cases were reviewed by a third senior intensivist, also blinded, who adjudicated the final grade. Consequently, all labels used for model development and evaluation were derived from either initial rater agreement or three-expert consensus. Inter-rater agreement between the two primary raters was excellent (Cohen’s κ = 0.91).

AI model development (ChatGPT-assisted architecture)

For AI model development, ChatGPT (GPT-4-0613 version) received structured clinical specifications through engineered prompts: “Design a hybrid deep learning model for real-time detection of pendelluft in 32 × 32 EIT frames at 50 Hz. Please include these components: (1) a CNN backbone tailored for medical imaging, (2) a bidirectional temporal analysis module, (3) an attention-based interpretability mechanism, and (4) confidence-based safety triggers for clinical deployment. Please output a single Python class with explicit layer specifications and inline comments.“The full GPT-generated Python code and corresponding conversational log are provided in Supplementary eAppendix 3 and 5(Appendix 3 contains the original GPT-4 prompt and prototype, while Appendix 5 contains the finalized GPT-3.5-based implementation and deployment details). A comprehensive summary of architecture modifications and their clinical rationale is provided in Supplementary Table S16.

Design a hybrid deep learning model for real-time pendelluft detection in 32 × 32 EIT frames at 50Hz sampling. Required components: 1) CNN backbone optimized for medical imaging, 2) Bidirectional temporal analysis module, 3) Attention-based interpretability mechanism, 4) Clinical safety thresholds. Output Python class structure with layer specifications.” The initial ChatGPT-generated architecture consisted of a ResNet-50 backbone, unidirectional LSTM (128 units), and binary classifier. We subsequently implemented targeted clinical optimizations constituting approximately 30% code modification: the spatial encoder was reconfigured to a lightweight ResNet-34 architecture with removed Stage4 layers and added dilated convolutions to reduce overfitting; the temporal module upgraded to bidirectional LSTM (256 units) with 8-head self-attention for enhanced phase detection; the output layer expanded to 4-class grading with integrated Grad-CAM visualization; and clinical safety mechanisms incorporated confidence thresholds triggering mandatory clinician review below 90% certainty. These modifications specifically addressed clinical deployment requirements while preserving the core AI-generated architecture [19, 21, 22]. Exploratory comparisons with deeper CNN architectures are provided in Supplementary Table S15 and S17.

Model training, hyperparameter selection, and validation

Building on the GPT-4-generated baseline (ResNet-50 backbone, 128-unit unidirectional LSTM, binary output), we implemented targeted clinical optimizations accounting for approximately 30% of the total codebase. Specifically, we: (1) replaced the ResNet-50 feature extractor with a lightweight ResNet-34 with dilated convolutions, pretrained on a large medical imaging dataset (MED-Image); (2) upgraded the temporal module to a 256-unit bidirectional LSTM with an 8-head self-attention layer; (3) expanded the output layer from binary classification to a 4-class system aligned with clinically validated pendelluft grades (0–3) and integrated Grad-CAM visualization; and (4) introduced confidence-based safety triggers so that predictions with < 90% certainty mandate clinician review. Example code differences and functional impacts are shown in Supplementary Appendix 2. Refer to Supplementary Appendix 5 for representative ChatGPT prompts, generated code, and complete model implementation details, provided to enable independent validation.

Technical validation was performed using 5-fold stratified cross-validation on 222 training cases, ensuring robust model assessment and minimizing the risk of overfitting. For independent clinical validation, two complementary approaches were adopted: First, five intensive care physicians, blinded to AI predictions, reviewed and graded 128 cases using standardized criteria [11]. Second, real-time alerts generated by the AI system for 42 detected Grade 2–3 pendelluft cases guided adjustments in ventilator parameters, including PEEP titration and modification of inspiratory time, enabling evaluation of physiological response to intervention [23].

All algorithms, including the ChatGPT-AI model and comparators, were trained and validated using identical 5-fold stratified cross-validation splits across the entire dataset. The same cross-validation folds were applied to every model to ensure precise paired comparisons and eliminate sampling-related confounders. A fixed random seed (seed = 2023) was used for reproducibility. Reported performance metrics (e.g., sensitivity, specificity, AUC) represent averages across the same test folds, and statistical analyses were performed in a paired design accordingly.

All models were trained and evaluated using the same 5-fold stratified cross-validation splits, with a fixed random seed (2023) for reproducibility. For the ChatGPT-optimized model, we tuned learning rate (1e-3, 5e-4, 1e-4), batch size (32, 48, 64), and number of epochs (40, 60, 80) using the training folds only, with early stopping based on validation loss. Similar small grids were used for the CNN comparator. Random Forest (n_estimators, max_depth) and SVM (C, γ) hyperparameters were tuned by stratified cross-validation on the training data. Final hyperparameters reported in Table 1 correspond to the configurations achieving the highest average validation AUC across folds.

Table 1.

Baseline characteristics of study population

Characteristic	Low Grade (n = 156)	High Grade (n = 122)	p Value
Demographics
Age, mean (SD), years	57.8 (12.4)	61.3 (14.6)	0.48
Male, No. (%)	98 (62.8)	80 (65.6)	0.654
BMI, mean (SD), kg/m²	26.5 (4.3)	27.8 (4.9)	0.87
Severity Scores
APACHE II score, mean (SD)	17.5 (4.2)	21.3 (5.6)	< 0.001
SOFA score, mean (SD)	6.8 (2.1)	8.4 (3.0)	< 0.001
Respiratory Parameters
PaO2/FiO2 ratio, mean (SD)	220 (48)	180 (42)	< 0.001
PEEP, median (IQR), cm H2O	8 (6–10)	10 (8–12)	< 0.001
Respiratory rate, mean (SD), /min	22 (5)	28 (6)	< 0.001
ΔP/Ti, mean (SD), cm H2O/s	35.2 (8.4)	48.7 (10.2)	< 0.001
Primary Diagnosis, No. (%)
ARDS	45 (28.8)	53 (43.4)	0.14
Pneumonia	62 (39.7)	45 (36.9)	0.674
Post-operative	28 (17.9)	12 (9.8)	0.49
Others	21 (13.5)	12 (9.8)	0.382

Open in a new tab

As shown in Fig. 1, the study workflow comprised five major stages: (1) patient screening and EIT data acquisition at three centers; (2) quantification and grading of pendelluft; (3) development of a ChatGPT-generated AI model using a hybrid ResNet-BiLSTM-Attention architecture with Grad-CAM interpretability; (4) model validation with five-fold cross-validation and blinded review; and (5) assessment of diagnostic performance and clinical impact.

Comparator models

For benchmarking, we implemented three comparator models trained on the same train/validation/test splits: (1) CNN: a 5-layer convolutional neural network with 3 × 3 kernels, stride 1, batch normalization, ReLU activations, a fully connected layer with 128 units, dropout of 0.3, and Adam optimizer (learning rate 1e-3).(2) Random Forest: a classifier with 350 trees (n_estimators = 350), maximum depth 15, and minimum samples per leaf of 2 (scikit-learn defaults for other parameters).(3) SVM: a radial basis function (RBF) kernel SVM with C = 2.0, gamma = “scale”, probability estimates enabled, and class_weight = “balanced”.Because Random Forest and SVM cannot directly process raw spatiotemporal EIT tensors, we derived hand-crafted feature vectors from each 10-second window, including global and regional tidal impedance variation statistics (mean, standard deviation, skewness), pendelluft amplitude, ventral-to-dorsal ventilation ratios, and temporal summary descriptors (e.g., peak-to-trough amplitude and variability across breaths). The full feature set is described in Supplementary Table S18, (Fig 2).

Fig. 2 — Receiver-operating-characteristic curves for model performance comparison. ROC curves demonstrating the diagnostic performance of ChatGPT (AUC = 0.91) compared to CNN (AUC = 0.85), Random Forest (AUC = 0.82), and SVM (AUC = 0.79). Shaded areas represent 95% confidence intervals. P values for pairwise comparisons were adjusted using Bonferroni correction

Safety workflow and human-in-the-loop governance

The model outputs calibrated probabilities for each pendelluft grade. If the maximum predicted probability is < 90%, the case is automatically flagged as “low confidence”, and the corresponding EIT segment and Grad-CAM visualization are queued for mandatory clinician review. In these cases, AI outputs are treated as advisory only and not acted upon without explicit clinician confirmation. All low-confidence predictions, Grade 2–3 alerts, and clinician overrides are logged for periodic failure analysis and potential model recalibration, thereby providing a safeguard against model drift and maintaining patient safety. Operational safety and reliability metrics from the clinical validation period are reported in Supplementary Table S10.

Computational environment

All models were implemented in Python (version 3.12) using PyTorch 2.0. Training and inference were performed on a workstation equipped with [e.g., two NVIDIA RTX 3090 GPUs (24 GB VRAM each), Intel Xeon CPU, and 128 GB RAM] running Ubuntu 20.04 and CUDA 11.X. The per-case analysis times reported in Table 2; Fig. 3 were measured on a single RTX 3090 GPU in this environment.

Table 2.

Diagnostic performance comparison among different models

Model	Sensitivity % (95% CI)	Specificity % (95% CI)	AUC-ROC (95% CI)	F1 Score	Analysis Time per Case (min)
ChatGPT (clinician-optimized)	89.6 (85.3–93.2) ^*	92.1 (88.7–95.0) ^*	0.91 (0.87–0.95) ^*	0.89	2.8 (2.3–3.4)
ChatGPT (original, baseline)	83.2 (78.1–87.7) ^#	84.5 (79.9–89.1) ^#	0.87 (0.83– 0.91) ^#	0.82	4.1 (3.5–4.9)
CNN	82.1 (77.4–87.0)	84.5 (80.1–88.7)	0.85 (0.81–0.89)	0.86	5.6 (4.8–6.4)
Random Forest	79.2 (74.5–84.3)	81.7 (77.1–86.2)	0.82 (0.78–0.86)	0.83	4.9 (4.2–5.5)
SVM	76.8 (71.2–82.1)	78.3 (73.6–83.2)	0.79 (0.75–0.83)	0.80	5.2 (4.6–5.9)

Open in a new tab

^#p < 0.01 for improvement after clinical optimization over ChatGPT-original (paired Wilcoxon Signed-Rank Test and DeLong’s test). All performances based on identical 5-fold CV splits for all models. Model parameters and training pipeline as follows: ChatGPT-optimized: ResNet-34 (MED-Image pretrained, Stage 4 removed, dilated conv), Bi-LSTM (256 units), self-attention (8-head), Grad-CAM, Adam (LR 1e-4), batch size 48, 60 epochs.ChatGPT-original: ResNet-50 (ImageNet pretrained), unidirectional LSTM (128), binary output, Adam (LR 1e-4).CNN: 5 conv layers, kernel 3 × 3, stride 1, batchnorm, ReLU, FC 128, dropout 0.3, Adam (LR 1e-3). Random Forest: n_estimators = 350, max_depth = 15, min_samples_leaf = 2, sklearn default. SVM: RBF kernel, C = 2.0, gamma=’scale’, probability = True, class_weight=’balanced’ (sklearn). All details for model training and code provided in Supplementary Table S12 and Appendix 2

*p < 0.01 vs. CNN/Random Forest/SVM and ChatGPT-original (paired Wilcoxon Signed-Rank Test and DeLong’s test; Holm correction)

Fig. 3 — Workflow impact visualization. Visualizes the operational and workflow advantages realized after clinical implementation of the ChatGPT-optimized detection system. The figure includes metrics such as average interpretation time per case, detection lead time compared to standard monitoring, cost savings, and healthcare staff-reported efficiency improvements. Graphical representation underscores the system’s potential to enhance ICU workflow while maintaining diagnostic accuracy

Data availability

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

The complete source code and trained model used in this study are openly available at: GitHub: https://github.com/ccmzhangrui/EITmodel.

Cost and efficiency analysis

We estimated the cost per case by considering ICU physician time as the primary variable cost. The average hourly ICU physician salary was 150 RMB, based on institutional human resources data. The median interpretation time was 12.5 min for manual EIT review and 2.8 min for AI-assisted review. AI system fixed costs (infrastructure and maintenance) were amortized over the annual number of EIT cases.

Formulas

Example (institutional figures).

Estimated saving: ≈ 150 RMB per case (including reduced physician time and operational overhead).

Limitations

This simplified, institution-specific calculation omits other potential cost or benefit categories (e.g., training, opportunity costs, downstream clinical effects) and should be interpreted as hypothesis‑generating, not as a full cost‑effectiveness analysis. Detailed comparative cost-effectiveness results are summarized in Supplementary Table S7.

Endpoints

The co-primary endpoints were diagnostic accuracy—quantified by AUC-ROC, sensitivity, and specificity—and analytic efficiency, defined as median analysis time per case and estimated cost per case. Secondary endpoints included associations between pendelluft grades and patient outcomes (28-day mortality, ventilator-free days, reintubation rates, and physiological changes after AI-guided ventilatory adjustments).

Statistical analysis

Statistical analysis was designed following DeLong’s method for AUC comparison with α = 0.05 and 80% power [24, 25]. Statistical analysis was designed following methodological standards for machine learning evaluation. For model performance comparisons based on 5-fold cross-validation with paired samples, overall significance across all algorithms was assessed using the Friedman test [26] (nonparametric repeated measures ANOVA). Pairwise differences (e.g., ChatGPT vs. each baseline model) were further evaluated using the Wilcoxon Signed-Rank Test with Holm-Bonferroni correction for multiple testing [25, 27]. For AUC comparisons between paired ROC curves, DeLong’s test was applied. All model comparisons used identical patient splits (same cross-validation folds across algorithms). Continuous variables were assessed for normality, and appropriate parametric (paired t-test, ANOVA) or non-parametric (Wilcoxon, Friedman) methods. Statistical significance was set at α = 0.05. Additional statistical methods and supplementary analyses are described in Appendix 4 in the Supplementary Materials. All statistical analyses used R v4.3.1 or Python (SciPy. Stats), with results reported as mean ± SD or median (IQR), with 95% confidence intervals where appropriate [28].

The schematic illustrates the five key stages: (1) Patient screening and EIT data acquisition across three centers (n = 458 screened, n = 278 included); (2) Quantification of pendelluft amplitude and 4-grade severity classification; (3) AI model development using GPT-4, featuring a hybrid ResNet-BiLSTM-Attention architecture with Grad-CAM interpretability; (4) Model validation through 5-fold cross-validation and blinded clinical review; (5) Assessment of diagnostic performance (AUC, sensitivity, specificity) and clinical impact, including increased mortality risk with severe pendelluft and reduced analysis time. Abbreviations: EIT, electrical impedance tomography; AUC, area under the receiver operating characteristic curve

ResNet-34 Modification:

Removed final convolutional stage (Stage 4) to reduce computational complexity.

Implemented dilated convolutions to maintain spatial resolution while decreasing parameters.

Pretrained on publicly available MED-Image dataset (2.8M medical images)
Attention Mechanism:

Where = 32-dimensional key vectors
Uncertainty Quantification:

Bayesian estimation via Monte Carlo dropout (T=100 iterations):
Grad-CAM Implementation:

Generated visual explanations highlighting regional ventilation abnormalities to support clinical interpretation.

Results

A total of 278 mechanically ventilated patients were enrolled across three tertiary hospitals in China (Supplemental Fig. 1). Of the 458 adults screened, 180 were excluded according to predefined criteria, ensuring a representative cohort for analysis. Baseline clinical and demographic characteristics, stratified by pendelluft severity, are presented in Table 1. Patients with high-grade pendelluft exhibited significantly higher APACHE II and SOFA scores, decreased PaO₂/FiO₂ ratios, elevated PEEP, and increased respiratory drive compared to those with low-grade pendelluft (all p < 0.001). The cohort displayed adequate heterogeneity and sampling consistency by participating center (see Supplementary Table S1). Among the 42 Grade 2–3 cases in which the AI system detected high-grade pendelluft and prompted bedside ventilator adjustments, PaO₂/FiO₂ increased from 181 ± 42 to 223 ± 46 mmHg (mean change + 42 mmHg, P < 0.001), plateau pressure decreased from 22.8 ± 2.9 to 20.5 ± 2.7 cmH₂O (− 2.3 cmH₂O, P < 0.001), and driving pressure decreased from 13.6 ± 2.5 to 11.7 ± 2.3 cmH₂O (− 1.9 cmH₂O, P < 0.001) (Fig. 4, Supplementary Table S6).

Fig. 4 — Intervention physiology outcomes. Summarizes the key physiological improvements observed in patients who received real-time ventilator adjustments based on AI-guided pendelluft detection. The panel displays pre- and post-intervention changes in PaO₂/FiO₂ ratio, plateau pressure, and driving pressure among Grade 2-3 cases (n=42), with associated statistical significance. Physiological Outcomes Before and After AI-Guided Intervention in Supplementary Figure 6. These results highlight the clinical relevance and potential impact of AI-based system recommendations on patient respiratory metrics

The distribution of pendelluft severity, as depicted in Supplementary Fig. 2, revealed that clinically significant cases (Grade 2–3) comprised a notable proportion of the study population. Grade-wise classification achieved high inter-rater reliability (κ = 0.91), confirming the robustness of severity assessment.

The ChatGPT-generated, clinician-optimized AI system demonstrated superior diagnostic performance for automated pendelluft detection compared with conventional machine learning models. As summarized in Table 2, the optimized ChatGPT model achieved an AUC-ROC of 0.91 (95% CI: 0.87–0.95), with sensitivity of 89.6% and specificity of 92.1%. These values were significantly higher than the original ChatGPT baseline, CNN (AUC 0.85), random forest (AUC 0.82), and SVM (AUC 0.79) models (all P < 0.01; Table 2; Fig. 3). F1 scores, analysis time per case, and additional predictive metrics are detailed in Supplementary Table S2 and S3.

Subgroup analyses (Table 3) confirmed that the performance advantages of the ChatGPT-AI system persisted across clinically relevant strata, including APACHE II score, patient age, and primary diagnosis (ARDS, pneumonia, postoperative, and others), with AUCs consistently > 0.89 in all subgroups. Expanded subgroup performance metrics are provided in Supplementary Table S4. Expanded comparative metrics with additional large language models (LLaMA-Medical v2, BioBERT-Clinic) are provided in Supplementary Table S5 and visualized in Supplementary Fig. 4, further underscoring the model’s state-of-the-art performance.

Table 3.

Integrated analysis of clinical outcomes and model performance

Category & Subgroup	N	ChatGPT	CNN	Random Forest	SVM	P Value*
Overall Performance	278
AUC (95% CI)		0.91 (0.87–0.95)	0.85 (0.81–0.89)	0.82 (0.78–0.86)	0.79 (0.75–0.83)	Reference
Sensitivity, %		89.6 (85.3–93.2)	82.1 (77.4–87.0)	79.2 (74.5–84.3)	76.8 (71.2–82.1)	< 0.001
Specificity, %		92.1 (88.7–95.0)	84.5 (80.1–88.7)	81.7 (77.1–86.2)	78.3 (73.6–83.2)	< 0.001
By APACHE II Score
< 20 points	130	0.92 (0.88–0.96)	0.85 (0.81–0.89)	0.83 (0.79–0.87)	0.80 (0.76–0.84)	< 0.001
≥ 20 points	148	0.90 (0.86–0.94)	0.83 (0.79–0.87)	0.81 (0.77–0.85)	0.78 (0.74–0.82)	< 0.001
By Age
< 65 years	145	0.92 (0.88–0.96)	0.86 (0.82–0.90)	0.83 (0.79–0.87)	0.80 (0.76–0.84)	< 0.001
≥ 65 years	133	0.89 (0.85–0.93)	0.84 (0.80–0.88)	0.81 (0.77–0.85)	0.78 (0.74–0.82)	< 0.001
By Primary Diagnosis
ARDS	99	0.90 (0.86–0.94)	0.84 (0.80–0.88)	0.81 (0.77–0.85)	0.78 (0.74–0.82)	< 0.001
Pneumonia	107	0.91 (0.87–0.95)	0.85 (0.81–0.89)	0.82 (0.78–0.86)	0.79 (0.75–0.83)	< 0.001
Post-operative	40	0.92 (0.88–0.96)	0.86 (0.82–0.90)	0.83 (0.79–0.87)	0.80 (0.76–0.84)	< 0.001
Others	32	0.90 (0.86–0.94)	0.84 (0.80–0.88)	0.81 (0.77–0.85)	0.78 (0.74–0.82)	< 0.001

Open in a new tab

Comprehensive evaluation of model performance across different clinical subgroups and outcome measures. Analysis performed on test set with stratification by clinical characteristics

Abbreviations: AUC Area under the curve, CI Confidence interval, CNN Convolutional neural network, SVM Support vector machine, APACHE Acute Physiology and Chronic Health Evaluation, ARDS Acute respiratory distress syndrome

P values calculated using DeLong's test for AUC comparison with Bonferroni correction for multiple comparisons. Model performance was consistently maintained across different subgroups, with ChatGPT showing superior performance in all categories (P<0.001 for all comparisons). Full subgroup analysis across demographic and clinical variables is shown in Supplementary Table S8

Workflow efficiency analysis (Fig. 3, Supplementary Fig. 7) demonstrated that implementation of the AI system reduced median analysis time from 12.5 min (manual expert interpretation) to 2.8 min per case, enabling a significant increase in cases processed per hour and an average interpretation cost reduction of 200 RMB per patient. The system maintained a sustained operational uptime of 99.2%, with clinician review triggered in 11.3% of cases based on confidence thresholds.

Clinical validation highlighted strong agreement between AI and clinician grading, with Cohen’s κ of 0.86 (95% CI: 0.82–0.91) across all sites (Supplementary Fig. 8). In 42 cases where the AI detected high-grade pendelluft and prompted real-time ventilatory adjustments, there were significant improvements in PaO₂/FiO₂, and reductions in plateau and driving pressures (Fig. 4, Supplementary Table S6), demonstrating the practical clinical utility of the system’s recommendations.

Importantly, higher pendelluft severity remained a strong independent predictor of poor outcomes. Patients with high-grade pendelluft experienced increased 28-day mortality (25.4% vs. 9.0%; adjusted HR 2.8, 95% CI: 1.6–4.9, P = 0.001), fewer ventilator-free days, and higher incidences of reintubation and complications (Supplementary Table S2, Supplementary Fig. 2, and Supplementary Fig. 6). The system’s false negative rate for clinically significant pendelluft was low (2.1%), supporting its safety for clinical decision support.

Model calibration was assessed to determine the agreement between predicted and observed probabilities of pendelluft presence. As shown in Fig. 5A, the calibration plot demonstrated a close correspondence between predicted risks generated by the ChatGPT model and actual observed outcomes across probability strata. The 95% confidence interval overlapped substantially with the line of perfect calibration, and the Hosmer-Lemeshow test indicated good model fit, confirming the reliability of predicted probabilities for clinical interpretation.To evaluate clinical utility, decision curve analysis was performed (Fig. 5B). Across a wide range of threshold probabilities, the ChatGPT-based system consistently showed a higher net benefit compared to both conventional CNN and expert physician review. The decision curve for the ChatGPT model remained above those for alternative strategies and reference lines (“treat all” and “treat none”), highlighting its potential to enhance clinical decision-making and optimize patient management.

Fig. 5 — A Model Calibration Analysis.The calibration plot illustrates the agreement between predicted and observed probabilities for the ChatGPT model in pendelluft detection. The solid diagonal line represents perfect calibration, while the plotted points depict the actual model outputs across probability deciles. The shaded gray area corresponds to the 95% confidence interval. The model demonstrates close alignment with the ideal line, and the goodness-of-fit assessed by the Hosmer-Lemeshow test confirms satisfactory calibration. B Decision Curve Analysis (DCA). The decision curve analysis compares the net clinical benefit of the ChatGPT-based system, conventional CNN, and expert review across a range of threshold probabilities for intervention. The net benefit curves show that the ChatGPT model provides consistently greater net benefit than both alternatives. The “treat all” and “treat none” lines serve as reference strategies, and the superior curve of the ChatGPT model highlights its added value in guiding clinical decision-making. C Electrical Impedance Tomography (EIT) combined with AI Grad-CAM heatmap illustrating the spatial distribution and grading of regional ventilation delay. On the left, the EIT cross-sectional image shows thoracic background and red-highlighted high-attention areas as detected by Grad-CAM.The lung is divided into four equally spaced gravitational zones (ZONE 1–4), with Nondependent region indicated above and Dependent region indicated below the image. The bottom color bar represents the percentage range of the delayed area for corresponding grades: green (Grade 0, <2.5%), yellow (Grade 1, 2.5–5%), orange (Grade 2, 5–10%), and red (Grade 3, >10%).Waveform panels on the right display ventilation curves for each zone, visualizing phase delay and phase inversion; both delay magnitude and color grade are positively correlated. This grading method provides an intuitive depiction of the distribution characteristics of different levels of ventilation delay, which can support precise adjustment of mechanical ventilation at the bedside

After targeted clinical optimizations—accounting for approximately 30% code modification—the AI system further improved in AUC (+ 0.04), Cohen’s κ (+ 0.07), and reduced mean inference time by 1.3 min compared to the original ChatGPT pipeline (see Supplementary Fig. 8 and Table S11 for details). Representative EIT images and Grad-CAM visualizations of regional ventilation abnormalities across pendelluft grades are provided in Fig. 5C and Supplementary Fig. 5, reinforcing the interpretability and clinical utility of the approach.

Baseline characteristics of study participants stratified by pendelluft severity (low-grade vs. high grade): Continuous variables are presented as mean (SD) or median (IQR) based on their distribution. Categorical variables are presented as number (percentage). P values were calculated using one-way ANOVA or Kruskal-Walli’s test for continuous variables and chi-square test or Fisher exact test for categorical variables, as appropriate. Abbreviations: BMI, body mass index; APACHE, Acute Physiology and Chronic Health Evaluation; SOFA, Sequential Organ Failure Assessment; ARDS, acute respiratory distress syndrome.

The bottom color bar represents the percentage range of the delayed area for corresponding grades: green (Grade 0, < 2.5%), yellow (Grade 1, 2.5–5%), orange (Grade 2, 5–10%), and red (Grade 3, > 10%).Waveform panels on the right display ventilation curves for each zone, visualizing phase delay and phase inversion; both delay magnitude and color grade are positively correlated. This grading method provides an intuitive depiction of the distribution characteristics of different levels of ventilation delay, which can support precise adjustment of mechanical ventilation at the bedside.

Discussion

This multicenter study provides compelling evidence that a ChatGPT-generated, hybrid deep learning system for pendelluft detection in mechanically ventilated patients represents a significant advancement in both diagnostic accuracy and practical applicability compared to existing paradigms. Mechanical ventilation remains central to ICU practice, but inappropriate strategies can lead to complex complications, including difficult weaning, ventilator-induced lung injury, and eventual increases in morbidity and mortality [1–3]. Pendelluft—describing the asynchronous movement of air between lung regions—has been increasingly recognized as a clinically meaningful manifestation of ventilation heterogeneity and a harbinger of adverse outcomes [4, 5]. EIT offers dynamic, radiation-free imaging that ideally positions it for continual bedside assessment of regional ventilation [7]. However, the clinical impact of EIT has been limited by the interpretive demands on clinicians and the lack of robust, standardized, and real-time analytic tools capable of integrating seamlessly into demanding critical care environments [7–9].

The significance of this work lies not simply in the technical question of whether LLMs can generate medical AI code, but in systematically demonstrating how such models can, with rigorous expert curation and clinical testing, deliver diagnostic accuracy, efficiency, and explainability surpassing both conventional ML and even contemporary domain specialized LLMs. All analyses included benchmarks not only with conventional CNN and classical machine learning methods, but also with state-of-the-art medical-specialized large language models (LLaMA-Medical v2 and BioBERT-Clinic), validating the distinct advantages of the GPT-4 generated framework both within and across AI paradigms. In addition to its diagnostic accuracy, our system’s design incorporates several key strengths that align with current priorities in ICU care: use of a rigorously curated multicenter dataset for enhanced generalizability; comparative benchmarking with conventional and state-of-the-art AI architectures; integration of explainable Grad-CAM visualizations to support clinician trust; embedding into the existing EIT workflow for seamless adoption; and demonstrable improvements in analytic efficiency and cost-effectiveness. These features together ensure both scientific robustness and practical applicability in diverse critical care environments.

Our findings clearly show that leveraging a large language model-generated architecture, further optimized by domain experts, overcomes many of these challenges. The system developed in the present study (combining a lightweight, pre-trained convolutional feature extractor and a bidirectional attention-augmented recurrent network) demonstrates superior performance compared to classic machine learning approaches, achieving an AUC of 0.91, with both sensitivity and specificity above 89% [10, 11, 14, 16]. These results align with and extend previous studies suggesting that deep hybrid models can raise the bar for EIT interpretation and surpass the practical limitations of classical image reconstruction and machine learning frameworks [11, 12]. Of particular note, by integrating clinical feedback into the loop and building in explainability features alongside strict safety guardrails, this platform addresses historical criticisms regarding the ‘black box’ nature of advanced AI in healthcare and prepares it for bedside adoption [17, 19]. Automated quantification of pendelluft, based on validated metrics, is not merely an academic exercise but offers tangible clinical value. Consistent with prior work, we found that higher pendelluft severity was independently associated with increased 28‑day mortality, fewer ventilator‑free days, and higher rates of clinically meaningful complications such as reintubation and ventilator‑associated pneumonia [5, 29]. These associations reinforce the concept that pendelluft reflects injurious regional ventilation heterogeneity and can serve as a real‑time signal of patients at increased risk during assisted mechanical ventilation.

Recent physiological studies have begun to clarify how ventilatory settings and respiratory muscle activity shape pendelluft during the vulnerable transition from fully controlled to partially assisted ventilation. In hypoxemic ARDS patients resuming spontaneous breathing, Cornejo et al. showed in a randomized crossover trial that increasing pressure support systematically reduced pendelluft magnitude and expiratory muscle activity, whereas higher EIT‑guided PEEP had a more complex and context‑dependent effect [31]. Specifically, higher PEEP tended to decrease pendelluft by improving dynamic lung compliance, but this benefit could be attenuated or even counteracted by concomitant recruitment of expiratory muscles. Expiratory gastric pressure swings and inspiratory muscular pressure were both independent determinants of pendelluft, and mediation analyses suggested that changes in pendelluft induced by PS and PEEP were transmitted, at least in part, through their impact on respiratory muscle effort and compliance. Together, these findings indicate that pendelluft is not a fixed patient characteristic, but a dynamic phenomenon arising from the interaction among ventilatory support level, inspiratory and expiratory muscle activation, and lung mechanics.

Our results extend these physiological insights by demonstrating that an automated EIT‑based system can continuously detect and grade pendelluft at the bedside across a broad ICU population. Although our study was not designed as an interventional ventilatory‑setting trial and cannot establish causality, the strong association between high‑grade pendelluft and adverse outcomes, combined with prior mechanistic work [5, 12, 13, 31], supports the notion that pendelluft‑aware ventilator management may help identify opportunities to adjust pressure support and PEEP in a way that minimizes injurious regional ventilation while avoiding overassistance or excessive expiratory muscle recruitment. In this framework, the AI system we propose is best viewed as a real‑time monitoring and decision‑support tool that surfaces clinically relevant pendelluft patterns and their temporal evolution, rather than as an autonomous prescriptive controller of ventilator settings.

Beyond the respiratory system, recent work has shown that AI‑enabled EIT can be generalized to other organ domains. Bahukhandi et al. reported a portable multi‑spectral EIT system combined with a ResNet‑18‑based deep learning model to detect and quantify metabolic associated steatotic liver disease by predicting the Controlled Attenuation Parameter, achieving excellent discrimination (AUC 0.95) with high sensitivity and specificity compared with vibration‑controlled transient elastography [10]. Taken together with our pendelluft‑focused findings, this suggests that EIT, when coupled with modern AI architectures, may serve as a versatile, non‑invasive platform for organ‑specific functional imaging, spanning hepatology and critical care respiratory monitoring. Our work adds to this emerging body of evidence by showing that a clinically optimized, LLM‑assisted pipeline can deliver robust, explainable performance for a complex, dynamic respiratory phenomenon such as pendelluft in real‑world ICU settings. Additional benchmarking versus other AI/ML and LLM models is presented in Supplementary Table S14.

Implementation of the ChatGPT-based model substantially reduced the time and expertise required for EIT interpretation, providing rapid (mean 2.8 min) and cost-effective analysis without sacrificing robustness. The system identified at-risk patients earlier than conventional review, resulting in more timely clinical interventions and, by extension, improved physiological and clinical trajectories. These workflow and efficiency gains, coupled with high clinician agreement (Cohen’s κ = 0.86), are particularly relevant as ICU staffing pressures and economic constraints continue to rise [24, 29]. Importantly, the high level of staff acceptance (approaching 90%) underscores that thoughtfully designed, clinician-aligned AI tools can be successfully deployed in real-world ICU settings, addressing a key barrier to digital health adoption.

The multicenter nature of our study—spanning three tertiary care hospitals—ensured rigorous cross-institutional validation, and comprehensive statistical protocols minimized concerns regarding model overfitting or site-specific bias [23]. The deployment of strict confidence thresholds, which automatically triggered human review if certainty fell below 90%, ensured that no high-grade pendelluft events were missed and maintained patient safety at the forefront [21, 22]. Consistency with recent international consensus statements and technical guidelines further strengthens the clinical translation of our approach [9, 18].

While large language models such as GPT-4 can provide rapid code prototyping, their outputs for clinical applications must not be presumed clinically reliable. Direct application of GPT-native code alone, without structured expert review and cross-site validation, was insufficient for clinical reliability. In our workflow, all system outputs—including early code, model predictions, and AI-based pendelluft gradings—were validated using a combination of: (1) five expert ICU physicians blinded to AI predictions; (2) multicenter retrospective datasets with external cohorts (see Supplementary Table S9); (3) real-time physiological response to AI-guided interventions; and (4) built-in system safety triggers for uncertain cases. We stress that no clinical reliability was assumed based on prompt engineering alone.Specific safety and failure-prevention cases are detailed in Supplementary Table S13.

Despite these strengths, several limitations deserve mention. The retrospective data collection—while practical for rapid prototyping and validation—may introduce selection bias, and our inclusion criteria (good EIT signal quality, exclusion of severe hemodynamic instability, etc.) may restrict generalizability to broader ICU populations [20]. Additionally, while the system’s performance remained robust across participating centers and baseline severities, prospective validation in more diverse clinical environments and with newer EIT hardware is warranted to further confirm real-world utility. Another potential limitation is the necessary human-in-the-loop approach during system development: expert feedback was essential to ensure clinical appropriateness, underscoring that even advanced LLM-derived tools require multidisciplinary iterative refinement prior to fully autonomous deployment. Furthermore, while our system excelled at pendelluft analysis, its ability to detect subtler forms of asynchronous ventilation or to provide comprehensive, multi-modal physiological decision support remains to be explored. Although clear efficiency and cost-savings were demonstrated, future health economic analyses should fully account for the long-term impact of training, maintenance, and ongoing integration with evolving ICU workflows. Any improvements in PaO₂/FiO₂ or ventilatory pressures observed should be interpreted as exploratory and may reflect clinician-driven changes influenced by awareness of pendelluft status, rather than effects uniquely attributable to the AI system.

Overall, this work supports the notion that clinically optimized, LLM-powered diagnostic systems can enable next-generation real-time monitoring, thereby moving EIT beyond a research modality to a practical, critical care decision-support tool [30]. By demonstrating rigorous accuracy, efficiency, explainability, and workflow compatibility, our system provides a benchmark for future AI-medical device integration and paves the way for larger prospective trials and broader physiological monitoring applications [28]. As AI-driven analytics mature, their transformative potential for patient-specific and dynamic ICU care will likely depend on interdisciplinary collaboration, transparent validation, and ongoing attention to safe, human-centric deployment. Beyond the respiratory system, recent work has shown that AI‑enabled EIT can be generalized to other organ domains. Bahukhandi et al. reported a multi‑spectral EIT system combined with a ResNet‑18‑based deep learning model to detect and quantify metabolic associated steatotic liver disease by predicting the Controlled Attenuation Parameter, with excellent discrimination (AUC 0.95) and high sensitivity and specificity in comparison to vibration‑controlled transient elastography [10]. Taken together with our findings, this suggests that EIT, when coupled with modern AI architectures, may serve as a versatile non‑invasive platform for organ‑specific functional imaging, spanning hepatology and critical care respiratory monitoring.

This research serves as a blueprint for combining LLM-based code generation with domain expertise and comprehensive clinical validation, setting a new benchmark for AI translation from theory to high-stakes bedside practice.

Conclusion

In conclusion, our study demonstrates that a clinically optimized, ChatGPT-generated deep learning system enables rapid, accurate, and cost-effective detection of pendelluft in mechanically ventilated patients using EIT data. By substantially improving diagnostic accuracy and workflow efficiency compared to traditional approaches, this system facilitates real-time clinical decision-making and is closely aligned with current needs for precision critical care. These findings support the integration of AI-driven EIT analysis into routine practice and provide a foundation for future prospective studies to validate its impact on patient outcomes across diverse ICU populations.

Supplementary Information

12931_2025_3489_MOESM1_ESM.pdf^{(1.8MB, pdf)}

Supplementary Material 1. Supplementary Table S1: Expanded Baseline Characteristics by Center Supplementary Table S2: Clinical Outcomes Stratified by Pendelluft Severity Supplementary Table S3: Detailed Model Performance Metrics Supplementary Table S4: Subgroup Analyses of Model Performance Supplementary Table S5: Time-Based Performance Analysis Supplementary Table S6: Center-Specific Characteristics and Model Performance Supplementary Table S7: Comprehensive Cost-Effectiveness Analysis Supplementary Table S8: Comprehensive Subgroup Analysis of Model Performance Supplementary Table S9: External Validation Cohort Characteristics and Model Validation Supplementary Table S10: System Safety and Reliability Metrics Supplementary Table S11: Performance Metrics Before and After Clinical Optimization Supplementary Table S12: Model Training Parameters Supplementary Table S13: Safety/Failure Examples Prevented by Expert Modification Supplementary Table S14: Benchmarking Versus Other AI/ML and LLM Models Supplementary Table S15: Benchmarking Deeper CNN Architectures Supplementary Table S16: Architecture Modifications and Clinical Rationale Supplementary Table S17: Additional Model Comparisons Supplementary Table S18: Feature Set Description Supplementary Figure 1: Model Architecture and Workflow Supplementary Figure 2: Kaplan-Meier Survival Analysis Based on Pendelluft Severity Supplementary Figure 3: ROC curves comparing performance of different models Supplementary Figure 4: Model Calibration and Performance Visualization Supplementary Figure 5: ROC curves for LLM-Based Pendelluft Detection Supplementary Figure 6: Physiological Outcomes Before and After AI-Guided Intervention Supplementary Figure 7: Workflow Efficiency and Cost Impact Supplementary Figure 8: Comparative Performance Metrics of ChatGPT-Generated vs Clinician-Optimized Models Supplementary Appendix 1: ChatGPT Prompts and Model Generation Transcript Supplementary Appendix 2: Code Comparison – ChatGPT Original vs Clinically Refined Supplementary Appendix 3: ChatGPT Prompts, Code Outputs, and All Model Implementation Details Supplementary Appendix 4: Quantitative Performance Comparison Supplementary Appendix 5: ChatGPT Prompt, Generated Model Code, and Clinical Examples Supplementary Note: Explanation why “Prompt Alone” is Insufficient for Clinical Deployment

Acknowledgements

The authors would like to thank Yixin Ma, Songjin Shi, Fang Wang and Yi Chi for their valuable support and assistance in this work.

Abbreviations

AUC: Area under the curve
CI: Confidence interval
CNN: Convolutional neural network
SVM: Support vector machine
APACHE: Acute Physiology and Chronic Health Evaluation
ARDS: Acute respiratory distress syndrome

Authors’ contributions

RZ, HH and ZZ designed the work. RZ, DH, RT, JX and JG collected and analyzed the data. RZ, FL, ZZ and ZB wrote the manuscript. YL, HQ and KB critically reviewed the manuscript. All authors are agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. All authors read and approved the final manuscript. All authors thank Fang Wang and Yi Chi for their valuable support and assistance in this work.

Funding

National High-Level Hospital Clinical Research Funding (Funding Number: 2022-PUMCH-D-005, 2022-PUMCH-B-115. National Natural Science Foundation of China (82470088).

Data availability

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.The complete source code and trained model used in this study are openly available at: GitHub [https://github.com/ccmzhangrui/EITmodel].

Declarations

Ethics approval and consent to participate

This study was approved by the Institutional Review Boards of all participating centers, including Ruijin Hospital (RJ-IRB2024-505), Peking Union Medical College Hospital (S-K1859), and Fujian Provincial Hospital (K2025-02-002). The requirement for informed consent was waived due to the retrospective design and anonymized data analysis in accordance with the Declaration of Helsinki.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rui Zhang, Huaiwu He and Zhisheng Bi contributed equally to this work.

Contributor Information

Rui Zhang, Email: ccmzhangrui@foxmail.com.

Zhanqi Zhao, Email: zhanqizhao@gzhmu.edu.cn.

Yun Long, Email: ly_icu@aliyun.com.

Hongping Qu, Email: qhp10516@rjh.com.cn.

References

1.Briel M, Meade M, Mercat A, Brower RG. Higher vs lower positive end-expiratory pressure in patients with acute lung injury and acute respiratory distress syndrome: systematic review and meta-analysis. JAMA. 2010;303(9):865–73. [DOI] [PubMed] [Google Scholar]
2.Thille AW, Richard JC, Brochard L. The decision to extubate in the intensive care unit. Crit Care Med. 2019;47(12):1695–702. [DOI] [PubMed] [Google Scholar]
3.Beduneau G, Pham T, Schortgen F, et al. Epidemiology of weaning outcome according to a new definition. Am J Respir Crit Care Med. 2018;196(6):772–83. [DOI] [PubMed] [Google Scholar]
4.Burns KEA, Meade MO, Premji A, Adhikari NKJ. Noninvasive ventilation as a weaning strategy for mechanical ventilation in adults with respiratory failure: a Cochrane systematic review. CMAJ. 2014;186(3):E112–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Chi Y, Zhao Z, Frerichs I, et al. Prevalence and prognosis of respiratory Pendelluft phenomenon in mechanically ventilated ICU patients with acute respiratory failure: a retrospective cohort study. Ann Intensive Care. 2022;12(1):22. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Cornejo, R.A., Brito, R., Arellano, D.H. et al. Non-synchronized unassisted spontaneous ventilation may minimize the risk of high global tidal volume and transpulmonary pressure, but it is not free from pendelluft. Intensive Care Med. 2025;51:194–6. 10.1007/s00134-024-07707-x. [DOI] [PubMed]
7.He H, Zhao Z, Becher T et al. Recommendations for lung ventilation and perfusion assessment with chest electrical impedance tomography in critically ill adult patients: an international evidence‑based and expert Delphi consensus study. eClinicalMedicine. 2025. 10.1016/j.eclinm.2025.103575. [DOI] [PMC free article] [PubMed]
8.Scaramuzzo G, Pavlovsky B, Adler A, Baccinelli W, Bodor DL, Damiani LF, Franchineau G, Francovich J, Frerichs I, Giralt JAS, Grychtol B, He H, Katira BH, Koopman AA, Leonhardt S, Menga LS, Mousa A, Pellegrini M, Piraino T, Priani P, Somhorst P, Spinelli E, Händel C, Suárez-Sipmann F, Wisse JJ, Becher T, Jonkman AH. Electrical impedance tomography monitoring in adult ICU patients: state-of-the-art, recommendations for standardized acquisition, processing, and clinical use, and future directions. Crit Care. 2024;28(1):377. 10.1186/s13054-024-05173-x. PMID: 39563476; PMCID: PMC11577873. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Frerichs I, Amato MB, van Kaam AH, Tingay DG, Zhao Z, Grychtol B, Bodenstein M, Gagnon H, Böhm SH, Teschner E, Stenqvist O, Mauri T, Torsani V, Camporota L, Schibler A, Wolf GK, Gommers D, Leonhardt S, Adler A. TREND study group. Chest electrical impedance tomography examination, data analysis, terminology, clinical use and recommendations: consensus statement of the translational EIT development study group. Thorax. 2017;72(1):83–93. 10.1136/thoraxjnl-2016-208357. Epub 2016 Sep 5. PMID: 27596161; PMCID: PMC5329047. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Bahukhandi R, Li JHW, Lawson M, Wong EC, Zhou IY, Yuen MF, Mak LY, Seto WK, Chan RW. AI-enabled electrical impedance tomography (EIT) system for detecting and quantifying metabolic associated steatotic liver disease. Annu Int Conf IEEE Eng Med Biol Soc. 2025; 2025:1–4. 10.1109/EMBC58623.2025.11251537. PMID: 41336834. [DOI] [PubMed]
11.Zhang T, Tian X, Liu X, Ye J, Fu F, Shi X, Liu R, Xu C. Advances of deep learning in electrical impedance tomography image reconstruction. Front Bioeng Biotechnol. 2022;10:1019531. 10.3389/fbioe.2022.1019531. PMID: 36588934; PMCID: PMC9794741. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Yu H, Liu H, Liu Z, Wang Z, Jia J. High-resolution conductivity reconstruction by electrical impedance tomography using structure-aware hybrid-fusion learning. Comput Methods Programs Biomed. 2024;243:107861. 10.1016/j.cmpb.2023.107861. Epub 2023 Oct 19. PMID: 37931580. [DOI] [PubMed] [Google Scholar]
13.Yoshida T, Torsani V, Gomes S, De Santis RR, Beraldo MA, Costa EL, Tucci MR, Zin WA, Kavanagh BP, Amato MB. Spontaneous effort causes occult pendelluft during mechanical ventilation. Am J Respir Crit Care Med. 2013;188(12):1420-7. 10.1164/rccm.201303-0539OC. PMID: 24199628. [DOI] [PubMed]
14.Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large Language models. PLoS Digit Health. 2023;2(2):e0000198. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.van Dis EAM, Bollen J, Zuidema W, van Rooij R, Bockting CL. ChatGPT: five priorities for research. Nature. 2023;614(7947):224–6. [DOI] [PubMed] [Google Scholar]
16.Hager P, Jungmann F, et al. Evaluation and mitigation of the limitations of large Language models in clinical decision-making. Nat Med. 2024;30(9):2613–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Tejani AS, Cook TS, et al. Integrating and adopting AI in the radiology workflow: A primer for standards and integrating the healthcare enterprise (IHE) profiles. Radiology. 2024;311(3):e232653. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Costa EL, Chaves CN, Gomes S, et al. Real-time detection of pneumothorax using electrical impedance tomography. Crit Care Med. 2009;15(1):18–24. [DOI] [PubMed] [Google Scholar]
19.Frerichs I, Becher T, Weiler N. Electrical impedance tomography imaging of the cardiopulmonary system. Curr Opin Crit Care. 2017;72(1):83–93. [DOI] [PubMed] [Google Scholar]
20.Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143(1):29–36. [DOI] [PubMed] [Google Scholar]
21.Bird A, McMaster C, et al. Clinical evaluation is critical for the implementation of artificial intelligence in health care: comment on the Article by Mickley. Arthritis Care Res (Hoboken). 2024;76(6):904. 10.1002/acr.25310. Epub 2024 Feb 14. PMID: 38317297. [DOI] [PubMed] [Google Scholar]
22.Tibshirani R. The Lasso method for variable selection in the Cox model. Stat Med. 1997;16(4):385–95. [DOI] [PubMed] [Google Scholar]
23.Thompson BT, Chambers RC, Liu KD. Acute respiratory distress syndrome. N Engl J Med. 2017;377(6):562–72. [DOI] [PubMed] [Google Scholar]
24.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44(3):837–45. [PubMed] [Google Scholar]
25.WILCOXON F. Individual comparisons of grouped data by ranking methods. J Econ Entomol. 1946; 39:269. 10.1093/jee/39.2.269. PMID: 20983181. [DOI] [PubMed]
26.Friedman M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc. 1937;32(200):675–701. 10.2307/2279372. [Google Scholar]
27.WILCOXIN F. Probability tables for individual comparisons by ranking methods. Biometrics. 1947;3(3):119–22. PMID: 18903631. [PubMed] [Google Scholar]
28.Austin PC, Steyerberg EW. The integrated calibration index (ICI) and related metrics for quantifying the calibration of logistic regression models. Stat Med. 2019;38(21):4051–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Liu W, Chi Y, Zhao Y, He H, Long Y, Zhao Z. Occurrence of Pendelluft during ventilator weaning with T piece correlated with increased mortality in difficult-to-wean patients. J Intensive Care. 2024;12(1):23. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ. 2015;350:g7594. [DOI] [PubMed] [Google Scholar]
31.Cornejo RA, Brito R, Rivera C, et al. Influence of ventilatory settings on Pendelluft and expiratory muscle activity in hypoxemic patients resuming spontaneous breathing. Crit Care. 2025. 10.1186/s13054-025-05784-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

12931_2025_3489_MOESM1_ESM.pdf^{(1.8MB, pdf)}

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

The complete source code and trained model used in this study are openly available at: GitHub: https://github.com/ccmzhangrui/EITmodel.

[CR1] 1.Briel M, Meade M, Mercat A, Brower RG. Higher vs lower positive end-expiratory pressure in patients with acute lung injury and acute respiratory distress syndrome: systematic review and meta-analysis. JAMA. 2010;303(9):865–73. [DOI] [PubMed] [Google Scholar]

[CR2] 2.Thille AW, Richard JC, Brochard L. The decision to extubate in the intensive care unit. Crit Care Med. 2019;47(12):1695–702. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Beduneau G, Pham T, Schortgen F, et al. Epidemiology of weaning outcome according to a new definition. Am J Respir Crit Care Med. 2018;196(6):772–83. [DOI] [PubMed] [Google Scholar]

[CR4] 4.Burns KEA, Meade MO, Premji A, Adhikari NKJ. Noninvasive ventilation as a weaning strategy for mechanical ventilation in adults with respiratory failure: a Cochrane systematic review. CMAJ. 2014;186(3):E112–22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Chi Y, Zhao Z, Frerichs I, et al. Prevalence and prognosis of respiratory Pendelluft phenomenon in mechanically ventilated ICU patients with acute respiratory failure: a retrospective cohort study. Ann Intensive Care. 2022;12(1):22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Cornejo, R.A., Brito, R., Arellano, D.H. et al. Non-synchronized unassisted spontaneous ventilation may minimize the risk of high global tidal volume and transpulmonary pressure, but it is not free from pendelluft. Intensive Care Med. 2025;51:194–6. 10.1007/s00134-024-07707-x. [DOI] [PubMed]

[CR7] 7.He H, Zhao Z, Becher T et al. Recommendations for lung ventilation and perfusion assessment with chest electrical impedance tomography in critically ill adult patients: an international evidence‑based and expert Delphi consensus study. eClinicalMedicine. 2025. 10.1016/j.eclinm.2025.103575. [DOI] [PMC free article] [PubMed]

[CR8] 8.Scaramuzzo G, Pavlovsky B, Adler A, Baccinelli W, Bodor DL, Damiani LF, Franchineau G, Francovich J, Frerichs I, Giralt JAS, Grychtol B, He H, Katira BH, Koopman AA, Leonhardt S, Menga LS, Mousa A, Pellegrini M, Piraino T, Priani P, Somhorst P, Spinelli E, Händel C, Suárez-Sipmann F, Wisse JJ, Becher T, Jonkman AH. Electrical impedance tomography monitoring in adult ICU patients: state-of-the-art, recommendations for standardized acquisition, processing, and clinical use, and future directions. Crit Care. 2024;28(1):377. 10.1186/s13054-024-05173-x. PMID: 39563476; PMCID: PMC11577873. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Frerichs I, Amato MB, van Kaam AH, Tingay DG, Zhao Z, Grychtol B, Bodenstein M, Gagnon H, Böhm SH, Teschner E, Stenqvist O, Mauri T, Torsani V, Camporota L, Schibler A, Wolf GK, Gommers D, Leonhardt S, Adler A. TREND study group. Chest electrical impedance tomography examination, data analysis, terminology, clinical use and recommendations: consensus statement of the translational EIT development study group. Thorax. 2017;72(1):83–93. 10.1136/thoraxjnl-2016-208357. Epub 2016 Sep 5. PMID: 27596161; PMCID: PMC5329047. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Bahukhandi R, Li JHW, Lawson M, Wong EC, Zhou IY, Yuen MF, Mak LY, Seto WK, Chan RW. AI-enabled electrical impedance tomography (EIT) system for detecting and quantifying metabolic associated steatotic liver disease. Annu Int Conf IEEE Eng Med Biol Soc. 2025; 2025:1–4. 10.1109/EMBC58623.2025.11251537. PMID: 41336834. [DOI] [PubMed]

[CR11] 11.Zhang T, Tian X, Liu X, Ye J, Fu F, Shi X, Liu R, Xu C. Advances of deep learning in electrical impedance tomography image reconstruction. Front Bioeng Biotechnol. 2022;10:1019531. 10.3389/fbioe.2022.1019531. PMID: 36588934; PMCID: PMC9794741. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Yu H, Liu H, Liu Z, Wang Z, Jia J. High-resolution conductivity reconstruction by electrical impedance tomography using structure-aware hybrid-fusion learning. Comput Methods Programs Biomed. 2024;243:107861. 10.1016/j.cmpb.2023.107861. Epub 2023 Oct 19. PMID: 37931580. [DOI] [PubMed] [Google Scholar]

[CR13] 13.Yoshida T, Torsani V, Gomes S, De Santis RR, Beraldo MA, Costa EL, Tucci MR, Zin WA, Kavanagh BP, Amato MB. Spontaneous effort causes occult pendelluft during mechanical ventilation. Am J Respir Crit Care Med. 2013;188(12):1420-7. 10.1164/rccm.201303-0539OC. PMID: 24199628. [DOI] [PubMed]

[CR14] 14.Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large Language models. PLoS Digit Health. 2023;2(2):e0000198. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.van Dis EAM, Bollen J, Zuidema W, van Rooij R, Bockting CL. ChatGPT: five priorities for research. Nature. 2023;614(7947):224–6. [DOI] [PubMed] [Google Scholar]

[CR16] 16.Hager P, Jungmann F, et al. Evaluation and mitigation of the limitations of large Language models in clinical decision-making. Nat Med. 2024;30(9):2613–22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Tejani AS, Cook TS, et al. Integrating and adopting AI in the radiology workflow: A primer for standards and integrating the healthcare enterprise (IHE) profiles. Radiology. 2024;311(3):e232653. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Costa EL, Chaves CN, Gomes S, et al. Real-time detection of pneumothorax using electrical impedance tomography. Crit Care Med. 2009;15(1):18–24. [DOI] [PubMed] [Google Scholar]

[CR19] 19.Frerichs I, Becher T, Weiler N. Electrical impedance tomography imaging of the cardiopulmonary system. Curr Opin Crit Care. 2017;72(1):83–93. [DOI] [PubMed] [Google Scholar]

[CR20] 20.Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143(1):29–36. [DOI] [PubMed] [Google Scholar]

[CR21] 21.Bird A, McMaster C, et al. Clinical evaluation is critical for the implementation of artificial intelligence in health care: comment on the Article by Mickley. Arthritis Care Res (Hoboken). 2024;76(6):904. 10.1002/acr.25310. Epub 2024 Feb 14. PMID: 38317297. [DOI] [PubMed] [Google Scholar]

[CR22] 22.Tibshirani R. The Lasso method for variable selection in the Cox model. Stat Med. 1997;16(4):385–95. [DOI] [PubMed] [Google Scholar]

[CR23] 23.Thompson BT, Chambers RC, Liu KD. Acute respiratory distress syndrome. N Engl J Med. 2017;377(6):562–72. [DOI] [PubMed] [Google Scholar]

[CR24] 24.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44(3):837–45. [PubMed] [Google Scholar]

[CR25] 25.WILCOXON F. Individual comparisons of grouped data by ranking methods. J Econ Entomol. 1946; 39:269. 10.1093/jee/39.2.269. PMID: 20983181. [DOI] [PubMed]

[CR26] 26.Friedman M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc. 1937;32(200):675–701. 10.2307/2279372. [Google Scholar]

[CR27] 27.WILCOXIN F. Probability tables for individual comparisons by ranking methods. Biometrics. 1947;3(3):119–22. PMID: 18903631. [PubMed] [Google Scholar]

[CR28] 28.Austin PC, Steyerberg EW. The integrated calibration index (ICI) and related metrics for quantifying the calibration of logistic regression models. Stat Med. 2019;38(21):4051–65. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Liu W, Chi Y, Zhao Y, He H, Long Y, Zhao Z. Occurrence of Pendelluft during ventilator weaning with T piece correlated with increased mortality in difficult-to-wean patients. J Intensive Care. 2024;12(1):23. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ. 2015;350:g7594. [DOI] [PubMed] [Google Scholar]

[CR31] 31.Cornejo RA, Brito R, Rivera C, et al. Influence of ventilatory settings on Pendelluft and expiratory muscle activity in hypoxemic patients resuming spontaneous breathing. Crit Care. 2025. 10.1186/s13054-025-05784-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Diagnostic accuracy and cost-efficiency of ChatGPT in EIT-based pendelluft detection for mechanically ventilated patients: a multicenter study

Rui Zhang

Huaiwu He

Zhisheng Bi

Fang Long

Jing Xu

Jiayi Guan

Ruoming Tan

Knut Moeller

Donghuang Hong

Zhanqi Zhao

Yun Long

Hongping Qu

Abstract

Introduction

Methods

Results

Conclusions

Supplementary Information

Introduction

Methods

Study design and setting

Patient population and EIT data acquisition

EIT segmentation and dataset composition

Annotation protocol and inter-rater agreement

AI model development (ChatGPT-assisted architecture)

Model training, hyperparameter selection, and validation

Table 1.

Fig. 1.

Comparator models

Fig. 2.

Safety workflow and human-in-the-loop governance

Computational environment

Table 2.

Fig. 3.

Data availability

Cost and efficiency analysis

Limitations

Endpoints

Statistical analysis

Results

Fig. 4.

Table 3.

Fig. 5.

Discussion

Conclusion

Supplementary Information

Acknowledgements

Abbreviations

Authors’ contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases