Abstract
Aim
Acute decompensated heart failure (ADHF) is the leading cause of cardiovascular hospitalizations in the United States. Detecting B-lines through lung ultrasound (LUS) can enhance clinicians’ prognostic and diagnostic capabilities. Artificial intelligence/machine learning (AI/ML)-based automated guidance systems may allow novice users to apply LUS to clinical care. We investigated whether an AI/ML automated LUS congestion score correlates with expert’s interpretations of B-line quantification from an external patient dataset.
Methods and results
This was a secondary analysis from the BLUSHED-AHF study which investigated the effect of LUS-guided therapy on patients with ADHF. In BLUSHED-AHF, LUS was performed and B-lines were quantified by ultrasound operators. Two experts then separately quantified the number of B-lines per ultrasound video clip recorded. Here, an AI/ML-based lung congestion score (LCS) was calculated for all LUS clips from BLUSHED-AHF. Spearman correlation was computed between LCS and counts from each of the original three raters. A total of 3858 LUS clips were analysed on 130 patients. The LCS demonstrated good agreement with the two experts’ B-line quantification score (r = 0.894, 0.882). Both experts’ B-line quantification scores had significantly better agreement with the LCS than they did with the ultrasound operator’s score (p < 0.005, p < 0.001).
Conclusion
Artificial intelligence/machine learning-based LCS correlated with expert-level B-line quantification. Future studies are needed to determine whether automated tools may assist novice users in LUS interpretation.
Keywords: Point-of-care ultrasound, Acute heart failure, Lung ultrasound, Quantification, Machine learning
Introduction
Acute decompensated heart failure (ADHF) is the leading cause of cardiovascular hospitalizations in the United States.1 Traditionally, assessment and re-assessment of pulmonary congestion has relied on integrating clinical history, physical exam and chest X-ray with variable success.2 Point-of-care ultrasound (POCUS) accurately detects pulmonary congestion by visualizing bilateral B-lines on lung ultrasound (LUS).3,4 B-lines change dynamically with severity and treatment of ADHF, providing both diagnostic and prognostic information.5,6 However, LUS relies heavily on examiner competence in obtaining high-quality images and image interpretation. Although LUS acquisition and quantification does not require extensive training,7 there is large variability between different operators.8 The use of automated B-line software packages to identify and quantify B-lines has the potential not only to aid novice users, but essentially standardize quantification of B-lines across operators, allowing for more objective serial measurements.9
We aimed to determine whether an automated artificial intelligence/machine learning (AI/ML) assessment of LUS images from patients with ADHF correlates to an expert’s interpretation of images for the quantification of B-lines.
Methods
This was a secondary analysis from the BLUSHED-AHF study, an NHLBI sponsored, multicentre, randomized controlled trial, designed to test whether LUS-guided treatment of emergency department patients with ADHF resulted in less pulmonary congestion compared to usual care. Methods from this study have been previously published.10 The study was approved by the institutional review board at all sites (NCT03136198).
The BLUSHED-AHF LUS imaging protocol used the standard eight lung zones. LUS examinations were performed with Zonare ZS3, Z One Pro (Mindray, Mountain View, CA, USA) or Sonosite MTurbo (FUJIFilm Sonosite, Bothell, WA, USA) with the curvilinear C1–5s transducer. Serial LUS images were obtained on patients at multiple time points: enrolment (T0), 2–4 h after T0 (T1), 2–4 h after T1 (T2), and daily throughout hospitalization until hospital day 7 or day of discharge, whichever occurred first.
All POCUS operators were required to undergo a structured training programme. LUS were performed by research associates, emergency medicine residents, ultrasound fellowship-trained and non-ultrasound trained emergency medicine faculty (R0). Two faculty with expertise in LUS and LUS research (E1 and E2) provided blinded interpretations off-line. B-lines were quantified for each lung zone and determined by multiplying the percentage of the intercostal space filled with B-lines by 20, where the maximum total count per individual zone was 20 B-lines.11 Therefore, the distribution of scores ranged from 0 to 20.
Artificial intelligence/machine learning modelling (lung congestion score)
An AI/ML model-based severity score (named LUS congestion score) was computed for 3858 LUS clips. Correlation between these scores and the two expert B-line scores (i.e. counts) as well as the bedside operator were computed to assess concordance.
A detailed description of the dataset and methods used in the development of the AI/ML model for congestion scoring was previously published.12 In brief, the model was developed using an independent dataset comprised of 1419 LUS clips from 113 patients, acquired using curvilinear C1–5s transducers with Mindray ME8 or M9 (Mindray, Mountain View, CA, USA). Single-point annotations of B-line origins located at the pleura were provided by LUS experts for model training. The network architecture consisted of EfficientNet-B0 as encoder coupled with the decoder structure of U-Net. The AI model combined the predictions of five network instances, trained based on different overlapping partitions of the training dataset. The predicted lung congestion score was defined as the sum of the predicted B-line origins in all frames of a LUS video, divided by the total number of frames in the clip. This was done by the AI/ML counting the number points on the pleural line where a B-line originated from for each frame of a clip.
Statistical analysis
Spearman correlation was computed between sums of per zone counts from each of the raters: R0 (bedside operator), E1 and E2 (expert), and AI (AI lung congestion score). If a rater did not provide a score for at least two left zone and two right zone clips for a patient at a particular time point, that patient was excluded and was not considered in calculating correlations involving that rater at that time point.
Statistical significance for difference in Spearman correlations between expert and AI/ML versus expert and operator was calculated using a two-tailed z-test with Fisher-transformed correlation coefficients. These correlation coefficients were calculated based on taking the sums of per zone counts and averaging them over all time points. Further, the same scores were compared across all three groups when scores were normalized from 0 to 1 to determine if the lung congestion score over- or underestimated the BLUSHED-AHF congestion score.
Results
A total of 7928 LUS clips were obtained on 130 patients in the original BLUSHED-AHF study. Analysis in this paper was performed on 3858 video clips obtained at four time points: emergency department arrival/baseline (t0), 2 h post-baseline, 6 h post-baseline, and 24 h post-baseline. Not all patients had imaging at all time points. Of these 3858 clips, bedside operators (R0) annotated B-line counts for 3842 clips (99.6%). Expert 1 (E1) annotated B-line counts for 3406 clips (88.3%) and expert 2 (E2) annotated B-line counts for 3301 clips (85.6%). The AI model annotated B-line counts for 3858 clips (100%) and calculated a score range from 0 to 16.2 with a mean of 4.2 (± 3.5) for all clips reviewed. A total of 3212 clips (82.0%) had a B-line count from the operator, both experts, and the AI model (Figure 1).
Figure 1.

Flow diagram of BLUSHED-AHF clips through artificial intelligence (AI) model. E1, expert 1; E2, expert 2; R0, ultrasound operator.
Spearman correlations between rater scores were calculated for each time point using the sum of the available rater’s scores across all eight lung zones for each patient. Correlations for each time point for each of the six pairwise combinations of the four raters (R0 through R3) are listed in Table 1. Specifically, the two experts had excellent agreement between each other for all time points (r = 0.938). Both experts’ B-line quantification scores were significantly more correlated with the AI/ML lung congestion score than they were with the operator (r = 0.917 vs. 0.860 for E1, p < 0.005; r = 0.925 vs. 0.836 for E2, p < 0.001). When the congestion scores were normalized on a scale from 0 to 1, the AI/ML congestion score had a non-significant underestimation of scores compared to the operator and non-significant overestimation of congestion scores compared to experts.
Table 1.
Spearman correlation
| Expert internal agreement | Experts vs. AI | Operator vs. expert | Operator vs. AI | |||
|---|---|---|---|---|---|---|
| E1 vs. E2 | E1 vs. AI/ML | E2 vs. AI/ML | R0 vs. E1 | R0 vs. E2 | R0 vs. AI/ML | |
| Day 1, T0 | 0.911 | 0.861 | 0.87 | 0.723 | 0.72 | 0.622 |
| Day 1, T2 | 0.94 | 0.888 | 0.862 | 0.836 | 0.825 | 0.791 |
| Day 1, T6 | 0.944 | 0.911 | 0.858 | 0.826 | 0.818 | 0.766 |
| Day 2 | 0.953 | 0.901 | 0.925 | 0.806 | 0.816 | 0.763 |
| All time points combined | 0.938 | 0.894 | 0.882 | 0.809 | 0.808 | 0.751 |
Color corresponds to strength of correlation with blue (very strong), yellow and orange (strong), and red (moderate).
AI/ML, artificial intelligence/machine learning; E1, expert 1; E2, expert 2; R0, ultrasound operator.
Limitations
The AI/ML model used in the study does not account for confluent B-lines. If a LUS clip contained confluent B-lines that covered half the intercostal space or a single discrete B-line, the AI/ML model would predict the same LUS congestion score for that same clip. Thus, the LUS congestion score may underestimate B-line counts in the most pathological LUS clips.
Conclusion
Lung ultrasound offers real-time data to guide clinical decision-making and patient care for patients with ADHF. Although widely considered an easy technique to master and interpret, we tested this assumption. Operator experience may impact correct interpretation, thereby restricting its use and scalability. AI/ML technologies can provide automated interpretation of LUS images making POCUS more accessible to a broad range of novice users across a care team, standardizing interpretation allowing it to be both scalable and reproducible. Our data suggest that our AI/ML model performs at the same level as two LUS experts and outperforms a group of mixed-experience operators.
The B-line scoring protocol used in the BLUSHED-AHF trial has been shown to predict 30-day acute heart failure rehospitalization or death in studies of longitudinal LUS monitoring.13 As our study shows that an AI/ML LUS congestion score correlated well with experts on external data, it is possible that an automated tool such as this may provide information to help providers determine who may be at risk for rehospitalization and/or death as well as guide therapy. Although there is no agreement upon gold standard for B-line quantification, as we further train our LUS congestion score based on more LUS data with expert annotations, it is reasonable to expect that this AI/ML model will eventually resemble a large consensus of LUS experts in congestion scoring.
Footnotes
Conflict of interest: none declared.
References
- 1.Virani SS, Alonso A, Aparicio HJ, Benjamin EJ, Bittencourt MS, Callaway CW, et al. Heart disease and stroke statistics – 2021 update. Circulation. 2021;143:e254–e743. 10.1161/CIR.0000000000000950 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Pivetta E, Goffi A, Nazerian P, Castagno D, Tozzetti C, Tizzani P, et al. ; Study Group on Lung Ultrasound from the Molinette and Careggi Hospitals. Lung ultrasound integrated with clinical assessment for the diagnosis of acute decompensated heart failure in the emergency department: A randomized controlled trial. Eur J Heart Fail. 2019;21:754–766. 10.1002/ejhf.1379 [DOI] [PubMed] [Google Scholar]
- 3.McDonagh TA, Metra M, Adamo M, Gardner RS, Baumbach A, Böhm M, et al. 2021 ESC Guidelines for the diagnosis and treatment of acute and chronic heart failure: Developed by the Task Force for the diagnosis and treatment of acute and chronic heart failure of the European Society of Cardiology (ESC). With the special contribution of the Heart Failure Association (HFA) of the ESC. Eur J Heart Fail. 2022;24:4–131. 10.1002/ejhf.2333 [DOI] [PubMed] [Google Scholar]
- 4.Martindale JL, Wakai A, Collins SP, Levy PD, Diercks D, Hiestand BC, et al. Diagnosing acute heart failure in the emergency department: A systematic review and meta-analysis. Acad Emerg Med. 2016;23:223–242. 10.1111/acem.12878 [DOI] [PubMed] [Google Scholar]
- 5.Platz E, Lewis EF, Uno H, Peck J, Pivetta E, Merz AA, et al. Detection and prognostic value of pulmonary congestion by lung ultrasound in ambulatory heart failure patients. Eur Heart J. 2016;37:1244–1251. 10.1093/eurheartj/ehv745 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Gargani L, Pang PS, Frassi F, Miglioranza MH, Dini FL, Landi P, et al. Persistent pulmonary congestion before discharge predicts rehospitalization in heart failure: A lung ultrasound study. Cardiovasc Ultrasound. 2015;13:40. 10.1186/s12947-015-0033-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Russell FM, Ferre R, Ehrman RR, Noble V, Gargani L, Collins SP, et al. What are the minimum requirements to establish proficiency in lung ultrasound training for quantifying B-lines? ESC Heart Fail. 2020;7:2941–2947. 10.1002/ehf2.12907 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Herraiz JL, Freijo C, Camacho J, Muñoz M, González R, Alonso-Roca R, et al. Inter-rater variability in the evaluation of lung ultrasound in videos acquired from COVID-19 patients. Appl Sci. 2023;13:1321. 10.3390/app13031321 [DOI] [Google Scholar]
- 9.Baloescu C, Toporek G, Kim S, McNamara K, Liu R, Shaw MM, et al. Automated lung ultrasound B-line assessment using a deep learning algorithm. IEEE Trans Ultrason Ferroelectr Freq Control. 2020;67:2312–2320. 10.1109/TUFFC.2020.3002249 [DOI] [PubMed] [Google Scholar]
- 10.Pang PS, Russell FM, Ehrman R, Ferre R, Gargani L, Levy PD, et al. Lung ultrasound-guided emergency department management of acute heart failure (BLUSHED-AHF): A randomized controlled pilot trial. JACC Heart Fail. 2021;9:638–648. 10.1016/j.jchf.2021.05.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Russell FM, Ehrman RR, Ferre R, Gargani L, Noble V, Rupp J, et al. Design and rationale of the B-lines lung ultrasound guided emergency department management of acute heart failure (BLUSHED-AHF) pilot trial. Heart Lung. 2019;48:186–192. 10.1016/j.hrtlng.2018.10.027 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Lucassen RT, Jafari MH, Duggan NM, Jowkar N, Mehrtash A, Fischetti C, et al. Deep learning for detection and localization of B-lines in lung ultrasound. IEEE J Biomed Health Inform. 2023;1–10. 10.1109/JBHI.2023.3282596 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Harrison NE, Favot MJ, Gowland L, Lenning J, Henry S, Gupta S, et al. Point-of-care echocardiography of the right heart improves acute heart failure risk stratification for low-risk patients: The REED-AHF prospective study. Acad Emerg Med. 2022;29:1306–1319. 10.1111/acem.14589 [DOI] [PMC free article] [PubMed] [Google Scholar]
