Abstract
Objective
To develop an automated method for the joint and consistent evaluation of emphysema and mortality risk that provides quantification of data and model uncertainty.
Methods & Materials
Participants from the prospective COPDGene study who underwent both full radiation dose (FD) and reduced radiation dose (RD) chest CT scans at five-year follow-up were included and divided into training (80%), validation (10%) and testing (10%) datasets. We trained a multi-task Bayesian neural network (BNN) to estimate the FD volume-adjusted lung density (ALD) regardless of acquisition protocol, in addition to the five-year mortality risk. The data and model uncertainty were quantified in the testing dataset. Our deep learning ALD (DL-ALD) was compared to the conventional ALD.
Results
In total, 1,350 participants (mean age 64.4 years ± 8.7; 659 female) were included. Compared to conventional ALD, DL-ALD was more consistent between FD and RD CT images (mean difference: 1 g/L ± 3.1 versus 14.8 g/L ± 5.3, p < 0.001). The predicted 5-year mortality was similar between image protocols (mean difference: 0.0007±0.02, p = 0.76). The uncertainty associated to image variability when quantifying DL-ALD was lower in participants with severe emphysema (Pearson’s rho=0.79, p < 0.001), and the model uncertainty for mortality risk was lower both for severe and early-stage participants compared to other participants (p < 0.001).
Conclusion
The presented multi-task BNN provides an increased robustness to imaging protocol compared to conventional methods for CT evaluation of emphysema. Additionally, it provides direct measurements of uncertainty for its generalization to diverse imaging protocols and patient populations.
Keywords: Deep Learning, Uncertainty, Computed tomography, Xray, Lung, COPD
Introduction
Chronic obstructive pulmonary disease (COPD) is one of the leading causes of death in the United States [1]. It is an inflammatory disease characterized by the progressive obstruction of airflow with emphysema being a major morphologic component [2]. Hence, the quantification of the severity of emphysema from CT images plays an important role in the radiological evaluation of COPD. Unfortunately, two main factors have limited the clinical translation of existing metrics. First, they have shown high sensitivity to variations between scanners, imaging protocols, and patient characteristics [3, 4], which introduces a great degree of uncertainty that affects their interpretation and increases the difficulty in evaluating progression on serial scans. Additionally, although these metrics are correlated with mortality risk [5], they do not provide a direct estimation of the probability of mortality within a specific time frame.
During recent years, deep learning (DL) methods have been used to evaluate emphysema from CT images [6-8] with similar accuracy to more traditional metrics. Unlike previous approaches, DL methods can exploit relationships between multiple tasks via multi-task learning to improve single task performance [9, 10]. Thus, they provide an adequate framework to exploit the correlations between the severity of emphysema and mortality risk. Unfortunately, existing DL models for evaluating COPD severity cannot quantify the uncertainty associated with imaging protocol variability or the specific model design used. In the face of potentially highly variable data and complex models, metrics of uncertainty can provide important information about the reliability of model predictions [11].
Bayesian neural networks (BNNs) have emerged as useful tools to separately evaluate the predictive uncertainty inherent to the data (aleatoric) and the uncertainty associated with the model design (epistemic) [12]. In comparison to standard neural networks (SNNs) whose parameters are represented by scalar values, BNNs estimate probability distributions for each parameter, and model predictions are marginalized over all possible network parameter values [13]. As an added benefit, BNNs have naturally built-in regularization that can also improve model generalizability [14, 15].
The purpose of this study was to develop a multi-task BNN to quantify a reliable and consistent metric of emphysema across different imaging protocols simultaneously with a 5-year mortality risk from lung CT images of patients with COPD, in addition to providing a quantification of the aleatoric and epistemic uncertainty.
Materials and Methods
Data
This study uses data from COPDGene, a prospective study on the genetic epidemiology of COPD [16]. Between 2008 and 2011, 10,198 cigarette smokers with and without COPD, and 454 never smokers were enrolled at 21 United States centers. Health Insurance Portability and Accountability Act–compliant institutional review board approval was obtained at all centers, and consent was obtained from all participants. Participants underwent CT evaluation at baseline (Phase 1), 5-year follow-up (Phase 2), and 10-year follow-up (Phase 3).
Inspiratory, non-contrast CT acquisitions included contiguous axial sections with 0.625 to 0.75 mm thickness. The mean effective radiation dose at Phase 1 and Phase 2 was 6.5 mSv ± 1 (SD), which is referred to as the fixed (full) radiation dose (FD) protocol. For Phase 3, a reduced radiation dose (RD) protocol was implemented using dose modulation techniques [17, 18]. Scans with both RD and FD protocols were acquired on Phase 2 for 1,504 participants. The mean effective radiation dose for RD scans at Phase 2 and Phase 3 was 1.5 mSv ± 0.7 (SD). Data for 1,205 of the 1,504 participants have been previously reported [18]. This prior article sought to understand the reproducibility of conventional quantitative CT metrics of emphysema between variable dose protocols whereas in this manuscript we aimed to develop deep learning methods to increase consistency between variable dose protocols and provide measures of prediction reliability. Participants were split into training (80%), validation (10%), and testing (10%) datasets.
Traditional quantification of emphysema
The primary CT-derived metric of emphysema severity is the volume-adjusted lung density (ALD) [19] calculated as:
| (1) |
where is the lung density in g/L measured using the 15th percentile of the lung intensity histogram, is the observed lung volume from CT, and is the expected lung volume [18, 20], which was calculated using Multi-Ethnic Study of Atherosclerosis equation [21].
Deep learning methodology
FD and RD inspiratory chest CT images were masked and cropped to the lung region with the available segmentations obtained using LungQ (v1.0.0, Thirona). Images with the presence of significant artifacts or severe deviation from scanning protocol did not have available lung segmentations, and were excluded from this study. Training images underwent three-fold augmentation using random rotations lower than 10-degrees in the X, Y, and Z-axes. Given that images were cropped to the lung region, translation was not utilized for augmentation. All volumes were down-sampled to 256 x 256 voxels in the axial plane using nearest-neighbor interpolation, and 128 equally spaced slices were sampled in the Z direction maintaining original aspect ratios. Voxels intensities inside the lungs were normalized using global z-score normalization. Regions outside the lungs were set to zero.
The input to our model was the pre-processed CT images. Its architecture consisted of five Bayesian convolutional blocks followed by two parallel branches of Bayesian fully connected layers (Figure 1) that estimated the deep learning ALD (DL-ALD) and the 5-year mortality risk. Hyperbolic tangent (Tanh) activation and max-pooling were used after each Bayesian convolutional layer. Participant-specific data (i.e., age, race, sex, body mass index, height) previously associated with ALD and mortality risk [19], together with image voxel size, were concatenated with the latent features after the first fully connected Bayesian layer. Training was performed through variational inference using a Standard Normal prior and Normal approximate posterior parameter distributions due to weight regularization properties similar to regularization. The outputs of the DL-ALD branch were its mean and standard deviation following a Normal distribution. The output of the mortality branch was the 5-year probability of death following a Bernoulli distribution. Additional details are provided in Appendix S1.
Figure 1.

Multi-task Bayesian deep learning architecture. The primary input is the masked and cropped lung CT image. The blue boxes represent Bayesian convolutional layers followed by hyperbolic tangent activation while the yellow boxes represent max-pooling operations. Within each box, the output of the layer is given as (channels, Z, Y, X). Red boxes represent Bayesian fully connected layers where the first dense layer is followed by the hyperbolic tangent activation function. Demographic and image voxel size are concatenated with the latent image features following the first Bayesian dense layer. The model outputs from the ALD branch are the mean and standard deviation of its predicted distribution. The model outputs for the mortality risk branch are probability of death for the predicted distribution of 5-year mortality.
We chose the negative evidence lower-bound (ELBO) as loss function [22] given by the equation:
| (2) |
where is the joint likelihood between tasks, is the approximate posterior, and is the prior distribution for network parameters . The first term on the right-hand-side represents the joint negative log-likelihood, and the second term represents the Kullback-Leibler divergence () between the approximate posterior and prior distribution for network parameters. Thus, the loss function was
| (3) |
where is the likelihood for 5-year mortality, is the likelihood for DL-ALD, and are normalized mixture weights such that , , and is a scaling factor on . The negative log-likelihood mixture weights were chosen to promote equal contribution of each branch towards the overall reconstruction loss and the scaling factor for the was chosen via grid search. Derivation details are provided in Appendix S2. Given the imbalance in the number of deaths for our study sample, the 5-year mortality negative-log likelihood for subjects who died were inverse-weighted proportional to the number of deaths in the study sample. Stochastic gradient descent optimization with adaptive moment estimation algorithm, batch size of 8, and learning rate of 0.001 were used for training [23] using TensorFlow 2.0 and Python 3.10. Training was stopped when no performance improvement in the validation dataset was observed. All experiments were performed on a NVIDIA A600 GPU, an Intel Xeon Gold 5218R CPU, and 128 Gb RAM.
Quantitative evaluation
The DL-ALD was evaluated in the testing dataset against the quantified conventional ALD (Eq. 1) from the FD CT images using the root mean squared error (RMSE). Note that the model was trained to predict the ALD quantified from the FD imaging protocol regardless of whether the input image was acquired at a FD or RD protocol to achieve predictions independent from radiation dose. For the predicted 5-year mortality, performance was evaluated in terms of area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, and specificity. The true 5-year mortality status was determined from the time of image acquisition. Performance for our multi-task BNN was compared to single-task SNNs, single-task BNNs, a multi-task SNN, and a logistic regression model for 5-year mortality. Additional details are provided in Appendix S3. The data (aleatoric) and model (epistemic) uncertainty for both metrics were quantified on the testing dataset through Monte Carlo integration with 50 stochastic forward passes [24, 25]. Additional details are provided in Appendix S4.
Statistical Analysis
To assess model consistency across imaging protocols, the DL-ALD and 5-year mortality risk were compared between the FD and RD images in the testing dataset using paired Wilcoxon signed-rank tests. To study the potential bias between FD and RD ALD, linear calibration models were implemented to obtain predicted FD ALD from RD ALD for both conventional ALD and DL-ALD (Appendix S5). The relationship between the DL Fleischner emphysema scores [7], DL-ALD, and the predicted probability of 5-year mortality was examined using Kendall’s rank correlation. Furthermore, the association between the predictive uncertainty for DL-ALD and the imaging and participant characteristics was determined using univariable linear mixed effects models (LMMs) that account for within-participant correlation. These characteristics included BMI, smoking status, field of view (FOV), radiation protocol, scanner make, and scanner model. Additionally, the relationship between the CT noise and the predictive uncertainty was examined for FD and RD images using linear regression, where CT noise was quantified for each image using a method based on subtracting adjacent axial slices [26]. Statistical analysis was performed in R (version 4.3.2) and significance was considered at the α=0.05 significance level.
Results
Participant Characteristics
From the original set of 1,504 Phase 2 participants who consented to both a FD and RD protocol CT exam, 116 were excluded for inadequate CT quality. The remaining 1,388 participants were randomly split into training, validation and testing datasets using an 80%/10%/10% split. After discarding 38 participants who were missing either the FD or the RD image, our study included 1,350 participants (mean age, 64.4 years ± 8.7; 659 female; 860 White): 1,084, 130 and 136 participants for training, validation and testing, respectively. Table 1 shows demographic characteristics for the final study sample stratified by 5-year mortality status, with 154 (11%) participants who died within five years of follow-up. A total of 31 (2%) participants were lost-to-follow-up before the 5-year mark. The proportion of current smokers was higher among those who died within five years (45%) compared to participants who did not die within five years (41%).
Table 1.
Demographic characteristics by 5-year mortality status.
| 5-year mortality | |||
|---|---|---|---|
| Variable | Alive | Dead | Total |
| N | 1196 | 154 | 1350 |
| Age (years) | 64 ± 8.5 | 67.4 ± 9.4 | 64.4 ± 8.7 |
| Height (cm) | 169.6 ± 9.6 | 171.1 ± 10.5 | 169.8 ± 9.7 |
| BMI (kg/m2) | 29.1 ± 6.3 | 29.1 ± 7.7 | 29.1 ± 6.4 |
| Sex | |||
| Male | 603 (50) | 88 (57) | 691 (51) |
| Female | 593 (50) | 66 (43) | 659 (49) |
| Race | |||
| Black | 427 (36) | 63 (41) | 490 (36) |
| White | 769 (64) | 91 (59) | 860 (64) |
| Smoking status | |||
| Current | 491 (41) | 70 (45) | 561 (42) |
| Former | 569 (48) | 79 (51) | 648 (48) |
| Never | 136 (11) | 5 (3) | 141 (10) |
| TLC (L) | 5.2 ± 1.4 | 5.4 ± 1.6 | 5.3 ± 1.4 |
| FD ALD (g/L) | 88.2 ± 21.2 | 79.9 ± 29.3 | 87.2 ± 22.4 |
| RD ALD (g/L) | 74 ± 20.9 | 66.8 ± 27.7 | 73.1 ± 21.9 |
Notes.-Mean data are ± SD. Data in parenthesis indicate percentages. BMI = Body mass index, TLC = Total lung capacity, FD = Full radiation dose, RD = Reduced radiation dose, ALD = volume-adjusted lung density.
Performance evaluation
Performance results are shown in Table 2. For our multi-task BNN, the RMSE for the DL-ALD was 9.5 g/L which represents an improvement over the compared alternatives ranging from 11.2 to 13.7 g/L. Additionally, there were no significant differences between the conventional FD ALD (ground truth) and DL-ALD derived from either the FD or RD scans, with mean differences of 0.8 ± 9.3 g/L (p=.55) and 1.7 ± 9.5 g/L (p = 0.06), respectively (Figure S1). The AUC for predicting the 5-year mortality risk in the testing dataset was 0.64 for our multi-task BNN, and ranged from 0.61 to 0.72 across comparison models (Table 2 and Table S3). The overall classification accuracy for predicting the 5-year mortality was 0.64, the sensitivity was 0.63, and the specificity was 0.64.
Table 2.
Model performance for DL-ALD and predicted probability of 5-year mortality in testing set.
| DL-ALD | 5-year mortality | ||||
|---|---|---|---|---|---|
| Model | RMSE | AUC | Accuracy | Sensitivity | Specificity |
| Single-task SNN | 11.2 | 0.72 | 0.84 | 0.16 | 0.93 |
| Single-task BNN | 11.4 | 0.61 | 0.57 | 0.63 | 0.57 |
| Multi-task SNN | 13.7 | 0.65 | 0.86 | 0.00 | 0.98 |
| Multi-task BNN | 9.5 | 0.64 | 0.64 | 0.63 | 0.64 |
Notes.-SNN = Standard neural network, BNN = Bayesian neural network, DL-ALD = Deep learning volume-adjusted lung density, RMSE = Root mean square error, AUC = Area under receiver operator characteristic curve.
The consistency in emphysema quantification between FD and RD images using the conventional (traditional) ALD and our multi-task BNN DL-ALD was also evaluated. Figure 3a and 3b show the Bland-Altman plots comparing the conventional ALD and the DL-ALD quantified from the FD images versus the RD images, respectively, in the testing dataset. The mean difference between the conventional FD ALD and the conventional RD ALD was 14.9 g/L ± 5.3 (p < 0.001). The mean difference between the FD DL-ALD and RD DL-ALD was 1.0 g/L ± 3.1 (p < 0.001). Figure 3b shows that the bias between imaging protocols tended to occur at higher DL-ALD values (participants without severe emphysema). For the average DL-ALD at or below the 75th percentile (DL-ALD = 97 g/L), there was no significant difference in DL-ALD quantified from FD versus RD images (mean difference = 0.4 g/L ± 2.9, p = 0.21). However, for the average conventional ALD at or below 97 g/L, there was still a significant difference between emphysema quantified from FD versus RD images (mean difference = 14.6 g/L ± 5.4, p < 0.001). Figure 3c shows the Bland-Altman plot comparing the conventional FD ALD to the calibrated RD ALD where the bias between FD and RD ALD was reduced to 0.4 g/L ± 5 (p = 0.24). In comparison, Figure 3d shows the Bland-Altman plot comparing FD DL-ALD and calibrated RD DL-ALD where the bias between FD and RD DL-ALD was reduced to −0.1 g/L ± 2.9 (p = 0.40).
Figure 3.




Bland-Altman plots comparing (a) conventional volume-adjusted lung density (ALD) measured on full radiation dose (FD) versus reduced radiation dose (RD) protocol images, (b) deep learning (DL)-ALD measured on FD versus RD protocol images, (c) conventional FD ALD versus calibrated RD ALD, and (d) FD DL-ALD versus calibrated RD DL-ALD in the testing set. Blue dashed lines represent mean differences while red dashes lines represent 2 standard deviations above and below the mean difference.
The consistency in the 5-year mortality prediction was also examined between imaging protocols. Within the testing dataset, the predicted probability of the 5-year mortality from our multi-task BNN was not significantly different between FD and RD images (mean difference = 0.0007 ± 0.02, p = 0.76).
With respect to the DL Fleischner emphysema scores, lower DL-ALD was significantly associated with more severe DL emphysema scores (Kendall’s Tau-b = −0.13, p < 0.01). For the 5-year mortality risk, more severe DL emphysema scores were significantly associated with higher probabilities of death (Kendall’s Tau-b = 0.27, p < 0.001).
Quantification of uncertainty
Figure 4 shows the model uncertainty for the DL-ALD (panel 4a), the data uncertainty for the DL-ALD (panel 4b), the model uncertainty for the probability of 5-year mortality (panel 4c), and the data uncertainty for the probability of 5-year mortality (panel 4d) as functions of the DL-ALD and the probability of 5-year mortality in the testing data. The model uncertainty was higher at low DL-ALD predictions (more severe participants, Pearson’s product-moment rho = −0.55, p < 0.001) and the data uncertainty was significantly higher with higher predicted DL-ALD (less severe participants, Pearson’s product-moment rho = 0.79, p < 0.001). For the predicted probability of 5-year mortality, the model uncertainty was higher around a predicted probability of 0.5 than extreme probabilities close to 0 or 1 (p < 0.001). The data uncertainty for the probability of 5-year mortality followed a quadratic function with respect to the probability of 5-year mortality.
Figure 4.





Epistemic (model) and aleatoric (data) uncertainty as functions of deep learning volume-adjusted lung density (DL-ALD) and predicted probability of 5-year mortality in the testing set. Lines represent LOESS curves. Red points and curves represent observations derived from full radiation dose (FD) protocol images. Blue points and curves represent observations derived from reduced radiation dose (RD) protocol images. The pattern in panel (d) is observed since average variance (aleatoric uncertainty) of predicted probability is a direct function of the predicted probability of 5-year mortality.
Table 3 shows the univariable associations between the quantified DL-ALD uncertainty, imaging and participant characteristics for the testing dataset. Higher BMI was significantly associated with increased data-related uncertainty but not model uncertainty (p < 0.001 for data uncertainty and p = 0.51 for model uncertainty). Compared to former smokers, current smokers were associated with higher data-related uncertainty, but lower model uncertainty (p = 0.001 for data uncertainty and p = 0.046 for model uncertainty). In terms of CT image characteristics, higher FOV was associated with increased data uncertainty and model uncertainty (p = 0.028 and p = 0.048, respectively). Moreover, the RD imaging protocol had on average a slightly smaller amount of data uncertainty but not model uncertainty compared to the FD imaging protocol (p < 0.001 and p = 0.16, respectively). For CT noise, higher CT noise was associated with higher data uncertainty in both FD and RD images (p < 0.001 and p = 0.015, respectively). Neither scanner make nor model were significantly associated with either data uncertainty or model uncertainty.
Table 3.
Univariable associations between image/patient characteristics and predictive uncertainty for DL-ALD derived from multi-task Bayesian neural network in the testing set.
| Aleatoric (data) uncertainty | Epistemic (model) uncertainty | |||
|---|---|---|---|---|
| Variable | Coefficient (95% CI) |
P-value | Coefficient (95% CI) | P-value |
| BMI | 9.9 (6.7, 13.1) | <0.001 | −0.06 (−0.3, 0.1) | 0.51 |
| FOV | 1.6 (0.2, 3.0) | 0.028 | 0.07 (0.001, 0.1) | 0.048 |
| Smoking Status | ||||
| Former | Reference | Reference | ||
| Current | 17.9 (7.4, 28.4) | 0.001 | −0.6 (−1.2, −0.02) | 0.046 |
| Never | −7.3 (−21.8, 7.3) | 0.33 | −0.5 (−1.3, 0.3) | 0.26 |
| Dose | ||||
| Full Dose | Reference | Reference | ||
| Reduced Dose | −1.7 (−2.5, −0.9) | <0.001 | 0.2 (−0.06, 0.4) | 0.16 |
| CT Noise | ||||
| Full Dose | 2.2 (1.3, 3.0) | <0.001 | 0.001 (−0.05, 0.05) | 0.96 |
| Reduced Dose | 0.9 (0,2, 1.7) | 0.015 | 0.01 (−0.04, 0.05) | 0.83 |
| Scanner Model | − (-, -) | 0.13 | − (-, -) | 0.13 |
| Scanner Make | ||||
| GE | Reference | Reference | ||
| Philips | −12.9 (−38.7, 12.8) | 0.33 | 0.2 (−1.2, 1.5) | 0.81 |
| Siemens | −10.3 (−21.4, 0.8) | 0.07 | 0.6 (0.04, 1.2) | 0.036 |
Notes.-BMI = Body mass index, FOV = Field of view, GE = General Electric. Coefficient estimates, 95% confidence intervals, and p-values derived from univariable linear mixed effects models with random intercept for participant. For CT noise, linear models were fit separately for full dose and reduced dose scans. Estimates for scanner model not included due to large number of categories; p-value for scanner model represents Type III test. Number of scans from GE were n=82, Philips n=12, and Siemens n=178.
Discussion:
We presented a method that, unlike traditional metrics [3, 4], provides a consistent quantification of emphysema severity from CT images acquired under different imaging protocols and participant characteristics. Moreover, our method simultaneously predicts the probability of 5-year mortality, which ensures our estimate of emphysema severity is informed by and correlated with this important clinical endpoint. This consistency in emphysema measurement is essential to minimize the subjectivity in metric interpretation, ensure clinical utility, and improve assessment of disease progression over serial CT exams.
Our multi-task BNN showed more consistent estimates of emphysema severity between imaging protocols compared to conventional quantification methods. Although a significant mean DL-ALD difference of 1 g/L was quantified between FD and RD protocols, this difference was driven by the predictions associated with participants with mild to no emphysema (above the 75th percentile or 97 g/L). We hypothesize that the lack of sufficient emphysema information in the CT images and partial volume effects [27] may impact the model’s ability to provide optimal estimates of emphysema severity. In contrast, no significant differences between imaging protocols were found for participants at or below the 75th percentile, which shows the improved consistency of our method over conventional methods.
Several studies have provided methods for correcting bias between FD and RD protocols including image correction via median filtering, transformations to the lung intensity histogram via volume-noise-bias corrections, and metric calibration using regression equations [3, 18, 28]. In this study, we compared our DL-ALD with conventional ALD after applying bias correction through linear regression models. Similar to a previous COPDGene study [3], calibrated RD ALD on average removed the bias between FD and RD ALD. The primary advantage of calibrated RD ALD compared to DL-ALD is the simplistic methodology of using a linear model versus a deep learning approach. However, the DL-ALD still showed considerably smaller variability between the FD and RD imaging protocols, which demonstrates the improvements of our approach compared to bias correction methods that, unlike ours, are unable to quantify the underlying epistemic and aleatoric uncertainty. Consistent emphysema severity metrics across scanning protocols such as our DL-ALD can improve the longitudinal assessment of disease progression specifically when a shift in radiation dose is present as it occurs in the COPDGene protocol [3]. In terms of mortality, our multi-task BNN also provided a consistent 5-year mortality prediction between imaging protocols.
Using a multi-task approach, our model provided a more comprehensive evaluation of COPD severity through the simultaneous quantification of emphysema severity and a 5-year mortality risk. Our results demonstrate the improvements of our multi-task approach compared to single task architectures, as indicated by the lower RMSE for quantifying emphysema and the more balanced sensitivity and specificity when predicting 5-year mortality. This is not only due to the capacity of our model to exploit correlations between the quantification of emphysema severity and participant prognosis but to the self-regularization that is built into the Bayesian framework which reduces model overfitting to the majority class (i.e., participants who did not die after 5 years). The simultaneous quantification of emphysema severity and mortality risk also ensures the correlation between our metric of emphysema severity and the risk of mortality, which is a clinically important endpoint for COPD. Additionally, both emphysema and 5-year mortality predictions were significantly correlated with a deep learning based visual classification of emphysema, where lower DL-ALD and higher 5-year mortality risk were associated with more severe Fleischner emphysema grades. This correlation with a deep learning-based visual scoring indicates the alignment between our predicted metrics and expert radiological evaluations.
By utilizing a Bayesian framework, our model was able to quantify the uncertainty stemming from the data (aleatoric) and the uncertainty associated with the model (epistemic). In the presence of little or no emphysema (high ALD), data variability had a larger impact on the predicted distribution of DL-ALD, while lower predictive uncertainty was found for images showing moderate to severe emphysema. In contrast, images with severe emphysema (low ALD) had an increase in the uncertainty associated with the model, which is explained by fewer training images from participants with severe emphysema [12]. For the 5-year mortality risk predictions, model uncertainty was the lowest for patients that were in the extremes of probability of death (values close to 0 or 1 representing very severe or non-severe patients, respectively) as expected. We also observed significant relationships between the FOV, BMI, smoking status, and predictive uncertainty. Moreover, data uncertainty quantified from our model was positively associated with CT noise in both FD and RD images, while model uncertainty was not associated with CT noise. This provides further quantitative evidence that these imaging and patient characteristics impact the accuracy of existing metrics and patient prognosis evaluation. In the face of imaging variability introduced by different protocols and patient characteristics, the ability to quantify uncertainty provides a measurement of reliability. In contrast, standard deep learning or conventional methods are only able to provide aleatoric uncertainty estimates when modeling quantitative emphysema metrics in a cohort using Confidence or Prediction Intervals, ignoring an important component of overall predictive uncertainty. Furthermore, our Bayesian deep learning framework provides metrics of uncertainty at the time of emphysema quantification, which conventional methods are unable to do. Thus, using our Bayesian deep learning framework can aide the clinical utility of quantitative CT metrics by providing needed information about how much an automated AI-driven patient evaluation can be trusted. For example, a CT evaluation of emphysema using our Bayesian deep learning model that results in high predictive uncertainty relative to a comparable cohort can guide a clinical decision to favor other, possibly less variable, metrics. The increased robustness of a Bayesian-based analysis approach could also improve statistical power in cross-sectional epidemiologic studies of demographic, genetic and environmental associations of emphysema [29, 30]. More importantly, this type of analysis could reduce the inherent variability in longitudinal evaluation of emphysema, potentially facilitating clinical trials of novel treatments [3, 31]. Within the specific scope of the COPDGene project, the quantification of uncertainty during emphysema evaluation also presents a unique opportunity to examine more closely which factors within and across visits such as scanning protocol, FOV, and BMI contribute the most to imaging variability.
One limitation of our study was the use of the conventional ALD (Eq. 1) quantified from FD images as ground truth to train our model, which restricts incorporating or providing innovative metrics of emphysema severity. The primary goals of our proposed method were to both quantify uncertainty and provide more consistent emphysema severity quantification between imaging protocols in the context of a validated metric such as ALD, not to design a new or better metric. Future work involves a more comprehensive evaluation of emphysema severity through the joint prediction of other relevant metrics, such as the Fleischner emphysema grade, by leveraging the relationships between quantitative and qualitative measurements [2]. An additional limitation is the use of a mortality risk prediction within a 5-year window, which ignores important time-to-death information [32]. Moreover, the imbalance between the number of participants who died within five years and those who survived adds the challenge of overfitting to the majority class. We chose a binary mortality outcome due to the simplicity of incorporating a Binomial likelihood into our negative ELBO loss function and to demonstrate the value of exploiting the correlations between metrics of the severity of emphysema and patient prognosis. However, future work will use time-to-death predictions via survival models instead of binary mortality predictions, which will also help address the aforementioned challenge associated with data imbalance. Another limitation is the generalizability of our model to CT scans outside of COPDGene. Although COPDGene CT images at Phase 2 consisted of variable scanner makes, models, radiation doses, and study sites, CT acquisition adhered to a strict protocol and may not represent other CTs taken in a clinical setting. Moreover, our method relies on the availability of lung segmentation masks from LungQ, which may limit reproducibility in the presence of segmentation errors. However, lung masks were visually inspected for accuracy, limiting the potential for segmentations errors [18]. Future work aims to evaluate our Bayesian framework on subjects outside of COPDGene such as those in the Evaluation of Chronic Obstructive Pulmonary Disease Longitudinally to Identify Predictive Surrogate End-points (ECLIPSE) and Multi-Ethnic Study of Atherosclerosis (MESA) cohorts [5, 21]. We chose a Bayesian framework in part to benefit from its built-in self-regularization and its flexibility from using distributions instead of scalars for network parameters, which increases the generalizability of our deep learning model. However, future clinical studies with variable imaging protocols will be needed to clinically validate the generalizability of our method outside COPDGene.
The evaluation of COPD and emphysema severity through chest CT images is heavily affected by imaging protocol and patient variability. This study presents a method to reduce the impact of such variability. Additionally, our method enabled the quantification of the uncertainty associated with both the data variability and the model, and provided the necessary context to interpret evaluations performed under diverse imaging conditions and patient characteristics. Finally, our multi-task approach leveraged the known correlation between emphysema severity and disease prognosis to provide their simultaneous quantitative evaluation from CT images. Hence, the presented study does not only improve the robustness of traditional metrics of emphysema severity to CT imaging protocol variations but constitutes a substantial advancement towards consistent evaluations of COPD progression accompanied by a measurement of their reliability. Supplemental material references [33, 34].
Supplementary Material
Figure 2.

CONSORT diagram showing participant selection. ILD = Interstitial lung disease, FD = Full radiation dose, RD = Reduced radiation dose.
Key Points:
Question:
Quantitative CT evaluation of emphysema is highly sensitive to CT protocol, which increases uncertainty in disease evaluation and impacts the clinical utility of traditional metrics.
Findings:
Uncertainty-aware deep learning improved consistency in emphysema quantification between fixed and reduced dose CT scans compared to traditional histogram analysis.
Clinical Relevance:
CT evaluation of emphysema severity and mortality risk using uncertainty-aware deep learning methods is more consistent across variable radiation dose protocols compared to conventional methods while also providing measurement reliability metrics, improving the evaluation of COPD using CT.
Funding:
This work was supported by NHLBI grants U01 HL089897 and U01 HL089856 and by NIH contract 75N92023D00011. The COPDGene study (NCT00608764) has also been supported by the COPD Foundation through contributions made to an Industry Advisory Committee that has included AstraZeneca, Bayer Pharmaceuticals, Boehringer-Ingelheim, Genentech, GlaxoSmithKline, Novartis, Pfizer, and Sunovion.
Abbreviations:
- ALD
volume-adjusted lung density
- BNN
Bayesian neural network
- SNN
Standard neural network
- COPD
Chronic obstructive pulmonary disease
Data Sharing Statement:
Data generated or analyzed during the study are available from the corresponding author by request.
References:
- 1.Heron MP (2021) Deaths: leading causes for 2018. National Vital Statistics Reports Vol. 70, No. 4 [PubMed] [Google Scholar]
- 2.Lynch DA, Austin JHM, Hogg JC, et al. (2015) CT-Definable Subtypes of Chronic Obstructive Pulmonary Disease: A Statement of the Fleischner Society. Radiology 277:192–205 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Baraghoshi D, Strand M, Humphries SM, et al. (2023) Quantitative CT Evaluation of Emphysema Progression over 10 Years in the COPDGene Study. Radiology 307:e222786. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Coxson HO (2013) Sources of variation in quantitative computed tomography of the lung. J Thorac Imaging 28:272–9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ash SY, San José Estépar R, Fain SB, et al. (2021) Relationship between Emphysema Progression at CT and Mortality in Ever-Smokers: Results from the COPDGene and ECLIPSE Cohorts. Radiology 299:222–231 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Hatt C, Galban C, Labaki W, Kazerooni E, Lynch D, Han M (2018) Convolutional Neural Network Based COPD and Emphysema Classifications Are Predictive of Lung Cancer Diagnosis. in Image Analysis for Moving Organ, Breast, and Thoracic Images. Springer International Publishing. 302–309 [Google Scholar]
- 7.Humphries SM, Notary AM, Centeno JP, et al. (2019) Deep Learning Enables Automatic Classification of Emphysema Pattern at CT. Radiology 294:434–444 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Xie W, Jacobs C, Charbonnier JP, Slebos DJ, van Ginneken B (2023) Emphysema subtyping on thoracic computed tomography scans using deep neural networks. Sci Rep 13:14147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Kendall A, Gal Y, Cipolla R (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. in Proceedings of the IEEE conference on computer vision and pattern recognition. 7482–7491 [Google Scholar]
- 10.Ruder S (2017) An overview of multi-task learning in deep neural networks. arXiv:1706.05098 [Google Scholar]
- 11.Ghoshal B, Tucker A, Sanghera B, Lup Wong W (2021) Estimating uncertainty in deep learning for reporting confidence to clinicians in medical image segmentation and diseases detection. Computational Intelligence 37:701–734 [Google Scholar]
- 12.Kendall A, Gal Y (2017) What uncertainties do we need in Bayesian deep learning for computer vision? in Nips' 17. 5580–5590 [Google Scholar]
- 13.Shridhar K LF, Liwicki M (2019) A Comprehensive guide to Bayesian Convolutional Neural Network with Variational Inference. arXiv:1901.02731 [Google Scholar]
- 14.Chen X, Liu C, Zhao Y, Jia Z, Jin G (2022) Improving adversarial robustness of Bayesian neural networks via multi-task adversarial training. Information Sciences 592:156–173 [Google Scholar]
- 15.Cifci MA (2023) A Deep Learning-Based Framework for Uncertainty Quantification in Medical Imaging Using the DropWeak Technique: An Empirical Study with Baresnet. Diagnostics 13 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Regan EA, Hokanson JE, Murphy JR, et al. (2010) Genetic epidemiology of COPD (COPDGene) study design. COPD 7:32–43 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.QIBA. Lung Density Biomarker Committee. (2019) Available from: https://qibawiki.rsna.org/images/e/e4/QIBA_CT_Lung_Density_Profile_090319-clean.pdf. Accessed 2022 July 19
- 18.Hatt CR, Oh AS, Obuchowski NA, Charbonnier JP, Lynch DA, Humphries SM (2021) Comparison of CT Lung Density Measurements between Standard Full-Dose and Reduced-Dose Protocols. Radiol Cardiothorac Imaging 3:e200503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Shaker SB, Dirksen A, Laursen LC, Skovgaard LT, Holstein-Rathlou NH (2004) Volume adjustment of lung density by computed tomography scans in patients with emphysema. Acta Radiol 45:417–23 [DOI] [PubMed] [Google Scholar]
- 20.Stoel BC, Putter H, Bakker ME, et al. (2008) Volume correction in computed tomography densitometry for follow-up studies on pulmonary emphysema. Proc Am Thorac Soc 5:919–24 [DOI] [PubMed] [Google Scholar]
- 21.Hoffman EA, Ahmed FS, Baumhauer H, et al. (2014) Variation in the percent of emphysema-like lung in a healthy, nonsmoking multiethnic sample. The MESA lung study. Ann Am Thorac Soc 11:898–907 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational Inference: A Review for Statisticians. Journal of the American Statistical Association 112:859–877 [Google Scholar]
- 23.Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980 [Google Scholar]
- 24.Kwon Y, Won J-H, Kim BJ, Paik MC (2020) Uncertainty quantification using Bayesian neural networks in classification: Application to biomedical image segmentation. Computational Statistics & Data Analysis 142:106816 [Google Scholar]
- 25.Kendall A, Badrinarayanan V, Cipolla R (2015) Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv:1511.02680 [DOI] [PubMed] [Google Scholar]
- 26.Tian X, Samei E (2016) Accurate assessment and prediction of noise in clinical CT images. Medical Physics 43:475–482 [DOI] [PubMed] [Google Scholar]
- 27.Marine S, Stephen LB, Irène B (2007) Partial-Volume Effect in PET Tumor Imaging. Journal of Nuclear Medicine 48:932. [DOI] [PubMed] [Google Scholar]
- 28.Sánchez-Ferrero GV, Díaz AA, Ash SY, et al. (2024) Quantification of Emphysema Progression at CT Using Simultaneous Volume, Noise, and Bias Lung Density Correction. Radiology 310:e231632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Manichaikul A, Hoffman EA, Smolonska J, et al. (2014) Genome-wide study of percent emphysema on computed tomography in the general population. The Multi-Ethnic Study of Atherosclerosis Lung/SNP Health Association Resource Study. Am J Respir Crit Care Med 189:408–18 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Hardin M, Foreman M, Dransfield MT, et al. (2016) Sex-specific features of emphysema among current and former smokers with COPD. Eur Respir J 47:104–12 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Stockley RA, Parr DG, Piitulainen E, Stolk J, Stoel BC, Dirksen A (2010) Therapeutic efficacy of α-1 antitrypsin augmentation therapy on the loss of lung tissue: an integrated analysis of 2 randomised clinical trials using computed tomography densitometry. Respir Res 11:136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Liu X (2012) Survival analysis: models and applications. John Wiley & Sons. [Google Scholar]
- 33.Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural network. in International conference on machine learning. PMLR. 1613–1622 [Google Scholar]
- 34.Kingma DP, Salimans T, Welling M (2015) Variational dropout and the local reparameterization trick. Advances in neural information processing systems 28 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data generated or analyzed during the study are available from the corresponding author by request.
