Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Apr 1.
Published in final edited form as: Med Phys. 2024 Feb 8;51(4):2834–2845. doi: 10.1002/mp.16915

Estimating lung function from computed tomography at the patient and lobe level using machine learning

Luuk H Boulogne 1, Jean-Paul Charbonnier 2, Colin Jacobs 1, Erik H F M van der Heijden 1, Bram van Ginneken 1
PMCID: PMC11132300  NIHMSID: NIHMS1959655  PMID: 38329315

Abstract

Background:

Automated estimation of Pulmonary function test (PFT) results from Computed Tomography (CT) could advance the use of CT in screening, diagnosis, and staging of restrictive pulmonary diseases. Estimating lung function per lobe, which cannot be done with PFTs, would be helpful for risk assessment for pulmonary resection surgery and bronchoscopic lung volume reduction.

Purpose:

To automatically estimate PFT results from CT and furthermore disentangle the individual contribution of pulmonary lobes to a patient’s lung function.

Methods:

We propose I3Dr, a deep learning architecture for estimating global measures from an image that can also estimate the contributions of individual parts of the image to this global measure. We apply it to estimate the separate contributions of each pulmonary lobe to a patient’s total lung function from CT, while requiring only CT scans and patient level lung function measurements for training. I3Dr consists of a lobe-level and a patient-level model. The lobe-level model extracts all anatomical pulmonary lobes from a CT scan and processes them in parallel to produce lobe level lung function estimates that sum up to a patient level estimate. The patient-level model directly estimates patient level lung function from a CT scan and is used to re-scale the output of the lobe-level model to increase performance. After demonstrating the viability of the proposed approach, the I3Dr model is trained and evaluated for PFT result estimation using a large data set of 8 433 CT volumes for training, 1 775 CT volumes for validation, and 1 873 CT volumes for testing.

Results:

First, we demonstrate the viability of our approach by showing that a model trained with a collection of digit images to estimate their sum implicitly learns to assign correct values to individual digits. Next, we show that our models can estimate lobe-level quantities, such as COVID-19 severity scores, pulmonary volume (PV), and functional pulmonary volume (FPV) from CT while only provided with patient-level quantities during training. Lastly, we train and evaluate models for producing spirometry and diffusion capacity of carbon mono-oxide (DLCO) estimates at the patient and lobe level. For producing Forced Expiratory Volume in one second (FEV1), Forced Vital Capacity (FVC), and DLCO estimates, I3Dr obtains mean absolute errors (MAE) of 0.377 L, 0.297 L, and 2.800 mL/min/mm Hg respectively. We release the resulting algorithms for lung function estimation to the research community at https://grand-challenge.org/algorithms/lobe-wise-lung-function-estimation/

Conclusions:

I3Dr can estimate global measures from an image, as well as the contributions of individual parts of the image to this global measure. It offers a promising approach for estimating PFT results from CT scans and disentangling the individual contribution of pulmonary lobes to a patient’s lung function. The findings presented in this work may advance the use of CT in screening, diagnosis, and staging of restrictive pulmonary diseases as well as in risk assessment for pulmonary resection surgery and bronchoscopic lung volume reduction.

Keywords: computed tomography, convolutional neural network, pulmonary function test, weakly supervised learning

1 |. INTRODUCTION

Pulmonary function tests (PFTs) are widely used to assess the respiratory health of a patient, with spirometry and Diffusion Capacity of Carbon mono-Oxide (DLCO) measurements being particularly useful for this purpose.13 Spirometry assesses how a person inhales or exhales a given volume of air over time,2 while DLCO measures the rate of oxygen uptake in the lungs.3

In this pilot study, we present a machine learning method that automatically estimates spirometry and DLCO test results from Computed Tomography (CT). Our method also estimates the contribution of each pulmonary lobe to a patient’s lung function, providing clinically relevant information that is impossible to obtain from PFTs.

Methods for accurate automatic lung function prediction from CT have the potential to replace PFTs when CT scans are already available. This would make them useful for the many purposes that PFTs are used for today, such as diagnosing1,2 and determining the efficacy of treatment1 of asthma, and the diagnosis2 and staging1,4 of Chronic Obstructive Pulmonary Disease (COPD).

In order to determine operability of patients with an early stage lung cancer, clinicians largely depend on the prediction of post-operative pulmonary function (ppo) based on calculation of the effect of the number of segments that will be removed upon surgery.5 These estimations are relatively crude, are based on a homogeneous perfusion of each lobe and do not compensate for the changes in volume of the remaining pulmonary tissue in the operated hemi-thorax.6 More precise lobe level lung function estimates could improve the risk assessment of pulmonary resection surgery.

Both spirometry and DLCO measurements are currently used for the risk assessment for ref. [7] and advocated to be used in the inclusion criteria for bronchoscopic lung volume reduction.8 Access to lobe level lung function estimates could improve the accuracy of these assessments.

In this work, we utilized an Inflated 3D ConvNet (I3D)9 backbone to construct the regional I3D (I3Dr) model for estimating lung function at the patient level and at the lobe level. I3Dr was trained and evaluated for estimating spirometry, namely Forced Expiratory Volume in one second (FEV1) and Forced Vital Capacity (FVC), as well as DLCO measurement outcomes using a large dataset of 12 045 CT volumes. It was designed to estimate the separate contributions of each pulmonary lobe to a patient’s total lung function, while requiring only CT scans and patient level lung function measurements for training.

We validated I3Dr through several experiments. We first explored the viability of the I3Dr model in a toy experiment by showing that, when only presented with a collection of digit images and their sum, it learned to assign correct values to individual digits. Next, we validated its ability to estimate lobe level quantities from CT while only training with patient level labels through estimating lobe level COVID-19 severity scores, Pulmonary Volume (PV), and Functional PV (FPV). Finally, we used the I3Dr model to estimate lobe level lung function quantities that sum up to patient level PFT estimates. The presented methods outperformed using FPV for PFT result estimation.

The main contributions of this work are as follows:

  1. We show that FEV1, FVC, and, as a first to the best of our knowledge, DLCO measurement outcomes can be estimated from CT using machine learning.

  2. We proposed and validated the I3Dr model for disentangling regional contributions from a patient level label and applied this model to estimate lobe level lung function.

  3. We made our models for PFT result estimation at the patient and lobe level publicly available to the research community.

2 |. RELATED WORK

2.1 |. Estimating regional contributions using machine learning

The presented method for lobe level lung function estimation was inspired by class activation maps (CAMs)10 and in particular by the Bag-of -local-Features (BagNet) model.11 Like most deep learning classification models, BagNet consists of a series of feature-extracting convolution layers followed by global average pooling and a linear layer that outputs the logits used for classification. By swapping the final pooling layer and the final linear layer of the model at test time, which does not change the model output,10,11 BagNet can produce interpretable heat maps. After swapping, each activation value produced by the linear layer can be interpreted as a regional output for the receptive field patch of the corresponding neuron.

BagNet has been used for a variety of medical image processing tasks in various modalities. It was deployed for sex and age prediction to generate heat maps for brain Magnetic Resonance Imaging (MRI) volumes12 and retinal images.13,14 In other work, BagNet was extended with a MIL branch and trained to generate interpretable heat maps for histology images describing malignant and benign15 regions.

In our work, we exploited the principle of swapping the final pooling layer and linear layer of a model to produce lung function estimations at the pulmonary lobe level. Instead of producing heat maps, we designed a model in which the receptive fields of the output neurons of the final linear layer each contain one pulmonary lobe. Our method assigns lung function estimates to each pulmonary lobe, such that they sum up to an image level estimate.

2.2 |. PFT prediction

Earlier work has shown promising results for producing PFT results at a patient level using convolutional neural networks. For example, total lung volume has been estimated from chest radiographs16 and CT scans have been used for estimating spirometry test results.17 These methods did not produce lobe level estimates.

Various methods for lobe level lung function estimation that do not make use of the complex features in a CT scan have been introduced in previous works. The risk assessment guidelines for pulmonary resection surgery advise to estimating the residual lung function of a patient with a simple calculation using pulmonary segment counting.5 Lobe level lung function has furthermore been estimated using PV as obtained from a lung segmentation,18 as well as through FPV.1923 Here, after obtaining the lobe level PV, dual thresholding operations on the CT scan are applied to acquire an estimation of the volume of functional parenchyma. These methods have been shown to outperform the segment counting method for the task of predicting PFT results after lung resection surgery.18,21 Post-operative PFT results have also been predicted using an estimation of lobar collapsibility,20 which is computed as the fraction of change in FPV between an inspiration and an expiration CT scan. Lastly, Single-Photon Emission Computed Tomography (SPECT)/CT images have been used to determine the lobe level FEV1 of a patient24 by computing the proportion of radioactivity in the lobe of interest with respect to the total radioactivity.

To the best of our knowledge, we are the first to propose a method for lobe level lung function estimation that makes use of the detailed information available in a CT scan and that does not require the patient to undergo additional imaging.

3 |. METHODS

3.1 |. I3Dr

Figure 1 shows the training and inference pipelines of the regional I3D (I3Dr) model proposed in this work, which consists of patient level model A and a lobe level model B.

FIGURE 1.

FIGURE 1

Schematic representations of the training and evaluation pipelines for the I3Dr model, which combines two distinct machine learning models. (a) Training procedure for the model that learned to produce patient level lung function estimates from a CT scan (model A). First, a lung mask was applied to the CT scan. Next, features were extracted by the feature extractor fA(). The linear layer rA() was applied to obtain a patient level estimate. Loss was computed by comparing this estimate with the patient level reference to update fA() and rA(). (b) Training procedure for the model that learned to produce lobe level estimates (model B). First, a lobe mask was applied to the input CT scan. The segmented lobes were processed by the feature extractor fB() individually. During training, the feature vectors corresponding to each lobe were summed together. The resulting feature vector was presented to the linear layer rB() to produce a patient level estimate. Only a patient level reference was used to compute the loss and update fB() and rB(). (c) Inference pipeline for model B. During inference, the feature vectors corresponding to each lobe were not aggregated. Instead, they were presented to linear layer rB() individually to produce lobe level estimates that sum up to a patient level estimate. (d) To increase performance, the output of model A, that learned to produce patient level lung function estimates from a CT scan, was used to rescale the lobe level estimates produced by model B. CT, computed tomography.

3.1.1 |. Patient level model

First, we trained model A to map a CT scan x to a corresponding PFT result vector y. Model A consisted of a feature extractor fA() followed by a linear fully connected layer rA():

y^A=WAfA(x)+bA.

Here, the linear fully connected layer rA() is parameterized by weight matrix WA and bias bA. Figure 1a schematically presents this conventional approach.

3.1.2 |. Lobe level model

Next, we trained a model B that processed images of all anatomical pulmonary lobes of a patient in parallel to produce a patient level PFT result estimate. The model was designed so that it could produce meaningful lobe level lung function estimates during inference. More specifically, the training and evaluation pipelines were altered to force the model to output lobe level lung function estimates that sum up to a patient level estimate. Figure 1b and Figure 1c shows these modifications.

Training

To extract feature vectors per lobe, the lobes were segmented from the input CT scan and presented individually to feature extractor fB(). The resulting feature vectors were aggregated and presented to the final linear layer rB(). This process can be described as:

y^B=WB(l=1LfB(xl))+bB.

Here, a linear fully connected layer rB() is parameterized by a weight matrix WB and bias bB. L is the number of regions for which the model produced estimates at test time. For the task of PFT result estimation, L=5, as it represents the number of anatomical pulmonary lobes. Each xI is a separate input region for which the model produces estimates. For PFT result estimation, xI is the image of a pulmonary lobe I of CT scan x.

Only a single feature extractor fB() was used for processing all five lobes. This sharing of weights allows fB() to take advantage of the common features of interest across different anatomical lobes.

Inference

At test time, the summation operation and final linear layer rB() of model B are swapped, which does not change the model’s output.10,11 After swapping, each of the activation values produced by the final linear layer had a receptive field containing one pulmonary lobe. These values could therefore be interpreted as lobe level lung function estimates that sum up to a patient level PFT result estimate. Figure 1c shows this inference pipeline.

For an image of pulmonary lobe I of CT scan x, a lobe level estimate was obtained as follows:

y^lB=WBfB(xl)+bBL.

These lobe level estimates could be aggregated again by the summation operation to produce the patient level estimate that the model produced before swapping the summation operation and final linear layer rB():

y^B=l=1Ly^lB.

3.1.3 |. Combined model

We found in our experiments that model A (see, Section 3.1.1) regularly outperformed model B (see, Section 3.1.2) on the patient level. We therefore combined models A and B into one model that we refer to throughout this work as the I3Dr model. The output of this model was obtained by simply rescaling the output of model A with the output of model B as follows:

y^l=y^1By^Ay^B.

Figure 1d shows the complete inference pipeline for the I3Dr model.

4 |. EXPERIMENTS

Since PFTs do not describe lobe level lung function, the approach described in Section 3.1 for lobe level lung function estimation cannot be validated directly. We therefore designed and performed several experiments where we do have access to local and global measurements to validate whether our models can actually produce meaningful regional (lobe level) output when only receiving feedback based on global (patient level) measurements during training. In our final and most comprehensive experiment, we trained a model with patient level PFT results to produce lobe level lung function estimates.

4.1 |. Summing digits

In this proof -of -concept experiment, we used 2D images of digits instead of images of pulmonary lobes, and sums of digits instead of patient level PFT results.

4.1.1 |. Dataset

We conducted this experiment with the MNIST dataset,25 which contains a training set of 60 000 and a test set of 10 000 images of digits of 28 × 28 pixels each.

4.1.2 |. Experiment design

For each of the models, the training set of digit images was randomly divided into collections or ‘bags’ of images prior to training. We used the bag sizes of 2i with i[0..8]. Each bag was labeled with the sum of all the digits it contained. Figure 2a shows some input-label example pairs for a bag size of eight.

FIGURE 2.

FIGURE 2

MNIST experiment to validate whether a model can implicitly learn the contribution of individual images to a collection level label. (a) Three examples of input-label pairs for this experiment. During training, the machine learning model is presented with a collection (bag) of images and a bag level label. Each bag is labeled with the sum of all the labels of the digits in that bag. The model only receives feedback based on these bag labels during training, and not based on the individual digit labels. (b) Performance on the test set of models trained to predict the sum of a bag of digits, evaluated on the task of predicting the value of a single digit for different bag sizes. (c) Individual digits from the test set were presented to the model trained for estimating the sum of a bag of eight MNIST digits. The model was evaluated for the task of predicting the individual label of each digit. Each row shows a histogram of the continuous model output for all images depicting a single target digit.

The models were trained with the methodology described in Section 3.1.2 to produce these bag level sums in a regression setting using 85% of the 60 000 MNIST training images. A validation set to monitor the performance on this task was constructed with the remaining 15%. No individual digit labels were presented to provide feedback to the model during training, except when training the model with a bag size of one. After training, the models were evaluated on their performance for estimating individual digit labels using the 10 000 test images. For this experiment, the rescaling described in Section 3.1.3 was not performed.

4.2 |. COVID-19 severity scoring

Subsequently, to validate whether our approach could be used for estimating meaningful lobe level quantities using only patient level labels, we trained two models to estimate COVID-19 CT Severity Scores (CTSSs).26 The CTSS indicates the severity of a COVID-19 infection for individual lobes. In clinical practice, these lobe-wise scores are summed up to a patient-level CTSS. This summing is analogous to how many patient-level PFT results are the sum of the contributions of lobe level quantities.

4.2.1 |. Dataset

To develop and test models for CTSS estimation, we adopted the internal dataset from.27 This dataset contains 482 CT scans for which each of the pulmonary lobes were scored with a CTSS by a radiologist. The dataset was split at the patient level into a training set (293 scans) validation set (84 scans) and test set (105 scans). The test set used in this work was identical to the test set used by N. Lessmann et al.27

4.2.2 |. Experiment design

First, a baseline model was trained for the task of producing lobe level CTSS estimates from images of individual pulmonary lobes to obtain a performance limit for lobe level CTSS estimation. This model was trained in a fully supervised fashion using the conventional methodology described in Section 3.1.1.

Furthermore, we trained an I3Dr model using the methodology described in Section 3.1. Only patient level CTSS labels were used for training this model.

Both models were evaluated on the task of producing lobe level CTSS labels.

4.3 |. Lung function estimation

Lastly, we trained an I3Dr model to jointly produce patient level PV, FPV, and PFT results from CT scans.

The PV was estimated by taking the volume of a lung mask. The FPV is the PV from which emphysematous regions, large blood vessels and airways, and dense lesions such as fibrosis are excluded, so that only the functional parenchyma remains. It was computed as the volume of a CT scan within the lung mask where only voxels with HU in the range [−950, −500] HU were included. The resulting subject level PV and FPV measures were used as reference targets for training the I3Dr model.

Lobe level PV and FPV values were computed in a similar fashion, but using segmentations of individual lobes instead of lung masks. These lobe level PV and FPV measures were not used during training. They were only used as ground truth for evaluating the model.

The PFT results that the model was trained to estimate were DLCO as well as spirometry measurements, namely FEV1 and FVC. In this work, spirometry measurements that were performed pre- and post-bronchodilator are indicated with superscripts ‘pre’ and ‘post’ respectively.

After training, the performance of the I3Dr model for predicting PFT results was compared to using FPV for PFT result estimation.

4.3.1 |. Dataset

For this experiment, we used inspiration CT scans from the COPDGene study.28 Data for this clinical trial was collected from 21 imaging centers in the United States.

The COPDGene dataset was divided randomly into separate sets for training, validation, and testing. This split was performed at a subject level to ensure that the model performance on the test set is not tainted due to overfitting on subject-specific information. An overview of the numbers of CT scans, patients, and measurements used in this work can be found in Table 1.

TABLE 1.

Number of CT Images (and Subjects) in the COPDGene Dataset used for the lung function estimation.

Total FEV1pre FEV1post FVCpre FVCpost DLCO

Train 8 433 (6 023) 8 415 (6 020) 8 356 (5 987) 8 414 (6 020) 8 355 (5 987) 2 350 (2 350)
Val 1 775 (1 271) 1 772 (1 271) 1 760 (1 262) 1 772 (1 271) 1 760 (1 262) 496 (496)
Test 1 837 (1 304) 1 831 (1 304) 1 819 (1 298) 1 831 (1 304) 1 819 (1 298) 500 (500)
Total 12 045 (8 598) 12 018 (8 595) 11 935 (8 547) 12 017 (8 595) 11 934 (8 547) 3 346 (3 346)

Note: In this dataset, DLCO measurements were not available as often as spirometry measurements.

Abbreviations: COPD, chronic obstructive pulmonary disease; CT, computed tomography; DLCO, diffusion capacity of carbon mono-oxide; FEV1, forced expiratory volume in one second; FVC, forced vital capacity.

The PV and FPV reference were computed using the lung masks available in this dataset.

4.3.2 |. Experiment design

An I3Dr model was trained using all CT scans in the COPDGene training set (see Table 1) to produce patient level PV, FPV, FEV1pre, FEV1post, FVCpre, FVCpost, and DLCO estimates. The performance of the model for producing these patient level labels was evaluated, as well as its performance for producing lobe level PV and FPV.

4.4 |. CT scan preprocessing

Prior to presenting the CT scans to a model, they were clipped between −1100 and 300 HU and the voxel values were scaled to the range [0,1]. After this, the CT scans were isotropically resampled to 1.6 mm3 with linear interpolation. A lobe segmentation was used to mask out all voxels outside of the region of interest. This region was either a lung mask or a mask of a single lobe, depending on the model being trained. The resulting volumes were centered around the region of interest and cropped to 240 × 240 × 240 voxels.

For the COPDGene dataset, the lobe segmentations that were provided with this dataset were used for masking and cropping the CT scans. These lobe segmentations had been automatically generated by a commercialized software (LungQ, Thirona, Nijmegen, The Netherlands), and were manually corrected by trained analysts that had at least 1 year experience in annotating pulmonary structures on CT.

RTSU-Net,29 was used to obtain lobe segmentations for masking and cropping the CT scans of the CTSS dataset from N. Lessmann et al.27

4.5 |. Training details

Training models for this work was done on a single GPU, using NVIDIA GeForce GTX TITAN X, GeForce GTX 1080, GeForce GTX 1080 Ti, GeForce RTX 2080 Ti, and TITAN Xp graphics cards.

4.5.1 |. Summing digits

For the experiments conducted with the MNIST dataset, the images were resized to 33 × 33 pixels, and the BagNet-33 architecture11 was used as a feature extractor. The models were trained to minimize the L1 loss between their bag level predictions and the bag level labels using the Adam optimizer (β1=0.9, β2=0.999) with a learning rate of 0.001 and a batch size of 512 divided by the bag size. The models were trained for 1 000 epochs and performance on the validation set was monitored after each epoch. The model weights with the best performance on the validation set were used for evaluation.

4.5.2 |. CTSS and PFT result estimation

For CTSS and PFT result estimation, we modeled the feature extractors fA() and fB() with the I3D,9 without its final layer. Because pre-training has been shown to be advantageous for processing medical images with convolutional neural networks,30 we initialized our model with publicly available weights trained for Red Green Blue (RGB) video classification.9

All models were trained using the Adam optimizer (β1=0.9, β2=0.999) to minimize the L1 loss with a learning rate of 1 × 10−4, weight decay of 1 × 10−2, and a batch size of two. Early stopping was used with a patience of 15 epochs.

ALGORITHM 1.

Gradient computation with checkpointing

Input: Set of lobe images in training batch X:={xi}1N, set of virtual mini-batches M that is a partition of the indices of X, set of CT scans S that is a partition of the indices of X, loss function 𝓛(), feature extractor fB() with learnable parameters ϕ, linear function rB() with learnable parameters θ, mapping Q(i) that maps an index i of a lobe image to the CT scan sS that contains it.
Output: r, f
1:  for all mM do
2:  hifB(xi) for all im Without gradient computation
3:  end for
4:  zsishi for all sS
5:  osrB(zs) for all sS
6:  rBsS𝓛(oS)θ
7:  fB0
8:  for all mM do
9:  h˜ifB(xi) for all im
10:  z˜ih˜i+jQ(i),jihi for all im
11:  fBfB+im𝓛(oQ(i))zQ(i)z˜iϕ
12:  end for
13:  return rB, fB

Data augmentation was used to increase training data diversity without altering relevant image features. The augmentations consisted of rotation, translation, and shearing in the axial plane, elastic deformation, and adding Gaussian noise. To accommodate the RGB input format of the I3D model, the CT scans were copied along the channel dimension after applying data augmentation.

All CT scan labels were divided by their mean standard deviation in the training set before computing loss. This was done to avoid unbalanced penalties due to the differences in magnitude distributions of different measurement types. For input CT scans for which not all labels were available, the loss was computed using only the available labels.

To compute the loss for a single CT scan when training a model with the methodology presented in Section 3.1.2, the feature extractor fB() would conventionally process a minimal batch size equal to the number of lobes in a CT scan. This minimum batch size of five was too large to fit on a 12GB GPU. To circumvent this problem, we applied the gradient checkpointing method described in Algorithm 1. We did not pass a full batch of lobe images through fB() when training these models. Instead, we further divided the training batch X comprised of lobe images xi into a set of virtual mini-batches M, such that each virtual mini-batch mM contained two lobe images. These virtual mini-batches were passed through the feature extractor separately without gradient computation (lines 1–3). Subsequently, for each CT scan s from the set S of CT scans with which the training batch X was constructed, the features extracted from the lobes of that CT scan were aggregated and cached (line 4). The aggregated features were processed by final linear layer rB() (line 5) and the gradient for rB() was computed (line 6). Lastly, the gradient fB() was aggregated by computing it for each virtual mini-batch separately (lines 7–12). More specifically, for each virtual mini-batch, the features extracted by fB() were re-computed, this time storing the computational graph (line 9). The features for this virtual mini-batch were aggregated with the features cached in the original forward pass (line 10) in order to continue back-propagation through fB() (line 11).

4.6 |. Evaluation

The correlation between FPV and PFT measurements and between model output and PFT measurements was computed on the test set using the Pearson correlation coefficient.

Following,27 the agreement between the CTSS labels and the output of the trained models was evaluated in terms of linearly weighted κ.

The performance of the models trained in this work were additionally evaluated in terms of MAE. 95% confidence intervals for performance measures were computed as the interval between the 2.5% and 97.5% percentiles of a bootstrap distribution generated with 1 000 iterations.31

All p-values were obtained using standard permutation tests for matched pairs31 with 10 000 iterations.

5 |. RESULTS

5.1 |. Summing digits

The models trained to predict the sum of a bag of MNIST digits using the methodology described in Section 3.1.2 were evaluated for the task of predicting the value of a single digit. Figure 2b shows that when increasing the bag size, performance decreased slowly in terms of MAE and accuracy for bag sizes up to 16. For larger bag sizes, performance decreased more rapidly. For small bag sizes, the models generally produced output within a narrow band around the target quantity. Figure 2c shows this for a bag size of eight.

5.2 |. CTSS scoring

The I3D model that was trained to produce lobe level CTSS labels from individual lobe images was evaluated for this task. It obtained a linearly weighted κ of 0.565, 95% CI: (0.520, 0.609), indicating moderate agreement, which was similar to the linearly weighted κ of 0.54 reported by N. Lessmann et al.27

The I3Dr model had a more challenging task. It was evaluated for predicting these same lobe-level CTSS labels, but was trained using only patient level CTSS scores. It obtained a linearly weighted κ of 0.491, 95% CI: (0.448, 0.534), which indicates moderate agreement with the CTSS labels from the test set.

5.3 |. Lung function estimation

Table 2 shows the performance of the I3Dr model with and without the rescaling step (see, Section 3.1.3) at the patient level and for the different anatomical lobes, namely the left upper lobe (LUL), left lower lobe (LLL), right upper lobe (RUL), right lower lobe (RLL) and the right middle lobe (RML).

TABLE 2.

Model performance in MAE (in mL/min/mm Hg for DLCO and in L for other measures) at the patient level and for each of the anatomical pulmonary lobes: LUL, LLL, RUL, RLL, RML.

Model Level PV FPV

I3Dr (no rescaling) LUL 0.055 0.063
LLL 0.059 0.055
RUL 0.057 0.065
RLL 0.060 0.058
RML 0.037 0.041
Patient 0.124 0.133
I3Dr LUL 0.053 0.053
LLL 0.045 0.051
RUL 0.065 0.078
RLL 0.046 0.054
RML 0.032 0.037
Patient 0.053 0.118

Abbreviations: DLCO, diffusion capacity of carbon mono-oxide; FPV, functional pulmonary volume; LLL, left lower lobe; LUL, left upper lobe; MAE, mean absolute error; PV, pulmonary volume; RLL, right lower lobe; RML, right middle lobe; RUL, right upper lobe.

For both PV and FPV, the I3Dr model was able to produce meaningful lobe level estimates, even though it was only trained with patient level labels. The rescaling step in the I3Dr model increased performance.

Table 3 shows the MAE of the I3D model for the task of PFT result estimation. Figure 3 shows the correlation between the I3Dr output and PFT results for the COPDGene test set. In terms of the Pearson correlation coefficient, the correlation between the I3Dr output and PFT test results was substantially better for all PFT types than the correlation between functional lung volume and PFT test results. The PFT measurements were better correlated with the rescaled I3Dr output than with the I3Dr output before the rescaling step (p<0.001 for all PFT types). Figure 4 shows representative qualitative PFT estimation results of the I3Dr model.

TABLE 3.

Model performance in MAE (in mL/min/mm Hg for DLCO and in L for other measures) at the patient level.

Model FVCpre FVCpost FEV1pre FEV1post DLCO

I3Dr (no rescaling) 0.423 0.411 0.357 0.347 3.336
I3Dr 0.388 0.377 0.307 0.297 2.800

Abbreviations: DLCO, diffusion capacity of carbon mono-oxide; FEV1, forced expiratory volume in one second; FVC, forced vital capacity; MAE, mean absolute errors.

FIGURE 3.

FIGURE 3

Correlation between PFT results from the test set and FPV (top row), the corresponding output of the I3D model (middle row), and the corresponding output of the I3Dr model (bottom row). Pearson correlation coefficients with 95% confidence intervals are shown above each plot. DLCO is shown in mL/min/mm Hg. All other measures are shown in L. DLCO, diffusion capacity of carbon mono-Oxide; FPV, functional pulmonary volume; PFT, pulmonary function test.

FIGURE 4.

FIGURE 4

Qualitative results of the I3Dr model for four randomly selected CT scans from the test set. The left images show central, evenly spaced coronal slices of the input CT scan cropped to the lobes, as well as a voxel-wise annotation of the pulmonary lobes. The bar plots on the right show the corresponding PFT measurement values in gray and the lobar lung function estimations in the colors corresponding to the lobe segmentation. DLCO is shown in mL/min/mm Hg. All other PFT measurement values are shown in L. CT, computed tomography; DLCO, diffusion capacity of carbon mono-Oxide; PFT, pulmonary function test.

6 |. DISCUSSION

In this paper, we introduced a method that can estimate global measures (lung function) and can also estimate the contributions of individual parts to this global measure (in this case lobar contributions to lung function). Especially for DLCO, the results are promising. Recommendations state that when measuring DLCO, there should be at least two acceptable tests that meet the repeatability requirement of either being within 3 mL CO (Standard Temperature (=0 C), Standard Pressure (=760mm Hg), Dry (=PH20 = 0 mm Hg))/min/mm Hg (or 1 mmol/min-1/kPa) of each other or within 10% of the highest value.3 When regarding the ground truth PFT measurement and the corresponding output of the I3Dr model as the two acceptable tests, the I3Dr model meets this repeatability requirement in 64% of the cases in the test set.

In our evaluation of the I3Dr model, we found that the PFT results correlated substantially better with the output of the I3Dr model than with PFV. The latter has been used as a proxy for lobe level PFT results in previous research.1923 The I3Dr model may be viable for directly producing lobe level lung function estimations from CT.

The I3Dr model presented in this work has some limitations. First, the I3Dr model was both trained and evaluated with CT scans and PFT results from the COPDGene study.28 Despite the scope of the COPDGene study, encompassing data from 21 imaging centers in the United States, this training data may not be representative for patients with different pathologies and/or demographics. Consequently, this could affect the efficacy of the I3Dr model when applied to populations outside the study’s demographic or those with different lung pathologies. The generalization ability of our model may be quantified in future work through external validation of the publicly available algorithm.

Furthermore, a direct lobe level evaluation for spirometry and DLCO results was not possible, since these measurements can only be performed at a subject level. The I3Dr’s ability to produce lobe-level CTSS scores, PV, and FPV, coupled with its performance in predicting subject level spirometry and DLCO results, suggests that it is a promising approach for estimating lobe-level lung function measures.

Lastly, we only trained the I3Dr model with inspiratory CT scans, since expiratory CT scans are not always available in clinical practice. Presenting both inspiratory and expiratory CT scans as input may increase performance for spirometry test result estimation. This might also allow the I3Dr model to indirectly capture interplay between lobes, such as the decreased inflation of one lobe due to the hyperinflation of another.

Incorporating a patient level model in the I3Dr inference pipeline as described in Section 3.1.3 increased patient level performance. A reason for this increase could be that the patient level model can model interactions between lobes, which is not possible when processing the images of each anatomical lobe separately. In our work, the I3Dr model was applied for lobe level lung function estimation. It could trivially be extended to estimate lung function per pulmonary segment when a segment segmentation is available. The I3Dr model might also be applied for determining functional measures of other organs that can be divided into separate regions with similar functionality such as the liver.

7 |. CONCLUSION

In this work, we conducted several experiments to validate the ability of the presented I3Dr model to produce meaningful regional labels, while being trained with only patient level labels.

First, we performed a proof of concept to test whether a model could accurately predict the label of individual 2D digit images when trained with a set of images and its sum. We showed that up to sets of 32 digits, the results were nearly flawless.

Our next experiments showed that our methodology for implicitly learning regional quantities can also be applied to regression from CT scans. We trained I3Dr models to produce meaningful lobe level quantities from CT, while only using patient level labels during training. In this setting, the I3Dr model was able to estimate a lobe level CTSS, PV, and FPV from CT.

After validating that the I3Dr model could estimate meaningful lobar quantities from CT, we found that it is able to also estimate patient level PFT results. Overall, we found that I3Dr can estimate global measures from an image, as well as the contributions of individual parts of the image to this global measure. I3Dr offers a promising approach for estimating PFT results from CT scans and disentangling the individual contribution of pulmonary lobes to a patient’s lung function.

We hope that the findings presented in this work may advance the use of CT in screening, diagnosis, and staging of restrictive pulmonary diseases as well as in risk assessment for pulmonary resection surgery and bronchoscopic lung volume reduction.

ACKNOWLEDGMENTS

This work was supported by the European Regional Development Fund, as well as by NHLBI U01 HL089897 and U01 HL089856. The COPDGene study (NCT00608764) is also supported by the COPD Foundation through contributions made to an Industry Advisory Committee comprised of AstraZeneca, Bayer Pharmaceuticals, Boehringer-Ingelheim, Genentech, GlaxoSmithKline, Novartis, Pfizer, and Sunovion.

Funding information

European Regional Development Fund, Grant/Award Numbers: NHLBI U01 HL089897, U01 HL089856; National Heart, Lung, and Blood Institute, Grant/Award Number: NCT00608764

Footnotes

CONFLICT OF INTEREST STATEMENT

The authors have no relevant conflicts of interest to disclose.

REFERENCES

  • 1.Derom E, Van Weel C, Liistro G, et al. Primary care spirometry. Eur Respir J. 2008;31:197–203. [DOI] [PubMed] [Google Scholar]
  • 2.Miller MR, Hankinson J, Brusasco V, et al. Standardisation of spirometry. Eur Respir J. 2005;26:319–338. [DOI] [PubMed] [Google Scholar]
  • 3.Macintyre N, Crapo R, Viegi G, et al. Standardisation of the single-breath determination of carbon monoxide uptake in the lung. Eur Respir J. 2005;26:720–735. [DOI] [PubMed] [Google Scholar]
  • 4.Gross NJ. The gold standard for chronic obstructive pulmonary disease. Am J Respir Crit Care Med. 2001;163:1047–1048. [DOI] [PubMed] [Google Scholar]
  • 5.Brunelli A, Charloux A, Bolliger CT, et al. ERS/ESTS clinical guidelines on fitness for radical therapy in lung cancer patients (surgery and chemo-radiotherapy). Eur Respir J. 2009;34:17–41. [DOI] [PubMed] [Google Scholar]
  • 6.Sengul AT, Sahin B, Celenk C, Basoglu A. Postoperative lung volume change depending on the resected lobe. Thorac Cardiovasc Surg. 2013;61:131–137. [DOI] [PubMed] [Google Scholar]
  • 7.Brunelli A, Kim AW, Berger KI, Addrizzo-Harris DJ. Physiologic evaluation of the patient with lung cancer being considered for resectional surgery: diagnosis and management of lung cancer: american college of chest physicians evidence-based clinical practice guidelines. Chest. 2013;143:e166S-e190S. [DOI] [PubMed] [Google Scholar]
  • 8.Flandes J, Soto FJ, Cordovilla R, Cases E, Alfayate J. Bronchoscopic lung volume reduction. Clin Chest Med. 2018;39:169–180. [DOI] [PubMed] [Google Scholar]
  • 9.Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017:6299–6308. [Google Scholar]
  • 10.Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016:2921–2929. [Google Scholar]
  • 11.Brendel W, Bethge M. Approximating CNNs with bag-of -local-features models works surprisingly well on ImageNet. International Conference on Learning Representations (ICLR); Arxiv. 2019:1904. [Google Scholar]
  • 12.Pawlowski N, Glocker B. Is texture predictive for age and sex in brain MRI? arXiv preprint arXiv:1907.10961, 2019. [Google Scholar]
  • 13.Ilanchezian I, Kobak D, Faber H, Ziemssen F, Berens P, Ayhan MS. Interpretable gender classification from retinal fundus images using BagNets. In: International Conference on Medical Image Computing and Computer-Assisted Intervention.Springer; 2021:477–487. [Google Scholar]
  • 14.Hassan ON, Menten MJ, Bogunovic H, Schmidt-Erfurth U, Lotery A, Rueckert D. Deep learning prediction of age and sex from optical coherence tomography. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). IEEE; 2021:238–242. [Google Scholar]
  • 15.Paschali M, Naeem MF, Simson W, Steiger K, Mollenhauer M, Navab N. Deep learning under the microscope: improving the interpretability of medical imaging neural networks. arXiv preprint arXiv:1904.03127, 2019. [Google Scholar]
  • 16.Sogancioglu E, Murphy K, Scholten ET, Boulogne LH, Prokop M, van Ginneken B. Automated estimation of total lung volume using chest radiographs and deep learning. arXiv preprint arXiv:2105.01181, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Argus M, Schaefer-Prokop C, Lynch DA, van Ginneken B. Function follows form: regression from complete thoracic computed tomography scans. arXiv preprint arXiv:1909.12047, 2019. [Google Scholar]
  • 18.Fernández-Rodríguez L, Torres I, Romera D, et al. Prediction of postoperative lung function after major lung resection for lung cancer using volumetric computed tomography. J Thorac Cardiovasc Surg. 2018;156:2297–2308. [DOI] [PubMed] [Google Scholar]
  • 19.Liu F, Han P, Feng G-s, et al. Using quantitative CT to predict post-operative pulmonary function in patients with lung cancer. Chin Med J. 2005;118:742–746. [PubMed] [Google Scholar]
  • 20.Yabuuchi H, Kawanami S, Kamitani T, et al. Prediction of post-operative pulmonary function after lobectomy for primary lung cancer: a comparison among counting method, effective lobar volume, and lobar collapsibility using inspiratory/expiratory CT. Eur J Radiol. 2016;85:1956–1962. [DOI] [PubMed] [Google Scholar]
  • 21.Bolliger CT, Gückel C, Engel H, et al. Prediction of functional reserves after lung resection: comparison between quantitative computed tomography, scintigraphy, and anatomy. Respiration. 2002;69:482–489. [DOI] [PubMed] [Google Scholar]
  • 22.Sverzellati N, Chetta A, Calabrò E, et al. Reliability of quantitative computed tomography to predict postoperative lung function in patients with chronic obstructive pulmonary disease having a lobectomy. J Comput Assist Tomogr. 2005;29:819–824. [DOI] [PubMed] [Google Scholar]
  • 23.Wu M-T, Pan H-B, Chiang AA, et al. Prediction of postoperative lung function in patients with lung cancer: comparison of quantitative CT with perfusion scintigraphy. Am J Roentgenol. 2002;178:667–672. [DOI] [PubMed] [Google Scholar]
  • 24.Nomori H, Cong Y, Sugimura H. Systemic and regional pulmonary function after segmentectomy. J Thorac Cardiovasc Surg. 2016;152:747–753. [DOI] [PubMed] [Google Scholar]
  • 25.Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86:2278–2324. [Google Scholar]
  • 26.Chang Y-C, Yu C-J, Chang S-C, et al. Pulmonary sequelae in convalescent patients after severe acute respiratory syndrome: evaluation with thin-section CT. Radiology. 2005;236:1067–1075. [DOI] [PubMed] [Google Scholar]
  • 27.Lessmann N, Sánchez CI, Beenen L, et al. Automated assessment of COVID-19 reporting and data system and chest CT severity scores in patients suspected of having COVID-19 using artificial intelligence. Radiology. 2021;298:E18–E28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Regan EA, Hokanson JE, Murphy JR, et al. Genetic epidemiology of COPD (COPDGene) study design. COPD. 2010;7:32–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Xie W, Jacobs C, Charbonnier J-P, van Ginneken B. Relational modeling for robust and efficient pulmonary lobe segmentation in CT scans. IEEE Trans Med Imaging. 2020;39:2664–2675. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Raghu M, Zhang C, Kleinberg J, Bengio S. Transfusion: understanding transfer learning for medical imaging. arXiv preprint arXiv:1902.07208, 2019. [Google Scholar]
  • 31.Moore DS, McCabe GP, Craig BA. Introduction to the Practice of Statistics. 6th ed. W. H. Freeman and Company; 2009. [Google Scholar]

RESOURCES