Design decisions |
The whole of Madrid dataset was randomly divided into four subsets in order to conduct a fourfold cross-validation training strategy to select for the best model (by F1-score) for each of the three model types. We ensured similar numbers of mortality cases in each split and the same four-way split was used for all experiments. Finally, we trained each of the 3 models on all of Madrid data once we identified the best hyperparameters from the fourfold cross-validation hyperparameter tuning experiments. The final models are then validated on the two external datasets (Hoboken and Seoul). See supplementary materials for the details of model training and picking for EHR, CXR, and fusion models. All models are trained on Google Cloud TPUs via Colab notebooks. Code for both training with paid and free TPUs are available. Software packages used were tensorflow = = 2.4.1, sklearn-pandas = = 1.8.0, xgboost = = 0.90. To ensure repeatability, a random seed of 2020 was used for all experiments |
Reasons |
In this setup, for each experiment, 3 folds are combined and used for training and the fourth fold is used for validation, whereby each individual data subset gets equal opportunities to validate models. Model parameters that perform best on one validation subset might just be “lucky.” Rotating the validation subset and picking model parameters that perform the best on average across all subsets helps in selecting a model that has hopefully learned more reliable features and may generalize better on external validation sets. Similar number of positive mortality cases (expired patients) in each split makes the validation set more likely to be equally difficult. We had to use the whole Madrid dataset for model development (training and validation); otherwise, the number of positive cases (mortality) would be too small for tuning. We used only open sourced python packages so that others can easily re-use and build on our work with no cost barriers |
EHR-based model |
Design decisions |
Four different types of machine learning algorithms (logistic regression, random forest, gradient boosting, and XGBoost) implemented in scikit learn 0.21 were tried in a tuning setting to select for the best EHR-based model. A randomized grid-search method was used to sample different hyperparameter settings from prespecified ranges for each optimization experiment, as shown in Supplementary Table 3
|
Reasons |
The goal of this modeling is not causality analysis but simply to select a model that performs the best for the given dataset and prediction task. We picked the four most common machine learning algorithms suitable for modeling tabular data and tuned their hyperparameters |
CXR-based model |
Step 1: Online (real-time) image augmentation during training |
Design decisions |
CXR images were randomly flipped vertically (left–right) and brightness adjusted (0–0.05). Together with the preprocessed anatomical Bbox augmentation, a random set of CXRs used for training the model is illustrated in Fig. 2. Both the online and offline augmentations are only used during training and not during internal and external validation of models |
Reasons |
The goal of image augmentation is to automatically increase training sample variety so that the model can learn to discern features that are more generalizable for the downstream prediction task. This step is particularly important if the training dataset is small. The online augmentations (flip and brightness) try to simulate how variations under which CXRs can be taken in real life might alter the image appearance. Only small augmentation ranges are chosen so that the CXR images remain radiologically interpretable. Augmentation is not used during internal and external validation because there is (1) no need to update model weights during evaluation settings and (2) need for comparing models against a consistent benchmark and augmentation introduces randomness |
Step 2: Online CXR feature extraction |
Design decisions |
Two different previously published pre-trained DenseNet-121 CXR models [28] are tried for feature selection for our downstream mortality prediction task. The Madrid CXR images are resized during training to the input size for each of the pre-trained models (320 × 320 vs. 224 × 224) to output the imaging features for classification. The last fully connected layer of both models, containing 14 outputs corresponding to the 14 radiologic CheXpert finding labels [35], was removed. Instead, linearized convolutional features from either the second (− 2) or the fourth (− 4) to the last layer were used for the mortality prediction classification task. The pre-trained models were partially frozen, with model weights updating after either layer 355, 400, or 420 during training. Choices for which “teacher” pre-trained model, feature layer to use, and how many model layers to update for the new mortality prediction task are set up as hyperparameters to be tuned in our experiments |
Reasons |
The Madrid dataset is too small to train deep learning networks from scratch. The pre-trained CXR models chosen have already been trained on much larger CXR datasets (MIMIC-CXR) [29] (> 200,000 images) to discern features that are useful for diagnosing 14 different CXR lung and heart radiologic findings, which are also clinically relevant for COVID patients. The final few layers in pre-trained convolutional neural networks tend to have best summarized the features useful for downstream (related) classification tasks. Since we only have a small training dataset (Madrid), we decided to only partially update the weights in the later layers in the pre-trained models and leave the choice of how many layers to update as a tunable parameter—knowing that there is a balance to be “learned” between updating weights for the new task on the small Madrid training dataset and losing the benefit of pre-learned weights from the pre-trained “teacher” CXR models |
Step 3: Mortality classification layers |
Design decisions |
After CXR features are extracted from a pre-trained model, we added a classification block consisting of tunable number of hidden linear layers, followed by a final activation function (choice between ReLU and LeakyReLU), a dropout layer, and a single binary output layer. The output layer represents whether a patient is alive or expired at 30 days. An initial bias to the final out layer was optionally added and tuned along with the choices for activation function (ReLU or LeakyReLU) and the number and sizes of the hidden layers |
Reasons |
The feature size extracted from both pre-trained models is 1024 in length. Additional classification layers were added to learn the new mortality classification task. Since the layer numbers and sizes are arbitrary, we picked a few common sizes to tune. We tried LeakyRelu as an activation function in the classification block because the CXR features extracted from the (− 2) and (− 4) layers can have many zeros due to the DenseNet-121 architecture. Adding initial bias to the output layer can help with performance for very unbalanced dataset |
Step 4: Optimization settings |
Design decisions |
Binary cross entropy was used as the loss function and the Adam optimizer was used for parameter optimization. We did not tune for these settings |
Reasons |
Binary cross entropy as the loss is appropriate for the binary mortality classification task. Adam is a fast optimizer, helps with avoiding overfitting and has shown good performance over a range of tasks |
Step 5: Hyperparameter tuning and model selection |
Design decisions |
Supplementary Table 4 provides a summary of all the hyperparameters we experimented with on the Madrid dataset to select for the final best performing CXR-based COVID-19 30-day prediction model. An experiment is defined by one unique combination of hyperparameters. Due to limited training resources and a large hyperparameter search space (345,600 unique combinations), we had to first rough search and manually narrow down the hyperparameter search space—e.g., early observation suggests most experiments did better with smaller batch sizes, LeakyReLU activation, and with Bbox augmentation. We then fine-tuned the model on the other more important parameters such as the learning rate. Early stopping was used to end experiments that did not show loss reduction after 2 or 5 epochs. Overall, we performed over 300 experiments. For each experiment, we plotted the train and valid curves for multiple metrics (recall, precision, accuracy, AUC and F1-score) against the number of epochs. We performed a range of manual and automatic model selection by (1) evaluating experiments with F1-scores above 0.25 for all four folds and (2) manually examining the train-vs.-validation learning curves to pick the hyperparameter setting that showed improvement of the model’s precision and recall from baseline for both the train and valid data, as well as ensuring that the chosen model did not show evidence of overfitting |
Reasons |
The standard practice for hyperparameter tuning is to update model weights on the train dataset and evaluate the updated model on the validation dataset at the end of each epoch, which is when the model has “seen” all examples in the train set once. Despite using all of the Madrid dataset for training and validation, the number of positive cases in the valid set is still small. Simply picking the best F1-score automatically without inspecting all the learning curves could just end up picking a “lucky” epoch |
EHR-CXR fusion model |
Design decisions |
We took a late fusion approach that uses the output probability from the CXR model as a feature along with the EHR features for the 30-day mortality classification. With the Madrid train dataset, we again tuned four different machine learning models (logistic regression, random forest, gradient boosting, and XGBoost) in a fourfold cross-validation setting and the best model along with the best hyperparameters were selected using randomized grid search via the same methodology as that for training the EHR-based model |
Reasons |
Late fusion approach is used because it can be implemented with traditional machine learning methods, which can avoid overfitting for smaller datasets. On the other hand, intermediate (joint) fusion implemented by neural networks requires more data for training (the implementation of the intermediate fusion model can also be found in Supplementary Table 6). In addition, the much larger feature size from imaging modality can easily swamp important clinical signals from the tabular EHR data. From analyzing the fusion model’s point of view, late fusion allows interpretation of the overall feature importance from the CXR model’s prediction |