Table 1.
CXR | EXR | HCT | EEG | |
---|---|---|---|---|
Data type | single 2D radiograph | multiple 2D radiograph views | 3D CT reconstruction | 19-channel EEG time series |
Classification task | normal | normal | hemorrhage | seizure onset |
abnormal | abnormal | no hemorrhage | no seizure onset | |
Anatomy | chest | knee | head | head |
Train set size (Large/Medium) | 50,000 | 30,000 | 4,000 | 30,000 |
5,000 | 3,000 | 400 | 3,000 | |
Train Set Size (Literature) | 20,0007 | 40,56132 | 9046 | 23,21833 |
Network architecture | 2D ResNet-189 | patient-averaged 2D ResNet-509 | 3D MIL + ResNet-18 + Attention35 | 1D Inception DenseNet36 |
We apply cross-modal data programming to four different data types: 2D single chest radiographs (CXR), 2D extremity radiograph series (EXR), 3D reconstructions of computed tomography of the head (HCT), and 19-channel electroencephalography (EEG) time series. We use two different dataset sizes in this work: the full labeled dataset (large) of a size that might be available for an institutional study (i.e., physician-years of hand labeling) and a 10% subsample of the entire dataset (medium) of a size that might be reasonably achievable by a single research group (i.e., physician-months of hand labeling). For context, we present the size of comparable datasets used to train high-performance models in the literature. Finally, we list the different standard model architectures used. While each image model uses a residual network encoder,9 architectures vary from a simple single-image network (CXR) to a mean across multiple image views (EXR) to a dynamically weighted attention mechanism that combines image encodings for each axial slice of a volumetric image (HCT). For EEG time series, an architecture combining the best attributes of the Residual and Densely Connected37 networks for 1D applications is used, in which each channel is encoded separately and a fully connected layer is used to combine features extracted from each (see Experimental Procedures).