Title: 2024 Conference on Machine Intelligence in Medical Imaging (CMIMI)
Date: October 21-22, 2024
Venue: Boston University – George Sherman Union, Boston, MA
Sponsorship: Publication of this supplement was sponsored by the Society for Imaging Informatics in Medicine (SIIM). All content was reviewed and selected by the CMIMI Program & Review Committee, which held full responsibility for the abstract selections.
Program Listing
Monday, October 21
Tuesday, October 22
Oral Presentations
Image Classification | Scientific Abstract Presentations
001- Advancing Endometriosis Detection: A Deep Learning-Enhanced Multi-sequence MRI Analytical Model
Presenter: Mana Moassefi, Mayo Clinic – Rochester
Mana Moassefi1, Wendaline VanBuren1, Bradley J. Erickson1, Shahriar Faghani1
1Mayo Clinic - Rochester, Rochester, MN, USA
Introduction/Background
Endometriosis, a condition characterized by the growth of endometrial-like tissue outside the uterus, affects 5-10% of women of reproductive age. Despite its prevalence, the diagnosis of endometriosis through imaging remains challenging due to the complex anatomy of the pelvis and heterogeneity of the disease itself on imaging, which requires expertise. Advances in deep learning (DL) are revolutionizing the diagnosis and management of complex medical conditions, promoting patient-centered treatment.
Methods/Intervention
We gathered a patient cohort from our institutional database, composed of patients with pathologically confirmed endometriosis from 2015 to 2024. We selected gynecologic MRIs performed within three months prior to diagnostic surgery. We also created an age-matched control group that underwent a similar MR protocol but without a diagnosis of endometriosis. We used sagittal T1-weighted (T1) pre- and post-contrast, as well as T2-weighted (T2) MRIs. We split our dataset at the patient-level and allocated one-eighth of the dataset for testing and conducted seven-fold cross-validation on the remainder. MR images were analyzed using various convolutional neural network (CNN) architectures. Simultaneously, two abdominal radiologists with experience in endometriosis MRI and complex surgical planning and one women’s imaging fellow with specific training in endometriosis MRI reviewed a random selection of images and documented their endometriosis detection.
Results/Outcome
751 patients were included in the case and control groups. The final 3D-DenseNet-121 classifier model demonstrated robust performance. Our findings indicated the most accurate predictions were obtained using T2, T1 pre- and post-contrast. Testing on our test set using ensemble technique resulted in an F1 Score of 0.911, AUROCC of 0.881, sensitivity of 0.976, and specificity of 0.720. Our radiologist readers achieved 72.2% and 78.5% sensitivity without and with AI assistance in detecting endometriosis.
Conclusion
The study introduced the first DL model to use multi-sequence MRI on a large cohort, showing results equivalent to human detection in trained readers in identifying endometriosis. Further external validation of the model is in progress.
Statement of Impact
We aim to evaluate DL tools in enhancing the accuracy of multi-sequence MRI-based detection of endometriosis in daily practice.
Keywords
Endometriosis; Magnetic Resonance Imaging; Deep Learning
002 - Automated Detection of Tricuspid Regurgitation Using Multiple Instance Learning on Doppler Echocardiograms
Presenter: Anthony T. Wu, University of California, Irvine
Anthony T. Wu1, Nathan H. Choi1, Amanda Warren1, Gavin Shu1, Tobin Matthew1, Kyle Digrande1, Antonio Frangieh1,
Jin Kyung Kim1, Xiaohui Xie1, Peter Chang1, Jennifer Xu1
1University of California Irvine, Irvine, CA, USA
Introduction/Background
Tricuspid regurgitation (TR) affects >70 million people worldwide. 36% of patients with severe-TR die within 1 year, however >90% of these patients are not intervened upon due to prohibitive surgical mortality. Early diagnosis of severe-TR is critical for treatment effectiveness, but such diagnoses are time-consuming and require expert image acquisition and interpretation of echocardiograms (echo). Thus, there is a critical need for an accurate and rapid method for binary severe-TR (i.e. intervention needed) detection. While 3D Convolutional Neural Networks are frequently used for medical image classification, the variable spatial and temporal dimension sizes of echoes make them less effective and interpretable. Herein we describe an interpretable multiple-instance model to detect severe-TR from echo.
Methods/Intervention
Echocardiograms were interpreted and labeled at the video level by expert cardiologists. Sample gates were extracted and split with a train-validation ratio of 4:1. The multiple-instance model comprises a feature extractor utilizing spatial grouped-convolution blocks and two temporal heads, an instance predictor employing a perceptron model, and an instance aggregator using mean-pooling. A grid-search over hyperparameters was performed for optimization.
Results/Outcome
In total, 3604 echoes were collected. Our model achieved a maximum validation accuracy of 81.3% and ROC AUC of 0.865. Activations of spatial convolutions were used to visualize spatial focus, and instance predictions were used to identify temporal model focus.
Conclusion
Our model accurately detects severe-TR from echocardiograms with similar sensitivity and specificity, indicating balanced training. The separation of temporal and spatial features enhances its explainability. Activation of spatial convolutions demonstrate the model's ability to discern cardiac structure and Doppler flow, while single-instance predictions reveal its capability to learn fine-grain features from coarse video-level labels.
Statement of Impact
Our algorithm provides real-time interpretation of echo for severe-TR diagnoses with high sensitivity, allows cardiologists to focus on more severe patient cases, where quick diagnostic turnaround time is critical for intervention efficacy.
Keywords
Multiple Instance Learning; Tricuspid Regurgitation; Video Classification
003 - Iterations on a Classic: A Novel Machine Learning Algorithm for the Establishment of Pediatric Bone Age Using Knee Radiographs
Presenter: Kelly Horst, Mayo Clinic - Rochester
Kelly Horst1, Bradley Erickson1, Adam Tagliero1, Aaron Krych1, Shahriar Faghani1
1Mayo Clinic – Rochester, Rochester, MN, USA
Introduction/Background
It is common practice for orthopedic surgeons to obtain a radiograph of the left hand to estimate bone age for planning
surgical interventions of the knee in pediatric patients, as those interventions are based on skeletal age rather than
chronological age. We created a novel deep learning (DL) algorithm that determines skeletal age based on radiographs of
the knee.
Methods/Intervention
We identified a total of 7,336 radiographs of the knee acquired among 5,701 pediatric patients under 18 years of age,
acquired between January 2018-January 2024. The images included a range of normal images and images with
pathology. The following views were used to train (80% of total), tune (10% of total) and test (10% of total) the model:
1167 right 2 views, 1252 left 2 views, 768 right 3 views, 831 left 3 views, 1,282 right 4+ views, 1,280 left 4+ views.
Patients with more than one study were placed in the same cohort in order to prevent data leakage. We developed a
view-agnostic multimodal deep learning model using an intermediate fusion approach. Our model employed a 2D
DenseNet121 as the imaging feature extractor and two shallow neural networks. The first shallow neural network
transformed patient sex into the imaging feature space, while the second neural network merged and processed the
imaging features along with the transformed sex features to predict bone age in months. We used mean squared error as
the loss function and co-trained all the neural networks with the AdamW optimizer. The model’s performance was
evaluated on the test set using the mean absolute error (MAE) metric.
Results/Outcome
The mean patient age was 13.4 years, with an STD of 3.5 years. The model showed a MAE of 9.7 months on the test
cohort.
Conclusion
A view agnostic multimodal DL algorithm can estimate bone age from radiographs of the knee, with higher accuracy than
published references.
Statement of Impact
This ML algorithm may enable clinicians to forgo the routinely obtained left hand radiograph for skeletal bone age.
Keywords
Bone age; Regression algorithms; CNN
004 - Iterations on a Classic: Utilizing DenseNet121 for Gender Differentiation in Pediatric Bone Age Assessment Algorithms
Presenter: Kelly Horst, Mayo Clinic - Rochester
Kelly Horst1, Shahriar Faghani1
1Mayo Clinic - Rochester, Rochester, MN, USA
Introduction/Background
The algorithm aims to enhance the precision of bone age estimation by accurately identifying patient sex, since patient
sex is a crucial input for bone age prediction and relying solely on DICOM (digital imaging and communications in
medicine) metadata can result in errors. In addition, we aim to determine if there is a gender-based morphological
differences in hand radiographs in pediatric population.
Methods/Intervention
The DenseNet121 model was adapted for gender differentiation in pediatric bone age assessments. The 2017 RSNA
bone age dataset was utilized, with 12,611 training images and 1,425 reserved for the validation set. There are 5,778
female and 6,833 male subjects, ranging from 1 to 228 months in age, with a mean bone age of 127.2 months and a
standard deviation of 41.7 months. Channel normalization, resizing to 512 x 512, foreground cropping, normalization of
image intensities, random histogram shift, flipping, random affine transformations, and Gaussian noise addition were used
to enhance model generalizability. Weighted cross-entropy with inverse class ratios was used as the loss function to
address the slight imbalance in the dataset. To gain insight into model decision making process occlusion maps were
created and reviewed by a pediatric radiologist. Area under receiving operating curve (AUROC) was reported as the
performance metric.
Results/Outcome
An AUROC of 0.985 was achieved with a batch size of 16, a learning rate of 0.001, and 200 epochs. Due to the excellent
performance, no hyperparameter tuning was performed.
Conclusion
The model is effective in predicting patient gender with a high degree of accuracy. There are some imaging features that
contain gender information in the pediatric population.
Statement of Impact
This study provides proof of concept that some imaging features contain gender information in the pediatric population.
Leveraging interpretation methods might shed light on the biology of these differences and could be used as a scientific
discovery tool. The model is highly reliable for gender classification in the context of bone age assessment, leveraging
detailed image features that may be imperceptible to human observers.
Keywords
Bone age; Sex differentiation; CNN
005 - On-demand Generation of Probabilistic Models for Radiology Differential Diagnosis from Real-world Data
Presenter: Charles E. Kahn, Jr., University of Pennsylvania
Charles E. Kahn, Jr.1
1University of Pennsylvania, Philadelphia, PA, USA
Introduction/Background
Bayesian networks apply probability theory to perform diagnostic reasoning. They offer several attractive features,
including the ability to explain their reasoning and to account for missing or conflicting data. However, their construction
often is limited by the lack of real-world data to derive the conditional-probability tables (CPTs) that relate two conditions.
This work sought to establish an approach to extract probability data from radiology reports and apply that data for on-demand generation of Bayesian network models.
Methods/Intervention
The Radiology Gamuts Ontology (RGO), a reference source of more than 2000 radiology differential-diagnosis listings,
was accessed through its application programming interface (https://gamuts.net/api/specialty). Two years of radiology
reports from a large U.S. academic health system were analyzed using named-entity recognition and negation-detection
techniques to identify positive mentions of RGO entities. An occurrence was defined as positive mention of an RGO entity
in a patient. Data were aggregated by patient; the software tallied the number of occurrences of each RGO entity and the
number of co-occurrences of each pair of RGO entities. Age and sex distribution of each condition was computed.
Results/Outcome
From approximately 1.8 million reports on 1.3 million distinct patients, the software generated probabilistic data for the
2,742 RGO entities (of 16,839 total) that occurred in the dataset. Project software aggregated probability data around the
specified entity and entities that could cause or be caused by it; the generated Bayesian network model was encoded in
Structural Modeling, Inference, and Learning Engine (SMILE) format. Diagnostic inference was performed using the
GeNIe platform (BayesFusion LLC, Pittsburgh, PA).
Conclusion
This methodology generates Bayesian-network models for radiology diagnosis from real-world data extracted from
analysis of radiology reports. It demonstrates the ability to extract probabilistic data from the unstructured text of radiology
reports to generate diagnostic models tailored to the prevalence of diseases and imaging findings of a specific
organization's patient population.
Statement of Impact
This report describes a novel approach that generates conditional-probability data from unstructured radiology reports. It
overcomes a key limitation of Bayesian networks and allows one to create diagnostic models the apply the frequencies of
diseases and imaging findings of a specific set of patients.
Keywords
Diagnosis; Bayesian networks; Probabilistic reasoning; Real-world data
006 - Predicting Brain Age in Autism Spectrum Disorders Using Graph Neural Networks
Presenter: Anureet Tiwana, University of Alberta
Anureet Tiwana1, James R. Mitchell1
1University of Alberta, Edmonton, Alberta, Canada
Introduction/Background
Autism Spectrum Disorder (ASD) diagnosis is complicated by symptom variability and traditional labor-intensive methods.
This study explores using "brain age," derived from neuroimaging data, to quantify developmental delays in ASD.
Leveraging advanced GNN models, we aim to enhance early, accurate diagnoses and intervention strategies.
Methods/Intervention
The study used the ABIDE dataset, an open-source resource containing preprocessed neuroimaging data from 1112
individuals across 20 international sites, including 539 with ASD and 573 controls. The data were preprocessed using the
Data Processing Assistant for Resting-State fMRI (DPARSF) and analyzed with the "Dosenbach160" ROI set. Graph
construction involved defining node and edge connections, with nodes representing regions of interest (ROIs) and edges
representing functional connections. Three Graph Neural Network (GNN) architectures were employed: Graph Attention
Networks (GAT), Chebyshev Graph Convolutional Networks (ChebNets), and Graph Isomorphism Networks (GIN)
Results/Outcome
The GNNBrainAgePredictor models using GAT, ChebNet, and GIN architectures were evaluated for predicting brain age.
The GAT model, with two GAT layers, global mean pooling, and a linear regression layer, achieved a MAE of 4.8915 for
the autism group and 6.3125 for the control group. The ChebNet model, using Chebyshev polynomials for graph
convolutions, achieved a MAE of 5.2876 for the autism group and 6.6340 for the control group. The GIN model, with two
GINConv layers achieved a MAE of 4.9252 for the autism group and 6.3012 for the control group. Unexpectedly, the
MAE was lower for the autism group across models, suggesting developmental delays rather than pathological processes.
Conclusion
The GIN model emerged as the most effective, outperforming GAT and ChebNet models in predicting brain age with the
lowest MAE observed. The lower MAE for the autism group challenges conventional understanding and indicates
potential developmental delays. Further investigation is required to understand the neurobiological underpinnings of these
findings.
Statement of Impact
Our study highlights the potential of advanced GNN architectures, particularly GIN, for accurately predicting brain age and
enhancing our understanding of brain development in ASD. These findings suggest that leveraging machine learning
techniques can significantly improve early diagnosis and intervention strategies, ultimately leading to more personalized
and effective treatments for individuals with ASD.
Keywords
Graph Neural Network; Autism; ABIDE; Brain age
Oral Presentations
Bias & Uncertainty | Scientific Abstract Presentations
007 - A Technical Exploration of Bone Age Prediction with Machine Learning Regression Algorithms: Conformal Prediction
Presenter: Shahriar Faghani, Mayo Clinic - Rochester
Shahriar Faghani1, Kelly Horst1, Bradley J. Erickson1
1Mayo Clinic - Rochester, Rochester, MN, USA
Introduction/Background
Regression models usually provide a point estimate rather than an interval that reflects the uncertainty of the prediction.
While some uncertainty quantification methods can be applied to regression algorithms to obtain prediction intervals, most
lack statistical guarantees for these intervals. Conformal prediction, a post-hoc method, stands out by offering statistical
guarantees. We aimed to apply conformal prediction to bone age prediction in the pediatric population.
Methods/Intervention
A multimodal deep learning model was developed to estimate the skeletal age based on left-sided hand radiographs. This
model consists of the DenseNet121 architecture as an imaging feature extractor and two shallow neural networks: one for
transforming patient sex into the imaging feature space and the other for combining all features. The 2017 RSNA bone
age dataset was utilized, with 12,611 training images, 700 and 725 reserved for the calibration and validation set,
respectively. There are 5,778 female and 6,833 male subjects, ranging from 1 to 228 months in age, with a mean bone
age of 127.2 months and a standard deviation of 41.7 months. Resizing to 512 x 512, foreground cropping, normalization
of image intensities based on training dataset statistics to zero mean and unit standard deviation, histogram shift, flipping,
affine transformations, and Gaussian noise addition were used to enhance model generalizability. Quantile regression
loss with the 5th and 95th percentile provided at least 90% coverage for interval predictions. A calibration set was used to
calibrate the predicted percentiles for each image based on the conformal procedure. Mean absolute error (MAE), and
mean prediction interval (MPI) were calculated.
Results/Outcome
An MAE of 5.6, an MPI of 3.7, and a prediction interval coverage of 97% were achieved, indicating that 97% of the
observations fall within the predicted intervals.
Conclusion
These intervals allow for the individualization of the result uncertainty quantification and may be helpful to include in
individual patient reports. This measurement may also assist in identifying individual outliers, which may need further
follow-up.
Statement of Impact
To the best of our knowledge, this is the first application of conformal prediction in radiologic image-level regression tasks,
that could be served as a template for other clinical challenges.
Keywords
Deep learning; Uncertainty quantification; Regression; Pediatric radiology
008 - Automated Identification of Challenging Samples in Medical Imaging for Unbiased AI Model Training
Presenter: Frank Li, Emory University
Frank Li1, Theo Dapamede1, Bardia Khosravi2, Mohammadreza Chavoshi3, Saptarshi Purkayastha4, Hari Trivedi1, Judy Gichoya1
1Emory University, Atlanta, GA, USA
2Yale University, New Haven, CT, USA
3Shariati Hospital, Tehran, Iran
4Indiana University, Indianapolis, IN, USA
Introduction/Background
In medical imaging datasets, "shortcuts" or spurious correlations can cause AI models to unintentionally depend on
irrelevant features when making decisions. For instance, presence of support devices like chest tubes act as shortcuts
when predicting pneumothorax (easy cases), and pneumothorax cases without chest tubes are harder for the model to
learn (hard-to-learn cases). However, hard-to-learn cases are not always known a priori. In this study, we aim to establish
a pipeline to automatically differentiate easy- and hard-to-learn cases during model development.
Methods/Intervention
We used a bias amplification (BAM) technique during model training to identify hard-to-earn samples within the SIIM-ACR
pneumothorax dataset. BAM incorporates a trainable auxiliary variable (b) to track errors made by the model during
training to identify hard-to-learn samples and amplify them during the training process. For our experiments, predicted
probabilities below 0.25 for positive samples (false negatives, FN) and above 0.75 for negative samples (false positives,
FP) were designated as hard-to-learn samples. Conversely, predicted probabilities above 0.75 for positive samples (true
positives, TP) and below 0.25 for negative samples (true negatives, TN) were regarded as easy-to-learn samples. GradCAM++ was used to generate saliency maps and images reviewed by two radiologists.
Results/Outcome
The magnitude of the auxiliary variable (b) increased with the level of learning difficulty, implying that the model leaned
more heavily on b when facing challenging samples. As expected, FP and TP examples had a higher presence of support
devices and FN were often missing support devices. Saliency maps and radiologist review revealed that the model
focused more on support devices, further supporting this observation.
Conclusion
Our findings validate the hypothesis that images containing both pneumothorax and support devices are easier for the
model to learn from. The proposed pipeline may serve as an automated tool to identify hard-to-learn samples in medical
imaging datasets, facilitating the training of unbiased AI models and reducing the reliance on labor-intensive human
labeling.
Statement of Impact
This study offers an automated method of narrowing down datasets for AI training and validation to alleviate the need for
extensive human labeling of granular labels in medical imaging datasets.
Keywords
AI Bias; Shortcut Learning; Medical Imaging; pneumothorax
009 - Development of a Calculator for External Validation Study Sample Size in Radiology AI
Presenter: Shahriar Faghani, Mayo Clinic - Rochester
Shahriar Faghani1, Mana Moassefi1, Bradley J. Erickson1
1Mayo Clinic - Rochester, Rochester, MN, USA
Introduction/Background
External validation of clinical prediction models in radiology AI ensures their performance and generalizability in
independent datasets. Accurate estimation of sample sizes for these validation studies is crucial for obtaining precise and
unbiased performance metrics, such as the C-statistic and its standard error (SE). This paper introduces a Python-based
calculator designed to estimate the required sample size for external validation studies in radiology AI. The tool uses
mathematical and statistical methods to provide precise sample size estimations, aiding researchers in designing robust
validation studies.
Methods/Intervention
The Python-based tool requires input values for the C-statistic and the outcome event proportion (phi) from the validation
dataset. It calculates the standard error (SE) of the C-statistic using a specific formula that considers the sample size (n),
the C-statistic value, and the outcome event proportion. The SE of the C-statistic is calculated using the formula:
SE(C)≈C(1−C)Nϕ(1−ϕ)(1+N/2−12−C+N/2−11+C)\text{SE}(C) \approx \frac{C(1 - C)}{N \phi (1 - \phi)} \left( 1 + \frac{N / 2 -
1}{2 - C} + \frac{N / 2 - 1}{1 + C} \right)SE(C)≈Nϕ(1−ϕ)C(1−C)(1+2−CN/2−1+1+CN/2−1) The tool provides two methods for
estimating the required sample size for a targeted SE. The first method calculates the sample size directly from the SE
formula, while the second method uses a quadratic equation approach. The tool is implemented using the Gradio
interface, allowing users to input the C-statistic value, outcome event proportion, and targeted SE, and receive the
required sample size.
Results/Outcome
The tool was tested with various C-statistic values and outcome event proportions, demonstrating its accuracy in
estimating the required sample size for different scenarios.
Conclusion
The calculator is a valuable tool for researchers conducting external validation studies in radiology AI. It provides accurate
sample size estimations, ensuring robust and reliable evaluations of clinical prediction models.
Statement of Impact
The user-friendly interface enables researchers to quickly determine the necessary sample size for their validation
studies. The tool ensures that external validation studies are adequately powered, enhancing the reliability and
generalizability of AI models in radiology.
Keywords
Classification; External validation; Sample size calculation; Statistics
010 - Dissecting the Impact of Data Augmentation on Whole Slide Image Classification
Presenter: Dagoberto Pulido Arias, Massachusetts General Hospital
Dagoberto Pulido Arias1, Tiago Gonçalves2, Jaime Cardoso3, Jayashree Kalpathy-Cramer4, Elizabeth Gerstner1, Albert Kim1, Christopher Bridge5
1Massachusetts General Hospital, Boston, MA, USA
2FEUP, INESC TE, Porto, Portugal
3INESC Porto, Universidade do Porto, Porto, Portugal
4University of Colorado Anschutz Medical Campus, Aurora, CO, USA
5Harvard Medical School, Boston, MA, USA
Introduction/Background
Machine learning in pathology shows potential for advancing precision oncology across multiple tumor types. However,
limited annotated samples and high intra-class variability in histological images constrain these models' clinical potential.
Data augmentation enhances model performance and generalizability, especially with limited training data, but whole slide-image (WSI) classification models require pre-computed tile-level features, limiting the ability to perform on-the-fly
augmentations. This study compares the impact of image-level and feature-level data augmentation on WSI classification
in pathology at both tile and slide levels, specifically using foundational models for feature extraction. Our goal is to
assess the necessity of data augmentation when using foundational models.
Methods/Intervention
We conducted experiments using WSIs from the breast cancer subset of The Cancer Genome Atlas (TCGA), utilizing two
pre-trained feature extractor models: ResNet-50, trained on ImageNet, and CONCH (CONtrastive learning from Captions
for Histopathology), a vision-language foundational model designed for microscopic pathology and trained on millions of
WSIs. We used a Clustering-constrained attention multiple instance learning model for classification. Our study compared
different image-level augmentations, such as Hematoxylin-Eosin-DAB (HED) color transformation and tile shifting, and feature-level augmentation using Pseudo-Bag Mixup (PseMix), applied to features extracted by both models. Feature-level augmentation creates synthetic variations of feature representations to help the model generalize better.
Results/Outcome
The CONCH model, augmented with individual techniques, significantly improved classification accuracy, achieving a test
AUC of 0.868 with Pseudo-Bag Mixup. Without augmentation, the baseline model's test AUC was 0.758 with ResNet-50
and 0.846 with CONCH. Image-level augmentations like HED color transformation and tile shifting also improved
performance. For example, tile shifting led to a test AUC of 0.835 with ResNet-50, and HED alone with CONCH achieved
a test AUC of 0.856.
Conclusion
Data augmentation techniques are essential for enhancing model performance, addressing the challenges of limited
annotated data and high intra-class variability. Both image-level and feature-level augmentations improve predictive
performance, providing a robust solution for increasing the accuracy and reliability of computational pathology models.
Statement of Impact
This study underscores the importance and benefits of efficient data augmentation in computational pathology,
contributing to the development of robust, high-performing algorithms without needing additional data and resources.
Keywords
Artificial Intelligence; Attention mechanisms; Breast cancer; Computational pathology
011 - Iterations on a Classic: A Robust Hand Bone Age Algorithm Resistant to Computational Stress
Presenter: Kelly Horst, Mayo Clinic - Rochester
Kelly Horst, Bradley J. Erickson, Shahriar Faghani
1Mayo Clinic - Rochester, Rochester, MN, USA
Introduction/Background
Anterior-posterior images of the left hand have been traditionally used to estimate skeletal age for decades. More
recently, deep learning (DL) algorithms have been used to estimate skeletal age. The winning algorithm from the 2017
RSNA challenge was recently republished with relatively poor performance when a validation set with varied clinical
image appearances was used to test the model under computational stress. We sought to train a model with improved
results, with more robust performance to extensive variations in clinical image appearance.
Methods/Intervention
A multimodal DL model was developed and adapted for pediatric bone age assessment. This model includes the
DenseNet121 architecture as an imaging feature extractor and two shallow neural networks: one for transforming patient
sex into the imaging feature space, and the other for combining all features. The 2017 RSNA bone age dataset was
utilized, with 12,611 training images and 1,425 reserved for the validation set. There are 5,778 female and 6,833 male
subjects, ranging from 1 to 228 months in age, with a mean bone age of 127.2 months and a standard deviation of 41.7
months. Resizing to 512 x 512, foreground cropping, normalization of image intensities using train set statistics to zero
mean and unit standard deviation, random histogram shift, flipping, random affine transformations including rotation,
translation and scaling, and Gaussian noise addition were used to enhance model generalizability during training. To
perform computational stress test, the same data augmentation pipeline was applied during inference on the validation
set. Mean absolute error (MAE) was reported as the performance metric.
Results/Outcome
A MAE of 5.6 months was achieved with a batch size of 16, learning rate of 0.001, and 500 epochs. which is an
improvement on the previously published winning model performance of 6.8 months. Importantly, the model achieved this
result with extensive variations in clinical image appearance.
Conclusion
A deep neural network can accurately estimate bone age from radiographs of left hand among pediatric patients up to 21
years of age, with robust performance under computational stressors.
Statement of Impact
Training and testing algorithms with computational stress will enhance real world performance. This should be confirmed
prospectively with clinical application.
Keywords
Bone age; Computational stress; CNNs
012 - Unveiling Bias in AI Model Training Data: Exploring the Impact of Intrinsic Data Variability on Lung Ultrasound Video Classification Models
Presenter: Saunak Bhattacharjee, Boston University
Saunak Bhattacharjee1, Umair Khan2, Russell Thompson3, Lauren P. Etter4, Ingrid Camelo5, Rachel C. Pieciak6, Ilse Castro-Aragon7, Bindu Setty7, Christopher C. Gill6, Margrit Betke1
1Boston University, Boston, MA, USA
2University of Trento, Trento, Italy
3University of Massachusetts, Dartmouth, MA, USA
4Universty of Wisconsin-Madison, Madison, WI, USA
5Augusta University, Augusta, GA, USA
6Boston University School of Public Health, Boston, MA, USA
7Boston Medical Center, Boston, MA, USA
Introduction/Background
Lung ultrasound (LUS) is an emerging tool for providing clinical support for patients with respiratory diseases.
However, its operator-dependent data acquisition and interpretation introduce potential variability in data collection and
analysis. Factors such as variations in scanning techniques, duration of the recorded scan, and interpretation of visual
patterns may introduce discrepancies. These discrepancies can affect data consistency and potentially bias the model
training, impacting the development of generalizable artificial intelligence (AI)-based models.
Methods/Intervention
To investigate the inherent bias in LUS video data and its impact on AI model training, we employed a transformer-based
video classification model aimed at identifying lung consolidations among pediatric patients. This model was
complemented by a frame-level transformer-based classification model that aggregates frame-level predictions to produce
a video-level score. Both models were trained and validated on 2,400 videos collected in a sweep-acquisition fashion from
200 pediatric patients with pneumonia and were subsequently tested on an external dataset comprising another 2,400
LUS videos from 200 healthy individuals.
Results/Outcome
The analysis of the dataset revealed a correlation coefficient of 0.4039 between larger lung consolidations and longer
video lengths, suggesting moderate operator bias in data collection. The video classification model achieved a
100% accuracy on the external dataset. The frame-level model consistently predicted all frames from healthy individuals
as lacking consolidations, with a confidence level above 0.73, demonstrating its ability to generalize to an external
dataset.
Conclusion
This study highlights the need for careful consideration of data biases during AI model training to ensure accurate AIaided diagnosis. Despite its high accuracy, caution is advised when generalizing these results, as the identified biases
could affect the future performance of the models.
Statement of Impact
The findings underline the critical importance of addressing biases in LUS video datasets to develop reliable and
generalizable AI-based diagnostic tools. Ensuring consistent data collection and interpretation practices is essential for
the advancement of AI in medical diagnostics.
Keywords
Lung Ultrasound (LUS); Data Bias; Transformer-based Model; Lung Consolidation
Oral Presentations
Large Language Models – Session 1 | Scientific Abstract Presentations
013 - Application of a Multi-agent Open-source Large Language Model for Data Abstraction from Radiology Report
Presenter: Sanaz Vahdati, Mayo Clinic - Rochester
Sanaz Vahdati1, Pouria Rouzrokh1, Elham Mahmoudi1, Bradley J. Erickson1
1Mayo Clinic - Rochester, Rochester, MN, USA
Introduction/Background
Recent advancements in large language models (LLMs) have opened new frontiers in artificial intelligence. Multi-agent
systems have shown remarkable effectiveness in specialized tasks like data extraction. By emulating collaborative
cognitive processes, these systems transform problem-solving approaches, enabling more sophisticated and holistic data
analysis. Their ability to distribute complex, multifaceted tasks among specialized agents leads to enhanced decision-making capabilities and more comprehensive solutions. In this regard, we aimed to build a multi-agent language model for
radiology report data extraction to investigate their capabilities and asses their performance in deriving specific
conclusions from complex medical data.
Methods/Intervention
In this work, we collected 212 radiology reports from two different pathologies. We aimed to extract the presence or
absence of acute cervical spine fracture and liver metastasis from radiology reports of the cervical spine and
abdominopelvic CT scan collected between January and February 2022. We employed the open-source LLama3-
70Binstruct for inference and applied few-shot prompting for each agent. We propose a three-tier multi-agent architecture.
This system comprises two verification agents and a reconciliation expert agent, operating in a sequential manner. The initial two agents independently extract data, which is subsequently fed, along with the original report, to the reconciliation agent. This final agent synthesizes the information to produce a comprehensive conclusion. We evaluated the efficacy of this pipeline through performance metrics.
Results/Outcome
The multi-agent model for liver metastases assessment demonstrated high performance, achieving accuracy of 0.95, F1
score of 0.95, Positive Predictive Value (PPV) of 0.96, and Negative Predictive Value (NPV) of 0.95. Our model could
exclude patients with a prior history of metastasis from new diagnosis classification. For the extraction of acute cervical
spine fracture presence, the pipeline exhibited robust performance metrics: accuracy of 0.96, F1 score of 0.92, PPV of
0.92, and NPV of 0.97. In both tasks, the reconciliation agent provided salient cues pertaining to the final determination,
facilitating further elucidation of results.
Conclusion
In conclusion, we built a multi-agent orchestration using debating scenarios to boost collective reasoning of data
abstraction from the radiology report.
Statement of Impact
Multi-agent models with the potential to collective intelligence bear considerable promise for data extraction from radiology
reports.
Keywords
Radiology report; Large language models; Multi-agent
014 - Evaluation of Llama2 and Llama3 for Automated Extraction of Ground Truth from Radiology Reports for Post-Deployment Monitoring of Pulmonary Embolism and Intracranial Hemorrhage Detection AI Models
Presenter: Theo Dapamede, Emory University
Theo Dapamede1, Bardia Khosravi2, Chad Robichaux1, Aawez Mansuri1, Mohammadreza Chavoshi3, Alex Belov1, Angela Udongwo4, Chinonyelum Igwe5, Frank Li1, Beatrice Brown-Mulry1, Hanssen Li1, John Moon1, Judy Gichoya1, Hari Trivedi1
1Emory University, Atlanta, GA, USA
2Yale University, New Haven, CT, USA
3Tehran University of Medical Sciences, Tehran, Iran
4Temple University, Philadelphia, PA, USA
5University of Ibadan, Ibadan, Nigeria
Introduction/Background
Clinical use of AI models requires post-deployment monitoring for performance and potential drift. However, this requires comparison of model outputs to ground-truth radiologist interpretations which can be laborious. We evaluate the performance of 2 generations of open-source large language models (LLM) for label extraction tasks for pulmonary embolism (PE) and intracranial hemorrhage (ICH) against human annotated ground truths.
Methods/Intervention
We identified 4,668 CT PE exams and 74,394 non-contrast CT head exams from 2020-2022 and randomly sampled 250 reports for each exam type for manual annotation. PE labels were: PE, acuity, laterality, largest depth, right heart strain, and pulmonary artery hypertension. ICH labels were: ICH, acuity, laterality, subtype, midline shift, and mass effect. Reports were annotated by 6 human annotators using a browser-based interface and difficult cases were flagged for review by a senior radiologist. Multiple prompt styles were tested in preliminary analysis using Llama 2 7B. The top performing prompting style was selected and used to evaluate Llama2 (7B, 13B, and 70B) and Llama3 (8B and 70B) models.
Results/Outcome
Llama3 8B had the highest overall performance for both PE (sensitivity: 1.0; specificity: 1.0) and ICH (sensitivity: 0.93; specificity: 1.0). Across all models, performance for PE depth (accuracy range: 0.25-0.61) and ICH acuity (accuracy range: 0.63-0.74) were lowest. Llama2 performance improved with increasing parameters for most classes. However, Llama3 8B and 70B performance was similar across all categories. Llama3 8B significantly outperformed Llama2 7B for all labels, despite similar parameter sizes.
Conclusion
This study evaluated Llama2 and Llama3 models to extract labels for PE and ICH against human annotated ground truths. Llama3 8B had the highest performance with significant improvements over Llama2. Model performance for extracting binary PE and ICH labels was robust, however no model was able to successfully extract subgroup labels for PE or ICH to acceptable accuracy.
Statement of Impact
LLMs are a promising tool for post-deployment monitoring of AI models and can successfully extract binary ground truth from ICH and PE radiology reports for comparison to AI model predictions. If properly tuned, these models may also allow for robust subgroup evaluation to deliver further insights into model performance.
Keywords
Llama; Pulmonary Embolism; Intracranial Hemorrhage
015 - Examining Patient-Large Language Model Interactions Using the PromptWise Paradigm for Medical Education
Presenter: Satvik Tripathi, University of Pennsylvania
Satvik Tripathi1, Rithvik Sukumaran1, Suhani Dheer1, Tessa S. Cook1
1University of Pennsylvania
Introduction/Background
With the rise of large language models (LLMs) for general-purpose use, researchers have begun studying how they might improve patient care. In earlier work, we proposed the PromptWISE (Prompt engineering for Well-structured, Interactive, and Supportive Education) paradigm to educate patients on using LLMs to understand their medical issues. PromptWISE helps patients engineer higher-quality prompts that can enhance their medical experience and reduce the burden on medical professionals.
Methods/Intervention
We applied the six-point PromptWISE guidelines to answer a set of 25 questions patients might ask LLMs about their health or medical care. Using Amazon Mechanical Turk, we conducted an IRB-approved survey (n=1074) to compare a pair of LLM-generated responses, one from a simple prompt and the other from a PromptWISE-designed prompt. GPT-4 provided all text generations. Volunteers picked the better response based on three criteria: clarity, information, and relevance. We also collected demographic information, including gender, race, age bracket, income bracket, education level, and healthcare employment status. Statistical analyses were performed to determine the generalizability and reproducibility of our results.
Results/Outcome
The demographic reporting indicated a diverse cohort of volunteers, reducing any reporting biases. In our analysis, volunteers overwhelmingly (n=837) chose responses generated from PromptWISE prompts over those generated from non-PromptWISE prompts (p< 0.0001). We also found that non-PromptWISE prompts lacked essential details, leading to inaccurate or irrelevant responses.
Conclusion
The results demonstrate the tangible impact of prompt engineering on patient-LLM interactions. Querying LLMs with prompts crafted following our guidelines yielded more comprehensive and precise responses while also refraining from giving any medical advice. Finally, volunteers demonstrated a strong preference for responses to PromptWISE prompts, further indicating the impact and need for prompt engineering when interacting with LLMs.
Statement of Impact
PromptWISE significantly improves patient-LLM interactions by generating clearer, more informative responses. This study underscores the importance of prompt optimization in enhancing patient engagement and accuracy when utilizing LLMs for medical information retrieval.
Keywords
Large Language Models; Prompt Engineering; Patient Education; Patient-Centered Care
016 - GPT-Based Automated Classification and Labeling of Surgical Renal Pathology Reports
Presenter: Satvik Tripathi, University of Pennsylvania
Satvik Tripathi1, Rithvik Sukumaran1, Dana Alkhulaifat1, Charles M. Chambers1, Darco Lalevic1, Hanna Zafar1, Tessa S. Cook1
1University of Pennsylvania
Introduction/Background
Human annotation of reports to acquire high-quality data for model training can be costly and time-consuming. Leveraging automated labeling with large language models can be a valuable and cost-effective tool to streamline annotation processes. Our aim was to assess GPT-4’s performance in labeling renal surgical pathology reports using various prompting-based techniques.
Methods/Intervention
Renal surgical pathology reports from three health systems (n=40) within the same state were labeled by two radiologists with 10 and 14 years of experience as “malignant,” “indeterminate,” “benign,” or “ignore.” “Ignore” was used for reports of any pathology not specifically from a renal mass. The reports were distributed equally among the four classes. Prompt engineering for GPT-4 was utilized with zero-, one-, and few-shot learning techniques to classify the reports. The main performance evaluation metric was accuracy.
Results/Outcome
GPT-4 achieved 70%, 77.5%, and 92.5% accuracy with zero-, one- and few-shot learning, respectively. The incorrect classification was the highest in the “Indeterminate” (n = 4) class for zero-shot prompting and the “Ignore” class for one- and few-shot prompting techniques (n = 5 and 2, respectively). GPT-4 outperformed our existing Deep Learning-based methods.
Conclusion
GPT-4 holds the potential to classify renal surgical pathology reports with significant accuracy, even without extensive training data. The few-shot prompting technique achieved the highest accuracy, demonstrating the model's ability to adapt and learn from minimal examples. This capability could streamline the annotation process, reduce the burden on radiologists, and enable faster data processing. Furthermore, the model's performance in handling varied classes of pathology reports underscores its versatility and potential for broader applications in medical report classification.
Statement of Impact
Automatic labeling of reports can enable prompt identification of important clinical findings, leading to timely intervention and improved treatment outcomes.
Keywords
Large Language Models; Automated Labeling; Pathology; Prompt Engineering
017 - Revolutionizing Radiological Research: LLMs for Rapid, Accurate Data Extraction from Clinical Reports
Presenter: Ali Ganjizadeh, Mayo Clinic - Rochester
Ali Ganjizadeh1, Bardia Khosravi1, David A. Woodrum1, Bradley J. Erickson1
1Mayo Clinic - Rochester, Rochester, MN, USA
Introduction/Background
To evaluate the efficacy of Large Language Models (LLMs) in extracting clinically relevant information from MR-guided intervention reports, enabling efficient database creation and comprehensive retrospective analysis, and to introduce RadPrompter, a novel tool enhancing this extraction process.
Methods/Intervention
We employed the Meta-Llama-3-70B-Instruct model to process 2,016 MR-guided intervention reports. The model was tasked with extracting key clinical features, including organ, anatomical location, ablation type, assisted modality, needle specifications, lesion type, and treatment cycles. To optimize extraction, we developed a new tool, RadPrompter, which interfaces with the LLM engine and enhances its information retrieval capabilities. The system utilized 2 Nvidia A100 GPUs and 160 GB of RAM, processing two reports simultaneously with the model's temperature set to 0.0 to minimize hallucination.
Results/Outcome
Leveraging our custom tool, the LLM successfully processed all 2,016 reports in 6 hours and 27 minutes, averaging 15 seconds per report. We manually inspected 200 reports, and It achieved 100% accuracy in extracting the specified clinical data points, demonstrating high reliability in information retrieval from complex medical narratives.
Conclusion
This study showcases the powerful potential of LLMs, augmented by specialized tools like RadPrompter, in revolutionizing radiological research. By rapidly and accurately extracting structured data from unstructured reports, this approach can significantly enhance the efficiency and scope of retrospective analyses. It enables researchers to process large volumes of historical data with unprecedented speed and accuracy.
Statement of Impact
The application of LLMs, coupled with custom extraction tools, in radiology report analysis, represents a paradigm shift in medical research methodology. It offers a scalable solution to the challenge of mining valuable insights from vast repositories of unstructured clinical data. This technology has the potential to accelerate research timelines, uncover novel patterns in patient care, and ultimately contribute to the advancement of personalized medicine in interventional radiology.
Keywords
Large Language Models; Artificial Intelligence; Interventional Radiology
018 - The Effect of Prompt Elements on Labelling Incidental Breast Findings by Llama3-8B in Radiology Reports
Presenter: Benjamin E. Rush, University of Wisconsin-Madison
John Garrett1, Thanh Nguyen1, Benjamin E. Rush1, Ryan W. Woods1
1University of Wisconsin-Madison, Madison, WI, USA
Introduction/Background
Breast cancer is a leading cause of mortality among women, and early detection can improve survival probability. Incidental breast abnormalities are identified in approximately 7% of chest CT scans, of which about 28% are malignant. However, the radiology reports from CT scans are often lengthy, unstandardized text where incidental findings might be overlooked by physicians. We propose using large language models (LLMs) to label CT radiology reports for incidental breast findings, which could flag for additional diagnostic imaging.
Methods/Intervention
We selected 17752 routine chest CTs from female patients ages 40-72 obtained at UW-Health between 2015-2017. We sub selected 3226 exams with “breast” in the radiology report and randomly sampled 500 exams for evaluation. We compared the performance of Llama3-8B incidental breast findings labelling with varying prompts to a human reader. The LLM was tasked with labeling “Yes” or “No” for incidental breast findings with the role of a radiologist or annotator. Prompt elements included the task and radiology report, and varying combinations of background, keywords, and examples. Each prompt was run 30 times to evaluate consistency. We conducted sensitivity, specificity, and Fleiss’ Kappa consistency analyses to compare the human reader and LLM.
Results/Outcome
The human reader identified 125 (25.0%) of reports having incidental breast findings. The LLM’s average positively labelled cases ranged from 236.1 (47.2%) to 412 (82.4%) of reports. Sensitivity ranged from 0.76 to 0.99, though the highest average positive predictive value was 0.50. Specificity ranged from 0.23 to 0.71, with the lowest negative predictive value at 0.86. While sensitivity generally decreased with more prompt elements, specificity increased with more detailed prompts. Fleiss’ Kappa indicated high agreement among prompt iterations with the at κ=0.94.
Conclusion
The LLM and prompts labelled many false positives but had high negative predictive values with high consistency across all prompts. Future work will evaluate the parameter size of models on metric performance.
Statement of Impact
LLMs remain as a possible flagging system for missed details and prevention system, however larger models or fine-tuning might be required to match human performance.
Keywords
Large Language Models; Incidental Findings; Breast Health; Radiology Reports
Oral Presentations
Image Segmentation | Scientific Abstract Presentations
019 - Automated Pancreatic Perivascular Adipose Tissue Detection on Abdominal CT as a Biomarker for Type 2 Diabetes
Presenter: Anisa V. Prasad, National Institutes of Health
Anisa V. Prasad1, Tejas S. Mathai1, Pritam Mukherjee1, Abhinav Suri1, Jianfei Liu1, Ronald M. Summers1
1National Institutes of Health, Bethesda, MD, USA
Introduction/Background
Early diagnosis of diabetes mellitus is critical for preventing disease and improving health outcomes. Intrapancreatic fat deposition has previously been established as a biomarker for diabetes, but little is known about the role of pancreatic perivascular adipose tissue (PVAT), adipose tissue surrounding blood vessels supplying the pancreas. We developed a deep learning framework to quantify pancreatic PVAT on abdominal CT scans and applied it to identify CT biomarkers for type 2 diabetes.
Methods/Intervention
1350 contrast-enhanced CT (CECT) scans with ground truth labels from the public PANORAMA dataset were used to train a 3D nnUNet model to segment pancreatic anatomy (parenchyma, vasculature, ducts, pancreatic ductal adenocarcinoma lesions). It was then applied to an internal dataset containing 606 CECT scans with corresponding diabetes outcomes. Pancreatic adipose tissue (AT) and PVAT were derived from the predicted segmentations and used to measure several biomarkers, such as volume and attenuation. These biomarkers were then correlated to diabetes status using univariate and multivariate logistic regression. Metrics such as AUC were assessed to determine the best set of predictors for diabetes outcomes.
Results/Outcome
Four pancreatic PVAT biomarkers were measured: volume, mean attenuation, standard deviation (SD) attenuation, and fat fraction. Significant differences (p < 0.001) were found across diabetic and non-diabetic patients for all four biomarkers, with mean attenuation demonstrating a decrease in diabetic patients while the other three metrics were increased. Similar findings were observed for the corresponding pancreatic AT measurements. Among all combinations of the eight biomarkers measured, the best set of predictors for diabetes was (1) pancreatic AT mean attenuation, (2) pancreatic PVAT mean attenuation, and (3) pancreatic PVAT fat fraction, achieving a maximum AUC of 0.88 with sensitivity 0.90 and specificity 0.71.
Conclusion
We present a framework to automatically identify pancreatic PVAT. Our analysis suggests that metrics derived from these segmentations, such as pancreatic PVAT mean attenuation and fat fraction, are viable biomarkers for type 2 diabetes.
Statement of Impact
We provide an automated method for quantifying pancreatic PVAT that can be implemented to elucidate its role in disease progression. The biomarkers identified using this tool underscore the potential for opportunistic screening of diabetes mellitus using abdominal CT scans.
Keywords
Diabetes Mellitus, CT, Pancreas, Intrapancreatic Fat Deposition
020 - Deep Learning for Incidental Parotid Tumors on CT: Optimal Methods for Screening and Segmentation
Presenter: Wei Shao, University of California Irvine
Wei Shao1, Shirin Salehi1, Chanon Chantaduly1, Hayden Troutt1, Peter Chang1
1University of California Irvine, Irvine, CA USA
Introduction/Background
Parotid gland tumors (PGT) are the most common salivary gland tumors. With increasing imaging utilization, most PGTs are detected incidentally on CT, however many are overlooked by radiologists prioritizing acute pathology. This study presents a deep learning (DL) solution for opportunistic PGT detection on CT with a focus on optimizing complimentary objectives for tumor screening and segmentation.
Methods/Intervention
A retrospective cohort of 11,449 consecutive non-contrast head CT exams were aggregated from two academic centers. PGTs, defined as a parotid mass >10 mm, were identified from radiology or histopathology reports and annotated with a mask by an expert neuroradiologist. In total, 219 PGTs were identified (N=112 hospital A, N=107 hospital B). A multistage DL pipeline was developed for PGT detection. First, an initial model localizes each parotid gland. Subsequently, a single 3D U-Net simultaneously implements the segmentation (per-voxel spatial overlap) and screening (per-exam tumor detection) tasks. To convert segmentation outputs into binary screening results, thresholds for positive voxel predictions were calibrated for optimal accuracy. Given complimentary objectives of the segmentation and screening tasks, various loss functions (binary cross-entropy, focal loss, soft Dice) and training cohorts (full cohort, positive only) were evaluated. Performance was assessed using five-fold cross-validation.
Results/Outcome
Overall, the best screening model achieved a per-exam specificity, sensitivity, PPV, NPV and accuracy of 0.947, 0.719, 0.858, 0.878, 0.872, while the best segmentation model achieved a Dice score of 0.71. Of the positive predictions, six tumors were missed by the original interpreting physician. In general, cross-entropy (CE) outperformed focal loss (FL) for segmentation, while FL outperformed CE for screening due to improved specificity and lower false positives. Soft Dice (SD) tended to improve both tasks. The use of negative training examples significantly decreased tumor Dice score while reducing false positives for the screening task.
Conclusion
By combining a first-pass screening model with a subsequent focused segmentation model, a unified DL framework can identify and delineate PGTs on routine CT with high accuracy.
Statement of Impact
A DL model can identify incidental PGTs on routine CT imaging with high accuracy including tumors missed in a realistic clinical workflow.
Keywords
Deep Learning; Parotid Tumor; Screening; Segmentation
021 - Global Local Attention for Prostate Zonal Segmentation
Presenter: Chetana Krishnan, University of Alabama at Birmingham
Chetana Krishnan1, Ezinwanne Onuoha1, Alex Hung2, Kyung H. Sung2, Harrison Kim1
1University of Alabama at Birmingham, Birmingham, AL, USA
2University of California Los Angeles, Los Angeles, CA, USA
Introduction/Background
We focus on representation learning for large-scale image segmentation. Besides backbones, training pipelines, and loss functions, key methods have explored various spatial pooling and attention mechanisms essential for creating robust global image representations. Attention mechanisms differ based on feature tensor interactions (local vs. global) and the dimensions they target (spatial vs. channel). However, most studies examine only one or two forms of attention. Focusing on global and local descriptors, we can provide empirical evidence of the interaction of all forms of attention and improve the state of the art on standard benchmarks.
Methods/Intervention
The proposed GLCSA network uses multi-stream processing to capture comprehensive contextual information from images. The local attention stream (LAS) focuses on detailed information at individual spatial locations and specific feature channels, highlighting fine-grained patterns and textures. The global attention stream (GAS) models interactions across the entire spatial dimension and among feature channels, ensuring broader relationships are captured. The LAS uses fine-scale convolutions to discern intricate details, while the GAS leverages self-attention to integrate long-range dependencies. The information from these streams is embedded into feature maps, which are then fused into a unified feature map. Finally, a pooling operation distills the combined information into a compact representation for robust image analysis. We trained GLCSA with 34 prostate scans (over 20 slices per scan), and the model was tested with ten unseen scans. Performance was evaluated using the Dice similarity coefficient. Several networks were compared with and without GLCSA.
Results/Outcome
GLCSA achieved a higher DSC and minimum DSC, indicating superior performance.
Conclusion
GLCSA dynamically captures global–local spatial and channel information to address the challenge of prostate segmentations and the limitations of 2D networks.
Statement of Impact
GLCSA can improve the diagnosis, treatment planning, and monitoring of prostate pathology and volume for clinical diseases.
Keywords
Attention; Segmentation; Spatial; Channel
022 - Liver Surface Nodularity for Staging Hepatic Fibrosis on CT: A Comparative Study of Liver Segmenters
Presenter: Tejas S. Mathai, National Institutes of Health
Tejas S. Mathai1, Meghan G. Lubner2, Perry J. Pickhardt2, Ronald M. Summers1
1National Institutes of Health, Bethesda, MD, USA
2University of Wisconsin-Madison, Madison, WI, USA
Introduction/Background
Cirrhosis is the 12th leading cause of death in the US, and liver fibrosis can be caused by metabolic disorders, alcoholism, and Hepatitis B/C virus. Earlier stages (METAVIR F0 – F2) are reversible with therapy, but later stages (F3 - advanced fibrosis and F4 - cirrhosis) are irreversible. Liver Surface Nodularity (LSN) score, a non-invasive CT-based biomarker that measures the left hepatic lobe surface smoothness, can distinguish later fibrosis stages. However, it depends on a precise liver segmentation.
Methods/Intervention
480 patients underwent CT imaging at Institution-A with fibrosis (METAVIR) staged using biopsy. An internal deep learning tool (INT) segmented the full liver and 8 Couinaud segments. The public TotalSegmentator (TS) tool also segmented the liver. The extents of Couinaud segments 2 and 3 were found. A fully automated image analysis technique detected the liver surface in each 2D slice, and a smooth spline (4th order) was fit to it. The LSN score was the mean distance between the detected surface and fit spline, and higher scores indicated worsening fibrosis. Youden indices were used to find the optimal LSN cutoffs for each fibrosis stage. ROC curves by INT for advanced fibrosis (F3-4 vs. F0-F2) and cirrhosis (F4 vs. F0-3) were compared against TS. An AUC below 0.6 was considered clinically ineffective.
Results/Outcome
AUCs were similar between INT and TS for prediction of cirrhosis (F4, 87.8% vs. 88.7%, p = .143) and advanced fibrosis (≥ F3, 82.5% vs. 83.9%, p = .381). A statistical bootstrap test revealed no differences between the two tools for all three clinically important stages. But the specificity was higher with INT for advanced fibrosis (79.5% vs. 65.1%) and significant fibrosis (73.3% vs. 49.5%), while being comparable for cirrhosis (79.5% vs. 78.4%). Both INT and TS had good agreement (R^2 of 0.8) of computed LSN scores.
Conclusion
Both the internal tool and TotalSegmentator attained comparable performance for fibrosis staging. However, TotalSegmentator did not achieve high specificity.
Statement of Impact
Both INT and TS tools can predict the fibrosis stage in ~45 seconds compared to the ~2 minutes needed to manually measure the LSN in a CT volume. They show promise for population-based studies.
Keywords
CT; Liver; Liver Fibrosis; Cirrhosis
023 - Opportunistic Detection of Splenomegaly Using Automated AI-Based Measurements and Reporting of Organ Volumes in the Clinical Workflow
Presenter: David Y. Zhang, University of Pennsylvania - Penn Medicine
David Y. Zhang1, Ari Borthakur1, Jeffrey Duda1, Neil Chatterjee2, Rohan Valia1, James C. Gee1, Charles E. Kahn Jr.1, Daniel J. Rader1, Hersh Sagreiya1, Walter Witschey1
1University of Pennsylvania - Penn Medicine, Philadelphia, PA, USA
2Northwestern University, Evanston, IL, USA
Introduction/Background
Splenomegaly, an enlargement of splenic size and weight with a prevalence of 2% in the US, reflects a disruption of the organ’s complex role in immunological defense and hematopoiesis (1,2). Due to a broad array of underlying conditions such as hyperplasia, passive congestion, and infiltrative disease, it is often differentially diagnosed by CT imaging (3). Due to its widespread availability, opportunistic screening using CT captures information about clinical conditions when people are imaged for other reasons (4) and can augment the radiology workflow with detailed quantitative imaging traits (radiomics) that are cumbersome to obtain in traditional workflows that do not support computational imaging (5,6). Opportunistic screening for splenomegaly was performed in a disease-agnostic medical population (7), however, associations with other diseases were missing. To address these issues, we built and deployed an end-to-end opportunistic screening workflow using AI-based automated image analysis embedded in the radiology clinical workflow (8). We validate the system by measuring spleen volumes and demonstrating associations with systemic multi-organ diseases using a phenome-wide association study in patients that underwent CT at our institute in the last year.
Methods/Intervention
In an IRB-approved study, spleen volumes were estimated for 13,636 individuals from CTs using TotalSegmentator (9). Splenomegaly assignments were determined if a patient had an ICD diagnosis for the condition (ICD-10 = R16.1 or R16.2, ICD-9 = 789.2) prior to when the CT scan was performed. A phenome-wide association study was performed against phecodes adjusting for sex, age, age^2, principal components 1-10, and BMI.
Results/Outcome
AI-based spleen volume measurements were 484.3 ± 206.2 mL (mean ± sd) and 202.5 ±114.9 mL for splenomegaly vs. other patients, respectively. Increased spleen volumes were strongly associated with over 50 clinical entities including digestive system diseases and multi-organ diseases including infections, endocrine disease, and kidney disease.
Conclusion
Opportunistic screening for splenomegaly using an AI-based was validated with physician-determined enlarged spleen in a clinical population with systemic multi-organ diseases. PheWAS results serve as a nexus for future discovery and support the use of genetic analysis (GWAS).
Statement of Impact
AI-based measurements of spleen volume from CT images can be used to opportunistically screen patients for splenomegaly.
Keywords
Splenomegaly; Segmentation; Phenome-wide association study
024 - Validation of UniverSeg for Interventional Abdominal Angiographic Segmentation
Presenter: Michael J. Kovalchick, Wayne State University
Michael J. Kovalchick1, Chad Klochko2, Kundan Thind2
1Wayne State University, Detroit, MI, USA
2Henry Ford Health, Detroit, MI, USA
Introduction/Background
Automatic segmentation of angiographic structures can aid in assessment of presence and extent of vascular disease. Recent deep learning segmentation models promise automated processing, however, lack validation on interventional angiographic data. This study performs a validation test on the UniverSeg model to examine suitability for future use.
Methods/Intervention
After IRB approval, a retrospective review identified 234 patients who underwent interventional fluoroscopy of the celiac axis with iodinated contrast via intravenous catheter injection between January 1st, 2019, and December 31st, 2022. From 261 fluoroscopic acquisitions, 303 images were selected with maximum contrast from the contrast agent. From each image a partition of 128x128 pixels was selected to encompass arterial details, and a corresponding binary mask was subsequently generated by convex hull calculation. The resulting image-mask pairs were distributed into three classes of 101 pairs each. Classes were defined by decreasing arterial diameter and the number of bifurcations of the vessel. UniverSeg was applied to each class independently in a 5-fold nested cross comprehensive validation test. An analysis of model performance for in-context learning was performed for each class to determine the minimum size for average model convergence. For each class size, ranging from 1 to 81 pairs, five sample images were tested against the class with twenty repetitions iterated across the class.
Results/Outcome
Dice-Similarity-Coefficients comparing UniverSeg output to generated masks across the three classes with decreasing arterial diameters were 78.7%, 72.5%, and 59.9% (σ=5.96, 7.99, 14.29). Balanced-Average-Hausdorff-Distances, representing the maximum separation distance between prediction surface and ground truth, were 0.86, 0.71, 1.16 (σ=0.37, 0.52, 0.68) pixels respectively. Inverted mask testing revealed degradation of performance in line with published UniverSeg expectations. Class size testing in all cases showed non-linear improvement with plateauing performance with increased image sets used for in-context learning. All test images converged in performance to ±1.34 Dice-Score of the full class size value by N=51.
Conclusion
UniverSeg performed well for angiographic segmentation, with improved performance with greater class size, increased vessel diameter, and reduced bifurcations.
Statement of Impact
This study validates a potential method for arterial segmentation in interventional fluoroscopic procedures and facilitates development of vascular disease models and imaging research applications.
Keywords
Segmentation; Interventional; Angiography; Deep-Learning
Oral Presentations
Large Language Models – Session 2 | Scientific Abstract Presentations
025 - Benchmarking Quantization: A Comprehensive Comparison of Open-Source Large Language Models
Presenter: Blake T. Passe, Mayo Clinic - Rochester
Blake T. Passe1, Sanaz Vahdati1, Bradley J. Erickson1
1Mayo Clinic - Rochester, Rochester, MN, USA
Introduction/Background
Large Language Models (LLMs) have seen a boom within the Artificial Intelligence community recently. As the parameter size of models has grown from millions to trillions, computational requirements present substantial deployment challenges. A potential approach to resolve these pressing difficulties is quantizing open-source LLMs - reducing the precision of model parameters - while aiming to preserve performance. In the current work, our aim is to compare how quantization of open-source LLMs impacts information extraction from radiology reports, latency, and computational demands.
Methods/Intervention
622 radiology reports were obtained in five categories: cervical spine fractures, glioma progression, liver metastases, pneumonia, and pulmonary embolism. Each glioma progression report was labeled “Improved”, “Progression”, “Stable”, “Pseudoprogression”, or “Pseudoresponse”, while the four remaining categories were labeled a binary “Yes'' or “No”. Different ‘instruct’ versions of Llama3 and Phi-3 models were applied using Ollama and allotted one NVIDIA A100 80Gb GPU for inference. Prompting was conducted by a radiology artificial intelligence expert to describe the model’s task and criteria succinctly. The prompting structure consisted of four sequential steps: identity establishment, cognitive framework setup, report presentation, and contextual clarification. Finally, the JSON output, RAM usage, and latency for the extraction process was recorded for each model at each quantization level.
Results/Outcome
Findings indicate model size displays a positive correlation with both RAM and latency during inference. Comparable accuracy (>92%) was observed between the 8, 5, and 4-bit quantized versions of Llama3:70b, Llama3:8b, and Phi3:14b. This is intriguing due to the large gap between model sizes. The extreme 2-bit quantization demonstrated a prominent confabulation of answers. Response divergence (i.e. responses not within the defined structure, such as “Maybe” rather than the required “Yes” or “No”) is displayed before the performance degradation of the models. This suggests that LLMs may lose output obedience before specific performance metrics decline.
Conclusion
Our findings indicate that models can perform quite well even with substantial quantization for question-answering tasks applied to radiology reports.
Statement of Impact
Navigating the tradeoff between quantization and quality is largely unstudied in medicine, but it indicates significant potential to reduce computation load.
Keywords
Artificial Intelligence; Large Language Model; Quantization; Radiology Report Data Extraction
026 - ConTEXTual Net 3D: Visual Grounding in PET/CT for Enhanced Interactive Reporting
Presenter: Zachary Huemann, University of Wisconsin-Madison
Zachary Huemann1, Samuel Church1, Joshua D. Warner1, Daniel Tran1, Xin Tie1, Junjie Hu1, Steve Y. Cho1, Meghan G. Lubner1, Tyler J. Bradshaw1
1University of Wisconsin-Madison, Madison, WI, USA
Introduction/Background
Visual grounding algorithms, which link text descriptions to specific image regions, have many potential applications in radiology. However, these algorithms require large training datasets of annotated image-text pairs, which currently do not exist for most imaging modalities. We developed a pipeline to extract reported descriptions of salient PET/CT findings and to automatically segment the corresponding image findings. We then applied this pipeline to generate a large, annotated dataset for training a 3D vision-language visual grounding model, enabling interactive PET/CT reports.
Methods/Intervention
Our multi-step pipeline operates on PET/CT images and corresponding radiology reports, uses a series of large language models (LLMs) to extract text descriptions of PET findings, and then automatically segments the findings in the image based on the reported slice number and maximum standardized uptake value (SUVmax). Starting with 25,000 PET/CT exams retrospectively collected from 2010 to 2023, the final training/validation/test set consisted of 11,356 sentence-label pairs from 5,094 PET/CT exams. This dataset was used to train a novel 3D vision-language model adapted from ConTEXTual Net, which uses the sentence description, encodes it through an LLM, and fuses the text encodings with a 3D segmentation nn-UNet via cross-attention. The model was then evaluated on a holdout test set of 256 cases reviewed by a board-certified radiologist.
Results/Outcome
The automatic labeling pipeline’s accuracy was 98% (251/256). ConTEXTual Net 3D achieved an F1-score of 0.78 on the holdout test set, with a sensitivity of 0.75 and a recall of 0.81. The model performed similarly on 18F-fluorodeoxyglucose (FDG) PET/CT exams (F1=0.78) and on non-FDG PET/CT exams (F1=0.76).
Conclusion
The proposed labeling pipeline demonstrated high accuracy in creating large, annotated datasets of image-text pairs for PET/CT, allowing for the development of 3D visual grounding models.
Statement of Impact
Our method can be used to generate the necessary image-text training data to train a visual grounding model to segment key lesions in PET/CT. It opens the door to interactive reports that improve patient and provider comprehension and may allow for retrospective quantitative PET studies.
Keywords
Multimodal; Vision-Language Models; PET/CT; Large Language Models
027 - Does Size Really Matter? Comparing llama 3 vs 3.1
Presenter: Suyash Khubchandani, CARPL.ai, Inc.
Vasanth Venugopal1, Amit Kumar1
1CARPL.ai, Inc., Cupertino, CA, USA
Introduction/Background
Prompt engineering in radiological reports involves crafting inputs to guide AI models in generating accurate and useful outputs. This technique helps automate downstream tasks on radiological reports. However, its effectiveness is limited by context length constraints, which can restrict the amount of information the model can process and integrate simultaneously. Commercial LLMs like ChatGPT are bounded by max context window of 16 k tokens where 1.5 tokens are approximately used for one word. For large radiological reports like MRI, it becomes difficult to perform few shots learning when context window is of limited size. In this study we have used llama 3.1 open-source model with context size of 128 k tokens against llama 3.
Methods/Intervention
We collected 1000 radiology reports annotated by expert radiologists, classifying each report into findings: acute infarct, intra-axial tumor, intra-axial hemorrhage, extra-axial tumor, and extra-axial hemorrhage. These reports were evaluated using the Llama 3 and Llama 3.1 models under zero-shot, one-shot, and few-shot learning settings. For zero-shot learning, the models received no prior examples. In one-shot learning, each model received one example per finding, and for few-shot learning, multiple examples were provided. The models' performance was compared to determine the impact of different learning strategies on their accuracy in identifying radiological findings.
Results/Outcome
In our study, we observed a significant improvement in accuracy for the llama 3.1 model. For llama 3, its accuracy improved from 66-97% with zero short learning to 95-98 % with few-shot learning, as further prompt engineering was constrained by the limited number of context size. However, by incorporating additional examples into the prompt, LLama 3.1 demonstrated an accuracy of 99.5%.
Conclusion
This enhancement underscores the model's capability to learn effectively from expanded datasets, highlighting the importance of large context windows in achieving superior predictive accuracy for larger reports.
Statement of Impact
Larger context LLMs are better suited for analyzing radiological MRI reports as they can handle more detailed and comprehensive patient data, ensuring accurate diagnosis and interpretation. Their ability to process extensive contextual information enhances downstream tasks, providing more reliable and insightful outcomes compared to smaller context-size LLMs.
Keywords
Large Language Models; Natural Language Processing; Radiological Reports; Classification
028 - Large Language Models Create Useful, Accurate, Clear Summaries of Virtual Radiology Workgroup Meetings
Presenter: Benjamin Mervak, Michigan Medicine
Benjamin Mervak1, Muhammad Bhalli1, Tricia Niedbala1, Kenneth Buckwalter1
1Michigan Medicine, Ann Arbor, MI, USA
Introduction/Background
Remote meetings have dramatically increased in recent years. While convenient, these can disrupt team-based processes unless there is effective documentation of the discussion and follow-up tasks. Software solutions can generate a meeting transcript, although reviewing the entire transcript can be inefficient and tedious. A summary can be synthesized by a scribe, but this process is time-consuming, costly, susceptible to error, and may be delayed. Large Language Models (LLMs) are well suited for natural language processing and summarization tasks. This study investigates the performance of an LLM in extracting data from collaborative Radiology information technology (IT) workgroup meeting transcripts and generating a summary with a concise list of topics discussed, key points, and action items for each participant.
Methods/Intervention
After obtaining IRB exemption, this study was conducted at an academic medical institution. Over two months in 2024, virtual Radiology IT group meetings were recorded and transcribed using teleconferencing software. Transcripts were ingested by an instance of an LLM (GPT-4 Turbo, OpenAI, San Francisco, CA) privately hosted on an institutional server. The LLM was prompted to generate a meeting summary including minutes and action items. LLM-generated summaries were provided to participants on the same day as the meeting. The accuracy, clarity, and usefulness of the LLM-generated summaries were evaluated by collecting anonymous feedback from meeting participants using 5-point Likert scales.
Results/Outcome
Transcripts from 16 Radiology IT group meetings were summarized by an LLM. Feedback was received from 44 meeting participants. Most respondents (68-81%) rated the summaries as either “extremely” or “very” accurate, clear, and useful. A minority of respondents (9-12%) rated the summaries as either “not at all” or only “somewhat” accurate, clear, and useful. “Usefulness” was rated highly despite any perceived inaccuracies or unclear statements generated by the LLM.
Conclusion
LLMs can rapidly provide Radiology IT team members with meeting summaries that are accurate, clear, and useful.
Statement of Impact
We show that LLMs can help address a key problem with virtual meetings – even those of a technical nature – by quickly and inexpensively providing participants with an accurate, clear, and useful meeting summary.
Keywords
Large language models; Virtual meetings; Meeting summary; Transcription
029 - Synthesizing Diagnostic Insights from Radiology Reports: A RAG-Based LLM Method for Reducing Hallucinations and Preventing Catastrophic Forgetting
Presenter: Briana Malik, University of Pittsburgh
Briana Malik1
1University of Pittsburgh
Introduction/Background
Disparities in data quality and context availability can introduce biases in Large Language models (LLMs), affecting their accuracy. These issues are particularly pronounced in the field of radiology, where precise interpretation and understanding of reports is critical. Incorporating accurate and contextually relevant information is essential to LLM performance, reducing hallucinations and catastrophic forgetting.
Methods/Intervention
We hypothesize that integrating Retrieval-Augmented Generation (RAG) with contextual search will significantly reduce hallucinations by grounding LLMs with accurate and contextually relevant information. We used 500 radiology reports from a chest X-ray collection, tokenized the text, and generated embeddings using Large Language Model (LLM) tokenizer. These embeddings were stored in LevelDB database for efficient storage and retrieval. A similarity search index was built to facilitate efficient contextual retrieval. Queries related to specific radiological conditions were processed through RAG system. RAG system retrieved the most relevant context, which was then combined with the query and input into LLM (GPT-2) to generate contextually rich responses.
Results/Outcome
The RAG-based method significantly improved LLM’s understanding of radiology reports and rare conditions by grounding the model and reducing hallucinations. The number of words in responses decreased from 388 to 223, showing more concise outputs. Unique words decreased from 40 to 109 with RAG, indicating less repetition. The repetition rate fell from 0.897 to 0.511. ROUGE-1 F1 score improved from 0.015 to 0.024, and ROUGE-1 precision increased from 0.007 to 0.013. ROUGE-1 recall remained constant at 0.500. Perplexity increased from 1.605 to 10.369, reflecting more contextually rich responses.
Conclusion
The RAG-based approach enhances the accuracy and relevance of responses and improves understanding of rare conditions. Reduced repetition rates and improved ROUGE scores demonstrate more accurate responses. Higher perplexity with RAG indicates richer, more contextual responses compared to lower perplexity and incoherence without RAG.
Statement of Impact
This study demonstrates the potential of RAG-based LLMs to advance radiology report interpretation and rare disease diagnosis. By providing more accurate, contextually relevant answers, this approach enhances diagnostic quality and patient care, addressing critical gaps in traditional LLM training, which is not domain specific, suffers from under/not training if very rare words do not appear in vocab generated during frontier model training.
Keywords
Retrieval Augmented Generation (RAG); Large Language Models (LLM); Deep Learning; Artificial Intelligence
030 - Transforming Plain Text Radiology Reports into Structured Data Using Common Data Elements and FHIR Standards
Presenter: Michael Hood, Massachusetts General Hospital
Michael Hood1, Roshan Fahimi1, Heather Chase2, Tarik Alkasab1
1Massachusetts General Hospital, Boston, MA, USA
2Microsoft-Nuance, Redmond, WA, USA
Introduction/Background
This project aims to enable transforming plain text radiology reports into structured data using Common Data Elements (CDEs) and Fast Healthcare Interoperability Resources (FHIR) standards. This effort enhances downstream clinical applications and improves the compatibility and exchange of radiology data across healthcare systems.
Methods/Intervention
A large language model (LLM) was employed to generate preliminary CDE definitions from anonymized chest CT reports. These reports were segmented into overlapping chunks, with semantic vectors generated. For given chest CT findings, relevant chunks were retrieved using cosine-similarity search and reranked. A GPT model was then prompted to create structured data models using the report chunks as context. The model was prompted to generate models with attributes such as identification, characteristics, and associated findings. Models then underwent iterative refinement and expert radiologist review.
Results/Outcome
This initial pilot generated refined CDE definitions for over 200 chest CT findings, demonstrating LLM capability in rapidly producing preliminary CDEs. Our toolkit successfully transformed these into fully annotated JSON files for generating CDE-labeled FHIR Observation objects. This new process for appending ontological tags is robust — analysis of 82 chest CT reports revealed a total of 1190 findings, 83.2% of which could be successfully encoded as CDE-labeled FHIR Observations.
Conclusion
The developed methodologies and tools significantly expedite the generation and application of CDE definitions, which will enable structured and standardized representation of radiology findings using a standard FHIR structure.
Statement of Impact
This project demonstrates the feasibility of transforming unstructured radiology reports into structured data by integrating tailored LLMs, a streamlined toolchain, and standardized semantics. This improves compatibility and data exchange across healthcare systems, empowering downstream clinical workflows and ultimately improving patient care.
Keywords
Common Data Elements; Structured Data; Fast Healthcare Interoperability Resources (FHIR); Large Language Models
Oral Presentations
Radiomics & New Techniques | Scientific Abstract Presentations
031 – Adversarial Domain Adaptation for Robust Glaucoma Classification
Presenter: Homa Rashidisabet, University of Illinois Chicago
Homa Rashidisabet1, R. V. Paul Chan1, Thasarat S. Vajaranant1, Darvin Yi1
1University of Illinois Chicago, Chicago, IL, USA
Introduction/Background
While deep learning models have shown promising results in automated glaucoma prediction, they lack robustness when faced with a data domain shift. In real-world medical settings, discrepancies among images from different hospitals and patient populations create a domain shift, adversely impacting model performance and generalization across diverse datasets.
Methods/Intervention
We propose a Domain Adaptation (DA) method to address the performance degradation of deep learning models under data shift in glaucoma classification. Implementing Ganin et al., 2015, our deep learning architecture incorporates an adversarial learning component to eliminate domain-specific information, enabling the model to learn invariant features across data domains. The DA model is jointly trained on glaucoma classification and domain classification tasks, minimizing differences between domains while optimizing for the main task. We validated the DA method against the state-of-the-art ResNet-50 model using fundus images from LAG (n=4854), REFUGE (n=249), and University of Illinois Chicago (UIC) dataset (n=711) as both source and target data.
Results/Outcome
Using the UIC dataset as the source, the source-only model achieves 71.7% accuracy on LAG and 82.0% accuracy on REFUGE. The DA model improves these by 12.3% and 6.0%, achieving 84.0% and 88.0% accuracy, respectively. The train-on-target models, which serve as a reference, represent the upper bound on DA performance, while the source-only model without adaptation indicates the lower bound. When swapping source and target, the source-only model achieves 72.0% and 70.5% accuracy on UIC. The DA model improves these by 12.5% and 7.0%, achieving 84.5% and 77.5% accuracy on UIC.
Conclusion
Our proposed DA method effectively addresses the performance degradation of deep learning models when there is a shift in data. By learning domain-invariant features and unlearning domain-specific information, the DA method significantly improves the performance of the state-of-the-art ResNet-50 model in glaucoma classification.
Statement of Impact
Glaucoma, affecting eighty million people worldwide, is a leading cause of blindness. Deep learning has improved glaucoma prediction using fundus images, but performance degrades with data shifts from varied sources. Addressing this degradation is crucial to ensure clinical reliability and avoid patient risks. We propose a method to maintain performance across diverse datasets by adapting to data variations, validated on three datasets.
Keywords
Glaucoma classification; Domain adaptation; Computer vision; Fundus analysis
032 - Cross-Vendor Reproducibility of Radiomics-based Machine Learning Models for Computer-aided Diagnosis
Presenter: Jatin K. Chaudhary, University of Turku
Jatin K. Chaudhary1, Ivan K. Jambor1, Hannu Aronen1, Otto Ettala1, Jani Saunavaara1, Peter Bostrom1, Jukka Heikkonen1, Rajeev Kanth1, Harri Merisaari1
1University of Turku, Turku, Finland
Introduction/Background
Prostate cancer (PCa) remains the most common cancer among men in the western world and the second leading cause of cancer-related deaths. The integration of machine learning (ML) into Magnetic Resonance Imaging (MRI) holds promise for enhancing the accuracy and efficiency of PCa diagnostics. This study aims to evaluate the reproducibility and performance of ML models using radiomic features derived from Pyradiomics and MRCradiomics across different MRI scanners.
Methods/Intervention
We utilized imaging data from 637 men with clinical suspicion of PCa, enrolled in various clinical trials. The data included axial T2-weighted images scanned using Siemens MAGNETOM Verio 3T and Philips Ingenia 3T MRI devices. Radiomic features were extracted using Pyradiomics and MRCradiomics packages, yielding a total of 2693 features. Feature selection was conducted using the Maximum Relevance Minimum Redundancy (MRMR) method, reducing the set to 14 highly predictive features. We trained and evaluated Support Vector Machine (SVM) and Random Forest models on training, validation, and test datasets, assessing their performance using Area Under the Curve (AUC) metrics.
Results/Outcome
The SVM model achieved an AUC of 0.74 on the Multi-Improd dataset using combined Pyradiomics and MRCradiomics features, but the AUC dropped to 0.35 on the Philips test set. The Random Forest model showed similar trends with AUCs of 0.73 on the Multi-Improd set and 0.60 on the Philips set. Models trained exclusively on Pyradiomics features demonstrated higher robustness, with the Random Forest achieving an AUC of 0.78 on the Philips set. In contrast, models using only MRCradiomics features had varied outcomes, highlighting the challenge of reproducibility across different scanners.
Conclusion
This study underscores the significant impact of scanner-induced variability on the performance of ML models in PCa diagnostics. While combining Pyradiomics and MRCradiomics features enhances predictive performance, rigorous cross-platform validation is crucial to ensure model reliability.
Statement of Impact
Our findings emphasize the need for standardized imaging protocols and comprehensive validation frameworks to bridge the gap between innovative AI applications and practical, patient-centric healthcare solutions. This research advances the field of PCa imaging and promotes the broader adoption of AI in medical diagnostics, ultimately aiming to improve diagnostic accuracy and patient outcomes.
Keywords
Inter-Vendor Reproducibility; Radiomics; Diagnostic tools; Model Reproducibility
033 – From AI to Eye: Training the Radiologist with Deep Learning Interpretations in Sex Differentiation
Presenter: Shahriar Faghani, Mayo Clinic - Rochester
Shahriar Faghani1, Christin A. Tiegs-Heiden1, Mana Moassefi1, Garret Powell1, Micheal Ringler1, Bradley J. Erickson1, Nicholas Rhodes1
1Mayo Clinic - Rochester, Rochester, MN, USA
Introduction/Background
To use deep learning (DL) as an educational and scientific discovery tool to improve the radiologist’s ability to directly make subtle imaging findings without additional DL assistance.
Methods/Intervention
We present a DL model that can identify sex differences from frontal knee radiographs with high accuracy, then use the resultant occlusion interpretation maps (OIMs) to train human readers to improve their ability to perform this same task. Two groups, each of three human readers, were tasked to separate radiographs into male and female sex correctly. Both groups were informed of the patient’s sex, while the first group was also given these radiographs OIMs. After two weeks, the group was retested with a new set of 50 radiographs. This group was compared to a second group trained without the OIMs.
Results/Outcome
The DL model separated sex with 0.96 accuracy. The average accuracy of the six human readers initially was 0.62(range:0.56-0.74). After the study, the average accuracy of the six human readers was 0.77(range:0.7-0.84). The improvement in accuracy of the six human readers was statistically significant (p=0.0364). The accuracy of the “heat map” group was 0.8, and the control group was 0.74. When pooled as a collective group, Group 1 again showed significant improvement from baseline(p=0.0058), whereas Group 2 did not(p=0.1245), though there was no statistical difference between the two groups at the end of the experiment(p=0.2380).
Conclusion
OIMs could not be shown to definitively account for the improved accuracy in our test, though the group provided those maps demonstrated a statistically significant improvement from baseline, while the group without these maps did not. Moreover, simply the high accuracy of the DL model in performing this task proved it was possible and motivated our human readers to learn to perform this task.
Statement of Impact
This initiative seeks to identify new imaging biomarkers, thereby improving the functionality of existing DL systems and enabling human-led scientific advancements beyond the reach of DL alone. Additionally, this strategy incorporates a layer of explainability to facilitate the monitoring and troubleshooting of DL models when errors occur, contributing to the development of DL systems with increased resilience to such errors.
Keywords
Deep learning; Interpretability; Classification; MSK radiology
034 - Impact of Random Prostate Volume Modifications on Automated Segmentation Model Performance
Presenter: Dominic LaBella, Duke University Medical Center
Dominic LaBella1, Michaela Kop2, Xuan Qi3, Thomas Sanford4
1Duke University Medical Center, Durham, NC, USA
2John A Burns School of Medicine, University of Hawaii, Honolulu, HI, USA
3National Institutes of Health, Bethesda, MD, USA
4University of Hawaii Cancer Center, Honolulu, HI, USA
Introduction/Background
Accurate and consistent prostate segmentation on magnetic resonance imaging (MRI) plays an important role in surgical planning, biopsy targeting, and radiotherapy planning. Interobserver variability is an inherent challenge for defining a consistent ground truth (GT) segmentation. This study investigates the effect of modifying training set GT segmentations on the performance of automated prostate segmentation models.
Methods/Intervention
Ground truth segmentations of the whole prostate from T2-weighted MRI prostate sequences were manually delineated by a board-certified urologist who is also fellowship trained in urologic oncology. GT segmentation modifications were made by either adding or subtracting a specified distance in millimeters radially from the surface of the whole prostate volume on each axial slice. Each axial slice had a one-third chance of either A) subtracting a uniform inner margin from the prostate surface, B) not modifying the slice’s segmentation, or C) adding a uniform outer margin from the prostate surface. Ten different modified prostate segmentation-image pair training datasets were created. Each training dataset had a specified amplitude of potential margin modification. Identical segresnet models from the Auto3DSeg framework were trained over 300 epochs for each of the 10 modified training datasets and an additional unmodified GT training dataset. Validation and testing sets included unmodified GT segmentations. Dice similarity coefficients (DSC) were used to compare model performance.
Results/Outcome
A total of 119 T2-weighted images with whole prostate segmentations were included in the study. A linear decrease in mean test set DSC ranged from 0.917 to 0.856 as GT variability increased from 0 to 10 millimeters.
Conclusion
The decrease in testing set DSC as training set segmentation modification amplitude increases shows the importance of consistent GT segmentations for automated segmentation model development. Future studies should assess the impact of interobserver variability across additional radiographical structures on larger datasets.
Statement of Impact
This study elucidates the critical role of consistent GT segmentations when training automated segmentation models. By demonstrating that even minor modifications to GT segmentations can degrade the performance of automated segmentation models, we highlight the importance of standardized segmentation protocols.
Keywords
Artificial Intelligence; Automated Segmentation; Interobserver Variability; Prostate
035 - Replicating and Validating Radiomics-Based Prediction of PD-L1 Expression Status in NSCLC Patients
Presenter: Anna Theresa Stüber, LMU University Hospital, LMU Munich
Anna Theresa Stüber1, Maurice Heimer1, Clemens C. Cyran1, Michael Ingrisch1
1LMU University Hospital, LMU Munich, Munich, Germany
Introduction/Background
The purpose of this study is to investigate the predictive value of radiomics in determining PD-L1 expression (positive vs. negative) status among NSCLC patients using an external [18F]FDG PET/CT dataset. Specifically, we aim to replicate and validate the radiomics-based machine learning model proposed by Zhao et al.* to address concerns related to reproducibility and replicability in radiomics research. * Zhao et al. Predicting PD-L1 expression status in patients with non-small cell lung cancer using [18F]FDG PET/CT radiomics. EJNMMI Res. 2023 Jan 22;13(1):4. doi: 10.1186/s13550-023-00956-9. PMID: 36682020; PMCID: PMC9868196.
Methods/Intervention
We analyzed a cohort of 254 NSCLC patients (86 = 33,9 % negative, 168 = 66,1 % positve PD-L1 status) who underwent [18F]FDG PET/CT imaging, utilizing two distinct image segmentation methods: solid component-based segmentation (LUT) with lung tissue window (W1500/L-600) and attenuation-corrected PET volume, and a conservative, smaller segmentation (CON) with soft-tissue window (W400/40) and corresponding PET volume. We replicated two radiomics-based models (“Rad-score” and “complex model”) provided by Zhao et al. for both segmentation sets, along with their clinical stage model. Performance evaluation is based on 10-fold cross-validation and the Area Under the Curve (AUC).
Results/Outcome
Performance analysis of the Rad-score model revealed a mean AUC of 0.593 (95% CI: 0.573 - 0.613) for CON segmentation and 0.573 (0.544 - 0.586) for LUT segmentation, both falling below the reported mean AUC of 0.761 (0.664 - 0.860) by Zhao et al. Similarly, for the complex model, we achieved mean AUCs of 0.505 (0.485 - 0.524) and 0.519 (0.501 - 0.541), respectively, whereas Zhao et al. reported a mean AUC of 0.769 (0.675 - 0.863).
Conclusion
Our study failed to replicate the findings of the previous study. In particular, the original model achieved very poor prediction performance on our dataset, which is compatible with the original dataset. These findings underscore the challenges in replicating radiomics-based predictive models across different datasets and highlight the importance of rigorous validation in ensuring clinical utility.
Statement of Impact
These findings underscore the challenges in replicating radiomics-based predictive models across different datasets and highlight the importance of rigorous validation in ensuring clinical utility.
Keywords
Replication; Radiomics; Machine learning evaluation; PET/CT
036 - Transparent Radiomics ML Model: Combining Human and Artificial Intelligence for Prediction of Therapy Outcomes
Presenter: Shrey S. Sukhadia, Dartmouth Health
Shrey S. Sukhadia1, Crisi Patel1, Adrienne A. Workman1, Roberta M. diFlorio-Alexander1, Marthony L. Robins1
1Dartmouth Health, Lebanon, NH, USA
Introduction/Background
Predicting the response to neoadjuvant chemotherapy (NAC) in invasive breast carcinoma (IBC) is a key oncological challenge. Accurate prediction can help tailor individualized treatment, minimize chemotoxicity, and enhance therapeutic effectiveness. However, the efficacy of NAC varies significantly among patients, underscoring the importance of identifying reliable predictors of response. Advances in molecular biology, imaging, and artificial intelligence offer promising avenues for developing robust predictive models, which could revolutionize personalized treatment in breast cancer management. We identified the top nine latent radiomic features in the active tumor regions of pre-NAC MRI scans to predict the tumor’s response to NAC using a transparent Decision Tree Classifier (DTC) model.
Methods/Intervention
We collected pre-NAC MRI scans for 75 IBC patients, for which the active tumor regions of interest (ROIs) underwent voxel-based segmentation using ITK-SNAP 4.0. The ROIs were fed to a custom built radiomic feature extraction pipeline that extracted 108 IBSI approved radiomic features from 7 feature classes and normalized using a standard scaler technique. Tumor response to NAC was extracted from the EHR with confirmed post NAC imaging with pathology report confirmation. An aggregated score was developed indicating tumor response to NAC. A DTC model was trained and tested using a 90:10 split of the sample-set using IMAGENE v3.2. A 3-fold cross validation was performed for training-set to control overfitting. The model was tested using the testing-set.
Results/Outcome
Our DTC model predicted the response to NAC at a remarkable AUC and R-square of 1.0 (each), at a p-value < 0.002. The radiomic features contributing the most to this prediction were shape, gray-level co-occurrence matrix, gray-level dependence matrix and gray-level size zone matrix. The Decision Tree showcasing the algorithm performed by DTC aided transparency of the model.
Conclusion
We built a transparent Machine Learning model to predict NAC outcomes represented in EHR using the latent radiomic features extracted from MRIs pre-NAC. Our approach combines both human and artificial intelligence to identify novel radiomic biomarkers that predict therapy outcomes in breast cancer.
Statement of Impact
Our work offers a transparent Machine Learning model that predicts NAC outcomes using pre-NAC images in breast cancer.
Keywords
Radiomics; Neoadjuvant Chemotherapy; Therapy Response; Machine Learning
Oral Presentations
Clinical Implementation & Toolkits | Scientific Abstract Presentations
037 - An Ontology for Discoverable and Interoperable Radiology AI Models and Datasets
Presenter: Charles E. Kahn, Jr., University of Pennsylvania
Charles E. Kahn, Jr.1, Abhinav Suri2, Safwan Halabi3, Hari Trivedi4
1University of Pennsylvania, Philadelphia, PA, USA
2University of California Los Angeles, Los Angeles, CA, USA
3Lurie Children’s Hospital, Chicago, IL, USA
4Emory University, Atlanta, GA, USA
Introduction/Background
"Model cards" and “datasheets for datasets” provide valuable metadata to detail the performance, intended use, and potential limitations of AI resources. However, their format as unstructured text limits the ability to search for relevant resources and to automate their analysis. We sought to create a formal description to increase transparency and interoperability, reduce bias, and promote reproducibility of radiology AI models and datasets.
Methods/Intervention
The Radiology Model and Dataset Ontology (RMDO) was created to define attributes for AI models and datasets in radiology. RMDO references external ontologies and vocabularies, including RadLex, the LOINC/RSNA Radiology Playbook, and radiology common data elements (CDEs). RMDO incorporates RSNA content codes, PapersWithCode.com classifications of machine-learning methods and tasks, and the Metrics Reloaded listing of model-performance metrics. A JavaScript Object Notation (JSON) Schema was defined to allow serialization of RMDO-based descriptions of radiology AI models and datasets.
Results/Outcome
RMDO comprises 3,323 classes related by 4,403 logical axioms. The primary RMDO entity is a Project; its metadata describe authors, versioning, availability, licensing, and other features. Each Project consists of zero or more Models and zero of more Datasets. Model descriptions include architecture, intended uses, metrics, and ethical considerations. Dataset descriptions include imaging procedure, number of patients and images, image file format, output information, availability and licensing, partitions, annotation methods, and study cohort characteristics, such as demographics and disease prevalence. RMDO has been applied to datasets created for RSNA’s AI competitions and for several published AI models.
Conclusion
RMDO provides a standardized vocabulary that allows more effective classification and indexing of radiology AI models and datasets to make these resources more easily findable and accessible and to allow automated analysis of their underlying content.
Statement of Impact
An ontology to describe radiology AI models and datasets can make AI resources more findable and accessible. By allowing structured descriptions, the ontology helps promote reproducibility of AI models, and can aid in identifying and mitigating potential biases.
Keywords
Ontology; Model cards; Datasheets; Metadata
038 - Beyond FDA Clearance: Automated Post Deployment Monitoring and Validation of Commercial AI Models using Local Large Language Models (LLMs)
Presenter: Theo Dapamede, Emory University
Theo Dapamede1, Bardia Khosravi2, Chad Robichaux1, Aawez Mansuri1, Mohammadreza Chavoshi3, Alex Belov1, Angela Udongwo4, Chinonyelum Igwe5, Frank Li1, Beatrice Brown-Mulry1, Hanssen Li1, John Moon1, Judy Gichoya1, Hari Trivedi1
1Emory University, Atlanta, GA, USA
2Yale University, New Haven, CT, USA
3Tehran University of Medical Sciences, Tehran, Iran
4Temple University, Philadelphia, PA, USA
5University of Ibadan, Ibadan, Nigeria
Introduction/Background
As AI models are deployed in diverse clinical settings, continuous monitoring and assessment of subgroup performance is critical. Automated techniques to compare radiologist interpretations to model performance must be developed. We used a large language model (LLM) to evaluate the performance of two clinically deployed commercial AI models for pulmonary embolism and intracranial hemorrhage detection.
Methods/Intervention
We identified 8,966 CT pulmonary embolism exams and 14,637 non-contrast CT head exams conducted between April and October 2023 that were evaluated by the AI model and extracted the corresponding radiology reports. A locally deployed instance of Llama3 8B was used to extract the PE and ICH labels ground truth labels from the radiology reports, using methods that were previously validated on 500 manually annotated reports (PE: Sn 1.0, Sp: 1.0; ICH: Sn: 0.93, Sp: 1.0). AI model performance was compared to extracted ground truth for multiple subgroups (race, age, sex, and patient location). Overall performance was also compared to the submitted FDA and published performances.
Results/Outcome
For the PE model, sensitivity was 80.3% (95%CI: 77.8% – 83.0%) and specificity was 98.0% (95%CI:97.7% – 98.3%), compared to the published FDA clearance sensitivity of 93.0% (90.2% - 95.1%) and specificity of 93.7% (92.7% - 94.6%). For the ICH model, the sensitivity was 92.2% (91.2%-93.2%) and specificity was 90.3% (89.8%-90.8%), compared to FDA clearance sensitivity of 93.6% (86.6%-97.6%) and specificity of 92.3% (85.4%-96.6%). Both models demonstrated the lowest performance for outpatients as compared to emergency and inpatients, with sensitivities of 77.5% (58.8%-85.0%) and 87.4% (76.8%-95.5%) for PE and ICH models, respectively. Both models demonstrated equitable performance across race, ethnicity, age, and sex subgroups.
Conclusion
We have shown the potential use of LLMs as an automated method for post deployment monitoring and evaluation of clinical AI models. It is notable that the lowest-performing group for both models was outpatients, where advanced detection models can potentially provide the most benefit. Further work and reader studies are required to understand model failure modes and confounders.
Statement of Impact
This study demonstrates a potential automated solution for post deployment monitoring of clinical AI models, which is necessary for ensuring safe and stable model performance after deployment.
Keywords
Post-Deployment Monitoring; AI Validation; LLM
039 - Enhancing Equitable Study Distribution Using Reinforcement Learning
Presenter: Yiting Xie, Merative
Sun Young Park1, Linda Bagley1, Christy Weatherbee1, Ferenc Kis1, Marwan Sati1, Yiting Xie1
1Merative, Cambridge, MA, USA
Introduction/Background
Imaging organizations encounter heavy workloads requiring distribution among radiologists. Uneven study type distribution, cherry-picking and other factors can cause imbalanced workload distribution, which may lead to internal tension and burnout. We introduce an artificial intelligence model to assist in achieving workload balance. Our study compares PACS workload distribution between manual and AI-automated methods.
Methods/Intervention
We present a reinforcement learning model that distributes studies with the goals of maintaining fairness, respecting preferences, meeting priority deadlines, and balancing study value. The model takes requests from a PACS system including an exam and a list of active radiologists and returns an assignment recommendation. The model state is encoded as a 2D array, comprising information such as Relative Value Unit (RVU), due time, and the radiologists' workloads. The algorithm is rewarded if its recommendations meet the above goals and learns by maximizing cumulative rewards over time. Our model has two learning phases: offline learning using realistic simulations of small, medium, and large clinical settings, and online learning, where the model adapts to study distribution and radiologists’ preferences in real-time. We performed a comparative study between AI-automated and manual assignment phases.
Results/Outcome
Five radiologists reviewed 481 studies in the manual and AI-automated phases. While the modality distribution was similar in both phases, the radiologists favored CR and CT over MR in the manual phase. Modality distribution was more balanced for all radiologists in the AI-enabled phase, and a 34% more equitable RVU distribution across modalities was observed for all radiologists. MR RVUs read increased by 40% in the AI-automated phase, correcting the bias in favor of CR and CT from the manual phase.
Conclusion
We report a two-phase reinforcement learning-based study distribution framework that provides a balanced and efficient allocation of studies. We compared manual and AI-automated methods and showed a 34% reduction in the standard deviation of the RVUs read between radiologists when using the AI model.
Statement of Impact
We have demonstrated the impact of an AI-automated worklist on study distribution. We found a notable reduction in the RVU standard deviation and improved balance among modalities along with a reduction in cherry-picking.
Keywords
Reinforcement Learning; Study Distribution; Radiologist Efficiency; Artificial Intelligence
040 – Ensuring Real-Time Reliability: An Autonomous Monitoring System for Radiology AI Performance
Presenter: Suyash Khubchandani, CARPL.ai, Inc.
Vasantha K. Venugopal1, Abhishek Gupta1, Rohit Takhar1
1CARPL.ai, Inc., Cupertino, CA, USA
Introduction/Background
Integrating artificial intelligence (AI) in healthcare, especially radiology, has revolutionized diagnostics. However, maintaining AI model accuracy in real-time clinical settings is challenging, primarily due to the lack of real-time ground truth data. This study introduces an autonomous monitoring system using two novel metrics: predictive divergence and temporal stability, providing real-time insights to ensure AI model reliability.
Methods/Intervention
To overcome real-time monitoring challenges without ground truth data, we developed two key metrics: Predictive Divergence: This metric employs Kullback-Leibler (KL) and Jensen-Shannon (JS) divergences to compare predictions between the primary AI model and two supplementary models. Lower divergence indicates higher accuracy and agreement among models. Temporal Stability: This metric assesses AI model consistency by comparing current predictions with historical moving averages. Variations in temporal stability can indicate model decay or data drift. The system was validated using chest X-ray data from a single-center clinic. Three commercial AI models for chest X-ray classification were analyzed in a longitudinal retrospective study design, using Jensen-Shannon Divergence (JSD) to compute predictive divergence and temporal stability metrics.
Results/Outcome
The study analyzed 3,993 chest X-rays over several months, including the onset of the COVID-19 pandemic. Key findings include: Predictive Divergence: JSD values initially showed alignment between the main AI model (AI1) and its support models (AI2, AI3). Post-COVID, a significant increase in divergence between AI1 and AI2 indicated a need for model intervention. Temporal Stability: JSD values for AI1 indicated initial consistency, but increased significantly post-COVID, reflecting a deviation from historical performance. AI2 and AI3 also showed increased divergence post-COVID, highlighting the pandemic's impact on AI model predictions.
Conclusion
The proposed system, using predictive divergence and temporal stability, offers a robust framework for real-time AI performance evaluation in clinical settings. This ensures the safe integration of AI in healthcare, as demonstrated during the COVID-19 pandemic. The system's continuous insights can enhance AI model reliability, ultimately improving patient care.
Statement of Impact
Continuous AI model monitoring in healthcare is crucial. The proposed metrics enable real-time detection of performance issues without ground truth data, significantly enhancing AI model reliability in clinical practice. Future research will optimize this system for various clinical contexts and AI models.
Keywords
Deep learning; Post market surveillance; Mlops; AI Monitoring
041 - Imaging Management of Lumbar Spine MRI Annotation in Machine Learning Model Validation with an Emphasis on Subcohort Variation
Presenter: Veysel Kocaman, Gesund.ai
Sumir S. Patel1, Veysel Kocaman2, Enes Hosgor2
1Emory University, Atlanta, GA, USA
2Gesund.ai, Cambridge, MA, USA
Introduction/Background
Lumbar spine disorders, prevalent across various demographics, often require precise imaging analysis for accurate diagnosis and treatment planning. Magnetic Resonance Imaging (MRI) stands as the gold standard for visualizing spinal pathologies owing to its detailed soft tissue contrast. Although radiologists are predominantly trained in the clinical interpretation of lumbar spine MRIs, there exists a need for radiologists to be adept in the annotation of these exams for machine learning training, validation, and monitoring. We aim to assess radiologist annotation to determine if automatic subcohort analysis of a dataset can improve performance.
Methods/Intervention
For this study, three subspecialty trained attending radiologists each annotated 100 lumbar spine magnetic resonance imaging (MRI) examinations of patients across various clinical settings. The annotations included vertebral body height, vertebral body area, intervertebral disc area, and neuroforaminal area. Automatic analysis on a software platform (Gesund.ai) was used to identify relative underperformance of the radiologists with respect to clinical subcohorts. We analyzed multiple subcohorts and data annotation metrics to include Cohort Time Consumption, Gender and Institution Influence, and Most Time-Consuming Measurement.
Results/Outcome
Cohorts 1 and 3 have significantly high average times, both from a Missouri institution with 'M' and 'M/F' genders, suggesting case nature or protocols affect processing times. Cohorts from two clinical sites, especially male and Missouri cohorts, indicate challenging diagnostic criteria. Patterns show certain geographical cohorts, particularly males, require more time, potentially reflecting complex anatomy or pathology. L1 and L4 vertebral bodies and area measurements are consistently time-consuming, suggesting a need for better training or tools. Older age groups (60-74) have higher times than younger ones due to age-related anatomical changes.
Conclusion
This study highlights the effectiveness of a sophisticated software platform designed to accustom radiologists to the type of annotation tasks required in modern machine learning. Our approach utilizes this software platform to automate the process of identifying deficiencies in the analysis of patient subcohorts. Once these deficiencies are identified, the system can tailor the training regimen by assigning additional cases from the same subcohort to the respective radiologists.
Statement of Impact
Automated performance analysis with respect to clinical subcohorts has the potential to improve radiologist
Keywords
Machine Learning; Validation; Radiologist; Subcohort; Training; MRI
Poster Presentations
042 - Abdominal Organ Labeling for Abnormality in CT Reports Using a Large Language Model
Presenter: Ricardo B. Lanfredi, National Institutes of Health
Ricardo B. Lanfredi1, Yan Zhuang1, Luke Krembs2, Brandon Khoury2, Pritam Mukherjee1, Ronald M. Summers1
1National Institutes of Health, Bethesda, MD, USA
2Walter Reed National Military Medical Center, Bethesda, MD, USA
Introduction/Background
Medical report labelers capable of handling various abnormalities, such as CheXpert and CheXbert, have mainly targeted chest X-ray (CXR) reports. However, compared with chest X-ray reports, computed tomography (CT) reports, particularly in the abdominal region, are more complex and cover a broader range of organs and abnormalities. These challenges make abnormality labeling for abdominal organs underexplored.
Methods/Intervention
To address this challenge, we propose a large language model (LLM) labeler, MAPLEZ-CT, to annotate whether major abdominal organs are abnormal. MAPLEZ-CT is an adaptation to CT reports of the previously published MAPLEZ (Medical report Annotations with Privacy-preserving large language model using Expeditious Zero shot answers) LLM prompt system. The employed zero-shot prompt, which uses the publicly available Meta-Llama-3-70B-Instruct model, run locally to preserve privacy, is displayed in Figure 1. A key feature of the prompt was the inclusion of an extensive definition of abnormalities: any unusual findings the radiologist deems worth mentioning for a specific organ, including atypical anatomical variations, postsurgical changes, and findings in subparts organs. This definition excludes findings indicating limited evaluation, normal organs, adjacent structures, or broad anatomical areas. Additional modifications included LLM pre-extraction of relevant sentences and chain-of-thought reasoning, whose computational complexity was partially offset by the vLLM library, which reduced the processing time by around 92%.
Results/Outcome
One research fellow and two radiology residents annotated the test set for five major abdominal organs (the spleen, liver, kidneys, gallbladder, and intestines) using 100 private reports randomly sampled from the publicly available Deep Lesion dataset. The final labels were decided through majority voting. The proposed method was compared to MAPLEZ and the rule-based SARLE (Sentence Analysis for Radiology Label Extraction) labeler. It achieved a median F1 score of 0.954 [0.927, 0.975], with an improvement ranging from 0.135 to 0.403 over the median scores of the baseline models. Table 1 shows the results for each organ. We calculated 95% confidence intervals through bootstrapping.
Conclusion
MAPLEZ-CT can reliably label abnormalities for major organs in abdominal CT, outperforming alternatives. It has the potential to create large-scale annotated CT datasets for abnormalities detection.
Statement of Impact
Zero-shot privacy-preserving LLMs can successfully label abnormal organs for CT reports.
Keywords
Large-language models; Abdominal CT; Medical reports; Abnormality labels
043 - Analysis of Out-of-Distribution Factors to Detect Iris and Pupil Using Cataract Surgical Images
Presenter: Mahtab Faraji, University of Illinois Chicago
Mahtab Faraji1, Rogerio G. Nespolo1, Homa Rashidisabet1, Daniel Wang1, Alexis Warren1, Hesham Gabr1, Yannek Leiderman1, Darvin Yi1
1University of Illinois Chicago, Chicago, IL, USA
Introduction/Background
Deep Learning (DL) has substantial potential in ophthalmology, particularly for disease classification and tasks like iris and pupil detection in cataract surgery. DL algorithms typically assume that train and test samples share the same distribution. However, in practice, test samples often differ in distribution, which is called Out-of-Distribution (OOD), affecting generalizability. The OOD can result from factors like differences in data acquisition and preprocessing methods between train and test datasets. This study investigates the impact of two possible OOD-causing factors on a YOLOv5 model's performance in detecting the iris and pupil in cataract surgery images.
Methods/Intervention
Surgical images were divided into training, validation, and test sets. The test set underwent two transformations to simulate OOD: 1) adding Gaussian noise at various levels and 2) converting images to grayscale. We trained a YOLOv5 model for iris and pupil detection using the train set and evaluated it with the validation set. The model was then tested on both the original and transformed test data. Performance was assessed using the Mean Average Precision (mAP) metric.
Results/Outcome
The results have shown a progressive decline in mAP as noise levels increase from 2% to 15%. Notably, dataset_3 remains stable up to 4% noise, while dataset_4's performance drops to nearly zero at 15% noise. Similarly, we show a 3%-6% reduction in mAP across all datasets when images are transformed to grayscale.
Conclusion
Our findings demonstrate that noise and grayscale conversion significantly impact DL model performance. This underscores the necessity of considering these factors when deploying DL models in real-world scenarios.
Statement of Impact
This study examines how OOD factors, such as noise and grayscale conversion, affect DL models for iris and pupil detection in cataract surgery images.
Keywords
Deep learning, Out-of-Distribution, Cataract surgery images, YOLOv5
044 - Answer Positioning Biases in Large Language Model Responses to Medical Multiple-Choice Questions
Presenter: Kartik Gupta, Schulich School of Medicine and Dentistry
Kartik Gupta1, Jaron Chong2
1Schulich School of Medicine and Dentistry, London, ON, Canada
2London Health Sciences Centre, London, ON, Canada
Introduction/Background
Large language models (LLMs) have demonstrated high performance on standardized medical examinations across various domains, typically employing a multiple-choice question (MCQ) format. Technical literature has reported that the precise order of answer choices may affect and bias LLM performance, leading to unreliable estimates of LLM performance. The objective of this study is to evaluate the accuracy of LLMs with forced re-positioning of multiple-choice answer options, utilizing MedQA, a widely recognized medical benchmarking dataset for LLMs.
Methods/Intervention
The comparative efficacy of GPT-3.5 and GPT-4 was assessed using three randomized subsets of the MedQA dataset, each comprising 1273 questions, representing 10% of the total dataset. For each subset, four permutations were generated by forced re-positioning of the correct answer into each of four possible answer positions. The models were evaluated utilizing two prompt templates: question-only format (QO) and chain-of-thought format (COT; "Think step-by-step."). Statistical analysis involved repeated measures ANOVA followed by post-hoc comparisons using Bonferonni’s multiple comparison’s test. The variance of performance was calculated by subtracting the accuracy of the least effective position from the most effective position, termed delta.
Results/Outcome
Using basic QO prompting, GPT-4 outperformed GPT-3.5 in accuracy (69.18% vs 57.53%; p< 0.001). COT outperformed QO prompting, with GPT-4 COT achieving a maximal performance of 80.36%, versus GPT-3.5 COT of 67.08% (p< 0.001). Across model-prompt interventions without COT, position A's performance was significantly greater than other positions. This positional bias is reduced by COT. Utilizing COT reduced Delta with GPT-3.5 (16.3% to 5.5%) and GPT-4 (15.6% to 2.9%).
Conclusion
COT outperforms basic QO prompting, without which, there is strong LLM performance bias towards earlier answer positions. The distribution of answer choice positions in a MCQ evaluation may affect the apparent performance of an LLM.
Statement of Impact
Clinical LLM evaluation should carefully consider the effect of multiple-choice answer position, given systemic biases in performance based upon answer position. LLM evaluation should ideally incorporate randomization of answer position for evaluation.
Keywords
Large Language Models (LLMs); Medical Question Answering; Answering Bias; LLM Safety
045 - Validating GPT-4 for Automated Protocoling in Diagnostic Imaging
Presenter: Kartik Gupta, Schulich School of Medicine and Dentistry
Kartik Gupta1, Jaron Chong2
1Schulich School of Medicine and Dentistry, London, ON, Canada
2London Health Sciences Centre, London, ON, Canada
Introduction/Background
The growing volume of radiology exams, especially CT scans, necessitates more efficient workflows. Protocoling, taking up to 6% of a radiologist's time, is an opportunity for automation. Traditional machine learning methods need large datasets and are hard to adapt across institutions. Large language models (LLMs) demonstrate performance in medical question-answering and protocoling. This study evaluates the zero-shot prediction of OpenAI's GPT-4 in automating Chest CT scan protocoling, using prompts with institution-specific rules.
Methods/Intervention
A dataset of 796 labelled Chest CT Thorax imaging requests and protocols from Victoria Hospital, London Ontario, was analyzed. One data sample contains a requisition with a provided clinical indication from the ordering physician, and the assigned imaging protocol. There were 4 different classes of protocols; Chest CT “with contrast”, “without contrast”, “interstitial”, and “low-dose contrast”. Four prompts were tested with GPT-4: a baseline 'Control' prompt, a 'Classification Rules' (CR) prompt with specific guidelines, an 'Ablated' version with fewer guidelines, and a 'Refined' version (CR-V2) with improved rules. Performance was measured using accuracy, precision, recall, and F1 score. Statistical significance was assessed using McNemar's test where p-values less than 0.05 were significant.
Results/Outcome
The CR prompt significantly outperformed the 'Control' (accuracy: 0.88 vs 0.79, precision: 0.77 vs 0.61, recall: 0.89 vs 0.79, F1 score: 0.82 vs 0.66; P < 0.001). The 'Ablated' model showed reduced performance to CR yet superior performance to the 'Control' (accuracy: 0.85, P = 0.002 vs Control). The CR-V2 model achieved the highest metrics (accuracy: 0.9, precision: 0.8, recall: 0.89, F1 score: 0.84), significantly outperforming both the 'Control' (P < 0.001) and 'Classification Rules' (P = 0.014).
Conclusion
Providing specific instructions for GPT-4 can markedly improve the accuracy of protocol predictions in radiology. The use of specific prompting also improves performance of protocoling compared to no prompt (“Control”). The study demonstrates the potential of large language models in zero-shot protocol prediction for enhancing radiological workflow across institutions by adapting a set of protocoling rules.
Statement of Impact
By introducing custom prompts, institutions can tailor their automated pipelines with LLMs according to their own rules and improve protocoling accuracy. The zero-shot performance ensures that large training datasets are not required.
Keywords
Large Language Models; Protocols; Zero-shot prediction
046 - Artificial Intelligence System for Estimation of Liver Size on Pediatric Abdominal Ultrasounds
Presenter: Dana Alkhulaifat, Children's Hospital of Philadelphia
Dana Alkhulaifat1, Mario Sinti-Ycochea1, Vahid Khalkhali1, Michael Welsh1, Laith Sultan1, Susan Sotardi1
1Children's Hospital of Philadelphia, Philadelphia, PA, USA
Introduction/Background
Ultrasound (US) is a safe and efficient imaging tool for estimating liver size and detecting parenchymal abnormalities, and is essential for diagnosing and treating liver diseases in children. Unlike adults, children’s liver sizes vary with age, making accurate measurement according to age crucial for disease detection. Organ segmentation methods vary between manual and automated approaches. Artificial intelligence (AI) exhibits great potential in improving the accuracy of liver segmentation and size estimation to achieve reliable liver measurements on US images. Thus, our aim was to develop an AI model to accurately predict liver size on US images in pediatric patients.
Methods/Intervention
In this retrospective, IRB-approved study, a dataset of 55 abdominal US images containing a sagittal view of the liver was utilized. The estimated liver size was extracted from each image. The liver was then manually segmented by a radiologist with 3 years of experience using 3D Slicer. 33 images were used for re-training a pretrained fully convolutional neural network (FCN50). 11 images were used for validation and 11 images for testing, respectively. Shape features were extracted from the segmented liver area of any image based on physical pixel spacing that were taken out of original DICOM tags. Image post-processing was deployed to remove artifacts and clean the segmented liver area. A Random Forest regressor predicted the liver spans after standard scaling of liver shape features and truncating the liver lengths to centimeters without decimal points. R2-score and cross validation with 5 folds in a Monte Carlo iteration were used as performance metrics.
Results/Outcome
55 patients (25 male) were included in the analysis, with a mean age of 8.17 years (SD 6.83 years). The 3-phase model was able to predict the liver span with an average R2 of 0.59 and a maximum R2 of 0.74. Attached histograms show the absolute error and the percentage errors of the estimations.
Conclusion
A three-phase system consisting of a transfer deep learning model (FCN50), image post-processor, and a machine learning regressor (Random Forest) can estimate the liver sizes from pediatric ultrasound images with great accuracy.
Statement of Impact
This proposed system holds significant potential in detecting anomalies on liver ultrasound images.
Keywords
Deep learning; Ultrasound; Pediatric radiology
047 - Attention Variant Mechanism for Airways Segmentation
Presenter: Chetana Krishnan, University of Alabama at Birmingham
Chetana Krishnan1, Shah Hussain1, Denise Stanford1, Venkata Sthanam1, Sandeep Bodduluri1, Steve Rowe1, Harrison Kim1
1University of Alabama at Birmingham, Birmingham, AL, USA
Introduction/Background
Attention mechanisms enhance neural networks by focusing on relevant input features but often have limitations, such as addressing only single or dual aspects and struggling with diverse inputs. Our architecture overcomes these by integrating multiple attention strategies and adaptive embedding, ensuring dynamic, robust feature extraction and improved performance in small region segmentations (SRS). Integrating positional (POS), semantic (SEM), image (IM), cross-spatial (CS), and self-channel attentions (SC) with adaptive embedding will significantly enhance feature extraction and representation, improving accuracy and efficiency in SRS, such as airways in comparison to single/dual attentions.
Methods/Intervention
Non-contrast enhanced in vivo CT scans of the lungs were conducted on 25 ferrets, achieving a spatial resolution of 80 μm. The ferrets were anesthetized with inhaled isoflurane and gated to capture a single inspiratory phase using a μCT scanner (MiLabs, Utrecht, Netherlands). The ground truth airway was determined using a region-growing method. The proposed attention variant network (AVN) incorporates information from other pixels within the image by performing SEM, POS, SC, CS, and IM. AVN inputs feature maps from all locations at different scales and outputs refined feature maps. This approach captures and utilizes correlations between neighboring pixels, leading to more accurate segmentation. Multi-scale feature maps refine the attention mechanism, enabling precise adaptation to image data variations. SEM focuses on important semantic information, POS identifies where vital information is located, IM determines task-relevant regions, CS examines spatial relationships, and SC emphasizes relevant channel features. We trained AVN with 18 scans (over 500 slices per scan), and the model was tested with seven unseen scans. Performance was evaluated using the Dice similarity coefficient (DSC) and Intersection over Union score (IoU). AVN was compared with other popular deep-learning networks.
Results/Outcome
AVN achieved a higher DSC and exhibited the highest minimum DSC, indicating superior performance.
Conclusion
AVN dynamically captures spatial and channel information to address the challenge of SRS and the limitations of 2D networks.
Statement of Impact
AVN can improve the diagnosis, treatment planning, and airway branch and volume monitoring for clinical lung diseases.
Keywords
Attention; Segmentation; Spatial; Channel
048 - Classification of Vitreomacular Adhesion Types Using Deep Learning Models on Optical Coherence Tomography Images
Presenter: A. Q. M. Sala Uddin Pathan, University of New South Wales
A. Q. M. Sala Uddin Pathan1, Brughanya Subramanian2, Salil S. Kanhere1, Matthew P. Simunovic3, Rajiv Raman2, Maitreyee Roy1
1University of New South Wales, Kensington, Australia
2Sankara Nethralaya, Chennai, India
3The University of Sydney Save Sight Institute, Sydney, Australia
Introduction/Background
In recent years, Deep Learning (DL) approaches have received considerable interest in ophthalmology due to their ability to promptly diagnose diseases and aid clinicians in decision-making. Using DL models on Optical Coherence Tomography (OCT) images for detecting and classifying Vitreomacular Adhesion (VMA) is still in the early stages. This research aims to design an automated system to classify two types of VMA, Focal VMA and Broad VMA, from Diabetic Macular Oedema (DME) patients using OCT images.
Methods/Intervention
This retrospective study analyzed 302 OCT images from 202 DME patients collected at Chennai Eye Hospital (January 2015 to June 2022), approved by the Vision Research Foundation Institutional Review Board. Two optometrists graded the images, categorizing 107 as Focal VMA and 195 as Broad VMA. Data augmentation and resampling addressed data imbalance, and the data was normalized and resized. VGG16, InceptionV3, and XceptionNet models using transfer learning with pre-trained ImageNet weights, classified the VMA types with 80% of the data for training and 20% for validation. Grad-CAM was used to visualize the regions of interest that influenced the model's decisions. Model performance was assessed by accuracy, sensitivity, specificity, AUC, and F1-score.
Results/Outcome
All three models performed well. VGG16 illustrated 84.19% accuracy with 84% Sensitivity, 83% Specificity, 84% AUC score, and 84% F1-Score. InceptionV3 showed slightly better accuracy of 84.40% with 84% sensitivity and specificity, 84% AUC score, and 84% F1-Score. The XceptionNet model outperformed all with 85% accuracy. The sensitivity, specificity, AUC score, and F1 scores were 85%, 84%, 85%, and 85%, respectively.
Conclusion
DL models correctly classified Focal VMA and Broad VMA from OCT images. Transfer learning reduced the program execution time. Among all the models, XceptionNet performed slightly better. The DL models utilized in this research show the potential to automate the diagnosis of various vitreomacular interface disorders with higher accuracy and a streamlined diagnostic process.
Statement of Impact
The study demonstrates that deep learning models, particularly the XceptionNet model, can accurately and efficiently classify vitreomacular adhesion types in diabetic macular edema patients using OCT images, achieving up to 85% accuracy. This automation significantly improves diagnostic speed and accuracy, facilitating better treatment planning and clinical workflow efficiency.
Keywords
Vitreomacular adhesion; Deep learning; Focal VMA; Broad VMA
049 - Classifying Common Breast Pain Symptoms for Patients Using a Large Language Model, ChatGPT
Presenter: Hana Haver, Mass General Brigham
Hana Haver1, Manisha Bahl1, Maggie Chung2
1Mass General Brigham, Boston, MA, USA
2University of California, San Francisco, CA, USA
Introduction/Background
Breast pain is a common symptom for which diagnostic imaging evaluation is recommended based on clinical significance according to the American College of Radiology’s (ACR) Appropriateness Criteria. Imaging is not recommended for clinically insignificant breast pain, which is defined as nonfocal, diffuse, or cyclical pain. This study aims to use ChatGPT GPT-4 (March 2023 release, OpenAI) to automate the classification of common breast pain symptoms based on clinical significance.
Methods/Intervention
The authors created a library of 150 breast pain symptoms representing breast pain variants described in the ACR Appropriateness Criteria, including clinically insignificant and significant pain, and non-pain-related clinically significant symptoms (e.g., palpable lump, pathologic nipple discharge). A zero-shot prompt for the LLM was developed to characterize breast concerns as clinically insignificant or clinically significant, “Use the ACR appropriateness criteria for breast pain. Respond with only is this ‘clinically significant breast symptom’ or ‘not clinically significant symptom.’" Each breast symptom was submitted with the prompt in three independent tests in June 2024. Clinical significance was determined by the mode of the three tests and compared to the ground truth, established by radiologist consensus based on the ACR Appropriateness Criteria for breast pain.
Results/Outcome
ChatGPT GPT-4 assigned the appropriate clinical significance, in agreement with the breast imaging radiologists, in 74.7% (112/150) of breast pain symptoms. ChatGPT GPT-4 correctly identified 89.1% (57/64) of clinically significant breast symptoms. Among instances where the model did not agree with the ground truth, the majority (81.6%; 31/38) were clinically insignificant cases that ChatGPT GPT-4 considered to be clinically significant. All 30 pain symptoms with non-pain-related clinically significant symptoms (e.g., palpable lump, pathologic nipple discharge) were correctly assessed by ChatGPT GPT-4 as clinically significant. Eighty-nine point three percent (134/150) of LLM-generated results were identical across three independent tests.
Conclusion
We demonstrate the first known potential application of an LLM to classify breast pain symptoms as clinically significant or clinically insignificant.
Statement of Impact
To automate ascertaining breast pain clinical significance, prior to patient scheduling, could influence decision-making about imaging evaluation, as only clinically significant symptoms would be indicated for imaging evaluation.
Keywords
Large language model; Breast Imaging; Clinical decision support
050 - Classifying, Fast and Slow: Adversarial Training for Bias Mitigation in Medical Imaging
Presenter: Felipe Matsuoka, Faculdade de Ciências Médicas da Santa Casa de São Paulo
Felipe Matsuoka1, Eduardo Farina2, Felipe Kitamura2
1Faculdade de Ciências Médicas da Santa Casa de São Paulo, São Paulo, Brazil
2UNIFESP, São Paulo, Brazil
Introduction/Background
Ethnicity bias in deep learning models poses significant ethical concerns. Leveraging Daniel Kahneman's dual-process theory in "Thinking, Fast and Slow," which distinguishes between rapid, intuitive System 1 and deliberate, analytical System 2 thinking, we propose an approach to mitigate bias in chest X-ray classification. Previous studies have demonstrated the effectiveness of adversarial methods in reducing bias, such as COVID-19 classification from electronic medical records (Zhang et al., 2021), making this approach both relevant and innovative in medical imaging.
Methods/Intervention
Our methodology employs two complementary models: a predictor model and an adversarial model. The predictor model, akin to System 1, efficiently classifies chest X-rays, identifying normal and abnormal cases. Simultaneously, the adversarial model, similar to System 2, challenges the predictions to reduce bias. We used the CheXpert dataset (Irvin et al., 2019), ensuring a balanced representation of ethnic groups through binary label adjustment and sampling techniques. During training, the adversarial model increases its error in predicting patient ethnicity, forcing the predictor model to focus on unbiased features.
Results/Outcome
Using One-way ANOVA, we assessed the ROCAUC performance across different ethnicities for both models. The baseline model showed no significant differences across ethnicities (p-value = 0.258). Similarly, the adversarial model also exhibited no significant differences (p-value = 0.405). These findings suggest that the adversarial model maintained consistent performance across ethnic groups without introducing additional bias, highlighting the complexity of addressing ethnicity bias in medical imaging.
Conclusion
The adversarial training framework in chest X-ray classification demonstrates an innovative approach to mitigating ethnicity bias. Despite not showing significant performance differences, this study emphasizes the importance of developing methods to address bias in medical imaging. The results suggest that achieving equitable performance across all ethnic groups is challenging, and a potential alternative could involve optimizing models for specific ethnicities.
Statement of Impact
This study introduces a novel adversarial training framework to mitigate ethnicity bias in deep learning models for medical imaging. The ANOVA results indicate that adversarial methods can maintain equitable performance across different ethnic groups. By applying principles from psychology, this research connects theoretical concepts with practical applications, advancing the development of more reliable AI systems in medical diagnostics.
Keywords
Ethnicity Bias; Deep Learning; Medical Imaging; Adversarial Training
051 - Comparative Evaluation of Computationally Efficient and Explainable 1D Brightness Profiles from Axial Projections for Lung Ultrasound Frame Classification
Presenter: Srishti Jain, Boston University
Srishti Jain1, Umair Khan2, Russell Thompson3, Lauren P. Etter4, Ingrid Camelo5, Rachel C. Pieciak6, Ilse Castro-Aragon7, Bindu Setty7, Christopher C. Gill6, Margrit Betke1
1Boston University, Boston, MA, USA
2University of Trento, Trento, Italy
3University of Massachusetts-Dartmouth, Dartmouth, MA, USA
4University of Wisconsin-Madison, Madison, WI, USA
5Augusta University, Augusta, GA, USA
6Boston University School of Public Health, Boston, MA, USA
7Boston Medical Center, Boston, MA, USA
Introduction/Background
Analyzing lung ultrasound (LUS) data to identify pneumonia in pediatric patients is crucial for providing timely and accurate patient care. Automated solutions developed in this regard are mostly CNN-based methods; while effective, offer high computational complexity, limiting their wide application in resource-constrained environments. Their lack of transparency leads to clinician distrust. To address the aforementioned challenges, a computationally efficient yet explainable method to evaluate LUS data is our research focus.
Methods/Intervention
Lung consolidations in ultrasound images appear as dark, wedge-shaped areas with mixed textures, characteristic patterns to be observed in pneumonia patients. We hypothesize that the compressed 1D data representation of the 2D frames can retain the characteristic features of lung consolidations. In the presence of lung consolidations, the intensity values across the axes plummet, as darker regions have lower pixel values. This drop in intensity values adds a valuable characteristic feature to the 1D BP vector that allows the MLP to discriminate between frames with/without consolidations. Classification of such representations can lead to the development of a computationally efficient and explainable automated solution. As a proof of concept, this study explores a novel LUS frame classification method using 1D Brightness Profiles (BP). These are obtained by summing pixel values along the y-axis and x-axis. Three types of BP were extracted from LUS frames: fy (sum of pixel values along the y-axis), fx (sum of pixel values along the x-axis), and fy+x (concatenation of fy and fx). These are then fed to separate Multilayer Perceptron Models (MLP) which perform the binary classification task.
Results/Outcome
Our findings reveal that fy+x projections give better classification metrics than fy and fx projections. We now have a robust perspective on how different projection axes capture information.
Conclusion
The Brightness Profiles, derived from pediatric pneumonia patients, are tested for their reliability in capturing frame patterns. The study demonstrates that the 1D arrays offer a compressed representation of LUS frames, outperforming existing CNN-based methods in terms of explainability yet offering reliable classification metrics.
Statement of Impact
Brightness Profiles is a simplistic yet powerful information capture data type that is computationally efficient and contributes towards AI Explainability.
Keywords
Brightness Profiles; 1D Projections; Multilayer Perceptron; Computational Efficiency
052 - Comparing Classic to State-of-the-Art Image Features: A Clustering Approach Using Local Binary Patterns and ResNet-18 Features for Lung Ultrasound Video Classification
Presenter: Saunak Bhattacharjee, Boston University
Saunak Bhattacharjee1, Umair Khan2, Russell Thompson3, Lauren P. Etter4, Ingrid Camelo5, Rachel C. Pieciak6, Ilse Castro-Aragon7, Bindu Setty7, Christopher C. Gill6, Margrit Betke1
1Boston University, Boston, MA, USA
2University of Trento, Trento, Italy
3University of Massachusetts-Dartmouth, Dartmouth, MA, USA
4University of Wisconsin-Madison, Madison, WI, USA
5Augusta University, Augusta, GA, USA
6Boston University School of Public Health, Boston, MA, USA
7Boston Medical Center, Boston, MA, USA
Introduction/Background
Lung ultrasound (LUS) is a valuable non-invasive tool for diagnosing respiratory diseases, and the use of AI to support LUS interpretation has been proposed. Automatically interpreting LUS data is complex and requires advanced techniques to detect abnormalities like lung consolidations, especially with limited labeled datasets for training AI models. This study explores using Local Binary Pattern (LBP) features and features computed by a ResNet-18 model in an unsupervised learning context to classify LUS video frames in an efficient way.
Methods/Intervention
The study used 178 LUS videos from 200 patients. LBP and ResNet-18 features were extracted from each video frame to capture texture information for distinguishing abnormal from normal lung patterns. Both feature sets underwent unsupervised clustering using a k-means clustering approach, with k=2, to identify natural data groupings. The effectiveness of the resulting clusters was assessed by calculating the precision in isolating frames that contained lung consolidations, which was determined by comparing the clusters against clinical data on a frame-by-frame basis.
Results/Outcome
The analysis showed that the clustering approach based on LBP features achieved a mean overall precision of 83.72%, and based on ResNet-18 features, 88.73% precision. Visual analysis of the resultant clusters revealed that consolidation frames sometimes appeared to form separate distinct clusters of their own, while in other cases, they were interspersed within either one of the two primary clusters. ResNet-18 outperformed LBP features, but the simplicity and efficiency of computing LBP features make them a practical alternative to ResNet-18 features, particularly in resource-limited settings.
Conclusion
Both ResNet-18 and LBP features showed promise as inputs to an unsupervised clustering method for identifying lung consolidations in LUS video frames. Future work will refine these methods to better handle the variability and complexity of medical imaging data by using representative samples from clusters instead of the entire dataset, reducing computational demands and potentially improving generalization by focusing on key data points.
Statement of Impact
This research highlights the potential of traditional and modern techniques in enhancing LUS diagnostics. By addressing current limitations, the study contributes valuable insights into the development of efficient, generalizable AI-based diagnostic tools for respiratory diseases.
Keywords
Lung Ultrasound; Local Binary Pattern (LBP); ResNet-18; Unsupervised Learning
053 - Enhanced Sperm Image Segmentation Using MCFA Unet: Integrating Multi-Channel Feature Extraction and Attention Mechanisms
Presenter: Qiufeng Yi, University of Birmingham
Qiufeng Yi1, Chenyang Wang1, Xiazhen Xu1, Jiaqi Ye1, Amir Hajiyavand1
1University of Birmingham, Birmingham, AL, USA
Introduction/Background
The segmentation of sperm images is critical in the field of assisted reproductive technology (ART). Infertility affects millions globally, and male reproductive health plays a significant role in many of these cases. Sperm quality analysis has therefore become essential for evaluating male fertility. Accurate sperm segmentation is crucial as it enables automated sperm counting, morphological analysis, and motion tracking, significantly enhancing diagnostic accuracy and efficiency.
Methods/Intervention
In this study, we introduced an improved U-Net model, the Multi-Channel Feature Extraction U-Net (MCFA Unet), aimed at enhancing the precision and reliability of sperm segmentation. The U-Net architecture, well-known for its effectiveness in biomedical image segmentation, was adapted with several key enhancements: Multi-Channel Feature Extraction: This allows the model to capture a wider range of sperm characteristics, improving segmentation accuracy. Advanced Data Augmentation: By increasing the variety of training images, the model becomes more robust to different sperm image variations. Improved Loss Function: Combining Dice loss and cross-entropy loss ensures more precise segmentation. We trained and tested the MCFA Unet on subset B of the Sperm Video Image Analysis (SVIA) dataset, which provided a comprehensive set of annotated sperm images for robust evaluation.
Results/Outcome
Our experiments on subset B of the SVIA dataset showed that the MCFA Unet significantly outperformed traditional models in sperm image segmentation. The key performance metrics were: Dice Coefficient: 91.27, indicating a high overlap between the predicted segmentation and the ground truth. Jaccard Coefficient: 84.14, which measures the similarity between the segmented results and the actual sperm cells. These results demonstrate the high precision and reliability of the MCFA Unet model, attributed to its enhanced feature extraction capabilities.
Conclusion
The MCFA Unet model offers a significant improvement in the precision and reliability of sperm image segmentation compared to traditional methods. This enhancement has substantial implications for the automation and accuracy of sperm quality analysis in ART, reducing the dependency on manual analysis and making the process faster and less error-prone.
Statement of Impact
By improving sperm segmentation accuracy, the MCFA Unet model contributes to better diagnostic tools in assessing male reproductive health.
Keywords
Segmentation; Sperm Analysis; Attention Mechanisms; Deformable Convolutions
054 - Enhancing Radiology Report Comprehension: A Study on GPT-4's Identification of Key Radiological Terms
Presenter: Jad Alsheikh, Creighton University School of Medicine
Jad Alsheikh1, Ali Memon1, Daniel Spalinski1, Kimberly Mendez2, Sherif Zineldine1, Dorina Pinkhasova1, Michael Fei1
1Creighton University School of Medicine, Omaha, NE, USA
2Baylor College of Medicine, Houston, TX, USA
Introduction/Background
This study evaluates GPT-4's accuracy in identifying common radiological terms from chest X-ray (CXR) reports, comparing it to terms derived from actual radiology reports. The objective is to assess GPT-4's potential in creating a database for highlighting and defining medical terms to aid patient comprehension.
Methods/Intervention
This was a retrospective analysis of CXR reports. Two lists of the top 40 most common radiological CXR findings and phrases were generated. The first list was derived from 3,999 reports from the Open-i service of the NLM, covering a wide array of pathologies. The second list was generated by GPT-4, identifying what it believed to be the 40 most common findings and phrases. We compared GPT-4’s performance against terms derived from the sample reports by analyzing the overlap between the two lists and assessing the frequency of GPT-4 terms in actual reports, considering exact and similar matches. Additionally, we evaluated how well GPT-4 can account for variations in radiologist terminology by examining the coefficient of variation (CV) for the term frequencies.
Results/Outcome
GPT-4 demonstrated the ability to identify frequently used terms such as "effusion" and "pneumothorax" with high accuracy. Terms with high exact match proportions included "pleural effusion" (100%), "pulmonary edema" (100%), and "pneumothorax" (73.7%). The precision, recall, and F1 score were all 0.30, indicating moderate overlap between the terms identified by GPT-4 and those derived from the CXR reports. The Spearman's rank correlation was 0.32, suggesting a weak correlation between the ranks of term frequencies in GPT-4’s list and the actual reports. The Chi-Square Test (chi2=840.00, p=7.53e-04, dof=714) indicated that the differences between the observed frequencies of terms in actual reports and those identified by GPT-4 were statistically significant.
Conclusion
GPT-4 demonstrated reasonable accuracy in identifying common radiological terms in CXR reports and can effectively account for variations in terminology. While it successfully identified frequently used terms, its performance varied for less common terms.
Statement of Impact
This study underscores the potential of GPT-4 in enhancing patient understanding of radiological reports by providing a reliable database of terms. Incorporating AI tools like GPT-4 could improve patient communication and engagement in radiology, ultimately contributing to better healthcare outcomes.
Keywords
GPT-4; Chest X-ray Reports; Natural Language Processing; Term Identification
055 - Evaluating Performance and Environmental Impact: A Comparative Study of Large Language Models
Presenter: Sanaz Vahdati, Mayo Clinic - Rochester
Sanaz Vahdati1, Bardia Khosravi1, Bradley J. Erickson1
1Mayo Clinic - Rochester, Rochester, MN, USA
Introduction/Background
Leveraging artificial intelligence (AI) for diagnostic purposes has shown promise in enhancing patient care while also raising concerns about environmental sustainability. Large language models (LLMs) are increasingly impacting the medical field by automating complex tasks and facilitating a deeper understanding of vast datasets, thus revolutionizing the approach to patient care and medical research. This study focuses on the extraction of acute cervical spine fractures from radiology reports using open-source LLMs juxtaposed with an analysis of their associated carbon emissions.
Methods/Intervention
We randomly acquired radiology reports from 1000 non-contrast cervical spine CT scans conducted between January and February 2022. After prompt optimization on 110 reports, the remaining 890 served as a test dataset to assess the model's performance. The model aimed to indicate the presence or absence of an acute cervical vertebral fracture. We calculated the carbon emissions generated by running two models, Zephyr Alpha 7 Billion and LLama3 70 Billion. We applied utilizing a carbon tracker package to assess the environmental impact of LLM’s operations.
Results/Outcome
The Zephyr7B model achieved an accuracy of 94% for extracting acute cervical spine fracture, and LLAMA3 70B obtained an accuracy of 92% for this task. The sensitivity and specificity were 0.97,0.94 and 0.97,0.91 for Zephyr 7B and LLama70B, respectively. The carbon emission analysis revealed that the inference of the Zephyr model is estimated to use Energy 0.42 kWh of electricity, contributing to 0.145 Kg of CO2eq. The LLama3 model is estimated to use Energy 1.17 kWh of electricity, contributing to 0.42 Kg of CO2eq.
Conclusion
In this study, we compared the LLM models’ performance and their environmental impact. We demonstrate the potential of achieving high performance using a smaller model size, which can lead to a more environmentally sustainable application of LLMs. We highlight the trade-offs between efficiency and environmental impact on the deployment of AI in medical settings.
Statement of Impact
Our findings advocate for a balanced approach to adopting AI technologies, considering their medical benefits and ecological footprints. Future work should explore optimization techniques to reduce the energy consumption of AI systems without compromising their performance, thereby aligning AI advancements with sustainable healthcare practices.
Keywords
Artificial Intelligence; Large Language models; Sustainability
056 - Evaluating TotalSegmentator for Muscle and Fat Segmentation in Patients with Ascites
Presenter: Tiffany Wei, National Institutes of Health
Tiffany Wei1, Benjamin Hou1, Tejas S. Mathai1, Jianfei Liu1, Ronald M. Summers1, Zhiyong Lu1
1National Institutes of Health, Bethesda, MD, USA
Introduction/Background
TotalSegmentator (TS) is a public CT segmentation tool that can segment 117 anatomical structures. The dataset it was trained on was randomly sampled and contained patients with certain abnormality types (e.g., inflammation, trauma, bleeding). However, it is not known if patients with fluid retention, such as ascites, were excluded. Excess fluid in the peritoneal cavity (often seen in patients with liver fibrosis) can have visual similarities to visceral fat and can pose a challenge for TS, thereby impacting its segmentation performance. This study determines if TS over-segments muscle and fat (subcutaneous and visceral) into regions of ascites.
Methods/Intervention
285 CT scans of 140 female patients from the public TCGA-OV-AS dataset were used. This dataset contained only ascites labels that were manually annotated on a voxel-level (all slices, all volumes) and no other organ labels. TS segmented the muscle and fat (subcutaneous and visceral) regions in these volumes, and its outputs were compared against the manual ascites labels to determine any over-segmentations. In this context, a lower Dice score is desirable as it signifies less overlap between ascites and the structure segmented by TS.
Results/Outcome
TS often over-segmented muscle as it had the highest mean Dice score (0.00965±0.0157), followed by visceral fat (0.00691±0.0152), then subcutaneous fat. Significant over segmentation in muscle (44.45 ± 97.11mL) into ascites was seen in 20 out of 285 scans (scans which exceeded 50mL).
Conclusion
TS is generally capable of accurately segmenting visceral fat, subcutaneous fat, and muscle in patients with ascites. However, it must be used with caution as significant over-segmentations can affect body composition measurements.
Statement of Impact
For population-based studies and opportunistic screening, body composition measurements (e.g., muscle and fat volume/attenuation) can be correlated with underlying disease conditions (e.g., Diabetes). They play a critical role in early interventions and patient management.
Keywords
Ascites; CT; Segmentation; Muscle
057 - Utility of Fully Automated Liver and Spleen Biomarkers for Staging Hepatic Fibrosis in CT
Presenter: Tejas S. Mathai, National Institutes of Health
Sydney V. Lewis1, Tejas S. Mathai1, Meghan G. Lubner2, Perry J. Pickhardt2, Ronald M. Summers1
1National Institutes of Health, Bethesda, MD, USA
2University of Wisconsin-Madison, Madison, WI, USA
Introduction/Background
Liver fibrosis can be caused by metabolic disorders (e.g., obesity, Diabetes), alcoholism, or Hepatitis B/C virus. While earlier fibrosis stages are reversible, later stages (advanced fibrosis and cirrhosis) are irreversible. Notably, cirrhosis is the 12th leading cause of death in the US. Biopsies are the gold standard for staging, but they are invasive and prone to sampling error. Consequently, there is a need for non-invasive CT-based biomarkers to distinguish early fibrosis (F0 – F2) from later stages (F3 – F4).
Methods/Intervention
372 patients underwent CT imaging at Institution-A with fibrosis (METAVIR) confirmed through biopsy. An automated deep learning-based model segmented the full liver, 8 liver Couinaud segments, and spleen. Using Couinaud segments 2 and 3, another fully automated technique computed the liver surface nodularity (LSN) score (defined as the smoothness of liver surface). Additionally, CT-based biomarkers, such as volume and attenuation, were also calculated for the full liver, 8 segments, and spleen. Liver Segmental Volume Ratio (LSVR) was also calculated as the sum of the volumes of segments 1 – 3 divided by that of segments 4 – 8. The dataset was divided into 80% training (n = 297) and 20% testing (n = 75) set. Univariate and multivariate logistic regression models were trained to stage fibrosis using the biomarkers and LSN. An AUC below 0.6 was considered clinically ineffective.
Results/Outcome
The best univariate models used spleen volume (Cirrhosis AUC = 0.829, Advanced Fibrosis AUC = 0.805) and LSN (Cirrhosis AUC = 0.766, Advanced Fibrosis = 0.695). The best multivariate model for predicting cirrhosis included LSVR, spleen volume, and segmental volume proportions (AUC = 0.927). For advanced fibrosis, the best multivariate model included LSVR, spleen volume, and automated LSN (AUC = 0.839).
Conclusion
The best multivariate models for staging liver fibrosis included LSVR, spleen volume, segmental volume proportions, and automated LSN. The addition of automated LSN score had the greatest impact for prediction of advanced fibrosis.
Statement of Impact
For population-based studies and opportunistic screening, non-invasive CT-based biomarkers may be clinically useful in differentiating advanced fibrosis and cirrhosis from earlier stages. They play a critical role in early interventions to reverse fibrosis and improve patient care.
Keywords
CT; Liver Fibrosis; Cirrhosis; Liver Segmental Volume Ratio
058 - Generating Structured Radiology Reports of Chest Radiographs using Retrieval Augmented Generation
Presenter: Yash S. Saboo, University of Texas at Austin
Yash S. Saboo1, Aaron Fanous2, Kal L. Clark3
1University of Texas at Austin, Austin, TX, USA
2Stanford School of Medicine, Palo Alto, CA, USA
3University of Texas Health San Antonio, San Antonio, TX, USA
Introduction/Background
Radiologist workload has increased over the past decade, increasing burnout and the risk of diagnostic inaccuracies. Artificial intelligence (AI) algorithms have been developed for tasks such as disease detection, image segmentation, and impression-generation. However, much work remains in using AI to generate comprehensive radiology reports. The purpose of this study is to develop a retrieval-augmented generative AI model that accurately generates radiology reports of chest radiographs (CXRs).
Methods/Intervention
We trained the DenseNet-121 model on 13964 CXRs from the VinDr-CXR dataset to classify the CXRs into seven classes: aortic enlargement, cardiomegaly, interstitial lung disease, lung opacity, pleural effusion, pneumothorax, no finding. We then used the trained DenseNet-121 model as an encoder to generate embeddings for 159,970 CXRs from the Medical Information Mart for Intensive Care Chest X-ray JPG (MIMIC-CXR-JPG) dataset. The embeddings were stored in a vector database. We generated reports for a separate test set of 59 CXRs from MIMIC-CXR-JPG using similarity search, where we compared the vector embedding of each of the 59 test CXRs with the embeddings in the vector database. The most similar embedding in the vector database for each of the 59 CXRs was identified using cosine similarity, and the most-similar embedding’s associated report was retrieved and restructured into six distinct sections (cardiomediastinum, pleural space, lungs, bones, hardware, other) using Generative Pretrained Transformer (GPT) 4. These restructured reports were recommended as the reports for the 59 CXRs, respectively.
Results/Outcome
On the test set of 59 CXRs, the model achieved a median BLEU score of 0.0701, median BERT score of 0.216, median CheXbert score of 0.227, and median RadCliQ score of 1.713. Additionally, a board-certified radiologist assigned a RADPEER Score, ranging from 1 to 3, to each section (cardiomediastinum, pleural space, lungs, bones, hardware, other) of the 59 AI-generated reports. Averaging across all sections, the model achieved a RADPEER Score of 1.548.
Conclusion
This retrieval-augmented generative AI model has the potential to assist radiologists with generating structured radiology reports.
Statement of Impact
This approach of generating structured radiology reports on CXRs may increase workflow efficiency and reduce radiologist burnout.
Keywords
Generative AI; Retrieval Augmented Generation; Natural Language Processing; Large Language Models
059 - Implementation of U-Net Deep Learning Model in SPECT Myocardial Perfusion Image Segmentation
Presenter: Ahmad Alenezi, Kuwait University
Ahmad Alenezi1, Ali Mayya2, Mahdi Alajmi3, Hamad Alhamad1
1Kuwait University, Kuwait City, Kuwait
2Tishreen University, Latakia, Syria
3Ministry of Health Kuwait, Kuwait City, Kuwait
Introduction/Background
Myocardial perfusion imaging (MPI) is a type of single photon emission computed tomography (SPECT) imaging that is performed to evaluate patients with suspected or docu-mented coronary artery disease (CAD) that detection and diagnosis is among the complex prog-nosis that requires accurate and precise image processing (2). Processing and segmentation should be done accurately to provide an accurate diagnosis. Many problems may arise from segmentation issues, leading to difficulties in diagnosis (5). Machine learning (ML) algorithms have been de-veloped with superior performance to overcome segmentation problems (7). To solve segmenta-tion problems and provide accurate segmentation, this study used a deep learning (DL) algorithm called U-Net for image segmentation in MPI.
Methods/Intervention
One thousand one hundred patients who had an MPI study were collected from the PACS system at Al Jahra Hospital between the period of 2015 and 2024. To train the U-net model, 100 studies have been segmented by different nuclear medicine (NM) experts to provide ground truth (i.e., gold-standard coordinates). To assess the performance of the model, multiple cross-validation tests (i.e., accuracy, precision, intersection over union (IOU), recall, and F1 score) were utilized after breaking down the main dataset into a training set (n= 100 images) and valida-tion subsets (n= 900 images).
Results/Outcome
A dataset of 4560 images and 4560 masks was obtained, and a holdout and k-fold (k-5) were utilized. Both cross entropy and dice score were also utilized. The findings indicate that the best case was corresponding to the holdout split scenario with a cross-entropy loss function with a test accuracy stands at 98.9%, test IOU at 89.5%, and the test Dice coefficient at 94%. The K-fold sce-nario was more balance between true positive rate and false positive rate. The results of U-Net segmentation were not significantly different from that produced by an expert nuclear medicine technologist (p=0.1).
Conclusion
The results show that the U-Net model provides a solution for segmentation problems, allowing better diagnosis and subsequent accurate reporting.
Statement of Impact
This research demonstrates that the U-Net deep learning algorithm significantly enhances MPI segmentation accuracy, aligning closely with expert evaluations and promising improved diagnostic precision for CAD.
Keywords
Artificial intelligence; Deep Learning; SPECT; Myocardial Perfusion
060 - Improved Osteoporosis Prediction in Breast Cancer Patients Using a Novel Semi-Foundational Deep Learning Model
Presenter: Katherine Q. Tibbets, USF Morsani College of Medicine
John D. Mayfield1, Katherine Q. Tibbets2, Aziz Rehman2, Millena Levin2, Dayna Goltz2, Neelesh Prakash2
1Massachusetts General Hospital, Boston, MA, USA
2USF Morsani College of Medicine, Tampa, FL, USA
Introduction/Background
Small cohorts of certain disease states are common especially in medical imaging. Despite the growing culture of data sharing, information safety often precludes open sharing of these datasets for creating generalizable machine learning models. To overcome this barrier and maintain proper health information protection, foundational models are rapidly evolving to provide deep learning solutions that have been pretrained on the native feature spaces of the data. Although this has been optimized in Large Language Models (LLMs), there is still a sparsity of foundational models for computer vision tasks.
Methods/Intervention
It is in this space that we provide an investigation into pretraining a Visual Geometry Group (VGG)-16 on an unrelated dataset of 8,500 chest CTs which was subsequently fine-tuned to classify bone mineral density (BMD) in 200 breast cancer patients using the L1 vertebra on CT.
Results/Outcome
This semi-foundational model showed significant improved ternary classification into mild, moderate, and severe demineralization in comparison to ground truth Hounsfield Unit (HU) measurements in trabecular bone. For the 20% holdout testing set, the AUC was 0.92 (p-value < 0.05, ANOVA versus no pretraining versus ImageNet transfer learning) and F1-score 0.84 (p-value < 0.05).
Conclusion
In this study, the use of a semi-foundational model trained on the native feature space of CT provided improved classification in a completely disparate disease state with different window levels.
Statement of Impact
Future implementation with these models may provide better generalization despite smaller numbers of a disease state to be classified.
Keywords
Foundational Models; Machine Learning; Artificial Intelligence; Osteoporosis
061 - Machine Learning Clustering of Qualitatively Assessed Lung Computed Tomography Scans to Distinguish Nonhuman Primate Models of Respiratory Virus Infection: A Pilot Study
Presenter: Edmond Adib, National Institutes of Health
Edmond Adib1, Shiva Singh1, Marcelo Castro1, Mark Rustad1, Winston Chu1, Maryam Homayounieh1, Gabriella Worwa1, Daniel Chertow1, Reed Johnson1, Michael Holbrook1, Yu Cong1, Ian Crozier2, Ashkan Malayeri1, Jeffrey Solomon2
1National Institutes of Health, Bethesda, MD, USA
2Leidos Biomedical Research, Fredrick, MD, USA
Introduction/Background
We explore unsupervised machine learning (ML) to analyze expert-generated qualitative assessments of lung computed tomography (CT) scans to differentiate nonhuman primate (NHP) models of experimental respiratory virus infections.
Methods/Intervention
We utilized CT scans from four distinct experiments evaluating NHP infection models after cowpox virus (CPV), influenza A virus (IAV; with and without superimposed methicillin-resistant staphylococcal [MRSA] exposure), Nipah virus (NiV), and SARS-CoV-2 exposures. N=19 subjects with imaging abnormality across multiple time points were selected. While CT protocols were controlled, NHP species, age, weight, and dose/route of inoculation varied across experimental groups. Using a standardized evaluation questionnaire, a radiology specialist qualitatively graded each CT lung-lobe. Features were one-hot encoded, and Uniform Manifold Approximation and Projection (UMAP) was applied for dimensionality reduction followed by k-means clustering.
Results/Outcome
A UMAP plot demonstrates the grouping of CT scan qualitative features from different NHP models into clusters revealing key insights: • # Clusters: Six-clustering analysis generated the highest silhouette score (SS= 0.575). Two clusters express peak vs. non-peak disease and another contains an individual (subject 12) which had pre-existing lung abnormality. • Exposure Clustering: CT abnormalities from IAV +/- MRSA and SARS-CoV-2 virus models cluster together suggesting similar qualitative lung features across these models. CT abnormalities after CPV exposure consistently clustered separately, suggesting distinct qualitative features in this model. • Longitudinal Variability: after a specific viral exposure (e.g. IAV + MRSA), peak CT abnormality (cluster 0 = blue) clustered distinctly versus non-peak (cluster 3 = red) abnormality, consistent with the expected time-series analysis.
Conclusion
Lung lesion phenotypes likely vary across viral infections, routes of inoculation, dose and other factors. Using a radiologist's lobe-based qualitative assessment, ML methods can effectively distinguish differences. Future efforts with more subjects will explore fully automated methods (e.g. lung segmentation, radiomic feature extraction) as input to machine learning-based classification.
Statement of Impact
Differentiating qualitative CT lung abnormality across NHP models of viral infections provides initial proof-of-principle that urges ML approaches of user-independent radiomic feature analysis in the future.
Keywords
SARS-CoV-2; Nipah virus; Cowpox virus; Influenza virus
062 - Natural Language Processing for Automated Correlation of Radiology and Pathology Reports in Prostate Cancer Detection
Presenter: Anthony T. Wu, University of California, Irvine
Anthony T. Wu1, Gavin Shu1, Ryan O'Connell1, Peter Chang1, Robert Edwards1, Sungmee Park1, Roozbeh Houshyar1
1University of California, Irvine, Irvine, CA, USA
Introduction/Background
Prostate cancer (PCa) is the second most common cancer among men and has the second highest mortality rate. With an aging U.S. population, the demand for accurate PCa detection is outpacing the diagnostic radiology workforce, particularly in remote areas. Though data-driven self-improvement in radiologist PCa detection can help alleviate workforce constraints, automated pathology correlation to prostate MRI reports (RRs) is currently lacking, limiting this approach’s feasibility. Herein we propose a novel natural language processing algorithm (NLP) for automated radiology-pathology report correlation using a 12-core biopsy template (12cBT) in real time.
Methods/Intervention
Radiology reports (RRs) and their corresponding pathology reports (PRs) from UCI Health (October 2013-October 2023) were retrieved from a HIPAA-compliant data warehouse, totaling 1162 pairs across 1093 patients. A random 10% subset was labeled by medical students and verified by physicians. RRs and PRs were analyzed separately. Regex expressions extracted lesion locations and PI-RADS scores from RRs, while core biopsy regions and Gleason scores were extracted from PRs. A custom spell-check addressed domain-specific errors. Lesions were mapped to 12cBTs, and with NLP mapping performance evaluated on the test set.
Results/Outcome
The NLP achieved 97.4% accuracy in detecting significant PI-RADs (≥3) in RRs and 100% accuracy in detecting significant Gleason scores (≥3+3) in PRs. Mapping of 12cBT regions for RRs and PRs yielded 89.6% and 89.4% overall accuracy, respectively. PI-RADs v2 demonstrated 60.8% accuracy in detecting PCa, with 12cBT regional sensitivities shown.
Conclusion
Our NLP system effectively mapped radiology reports (RRs) to pathology reports (PRs) in near-real time. To our knowledge, this is the first instance of an automated radiology-pathology correlation for PCa. In addition, we found that while radiologists faced challenges in pinpointing exact PCa locations, they were able to detect the general region of PCa with relatively high sensitivity.
Statement of Impact
Our tool provides radiologists with a means for self-improvement by delivering feedback as soon as pathology reports are available if integrated with hospital electronic health record systems. Additionally, we assessed the performance of PI-RADS v2 at UCI Health in detecting PCa across over 1000 reports and patients.
Keywords
Natural Language Processing; Prostate Cancer Detection; Radiology Reports; 12-core Pathology Reports
063 - Preliminary evaluation of the state-of-the-art large language models in processing reports from the American Association of Physicists in Medicine
Presenter: Hossein Jafarzadeh, McGill University
Hossein Jafarzadeh1, Jonathan Kalinowski1, Farhood Farahnak123, Shirin A. Enger123
1McGill University, Montreal, Quebec, Canada
2Lady Davis Research Institute, Montreal, Quebec, Canada
3Jewish General Hospital, Montreal, Quebec, Canada
Introduction/Background
Reports from the American Association of Physicists in Medicine (AAPM) contain consensus guidelines and tabulated reference data essential for daily clinical tasks in radiology and radiotherapy. A chatbot capable of accurately answering questions regarding the reports would facilitate compliance with the AAPM guidelines. Retrieval-Augmented Generation (RAG) allows large language models (LLMs) to find the answer to questions from large amounts of text due to their context comprehension, attention mechanisms, and reasoning abilities. We evaluated Google’s Gemini 1.5 Pro and OpenAI’s GPT 4O on answering technical questions from two AAPM reports using human evaluation.
Methods/Intervention
Out of 259 AAPM reports, reports number 233 (TG233) and 084S were chosen for system evaluation, totaling 90 PDF pages. The PDFs were converted to text, and tables and images were extracted using the available APIs for each system. For each report, 5 technical questions were designed, and the models were asked to find the correct answers within the text. Finally, two graduate students with accredited training in medical radiation physics evaluated the models' responses and scored them from 1 to 5 based on accuracy and conciseness.
Results/Outcome
Gemini and ChatGPT scored 3.7 ± 1.4 and 2.7 ± 1.3 out of 5, respectively, in the human evaluation, showing Gemini's superiority. Gemini's responses averaged 144 ± 72 words, shorter and more concise than ChatGPT's 233 ± 87 words, though both were much longer than the ground truth (40 ± 14 words). Human evaluators noted that both models' answers were often verbose and inaccurate when questions required an understanding of relevant physics.
Conclusion
In conclusion, this experiment demonstrates LLMs' capability to understand AAPM reports and answer related questions. Future work will include integrating a search module to retrieve relevant reports for queries. Additionally, training task-specific LLMs, such as those fine-tuned on medical physics textbooks, is essential. A robust evaluation framework is also necessary to accurately assess these systems.
Statement of Impact
While assessing the capability of commercial LLMs in processing reference documents specific to the medical physics domain, this work signifies the need for a more standardized method of evaluating model performance on technical reference documents.
Keywords
Artificial intelligence; Large Language Models; Medical Physics; Computer Tomography
064 - Prompt-Induced Bias in Vision Language Models: Implications for Pneumonia Detection in Pediatric Chest Radiographs
Presenter: David Li, Lοndοn Health Sciences Center
David Li1, Jaron Chong1
1Lοndοn Health Sciences Center, London, ON, Canada
Introduction/Background
Vision language models (VLMs) have the potential to revolutionize medical imaging. However, the effects of text prompts on visual tasks are not well understood. This study investigates how variations in text prompts influence the diagnostic accuracy of GPT-4 Turbo in detecting pneumonia in pediatric chest radiographs. We hypothesize that subtle differences in prompt context can lead to biased predictions in visual diagnostic tasks.
Methods/Intervention
This retrospective study utilized publicly available data and was exempt from institutional review board approval. 5856 pediatric chest radiographs were obtained from the Guangzhou Women and Children’s Medical Center. A test set of 200 radiographs, including 100 pneumonia cases and 100 normal cases, was randomly selected from patients aged 1 to 5 years. The latest version of GPT-4 Turbo with Vision was used to classify each radiograph as either pneumonia or normal, employing four prompt variations: neutral, query positive, clinically symptomatic, and leading answer. VLM performance was evaluated using sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC). Statistical analysis was performed using McNemar’s tests, with a significance threshold of p < 0.05.
Results/Outcome
The AUROC for the four prompts ranged from 0.35 to 0.53. Subgroup analysis showed that sensitivity increased progressively with greater prompt bias, ranging from 0.18 for neutral prompts to 0.99 for leading answer prompts. Significant differences were observed in pairwise comparisons between the neutral and clinically symptomatic prompts (p = 0.026) and between the neutral and leading answer prompts (p < 0.001).
Conclusion
This study highlights that prompt-induced bias significantly impacts GPT-4 Turbo’s performance in detecting pneumonia in pediatric chest radiographs. Moreover, VLM performance was lower compared to previously published benchmarks for convolutional neural networks in chest radiograph interpretation. Further research is needed to identify and address prompt-induced bias to ensure reliable clinical deployment.
Statement of Impact
Currently, VLMs without specialized medical fine-tuning demonstrate limited accuracy in interpreting chest radiographs. Prompt-induced bias significantly affects diagnostic performance in visual tasks. To enhance the clinical effectiveness of VLMs, it is crucial to conduct rigorous validation studies using neutral prompts to minimize bias and avoid overestimating results.
Keywords
Vision language models; Large language models; Multimodality; Generative pre-trained transformers
065 - Radiology AI Leaderboard: An Evaluation Platform for Large Language and Vision Language Models
Presenter: David Li, Lοndοn Health Sciences Center
David Li1, Jaron Chong1
1Lοndοn Health Sciences Center, London, ON, Canada
Introduction/Background
Rapid advancements in large language models (LLMs) and vision language models (VLMs) hold great promise for transforming radiology. However, assessing and comparing the performance of these models in radiology remains challenging due to the lack of standardized, transparent benchmarks. To address this gap, we created a comprehensive platform designed to evaluate and compare LLM and VLM performance in radiology tasks.
Methods/Intervention
The platform features an evaluation and voting framework with domain-specific criteria to ensure accurate performance assessment. It supports both public and proprietary datasets, including multimodal datasets. Visualization tools enable radiologists to easily compare model performance across various metrics, datasets, and tasks over time. Researchers and vendors are encouraged to submit their models for evaluation.
Results/Outcome
The platform has demonstrated both feasibility and effectiveness in evaluating LLMs and VLMs for radiology tasks. Models are assessed across a range of radiology-specific tasks and datasets, with performance transparently reported and ranked. While initial results were based on academic research, we have also evaluated 19 models with board-certified radiologists. The latest proprietary models, such as GPT-4o and Claude 3.5 Sonnet, as well as open-source models like LLaMA 3.1 405B, have been benchmarked. Preliminary results indicate that model performance on radiology-specific tasks differs substantially from general-purpose benchmarks, highlighting the need for radiology-specific benchmarks.
Conclusion
The Radiology AI Leaderboard represents a major advancement in standardizing the evaluation of LLMs and VLMs within radiology. It addresses a critical gap by introducing specialized benchmarks tailored to radiology, setting new standards for transparency and collaboration. The platform not only improves the accuracy of performance evaluations but also establishes a robust foundation for the safe and effective integration of AI into clinical practice.
Statement of Impact
This platform advances the evaluation of LLMs and VLMs in radiology by providing standardized and transparent benchmarking. By ensuring rigorous and equitable assessments, it facilitates the integration of generative AI into clinical practice.
Keywords
Large language models; Vision language models; Model validation; Benchmark
066 - Regression in GPT-4 Turbo’s Diagnostic Accuracy for Generating Radiology Differential Diagnoses
Presenter: David Li, Lοndοn Health Sciences Center
David Li1, Kartik Gupta1, Mousumi Bhaduri1, Paul Sathiadoss1, Sahir Bhatnagar2, Jaron Chong1
1Lοndοn Health Sciences Center, London, ON, Canada
2McGill University, Montreal, Quebec, Canada
Introduction/Background
Large language models (LLMs) have demonstrated impressive capabilities across a variety of domains; however, their effectiveness in clinical tasks, such as generating differential diagnoses, remains underexplored. This study evaluates the diagnostic accuracy of GPT-4 Turbo, an advanced generative pre-trained transformer (GPT), in analyzing Radiology Diagnosis Please cases. These cases encompass a broad range of pathologies, reflecting the complexities of diagnostic radiology. We hypothesize that GPT-4 Turbo will outperform its predecessors in generating accurate differential diagnoses.
Methods/Intervention
This study was exempt from institutional review board review due to the use of publicly available data. We retrospectively compiled a test set of 287 Radiology Diagnosis Please cases from August 1998 to July 2023, excluding cases with information leaks. Patient histories, imaging findings, and ground truth diagnoses were extracted. The latest version of GPT-4 Turbo (April 2024 release) was evaluated. Diagnostic accuracy was assessed by generating the top five differential diagnoses based on text inputs of history, imaging findings, and their combination. A panel of three radiologists, averaging 13 years of experience, evaluated blinded differentials and resolved discrepancies through mediated discussion.
Results/Outcome
GPT-4 Turbo’s diagnostic accuracy based on the history, imaging findings, and both combined were 43/287 (15%), 119/287 (41%), and 132/287 (46%), respectively. Accuracy varied across subspecialties, ranging from 0/26 (0%) in genitourinary cases to 4/6 (67%) in obstetrics cases. Qualitative observations of diagnostic regression included lower rankings of correct diagnoses and the omission of eponyms and previously accurate diagnoses.
Conclusion
This clinical validation study identifies an unexpected regression in the diagnostic accuracy of GPT-4 Turbo compared to previously published benchmarks for GPT-4 and GPT-3.5. These results highlight the need for additional fine-tuning to enhance GPT-4 Turbo’s performance and ensure its effectiveness before clinical deployment.
Statement of Impact
This clinical validation study underscores the importance of exercising caution when integrating LLMs into diagnostic workflows. The regression in GPT-4 Turbo’s performance suggests that foundational models require additional fine-tuning with medical datasets. Rigorous validation of LLMs is crucial to establish their effectiveness and reliability before widespread clinical adoption. With continuous improvements, LLMs have the potential to become valuable decision support tools for radiologists.
Keywords
Large language model; Generative pre-trained transformer; GPT-4 Turbo; Clinical validation
067 - Reduction of Molecular Breast Imaging Scan Time by Half with a Denoising Diffusion Probabilistic Model-based Algorithm
Presenter: Fred Nugen, Mayo Clinic - Rochester
Fred Nugen1, Bardia Khosravi1, Lacey Gray1, Katie N. Hunt1, Bradley J. Erickson1, Carrie Hruska1
1Mayo Clinic - Rochester, Rochester, MN, USA
Introduction/Background
Molecular Breast Imaging (MBI) uses a dedicated gamma camera to image functional uptake of a radiopharmaceutical (Tc-99m sestamibi) in breast cancer. Following an MBI patient satisfaction study (Hruska, JNMT 2024), reducing scan time and/or radiation dose while maintaining image quality is desirable. We designed an automated denoising tool and evaluated its application to “reduced count” MBI in maintaining lesion contrast and diagnostic accuracy.
Methods/Intervention
MBI scans comprise four ten-minute images: two projections for each breast. We generated “reduced-count” MBI scans by using only data from the first five minutes of each image. We trained a denoising diffusion probabilistic machine learning model (DDPM) (Khosravi, CMPB 2023) to “denoise” reduced-count images without reducing clarity of features such as suspicious lesions. We used a random sample of patients undergoing MBI at Mayo Clinic from January to April, 2021. The model was trained repeatedly on the training set (343 patients, 4962 images) and validation set (114 patients, 1614 images) until it had seen 40 million images. For model evaluation, we selected 81 additional MBI exams (15 negative, 66 positive). Quantitative evaluation was performed through region of interest analysis on both reduced-count denoised and ground truth images. We calculated contrast-to-noise ratios as CNR = (mean of lesion - mean of breast tissue) / (standard deviation of breast tissue). A retrospective reader study of this dataset is underway; breast radiologists are presented the reduced-count denoised and ground truth exams in a random order while blinded to image status and provide an assessment of cancer likelihood (ACR BI-RADS 1-5 scale) and image quality (1-5 scale).
Results/Outcome
In a random sample of 18 images containing a lesion, CNR of the DDPM-denoised image was equal to or higher than CNR of the ground truth, due to reducing standard deviation of intensities in breast tissue. Pending results from the reader study will be presented.
Conclusion
DDPM-denoised images acquired for half the acquisition time of ground truth data provided similar image quality without producing artifacts. DDPM-denoised images maintained or improved lesion contrast, and had less noise (ie, reduced standard deviation of breast tissue intensities).
Statement of Impact
DDPM-denoised MBI exams could improve patient satisfaction and/or reduce radiation dose.
Keywords
Molecular Breast Imaging; Denoising; Generative AI; Diffusion models
068 - Survival Prediction in Colorectal Liver Metastases using Radiomics
Presenter: Akhil Ambekar, Brown University
Akhil Ambekar1, Jon Steingrimsson1
1Brown University, Providence, RI, USA
Introduction/Background
Colorectal Liver Metastases (CRLM) signal an advanced stage of colorectal cancer. Accurate survival estimation can guide treatment decisions. We evaluate how accurate radiomics (a non-invasive method that generates numerical data from medical images) is for survival estimation in patients with CRLM and compare it with using traditional clinical information. We also examine the predictive value of radiomics features extracted from images at different quantization levels. Quantization is a process of limiting the number of distinct pixel values in an image to streamline the analysis by reducing noise and computational demands.
Methods/Intervention
This study uses the Colorectal-Liver-Metastases dataset, including preoperative CT DICOM scans with liver segmentations, and overall survival data for 197 patients post-CRLM resection. Using 'pyradiomics', 474 radiomic features were extracted at four different quantization levels (8, 32, 128, and 255). Feature extraction was performed on individual CT slices, with summary statistics computed for each patient based on slice-level data. A Random Survival Forest (RSF) was trained using the training feature set (80%) and evaluated using the censored data concordance index on the test dataset (20%). Feature selection was done using permutation importance analysis, followed by a comparison of the outcomes using concordance indexes. Moreover, we compared the predictive accuracy of just using clinical data acquired at the time of acquisition with radiomic analysis.
Results/Outcome
The most effective RSF model used the top 45 features and had a concordance index of 0.75 at the quantization level of 255. For features extracted at quantization levels 8, 32, and 128, the concordance indexes were 0.67, 0.74, and 0.73, respectively. The optimal results for these levels were obtained using the top 25, 55, and 25 features, respectively. By contrast, models employing traditional clinical data yielded a lower concordance index of 0.64.
Conclusion
Comparing radiomic features to clinical data shows a notable improvement in survival prediction accuracy. Radiomic data consistently outperforms traditional clinical data across quantization levels of 32 and above, this shows the value of quantization to streamline the analysis.
Statement of Impact
Radiomic features provide independent predictive value on top of using only clinical information when predicting survival in patients with CRLM.
Keywords
Colorectal cancer; Radiomic features; Quantization; Random Survival Forest
069 - The Phenotypic Basis of CT-derived Kidney Traits and Their Utility in Predicting Estimated Glomerular Filtration Rate
Presenter: David Y. Zhang, University of Pennsylvania
David Y. Zhang1, Rachit Kumar1, Ali H. Dhanaliwala1, Jeffrey T. Duda1, Hersh Sagreiya1, James C. Gee1,
Charles E. Kahn, Jr.1, Marylyn D. Ritchie1, Daniel J. Rader1, Walter R. Witschey1
1University of Pennsylvania, Philadelphia, PA, USA
Introduction/Background
The volume of imaging data, necessity of accurate and consistent granular imaging traits defined across the lifespan, and need to support underserved communities require novel end-to-end automation strategies, especially for kidney evaluation. To address this challenge, we developed and applied AI to CT scans and analyzed the clinical relevance of imaging traits with disease. We hypothesized that kidney imaging-derived phenotypes (IDPs) could be used to predict estimated glomerular filtration rate (eGFR).
Methods/Intervention
We extracted thorax, abdomen, and/or pelvis CT scans for 20,289 individuals in the Penn Medicine Biobank, segmented the kidneys using TotalSegmentator, and derived quantitative imaging traits to perform association studies, controlled for sex, age, age^2, BMI, and population stratification. A simple feed-forward neural network was also trained to predict eGFR using the kidney traits as well as age and sex. The dataset was split into 70%/15%/15% training/validation/testing.
Results/Outcome
We performed phenome-wide association studies against multiple quantitative kidney IDPs. For kidney volume, we observed strong significant negative associations with end-stage renal disease as well as related circulatory conditions such as hypertension and congestive heart failure, with similar trends also identified for kidney surface area and mean attenuation. Our neural network model for predicting eGFR from kidney traits, age, and sex was trained on eGFR values documented within 7 days of a CT scan. The model exhibited robust predictive ability and had a mean squared error of 413.93 on the testing dataset. Using a cutoff of 60 mL/min/1.73m2 for chronic kidney disease, our model had a sensitivity of 61.7% and specificity of 87.3%.
Conclusion
Our association studies demonstrate not only strong correlations between CT imaging-derived kidney traits and health conditions, but also granularity in how different kidney diseases affect certain kidney traits and not others. We will also perform genetic association studies to study the genetic architecture of our kidney IDPs. Furthermore, the quantitative IDPs showed strong predictive potential for estimating eGFR.
Statement of Impact
Our results not only validate the biological relevance of our IDPs, but also demonstrate the clinical utility of predicting eGFR from imaging traits that could be integrated into a clinical workflow and used to indicate further testing in relevant patients.
Keywords
Imaging; Deep learning; Phewas; Genomics
070 - Two-Step Fully Automated Classification of Choroidal Metastases on MRI: Orbit Localization via Bounding Boxes Followed by Binary Classification via Evolutionary Strategies
Presenter: Joseph N. Stember, Memorial Sloan Kettering Cancer Center
Jeffrey S. Shi1, Bala McRae-Posani1, Andrei Holodny1, Hrithwik Shalu2, Joseph N. Stember1
1Memorial Sloan Kettering Cancer Center, New York, NY, USA
2Indian Institute of Technology Madras, Chennai, India
Introduction/Background
The choroid of the eye is a rare site for metastatic spread of a tumor, and choroidal metastases (CMs) may be visualized on magnetic resonance imaging (MRI). However, as small lesions on the periphery of the image, they are often missed on brain MRI.
Methods/Intervention
Here, we describe sequential cropping and classification on brain MRI images to detect CMs using artificial intelligence (AI). We first trained an orbit localization model with a YOLOv5 architecture using 386 normal T2-weighted brain MRI images. The model predicted and cropped the positions of the orbits on MRI brain scans from 33 patients without and 33 patients with CMs. After zooming in around the orbits, the cropped images served as inputs to a binary classifier convolutional neural network (CNN) to classify images as normal or CM-containing. We used 36 images for training and the other 30 for testing. Given the small training set, we trained the network weights via the data-efficient deep neuroevolution (DNE) strategy.
Results/Outcome
Our orbit localization model achieved mean average precision at intersection over union of 0.5 of 0.590. For a confidence of 0.3, the model achieved recall of 1.00 and precision of 0.50, as the model accurately identified all orbits but was unable to distinguish “left” and “right”. Laterality was assigned afterwards using relative position. The model generalizes to scans with CMs; on our dataset of 33 slices demonstrating CMs, the model accurately determined the bounding boxes without errors. The predicted bounding boxes were used to crop the images for training our CNN classification model. After training via DNE for over 80,000 episodes, the model converged on a training set accuracy of 100% and testing set accuracy of 100%.
Conclusion
We trained a YOLOv5 model to accurately localize and crop the orbits on brain MRI. The cropped images were subsequently used to train a CNN with excellent performance in detecting CMs.
Statement of Impact
Our method provides an end-to-end model to accurately detect small, peripheral, easy-to-miss lesions to potentially improve sensitivity for detection of CMs. It could thereby help reduce “corner of the image” false negatives.
Keywords
Object detection; Classification; Tumor; Cancer
071 - Utilizing Natural Language Processing and Deep Learning Classification of Radiology Reports to Evaluate the Sensitivity of Chest CT for Detecting Signs of Congestive Heart Failure
Presenter: Ali Memon, Creighton University School of Medicine
Daniel Spalinski1, Sherif Zineldine1, Ali Memon1, Kimberly Mendez2, Dorina Pinkhasova1, Michael Fei1, Daniel Nguyen1, Jad Alsheikh1, Randy Richardson1
1Creighton University School of Medicine, Omaha, NE, USA
2Baylor College of Medicine, Houston, TX, USA
Introduction/Background
There is a lack of considerable data on the use of language models (LMs) in the interpretation of radiology reports, notably chest CTs. Chest CTs in the U.S. have a reported sensitivity and specificity of 86% and 68% in detecting signs of congestive heart failure (CHF).1 LMs can be used to find key characteristics of pathology. These models can allow for enhanced interpretation of reports. This study assesses the capabilities of natural language processing (NLP) in the evaluation of CT reports in patients with a diagnosis of CHF.
Methods/Intervention
This study is a retrospective review of data from the MIMIC-IV, an open-access database derived from the electronic health records of Beth Israel Deaconess Medical Center from 2008 to 2019.2 The multi-label radiology report classification model SARLE was implemented to generate lists of significant findings for chest CTs performed in the same admission with a diagnosis of CHF according to appropriate ICD codes.3 Radiology reads were classified as positive according to the presence of one to six key radiographic findings. An ROC curve was generated based on these varying numbers of positive findings.
Results/Outcome
3,670 hospital admissions where a chest CT was performed with a concurrent diagnosis of CHF were included. Odds ratios (OR, 95% CI) were calculated for each finding using the model interpretation of 91,281 total chest CT reports. These features included cardiomegaly (6.4, 5.9-6.8), vascular congestion (OR 5.8, 4.7-7.2), pleural effusion (OR 8.1, 7.4-8.8), septal thickening (OR 5.2, 4.7-5.7]), pulmonary edema (8.2, 7.6-8.9), and dilated pulmonary vessels (OR 2.3, 2.1-2.5). The presence of at least one key radiographic finding had a 92.34% sensitivity and specificity of 52.92% for the presence of CHF. The resulting ROC curve had an AUC of 0.81.
Conclusion
NLP allowed for the comprehensive interpretation of a large number of radiographic studies with CHF. Chest CT sensitivity for CHF may be greater than previously reported, and expanding this methodology to other modalities has the potential to better evaluate the accuracy of imaging modalities for specific pathologies.
Statement of Impact
The study of NLP has implications for the future of interpretation of radiology reports.
Keywords
Natural Language Processing; Deep Learning; Radiology report classification
072 - CT to MRI Style Transfer Deep Learning for Enhanced Detection of Brain Metastases
Presenter: Adhithya Narayanan, Geisel School of Medicine at Dartmouth
Adhithya Narayanan1, Nooriel Banayan2, Andrei I. Holodny3, Hrithwik Shalu4, Dylan G. Hsu3, Joseph N. Stember3
1Geisel School of Medicine at Dartmouth, Hanover, NH, USA
2SUNY Downstate Health Sciences University College of Medicine, New York, NY, USA
3Memorial Sloan Kettering Cancer Center, New York, NY, USA
4Indian Institute of Technology Madras, Chennai, India
Introduction/Background
Style transfer is a technique in AI computer vision that generates synthetic images by combining the content of one image with the visual attributes of another [1]. This approach has been employed in various contexts, such as enhancing the resolution of images from portable low-field MRI scanners to resemble those produced by high-field MRI scanners [2]. In this study, we evaluate cross-modality style transfer to assist radiologists in detecting brain metastases on CT. Vasogenic edema surrounding brain metastases can be subtle on non-contrast CT, often appearing as vague hypoattenuation. In contrast, T2 FLAIR imaging provides better visualization, as the edema appears bright with high contrast resolution [3, 4]. However, CT is much more commonly acquired, less expensive, and quicker than MRI. Therefore, we aimed to enhance the conspicuity of brain metastases on CT by style transferring to a virtual T2 FLAIR MRI. We assert that producing this synthetic MRI image may enable more confident detection of brain metastases.
Methods/Intervention
We used a two-dimensional Basic UNet++ model to generate style-transferred synthetic MRI from non-contrast CT head studies. The model was trained on 300 pairs of non-contrast CT and T2 FLAIR MRI images from 280 patients at our institution.
Results/Outcome
Qualitative assessment of the synthetic MRI images was performed by a board-certified neuroradiologist who determined that the synthetic images could improve confidence in detecting brain metastases over non-contrast CT alone.
Conclusion
In future and ongoing work, improved sensitivity for small metastases is being validated by surveying a larger group of board-certified neuroradiologists.
Statement of Impact
By increasing the conspicuity of features of metastases such as edema, synthetic MRI generation through style transfer from non-contrast CT can improve radiologists' confidence in detecting brain metastases and help inform clinical decisions.
Keywords
Artificial Intelligence; Style Transfer; Neuroradiology; Brain Metastases
073 - Prescreening Radiology Reports for Prostate Cancer Recurrences Using a Large Language Model (LLM)
Presenter: Ali Ganjizadeh, Mayo Clinic - Rochester
Ali Ganjizadeh1, Lance Mynderse1, Shahriar Faghani1, David A. Woodrum1, Bradley J. Erickson1
1Mayo Clinic – Rochester, Rochester, MN, USA
Introduction/Background
This study evaluated the efficiency and accuracy of using the Mixtral 8x7b v0.1 Instruct LLM to prescreen radiology reports for prostate cancer recurrences post-MR-guided seminal vesicle cryoablation.
Methods/Intervention
This retrospective study included 164 patients who underwent seminal vesicle cryoablation and were followed up with either PET or MRI scans every three months for 2 years. A total of 582 radiology reports were assessed using the Mixtral 8x7b v0.1 LLM, which was not fine-tuned but provided specific details about prostate cancer and radiological report analysis. The LLM analyzed the reports for recurrence indications at the ablation site without anatomical guidance. The performance of the model was evaluated by comparing its predictions with manual assessments of the radiology reports.
Results/Outcome
The model identified 21 true positive and 498 true negative reports, alongside 63 false positive and no false negative results. The performance metrics were: PPV = 25.00%, NPV = 100.00%, sensitivity = 100.00%, and specificity = 88.77%. The use of a language model for the prescreening of radiology reports substantially enhances efficiency, processing 582 reports in approximately 4 hours.
Conclusion
This study demonstrates the potential of using an LLM for prescreening radiology reports to facilitate the detection of prostate cancer recurrences. Despite a high rate of false positives, which were largely attributed to anatomical proximity errors in token embedding, the model effectively ensured that no recurrence was missed. This emphasizes its utility as a clinical tool for enhancing radiologist awareness. Further refinement of the model's performance is necessary to reduce false positives.
Statement of Impact
The application of LLMs in prescreening radiological reports for cancer recurrence offers a significant time-saving advantage and ensures high sensitivity in clinical settings. This technology can serve as a supplementary tool to assist radiologists in managing large volumes of data, focusing on high-priority cases, improving patient care, and improving the early detection and management of cancer recurrences. This leads to timely interventions and improved patient outcomes.
Keywords
Urology; Interventional Radiology; Large Language Model; Cancer Screening
Supplement Details (Please complete the list below.)
Journal name: Journal of Imaging Informatics in Medicine
Supplement title: 2024 Conference on Machine Intelligence in Medical Imaging (CMIMI) – Selected Abstracts
Conference data (venue, location, date): Boston University-George Sherman Union | Boston, MA | October 21-22, 2024
Chair, Guest editor(s), or Organizing-committee name:
Katherine P. Andriole, PhD, FSIIM
Associate Professor of Radiology, Director of Imaging Informatics, Brigham and Women's Hospital, Harvard Medical School
Director of Research Strategy and Operations, Director of Research Strategy and Operations
MGH & BWH Center for Clinical Data Science
Peter D. Chang, MD
Associate Professor, Departments of Radiological Sciences and Computer Science, University of California, Irvine
Director, UCI Center for AI in Diagnostic Medicine
Ingrid Reiser, PhD, FAAPM
Associate Professor of Radiology, University of Chicago
Eliot L. Siegel, MD, FSIIM
Professor of Radiology, University of Maryland School of Medicine
Chief, Imaging Services, VA Maryland Health Care System
Jeffrey H. Siewerdsen, PhD, FAAPM, FAIMBE
Professor, Department of Imaging Physics
Director, Surgical Data Science Program, Institute for Data Science in Oncology
The University of Texas MD Anderson Cancer Center
Sponsor (or Society name): Society for Imaging Informatics in Medicine (SIIM)
Sponsorship statement (required): Publication of this supplement was sponsored by the Society for Imaging Informatics in Medicine (SIIM). All content was reviewed and selected by the CMIMI Program & Review Committee, which held full responsibility for the abstract selections.
Abstracts (quantity): 73
Tables (quantity): n/a
Figures (quantity): n/a
Collated page proofs (contact name and e-mail): Anna Zawacki, azawacki@siim.org
Target publication date (if scheduled): as close to October 21, 2024 as possible
Publication format (Print and/or Online): Online
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


