Scientific Abstracts from the 2024 Conference on Machine Intelligence in Medical Imaging (CMIMI) of the Society for Imaging Informatics in Medicine (SIIM)

doi:10.1007/s10278-024-01278-5

. 2024 Oct 28;37(Suppl 1):1–35. doi: 10.1007/s10278-024-01278-5

Scientific Abstracts from the 2024 Conference on Machine Intelligence in Medical Imaging (CMIMI) of the Society for Imaging Informatics in Medicine (SIIM)

PMCID: PMC11685334

Title: 2024 Conference on Machine Intelligence in Medical Imaging (CMIMI)

Date: October 21-22, 2024

Venue: Boston University – George Sherman Union, Boston, MA

Sponsorship: Publication of this supplement was sponsored by the Society for Imaging Informatics in Medicine (SIIM). All content was reviewed and selected by the CMIMI Program & Review Committee, which held full responsibility for the abstract selections.

Program Listing

Monday, October 21

Tuesday, October 22

Oral Presentations

Image Classification | Scientific Abstract Presentations

001- Advancing Endometriosis Detection: A Deep Learning-Enhanced Multi-sequence MRI Analytical Model

Presenter: Mana Moassefi, Mayo Clinic – Rochester

Mana Moassefi¹, Wendaline VanBuren¹, Bradley J. Erickson¹, Shahriar Faghani¹

¹Mayo Clinic - Rochester, Rochester, MN, USA

Introduction/Background

Endometriosis, a condition characterized by the growth of endometrial-like tissue outside the uterus, affects 5-10% of women of reproductive age. Despite its prevalence, the diagnosis of endometriosis through imaging remains challenging due to the complex anatomy of the pelvis and heterogeneity of the disease itself on imaging, which requires expertise. Advances in deep learning (DL) are revolutionizing the diagnosis and management of complex medical conditions, promoting patient-centered treatment.

Methods/Intervention

We gathered a patient cohort from our institutional database, composed of patients with pathologically confirmed endometriosis from 2015 to 2024. We selected gynecologic MRIs performed within three months prior to diagnostic surgery. We also created an age-matched control group that underwent a similar MR protocol but without a diagnosis of endometriosis. We used sagittal T1-weighted (T1) pre- and post-contrast, as well as T2-weighted (T2) MRIs. We split our dataset at the patient-level and allocated one-eighth of the dataset for testing and conducted seven-fold cross-validation on the remainder. MR images were analyzed using various convolutional neural network (CNN) architectures. Simultaneously, two abdominal radiologists with experience in endometriosis MRI and complex surgical planning and one women’s imaging fellow with specific training in endometriosis MRI reviewed a random selection of images and documented their endometriosis detection.

Results/Outcome

751 patients were included in the case and control groups. The final 3D-DenseNet-121 classifier model demonstrated robust performance. Our findings indicated the most accurate predictions were obtained using T2, T1 pre- and post-contrast. Testing on our test set using ensemble technique resulted in an F1 Score of 0.911, AUROCC of 0.881, sensitivity of 0.976, and specificity of 0.720. Our radiologist readers achieved 72.2% and 78.5% sensitivity without and with AI assistance in detecting endometriosis.

Conclusion

The study introduced the first DL model to use multi-sequence MRI on a large cohort, showing results equivalent to human detection in trained readers in identifying endometriosis. Further external validation of the model is in progress.

Statement of Impact

We aim to evaluate DL tools in enhancing the accuracy of multi-sequence MRI-based detection of endometriosis in daily practice.

Keywords

Endometriosis; Magnetic Resonance Imaging; Deep Learning

002 - Automated Detection of Tricuspid Regurgitation Using Multiple Instance Learning on Doppler Echocardiograms

Presenter: Anthony T. Wu, University of California, Irvine

Anthony T. Wu¹, Nathan H. Choi¹, Amanda Warren¹, Gavin Shu¹, Tobin Matthew¹, Kyle Digrande¹, Antonio Frangieh¹,

Jin Kyung Kim¹, Xiaohui Xie¹, Peter Chang¹, Jennifer Xu¹

¹University of California Irvine, Irvine, CA, USA

Introduction/Background

Tricuspid regurgitation (TR) affects >70 million people worldwide. 36% of patients with severe-TR die within 1 year, however >90% of these patients are not intervened upon due to prohibitive surgical mortality. Early diagnosis of severe-TR is critical for treatment effectiveness, but such diagnoses are time-consuming and require expert image acquisition and interpretation of echocardiograms (echo). Thus, there is a critical need for an accurate and rapid method for binary severe-TR (i.e. intervention needed) detection. While 3D Convolutional Neural Networks are frequently used for medical image classification, the variable spatial and temporal dimension sizes of echoes make them less effective and interpretable. Herein we describe an interpretable multiple-instance model to detect severe-TR from echo.

Methods/Intervention

Echocardiograms were interpreted and labeled at the video level by expert cardiologists. Sample gates were extracted and split with a train-validation ratio of 4:1. The multiple-instance model comprises a feature extractor utilizing spatial grouped-convolution blocks and two temporal heads, an instance predictor employing a perceptron model, and an instance aggregator using mean-pooling. A grid-search over hyperparameters was performed for optimization.

Results/Outcome

In total, 3604 echoes were collected. Our model achieved a maximum validation accuracy of 81.3% and ROC AUC of 0.865. Activations of spatial convolutions were used to visualize spatial focus, and instance predictions were used to identify temporal model focus.

Conclusion

Our model accurately detects severe-TR from echocardiograms with similar sensitivity and specificity, indicating balanced training. The separation of temporal and spatial features enhances its explainability. Activation of spatial convolutions demonstrate the model's ability to discern cardiac structure and Doppler flow, while single-instance predictions reveal its capability to learn fine-grain features from coarse video-level labels.

Statement of Impact

Our algorithm provides real-time interpretation of echo for severe-TR diagnoses with high sensitivity, allows cardiologists to focus on more severe patient cases, where quick diagnostic turnaround time is critical for intervention efficacy.

Keywords

Multiple Instance Learning; Tricuspid Regurgitation; Video Classification

003 - Iterations on a Classic: A Novel Machine Learning Algorithm for the Establishment of Pediatric Bone Age Using Knee Radiographs

Presenter: Kelly Horst, Mayo Clinic - Rochester

Kelly Horst¹, Bradley Erickson¹, Adam Tagliero¹, Aaron Krych¹, Shahriar Faghani¹

¹Mayo Clinic – Rochester, Rochester, MN, USA

Introduction/Background

It is common practice for orthopedic surgeons to obtain a radiograph of the left hand to estimate bone age for planning

surgical interventions of the knee in pediatric patients, as those interventions are based on skeletal age rather than

chronological age. We created a novel deep learning (DL) algorithm that determines skeletal age based on radiographs of

the knee.

Methods/Intervention

We identified a total of 7,336 radiographs of the knee acquired among 5,701 pediatric patients under 18 years of age,

acquired between January 2018-January 2024. The images included a range of normal images and images with

pathology. The following views were used to train (80% of total), tune (10% of total) and test (10% of total) the model:

1167 right 2 views, 1252 left 2 views, 768 right 3 views, 831 left 3 views, 1,282 right 4+ views, 1,280 left 4+ views.

Patients with more than one study were placed in the same cohort in order to prevent data leakage. We developed a

view-agnostic multimodal deep learning model using an intermediate fusion approach. Our model employed a 2D

DenseNet121 as the imaging feature extractor and two shallow neural networks. The first shallow neural network

transformed patient sex into the imaging feature space, while the second neural network merged and processed the

imaging features along with the transformed sex features to predict bone age in months. We used mean squared error as

the loss function and co-trained all the neural networks with the AdamW optimizer. The model’s performance was

evaluated on the test set using the mean absolute error (MAE) metric.

Results/Outcome

The mean patient age was 13.4 years, with an STD of 3.5 years. The model showed a MAE of 9.7 months on the test

cohort.

Conclusion

A view agnostic multimodal DL algorithm can estimate bone age from radiographs of the knee, with higher accuracy than

published references.

Statement of Impact

This ML algorithm may enable clinicians to forgo the routinely obtained left hand radiograph for skeletal bone age.

Keywords

Bone age; Regression algorithms; CNN

004 - Iterations on a Classic: Utilizing DenseNet121 for Gender Differentiation in Pediatric Bone Age Assessment Algorithms

Presenter: Kelly Horst, Mayo Clinic - Rochester

Kelly Horst¹, Shahriar Faghani¹

¹Mayo Clinic - Rochester, Rochester, MN, USA

Introduction/Background

The algorithm aims to enhance the precision of bone age estimation by accurately identifying patient sex, since patient

sex is a crucial input for bone age prediction and relying solely on DICOM (digital imaging and communications in

medicine) metadata can result in errors. In addition, we aim to determine if there is a gender-based morphological

differences in hand radiographs in pediatric population.

Methods/Intervention

The DenseNet121 model was adapted for gender differentiation in pediatric bone age assessments. The 2017 RSNA

bone age dataset was utilized, with 12,611 training images and 1,425 reserved for the validation set. There are 5,778

female and 6,833 male subjects, ranging from 1 to 228 months in age, with a mean bone age of 127.2 months and a

standard deviation of 41.7 months. Channel normalization, resizing to 512 x 512, foreground cropping, normalization of

image intensities, random histogram shift, flipping, random affine transformations, and Gaussian noise addition were used

to enhance model generalizability. Weighted cross-entropy with inverse class ratios was used as the loss function to

address the slight imbalance in the dataset. To gain insight into model decision making process occlusion maps were

created and reviewed by a pediatric radiologist. Area under receiving operating curve (AUROC) was reported as the

performance metric.

Results/Outcome

An AUROC of 0.985 was achieved with a batch size of 16, a learning rate of 0.001, and 200 epochs. Due to the excellent

performance, no hyperparameter tuning was performed.

Conclusion

The model is effective in predicting patient gender with a high degree of accuracy. There are some imaging features that

contain gender information in the pediatric population.

Statement of Impact

This study provides proof of concept that some imaging features contain gender information in the pediatric population.

Leveraging interpretation methods might shed light on the biology of these differences and could be used as a scientific

discovery tool. The model is highly reliable for gender classification in the context of bone age assessment, leveraging

detailed image features that may be imperceptible to human observers.

Keywords

Bone age; Sex differentiation; CNN

005 - On-demand Generation of Probabilistic Models for Radiology Differential Diagnosis from Real-world Data

Presenter: Charles E. Kahn, Jr., University of Pennsylvania

Charles E. Kahn, Jr.¹

¹University of Pennsylvania, Philadelphia, PA, USA

Introduction/Background

Bayesian networks apply probability theory to perform diagnostic reasoning. They offer several attractive features,

including the ability to explain their reasoning and to account for missing or conflicting data. However, their construction

often is limited by the lack of real-world data to derive the conditional-probability tables (CPTs) that relate two conditions.

This work sought to establish an approach to extract probability data from radiology reports and apply that data for on-demand generation of Bayesian network models.

Methods/Intervention

The Radiology Gamuts Ontology (RGO), a reference source of more than 2000 radiology differential-diagnosis listings,

was accessed through its application programming interface (https://gamuts.net/api/specialty). Two years of radiology

reports from a large U.S. academic health system were analyzed using named-entity recognition and negation-detection

techniques to identify positive mentions of RGO entities. An occurrence was defined as positive mention of an RGO entity

in a patient. Data were aggregated by patient; the software tallied the number of occurrences of each RGO entity and the

number of co-occurrences of each pair of RGO entities. Age and sex distribution of each condition was computed.

Results/Outcome

From approximately 1.8 million reports on 1.3 million distinct patients, the software generated probabilistic data for the

2,742 RGO entities (of 16,839 total) that occurred in the dataset. Project software aggregated probability data around the

specified entity and entities that could cause or be caused by it; the generated Bayesian network model was encoded in

Structural Modeling, Inference, and Learning Engine (SMILE) format. Diagnostic inference was performed using the

GeNIe platform (BayesFusion LLC, Pittsburgh, PA).

Conclusion

This methodology generates Bayesian-network models for radiology diagnosis from real-world data extracted from

analysis of radiology reports. It demonstrates the ability to extract probabilistic data from the unstructured text of radiology

reports to generate diagnostic models tailored to the prevalence of diseases and imaging findings of a specific

organization's patient population.

Statement of Impact

This report describes a novel approach that generates conditional-probability data from unstructured radiology reports. It

overcomes a key limitation of Bayesian networks and allows one to create diagnostic models the apply the frequencies of

diseases and imaging findings of a specific set of patients.

Keywords

Diagnosis; Bayesian networks; Probabilistic reasoning; Real-world data

006 - Predicting Brain Age in Autism Spectrum Disorders Using Graph Neural Networks

Presenter: Anureet Tiwana, University of Alberta

Anureet Tiwana¹, James R. Mitchell¹

¹University of Alberta, Edmonton, Alberta, Canada

Introduction/Background

Autism Spectrum Disorder (ASD) diagnosis is complicated by symptom variability and traditional labor-intensive methods.

This study explores using "brain age," derived from neuroimaging data, to quantify developmental delays in ASD.

Leveraging advanced GNN models, we aim to enhance early, accurate diagnoses and intervention strategies.

Methods/Intervention

The study used the ABIDE dataset, an open-source resource containing preprocessed neuroimaging data from 1112

individuals across 20 international sites, including 539 with ASD and 573 controls. The data were preprocessed using the

Data Processing Assistant for Resting-State fMRI (DPARSF) and analyzed with the "Dosenbach160" ROI set. Graph

construction involved defining node and edge connections, with nodes representing regions of interest (ROIs) and edges

representing functional connections. Three Graph Neural Network (GNN) architectures were employed: Graph Attention

Networks (GAT), Chebyshev Graph Convolutional Networks (ChebNets), and Graph Isomorphism Networks (GIN)

Results/Outcome

The GNNBrainAgePredictor models using GAT, ChebNet, and GIN architectures were evaluated for predicting brain age.

The GAT model, with two GAT layers, global mean pooling, and a linear regression layer, achieved a MAE of 4.8915 for

the autism group and 6.3125 for the control group. The ChebNet model, using Chebyshev polynomials for graph

convolutions, achieved a MAE of 5.2876 for the autism group and 6.6340 for the control group. The GIN model, with two

GINConv layers achieved a MAE of 4.9252 for the autism group and 6.3012 for the control group. Unexpectedly, the

MAE was lower for the autism group across models, suggesting developmental delays rather than pathological processes.

Conclusion

The GIN model emerged as the most effective, outperforming GAT and ChebNet models in predicting brain age with the

lowest MAE observed. The lower MAE for the autism group challenges conventional understanding and indicates

potential developmental delays. Further investigation is required to understand the neurobiological underpinnings of these

findings.

Statement of Impact

Our study highlights the potential of advanced GNN architectures, particularly GIN, for accurately predicting brain age and

enhancing our understanding of brain development in ASD. These findings suggest that leveraging machine learning

techniques can significantly improve early diagnosis and intervention strategies, ultimately leading to more personalized

and effective treatments for individuals with ASD.

Keywords

Graph Neural Network; Autism; ABIDE; Brain age

Oral Presentations

Bias & Uncertainty | Scientific Abstract Presentations

007 - A Technical Exploration of Bone Age Prediction with Machine Learning Regression Algorithms: Conformal Prediction

Presenter: Shahriar Faghani, Mayo Clinic - Rochester

Shahriar Faghani¹, Kelly Horst¹, Bradley J. Erickson¹

¹Mayo Clinic - Rochester, Rochester, MN, USA

Introduction/Background

Regression models usually provide a point estimate rather than an interval that reflects the uncertainty of the prediction.

While some uncertainty quantification methods can be applied to regression algorithms to obtain prediction intervals, most

lack statistical guarantees for these intervals. Conformal prediction, a post-hoc method, stands out by offering statistical

guarantees. We aimed to apply conformal prediction to bone age prediction in the pediatric population.

Methods/Intervention

A multimodal deep learning model was developed to estimate the skeletal age based on left-sided hand radiographs. This

model consists of the DenseNet121 architecture as an imaging feature extractor and two shallow neural networks: one for

transforming patient sex into the imaging feature space and the other for combining all features. The 2017 RSNA bone

age dataset was utilized, with 12,611 training images, 700 and 725 reserved for the calibration and validation set,

respectively. There are 5,778 female and 6,833 male subjects, ranging from 1 to 228 months in age, with a mean bone

age of 127.2 months and a standard deviation of 41.7 months. Resizing to 512 x 512, foreground cropping, normalization

of image intensities based on training dataset statistics to zero mean and unit standard deviation, histogram shift, flipping,

affine transformations, and Gaussian noise addition were used to enhance model generalizability. Quantile regression

loss with the 5th and 95th percentile provided at least 90% coverage for interval predictions. A calibration set was used to

calibrate the predicted percentiles for each image based on the conformal procedure. Mean absolute error (MAE), and

mean prediction interval (MPI) were calculated.

Results/Outcome

An MAE of 5.6, an MPI of 3.7, and a prediction interval coverage of 97% were achieved, indicating that 97% of the

observations fall within the predicted intervals.

Conclusion

These intervals allow for the individualization of the result uncertainty quantification and may be helpful to include in

individual patient reports. This measurement may also assist in identifying individual outliers, which may need further

follow-up.

Statement of Impact

To the best of our knowledge, this is the first application of conformal prediction in radiologic image-level regression tasks,

that could be served as a template for other clinical challenges.

Keywords

Deep learning; Uncertainty quantification; Regression; Pediatric radiology

008 - Automated Identification of Challenging Samples in Medical Imaging for Unbiased AI Model Training

Presenter: Frank Li, Emory University

Frank Li¹, Theo Dapamede¹, Bardia Khosravi², Mohammadreza Chavoshi³, Saptarshi Purkayastha⁴, Hari Trivedi¹, Judy Gichoya¹

¹Emory University, Atlanta, GA, USA

²Yale University, New Haven, CT, USA

³Shariati Hospital, Tehran, Iran

⁴Indiana University, Indianapolis, IN, USA

Introduction/Background

In medical imaging datasets, "shortcuts" or spurious correlations can cause AI models to unintentionally depend on

irrelevant features when making decisions. For instance, presence of support devices like chest tubes act as shortcuts

when predicting pneumothorax (easy cases), and pneumothorax cases without chest tubes are harder for the model to

learn (hard-to-learn cases). However, hard-to-learn cases are not always known a priori. In this study, we aim to establish

a pipeline to automatically differentiate easy- and hard-to-learn cases during model development.

Methods/Intervention

We used a bias amplification (BAM) technique during model training to identify hard-to-earn samples within the SIIM-ACR

pneumothorax dataset. BAM incorporates a trainable auxiliary variable (b) to track errors made by the model during

training to identify hard-to-learn samples and amplify them during the training process. For our experiments, predicted

probabilities below 0.25 for positive samples (false negatives, FN) and above 0.75 for negative samples (false positives,

FP) were designated as hard-to-learn samples. Conversely, predicted probabilities above 0.75 for positive samples (true

positives, TP) and below 0.25 for negative samples (true negatives, TN) were regarded as easy-to-learn samples. GradCAM++ was used to generate saliency maps and images reviewed by two radiologists.

Results/Outcome

The magnitude of the auxiliary variable (b) increased with the level of learning difficulty, implying that the model leaned

more heavily on b when facing challenging samples. As expected, FP and TP examples had a higher presence of support

devices and FN were often missing support devices. Saliency maps and radiologist review revealed that the model

focused more on support devices, further supporting this observation.

Conclusion

Our findings validate the hypothesis that images containing both pneumothorax and support devices are easier for the

model to learn from. The proposed pipeline may serve as an automated tool to identify hard-to-learn samples in medical

imaging datasets, facilitating the training of unbiased AI models and reducing the reliance on labor-intensive human

labeling.

Statement of Impact

This study offers an automated method of narrowing down datasets for AI training and validation to alleviate the need for

extensive human labeling of granular labels in medical imaging datasets.

Keywords

AI Bias; Shortcut Learning; Medical Imaging; pneumothorax

009 - Development of a Calculator for External Validation Study Sample Size in Radiology AI

Presenter: Shahriar Faghani, Mayo Clinic - Rochester

Shahriar Faghani¹, Mana Moassefi¹, Bradley J. Erickson¹

¹Mayo Clinic - Rochester, Rochester, MN, USA

Introduction/Background

External validation of clinical prediction models in radiology AI ensures their performance and generalizability in

independent datasets. Accurate estimation of sample sizes for these validation studies is crucial for obtaining precise and

unbiased performance metrics, such as the C-statistic and its standard error (SE). This paper introduces a Python-based

calculator designed to estimate the required sample size for external validation studies in radiology AI. The tool uses

mathematical and statistical methods to provide precise sample size estimations, aiding researchers in designing robust

validation studies.

Methods/Intervention

The Python-based tool requires input values for the C-statistic and the outcome event proportion (phi) from the validation

dataset. It calculates the standard error (SE) of the C-statistic using a specific formula that considers the sample size (n),

the C-statistic value, and the outcome event proportion. The SE of the C-statistic is calculated using the formula:

SE(C)≈C(1−C)Nϕ(1−ϕ)(1+N/2−12−C+N/2−11+C)\text{SE}(C) \approx \frac{C(1 - C)}{N \phi (1 - \phi)} \left( 1 + \frac{N / 2 -

1}{2 - C} + \frac{N / 2 - 1}{1 + C} \right)SE(C)≈Nϕ(1−ϕ)C(1−C)(1+2−CN/2−1+1+CN/2−1) The tool provides two methods for

estimating the required sample size for a targeted SE. The first method calculates the sample size directly from the SE

formula, while the second method uses a quadratic equation approach. The tool is implemented using the Gradio

interface, allowing users to input the C-statistic value, outcome event proportion, and targeted SE, and receive the

required sample size.

Results/Outcome

The tool was tested with various C-statistic values and outcome event proportions, demonstrating its accuracy in

estimating the required sample size for different scenarios.

Conclusion

The calculator is a valuable tool for researchers conducting external validation studies in radiology AI. It provides accurate

sample size estimations, ensuring robust and reliable evaluations of clinical prediction models.

Statement of Impact

The user-friendly interface enables researchers to quickly determine the necessary sample size for their validation

studies. The tool ensures that external validation studies are adequately powered, enhancing the reliability and

generalizability of AI models in radiology.

Keywords

Classification; External validation; Sample size calculation; Statistics

010 - Dissecting the Impact of Data Augmentation on Whole Slide Image Classification

Presenter: Dagoberto Pulido Arias, Massachusetts General Hospital

Dagoberto Pulido Arias¹, Tiago Gonçalves², Jaime Cardoso³, Jayashree Kalpathy-Cramer⁴, Elizabeth Gerstner¹, Albert Kim¹, Christopher Bridge⁵

¹Massachusetts General Hospital, Boston, MA, USA

²FEUP, INESC TE, Porto, Portugal

³INESC Porto, Universidade do Porto, Porto, Portugal

⁴University of Colorado Anschutz Medical Campus, Aurora, CO, USA

⁵Harvard Medical School, Boston, MA, USA

Introduction/Background

Machine learning in pathology shows potential for advancing precision oncology across multiple tumor types. However,

limited annotated samples and high intra-class variability in histological images constrain these models' clinical potential.

Data augmentation enhances model performance and generalizability, especially with limited training data, but whole slide-image (WSI) classification models require pre-computed tile-level features, limiting the ability to perform on-the-fly

augmentations. This study compares the impact of image-level and feature-level data augmentation on WSI classification

in pathology at both tile and slide levels, specifically using foundational models for feature extraction. Our goal is to

assess the necessity of data augmentation when using foundational models.

Methods/Intervention

We conducted experiments using WSIs from the breast cancer subset of The Cancer Genome Atlas (TCGA), utilizing two

pre-trained feature extractor models: ResNet-50, trained on ImageNet, and CONCH (CONtrastive learning from Captions

for Histopathology), a vision-language foundational model designed for microscopic pathology and trained on millions of

WSIs. We used a Clustering-constrained attention multiple instance learning model for classification. Our study compared

different image-level augmentations, such as Hematoxylin-Eosin-DAB (HED) color transformation and tile shifting, and feature-level augmentation using Pseudo-Bag Mixup (PseMix), applied to features extracted by both models. Feature-level augmentation creates synthetic variations of feature representations to help the model generalize better.

Results/Outcome

The CONCH model, augmented with individual techniques, significantly improved classification accuracy, achieving a test

AUC of 0.868 with Pseudo-Bag Mixup. Without augmentation, the baseline model's test AUC was 0.758 with ResNet-50

and 0.846 with CONCH. Image-level augmentations like HED color transformation and tile shifting also improved

performance. For example, tile shifting led to a test AUC of 0.835 with ResNet-50, and HED alone with CONCH achieved

a test AUC of 0.856.

Conclusion

Data augmentation techniques are essential for enhancing model performance, addressing the challenges of limited

annotated data and high intra-class variability. Both image-level and feature-level augmentations improve predictive

performance, providing a robust solution for increasing the accuracy and reliability of computational pathology models.

Statement of Impact

This study underscores the importance and benefits of efficient data augmentation in computational pathology,

contributing to the development of robust, high-performing algorithms without needing additional data and resources.

Keywords

Artificial Intelligence; Attention mechanisms; Breast cancer; Computational pathology

011 - Iterations on a Classic: A Robust Hand Bone Age Algorithm Resistant to Computational Stress

Presenter: Kelly Horst, Mayo Clinic - Rochester

Kelly Horst, Bradley J. Erickson, Shahriar Faghani

¹Mayo Clinic - Rochester, Rochester, MN, USA

Introduction/Background

Anterior-posterior images of the left hand have been traditionally used to estimate skeletal age for decades. More

recently, deep learning (DL) algorithms have been used to estimate skeletal age. The winning algorithm from the 2017

RSNA challenge was recently republished with relatively poor performance when a validation set with varied clinical

image appearances was used to test the model under computational stress. We sought to train a model with improved

results, with more robust performance to extensive variations in clinical image appearance.

Methods/Intervention

A multimodal DL model was developed and adapted for pediatric bone age assessment. This model includes the

DenseNet121 architecture as an imaging feature extractor and two shallow neural networks: one for transforming patient

sex into the imaging feature space, and the other for combining all features. The 2017 RSNA bone age dataset was

utilized, with 12,611 training images and 1,425 reserved for the validation set. There are 5,778 female and 6,833 male

subjects, ranging from 1 to 228 months in age, with a mean bone age of 127.2 months and a standard deviation of 41.7

months. Resizing to 512 x 512, foreground cropping, normalization of image intensities using train set statistics to zero

mean and unit standard deviation, random histogram shift, flipping, random affine transformations including rotation,

translation and scaling, and Gaussian noise addition were used to enhance model generalizability during training. To

perform computational stress test, the same data augmentation pipeline was applied during inference on the validation

set. Mean absolute error (MAE) was reported as the performance metric.

Results/Outcome

A MAE of 5.6 months was achieved with a batch size of 16, learning rate of 0.001, and 500 epochs. which is an

improvement on the previously published winning model performance of 6.8 months. Importantly, the model achieved this

result with extensive variations in clinical image appearance.

Conclusion

A deep neural network can accurately estimate bone age from radiographs of left hand among pediatric patients up to 21

years of age, with robust performance under computational stressors.

Statement of Impact

Training and testing algorithms with computational stress will enhance real world performance. This should be confirmed

prospectively with clinical application.

Keywords

Bone age; Computational stress; CNNs

012 - Unveiling Bias in AI Model Training Data: Exploring the Impact of Intrinsic Data Variability on Lung Ultrasound Video Classification Models

Presenter: Saunak Bhattacharjee, Boston University

Saunak Bhattacharjee¹, Umair Khan², Russell Thompson³, Lauren P. Etter⁴, Ingrid Camelo⁵, Rachel C. Pieciak⁶, Ilse Castro-Aragon⁷, Bindu Setty⁷, Christopher C. Gill⁶, Margrit Betke¹

¹Boston University, Boston, MA, USA

²University of Trento, Trento, Italy

³University of Massachusetts, Dartmouth, MA, USA

⁴Universty of Wisconsin-Madison, Madison, WI, USA

⁵Augusta University, Augusta, GA, USA

⁶Boston University School of Public Health, Boston, MA, USA

⁷Boston Medical Center, Boston, MA, USA

Introduction/Background

Lung ultrasound (LUS) is an emerging tool for providing clinical support for patients with respiratory diseases.

However, its operator-dependent data acquisition and interpretation introduce potential variability in data collection and

analysis. Factors such as variations in scanning techniques, duration of the recorded scan, and interpretation of visual

patterns may introduce discrepancies. These discrepancies can affect data consistency and potentially bias the model

training, impacting the development of generalizable artificial intelligence (AI)-based models.

Methods/Intervention

To investigate the inherent bias in LUS video data and its impact on AI model training, we employed a transformer-based

video classification model aimed at identifying lung consolidations among pediatric patients. This model was

complemented by a frame-level transformer-based classification model that aggregates frame-level predictions to produce

a video-level score. Both models were trained and validated on 2,400 videos collected in a sweep-acquisition fashion from

200 pediatric patients with pneumonia and were subsequently tested on an external dataset comprising another 2,400

LUS videos from 200 healthy individuals.

Results/Outcome

The analysis of the dataset revealed a correlation coefficient of 0.4039 between larger lung consolidations and longer

video lengths, suggesting moderate operator bias in data collection. The video classification model achieved a

100% accuracy on the external dataset. The frame-level model consistently predicted all frames from healthy individuals

as lacking consolidations, with a confidence level above 0.73, demonstrating its ability to generalize to an external

dataset.

Conclusion

This study highlights the need for careful consideration of data biases during AI model training to ensure accurate AIaided diagnosis. Despite its high accuracy, caution is advised when generalizing these results, as the identified biases

could affect the future performance of the models.

Statement of Impact

The findings underline the critical importance of addressing biases in LUS video datasets to develop reliable and

generalizable AI-based diagnostic tools. Ensuring consistent data collection and interpretation practices is essential for

the advancement of AI in medical diagnostics.

Keywords

Lung Ultrasound (LUS); Data Bias; Transformer-based Model; Lung Consolidation

Oral Presentations

Large Language Models – Session 1 | Scientific Abstract Presentations

013 - Application of a Multi-agent Open-source Large Language Model for Data Abstraction from Radiology Report

Presenter: Sanaz Vahdati, Mayo Clinic - Rochester

Sanaz Vahdati¹, Pouria Rouzrokh¹, Elham Mahmoudi¹, Bradley J. Erickson¹

¹Mayo Clinic - Rochester, Rochester, MN, USA

Introduction/Background

Recent advancements in large language models (LLMs) have opened new frontiers in artificial intelligence. Multi-agent

systems have shown remarkable effectiveness in specialized tasks like data extraction. By emulating collaborative

cognitive processes, these systems transform problem-solving approaches, enabling more sophisticated and holistic data

analysis. Their ability to distribute complex, multifaceted tasks among specialized agents leads to enhanced decision-making capabilities and more comprehensive solutions. In this regard, we aimed to build a multi-agent language model for

radiology report data extraction to investigate their capabilities and asses their performance in deriving specific

conclusions from complex medical data.

Methods/Intervention

In this work, we collected 212 radiology reports from two different pathologies. We aimed to extract the presence or

absence of acute cervical spine fracture and liver metastasis from radiology reports of the cervical spine and

abdominopelvic CT scan collected between January and February 2022. We employed the open-source LLama3-

70Binstruct for inference and applied few-shot prompting for each agent. We propose a three-tier multi-agent architecture.

This system comprises two verification agents and a reconciliation expert agent, operating in a sequential manner. The initial two agents independently extract data, which is subsequently fed, along with the original report, to the reconciliation agent. This final agent synthesizes the information to produce a comprehensive conclusion. We evaluated the efficacy of this pipeline through performance metrics.

Results/Outcome

The multi-agent model for liver metastases assessment demonstrated high performance, achieving accuracy of 0.95, F1

score of 0.95, Positive Predictive Value (PPV) of 0.96, and Negative Predictive Value (NPV) of 0.95. Our model could

exclude patients with a prior history of metastasis from new diagnosis classification. For the extraction of acute cervical

spine fracture presence, the pipeline exhibited robust performance metrics: accuracy of 0.96, F1 score of 0.92, PPV of

0.92, and NPV of 0.97. In both tasks, the reconciliation agent provided salient cues pertaining to the final determination,

facilitating further elucidation of results.

Conclusion

In conclusion, we built a multi-agent orchestration using debating scenarios to boost collective reasoning of data

abstraction from the radiology report.

Statement of Impact

Multi-agent models with the potential to collective intelligence bear considerable promise for data extraction from radiology

reports.

Keywords

Radiology report; Large language models; Multi-agent

014 - Evaluation of Llama2 and Llama3 for Automated Extraction of Ground Truth from Radiology Reports for Post-Deployment Monitoring of Pulmonary Embolism and Intracranial Hemorrhage Detection AI Models

Presenter: Theo Dapamede, Emory University

Theo Dapamede¹, Bardia Khosravi², Chad Robichaux¹, Aawez Mansuri¹, Mohammadreza Chavoshi³, Alex Belov¹, Angela Udongwo⁴, Chinonyelum Igwe⁵, Frank Li¹, Beatrice Brown-Mulry¹, Hanssen Li¹, John Moon¹, Judy Gichoya¹, Hari Trivedi¹

¹Emory University, Atlanta, GA, USA

²Yale University, New Haven, CT, USA

³Tehran University of Medical Sciences, Tehran, Iran

⁴Temple University, Philadelphia, PA, USA

⁵University of Ibadan, Ibadan, Nigeria

Introduction/Background

Clinical use of AI models requires post-deployment monitoring for performance and potential drift. However, this requires comparison of model outputs to ground-truth radiologist interpretations which can be laborious. We evaluate the performance of 2 generations of open-source large language models (LLM) for label extraction tasks for pulmonary embolism (PE) and intracranial hemorrhage (ICH) against human annotated ground truths.

Methods/Intervention

We identified 4,668 CT PE exams and 74,394 non-contrast CT head exams from 2020-2022 and randomly sampled 250 reports for each exam type for manual annotation. PE labels were: PE, acuity, laterality, largest depth, right heart strain, and pulmonary artery hypertension. ICH labels were: ICH, acuity, laterality, subtype, midline shift, and mass effect. Reports were annotated by 6 human annotators using a browser-based interface and difficult cases were flagged for review by a senior radiologist. Multiple prompt styles were tested in preliminary analysis using Llama 2 7B. The top performing prompting style was selected and used to evaluate Llama2 (7B, 13B, and 70B) and Llama3 (8B and 70B) models.

Results/Outcome

Llama3 8B had the highest overall performance for both PE (sensitivity: 1.0; specificity: 1.0) and ICH (sensitivity: 0.93; specificity: 1.0). Across all models, performance for PE depth (accuracy range: 0.25-0.61) and ICH acuity (accuracy range: 0.63-0.74) were lowest. Llama2 performance improved with increasing parameters for most classes. However, Llama3 8B and 70B performance was similar across all categories. Llama3 8B significantly outperformed Llama2 7B for all labels, despite similar parameter sizes.

Conclusion

This study evaluated Llama2 and Llama3 models to extract labels for PE and ICH against human annotated ground truths. Llama3 8B had the highest performance with significant improvements over Llama2. Model performance for extracting binary PE and ICH labels was robust, however no model was able to successfully extract subgroup labels for PE or ICH to acceptable accuracy.

Statement of Impact

LLMs are a promising tool for post-deployment monitoring of AI models and can successfully extract binary ground truth from ICH and PE radiology reports for comparison to AI model predictions. If properly tuned, these models may also allow for robust subgroup evaluation to deliver further insights into model performance.

Keywords

Llama; Pulmonary Embolism; Intracranial Hemorrhage

015 - Examining Patient-Large Language Model Interactions Using the PromptWise Paradigm for Medical Education

Presenter: Satvik Tripathi, University of Pennsylvania

Satvik Tripathi¹, Rithvik Sukumaran¹, Suhani Dheer¹, Tessa S. Cook¹

¹University of Pennsylvania

Introduction/Background

With the rise of large language models (LLMs) for general-purpose use, researchers have begun studying how they might improve patient care. In earlier work, we proposed the PromptWISE (Prompt engineering for Well-structured, Interactive, and Supportive Education) paradigm to educate patients on using LLMs to understand their medical issues. PromptWISE helps patients engineer higher-quality prompts that can enhance their medical experience and reduce the burden on medical professionals.

Methods/Intervention

We applied the six-point PromptWISE guidelines to answer a set of 25 questions patients might ask LLMs about their health or medical care. Using Amazon Mechanical Turk, we conducted an IRB-approved survey (n=1074) to compare a pair of LLM-generated responses, one from a simple prompt and the other from a PromptWISE-designed prompt. GPT-4 provided all text generations. Volunteers picked the better response based on three criteria: clarity, information, and relevance. We also collected demographic information, including gender, race, age bracket, income bracket, education level, and healthcare employment status. Statistical analyses were performed to determine the generalizability and reproducibility of our results.

Results/Outcome

The demographic reporting indicated a diverse cohort of volunteers, reducing any reporting biases. In our analysis, volunteers overwhelmingly (n=837) chose responses generated from PromptWISE prompts over those generated from non-PromptWISE prompts (p< 0.0001). We also found that non-PromptWISE prompts lacked essential details, leading to inaccurate or irrelevant responses.

Conclusion

The results demonstrate the tangible impact of prompt engineering on patient-LLM interactions. Querying LLMs with prompts crafted following our guidelines yielded more comprehensive and precise responses while also refraining from giving any medical advice. Finally, volunteers demonstrated a strong preference for responses to PromptWISE prompts, further indicating the impact and need for prompt engineering when interacting with LLMs.

Statement of Impact

PromptWISE significantly improves patient-LLM interactions by generating clearer, more informative responses. This study underscores the importance of prompt optimization in enhancing patient engagement and accuracy when utilizing LLMs for medical information retrieval.

Keywords

Large Language Models; Prompt Engineering; Patient Education; Patient-Centered Care

016 - GPT-Based Automated Classification and Labeling of Surgical Renal Pathology Reports

Presenter: Satvik Tripathi, University of Pennsylvania

Satvik Tripathi¹, Rithvik Sukumaran¹, Dana Alkhulaifat¹, Charles M. Chambers¹, Darco Lalevic¹, Hanna Zafar¹, Tessa S. Cook¹

¹University of Pennsylvania

Introduction/Background

Human annotation of reports to acquire high-quality data for model training can be costly and time-consuming. Leveraging automated labeling with large language models can be a valuable and cost-effective tool to streamline annotation processes. Our aim was to assess GPT-4’s performance in labeling renal surgical pathology reports using various prompting-based techniques.

Methods/Intervention

Renal surgical pathology reports from three health systems (n=40) within the same state were labeled by two radiologists with 10 and 14 years of experience as “malignant,” “indeterminate,” “benign,” or “ignore.” “Ignore” was used for reports of any pathology not specifically from a renal mass. The reports were distributed equally among the four classes. Prompt engineering for GPT-4 was utilized with zero-, one-, and few-shot learning techniques to classify the reports. The main performance evaluation metric was accuracy.

Results/Outcome

GPT-4 achieved 70%, 77.5%, and 92.5% accuracy with zero-, one- and few-shot learning, respectively. The incorrect classification was the highest in the “Indeterminate” (n = 4) class for zero-shot prompting and the “Ignore” class for one- and few-shot prompting techniques (n = 5 and 2, respectively). GPT-4 outperformed our existing Deep Learning-based methods.

Conclusion

GPT-4 holds the potential to classify renal surgical pathology reports with significant accuracy, even without extensive training data. The few-shot prompting technique achieved the highest accuracy, demonstrating the model's ability to adapt and learn from minimal examples. This capability could streamline the annotation process, reduce the burden on radiologists, and enable faster data processing. Furthermore, the model's performance in handling varied classes of pathology reports underscores its versatility and potential for broader applications in medical report classification.

Statement of Impact

Automatic labeling of reports can enable prompt identification of important clinical findings, leading to timely intervention and improved treatment outcomes.

Keywords

Large Language Models; Automated Labeling; Pathology; Prompt Engineering

017 - Revolutionizing Radiological Research: LLMs for Rapid, Accurate Data Extraction from Clinical Reports

Presenter: Ali Ganjizadeh, Mayo Clinic - Rochester

Ali Ganjizadeh¹, Bardia Khosravi¹, David A. Woodrum¹, Bradley J. Erickson¹

¹Mayo Clinic - Rochester, Rochester, MN, USA

Introduction/Background

To evaluate the efficacy of Large Language Models (LLMs) in extracting clinically relevant information from MR-guided intervention reports, enabling efficient database creation and comprehensive retrospective analysis, and to introduce RadPrompter, a novel tool enhancing this extraction process.

Methods/Intervention

We employed the Meta-Llama-3-70B-Instruct model to process 2,016 MR-guided intervention reports. The model was tasked with extracting key clinical features, including organ, anatomical location, ablation type, assisted modality, needle specifications, lesion type, and treatment cycles. To optimize extraction, we developed a new tool, RadPrompter, which interfaces with the LLM engine and enhances its information retrieval capabilities. The system utilized 2 Nvidia A100 GPUs and 160 GB of RAM, processing two reports simultaneously with the model's temperature set to 0.0 to minimize hallucination.

Results/Outcome

Leveraging our custom tool, the LLM successfully processed all 2,016 reports in 6 hours and 27 minutes, averaging 15 seconds per report. We manually inspected 200 reports, and It achieved 100% accuracy in extracting the specified clinical data points, demonstrating high reliability in information retrieval from complex medical narratives.

Conclusion

This study showcases the powerful potential of LLMs, augmented by specialized tools like RadPrompter, in revolutionizing radiological research. By rapidly and accurately extracting structured data from unstructured reports, this approach can significantly enhance the efficiency and scope of retrospective analyses. It enables researchers to process large volumes of historical data with unprecedented speed and accuracy.

Statement of Impact

The application of LLMs, coupled with custom extraction tools, in radiology report analysis, represents a paradigm shift in medical research methodology. It offers a scalable solution to the challenge of mining valuable insights from vast repositories of unstructured clinical data. This technology has the potential to accelerate research timelines, uncover novel patterns in patient care, and ultimately contribute to the advancement of personalized medicine in interventional radiology.

Keywords

Large Language Models; Artificial Intelligence; Interventional Radiology

018 - The Effect of Prompt Elements on Labelling Incidental Breast Findings by Llama3-8B in Radiology Reports

Presenter: Benjamin E. Rush, University of Wisconsin-Madison

John Garrett¹, Thanh Nguyen¹, Benjamin E. Rush¹, Ryan W. Woods¹

¹University of Wisconsin-Madison, Madison, WI, USA

Introduction/Background

Breast cancer is a leading cause of mortality among women, and early detection can improve survival probability. Incidental breast abnormalities are identified in approximately 7% of chest CT scans, of which about 28% are malignant. However, the radiology reports from CT scans are often lengthy, unstandardized text where incidental findings might be overlooked by physicians. We propose using large language models (LLMs) to label CT radiology reports for incidental breast findings, which could flag for additional diagnostic imaging.

Methods/Intervention

We selected 17752 routine chest CTs from female patients ages 40-72 obtained at UW-Health between 2015-2017. We sub selected 3226 exams with “breast” in the radiology report and randomly sampled 500 exams for evaluation. We compared the performance of Llama3-8B incidental breast findings labelling with varying prompts to a human reader. The LLM was tasked with labeling “Yes” or “No” for incidental breast findings with the role of a radiologist or annotator. Prompt elements included the task and radiology report, and varying combinations of background, keywords, and examples. Each prompt was run 30 times to evaluate consistency. We conducted sensitivity, specificity, and Fleiss’ Kappa consistency analyses to compare the human reader and LLM.

Results/Outcome

The human reader identified 125 (25.0%) of reports having incidental breast findings. The LLM’s average positively labelled cases ranged from 236.1 (47.2%) to 412 (82.4%) of reports. Sensitivity ranged from 0.76 to 0.99, though the highest average positive predictive value was 0.50. Specificity ranged from 0.23 to 0.71, with the lowest negative predictive value at 0.86. While sensitivity generally decreased with more prompt elements, specificity increased with more detailed prompts. Fleiss’ Kappa indicated high agreement among prompt iterations with the at κ=0.94.

Conclusion

The LLM and prompts labelled many false positives but had high negative predictive values with high consistency across all prompts. Future work will evaluate the parameter size of models on metric performance.

Statement of Impact

LLMs remain as a possible flagging system for missed details and prevention system, however larger models or fine-tuning might be required to match human performance.

Keywords

Large Language Models; Incidental Findings; Breast Health; Radiology Reports

Oral Presentations

Image Segmentation | Scientific Abstract Presentations

019 - Automated Pancreatic Perivascular Adipose Tissue Detection on Abdominal CT as a Biomarker for Type 2 Diabetes

Presenter: Anisa V. Prasad, National Institutes of Health

Anisa V. Prasad¹, Tejas S. Mathai¹, Pritam Mukherjee¹, Abhinav Suri¹, Jianfei Liu¹, Ronald M. Summers¹

¹National Institutes of Health, Bethesda, MD, USA

Introduction/Background

Early diagnosis of diabetes mellitus is critical for preventing disease and improving health outcomes. Intrapancreatic fat deposition has previously been established as a biomarker for diabetes, but little is known about the role of pancreatic perivascular adipose tissue (PVAT), adipose tissue surrounding blood vessels supplying the pancreas. We developed a deep learning framework to quantify pancreatic PVAT on abdominal CT scans and applied it to identify CT biomarkers for type 2 diabetes.

Methods/Intervention

1350 contrast-enhanced CT (CECT) scans with ground truth labels from the public PANORAMA dataset were used to train a 3D nnUNet model to segment pancreatic anatomy (parenchyma, vasculature, ducts, pancreatic ductal adenocarcinoma lesions). It was then applied to an internal dataset containing 606 CECT scans with corresponding diabetes outcomes. Pancreatic adipose tissue (AT) and PVAT were derived from the predicted segmentations and used to measure several biomarkers, such as volume and attenuation. These biomarkers were then correlated to diabetes status using univariate and multivariate logistic regression. Metrics such as AUC were assessed to determine the best set of predictors for diabetes outcomes.

Results/Outcome

Four pancreatic PVAT biomarkers were measured: volume, mean attenuation, standard deviation (SD) attenuation, and fat fraction. Significant differences (p < 0.001) were found across diabetic and non-diabetic patients for all four biomarkers, with mean attenuation demonstrating a decrease in diabetic patients while the other three metrics were increased. Similar findings were observed for the corresponding pancreatic AT measurements. Among all combinations of the eight biomarkers measured, the best set of predictors for diabetes was (1) pancreatic AT mean attenuation, (2) pancreatic PVAT mean attenuation, and (3) pancreatic PVAT fat fraction, achieving a maximum AUC of 0.88 with sensitivity 0.90 and specificity 0.71.

Conclusion

We present a framework to automatically identify pancreatic PVAT. Our analysis suggests that metrics derived from these segmentations, such as pancreatic PVAT mean attenuation and fat fraction, are viable biomarkers for type 2 diabetes.

Statement of Impact

We provide an automated method for quantifying pancreatic PVAT that can be implemented to elucidate its role in disease progression. The biomarkers identified using this tool underscore the potential for opportunistic screening of diabetes mellitus using abdominal CT scans.

Keywords

Diabetes Mellitus, CT, Pancreas, Intrapancreatic Fat Deposition

020 - Deep Learning for Incidental Parotid Tumors on CT: Optimal Methods for Screening and Segmentation

Presenter: Wei Shao, University of California Irvine

Wei Shao¹, Shirin Salehi¹, Chanon Chantaduly¹, Hayden Troutt¹, Peter Chang¹

¹University of California Irvine, Irvine, CA USA

Introduction/Background

Parotid gland tumors (PGT) are the most common salivary gland tumors. With increasing imaging utilization, most PGTs are detected incidentally on CT, however many are overlooked by radiologists prioritizing acute pathology. This study presents a deep learning (DL) solution for opportunistic PGT detection on CT with a focus on optimizing complimentary objectives for tumor screening and segmentation.

Methods/Intervention

A retrospective cohort of 11,449 consecutive non-contrast head CT exams were aggregated from two academic centers. PGTs, defined as a parotid mass >10 mm, were identified from radiology or histopathology reports and annotated with a mask by an expert neuroradiologist. In total, 219 PGTs were identified (N=112 hospital A, N=107 hospital B). A multistage DL pipeline was developed for PGT detection. First, an initial model localizes each parotid gland. Subsequently, a single 3D U-Net simultaneously implements the segmentation (per-voxel spatial overlap) and screening (per-exam tumor detection) tasks. To convert segmentation outputs into binary screening results, thresholds for positive voxel predictions were calibrated for optimal accuracy. Given complimentary objectives of the segmentation and screening tasks, various loss functions (binary cross-entropy, focal loss, soft Dice) and training cohorts (full cohort, positive only) were evaluated. Performance was assessed using five-fold cross-validation.

Results/Outcome

Overall, the best screening model achieved a per-exam specificity, sensitivity, PPV, NPV and accuracy of 0.947, 0.719, 0.858, 0.878, 0.872, while the best segmentation model achieved a Dice score of 0.71. Of the positive predictions, six tumors were missed by the original interpreting physician. In general, cross-entropy (CE) outperformed focal loss (FL) for segmentation, while FL outperformed CE for screening due to improved specificity and lower false positives. Soft Dice (SD) tended to improve both tasks. The use of negative training examples significantly decreased tumor Dice score while reducing false positives for the screening task.

Conclusion

By combining a first-pass screening model with a subsequent focused segmentation model, a unified DL framework can identify and delineate PGTs on routine CT with high accuracy.

Statement of Impact

A DL model can identify incidental PGTs on routine CT imaging with high accuracy including tumors missed in a realistic clinical workflow.

Keywords

Deep Learning; Parotid Tumor; Screening; Segmentation

021 - Global Local Attention for Prostate Zonal Segmentation

Presenter: Chetana Krishnan, University of Alabama at Birmingham

Chetana Krishnan¹, Ezinwanne Onuoha¹, Alex Hung², Kyung H. Sung², Harrison Kim¹

¹University of Alabama at Birmingham, Birmingham, AL, USA

²University of California Los Angeles, Los Angeles, CA, USA

Introduction/Background

We focus on representation learning for large-scale image segmentation. Besides backbones, training pipelines, and loss functions, key methods have explored various spatial pooling and attention mechanisms essential for creating robust global image representations. Attention mechanisms differ based on feature tensor interactions (local vs. global) and the dimensions they target (spatial vs. channel). However, most studies examine only one or two forms of attention. Focusing on global and local descriptors, we can provide empirical evidence of the interaction of all forms of attention and improve the state of the art on standard benchmarks.

Methods/Intervention

The proposed GLCSA network uses multi-stream processing to capture comprehensive contextual information from images. The local attention stream (LAS) focuses on detailed information at individual spatial locations and specific feature channels, highlighting fine-grained patterns and textures. The global attention stream (GAS) models interactions across the entire spatial dimension and among feature channels, ensuring broader relationships are captured. The LAS uses fine-scale convolutions to discern intricate details, while the GAS leverages self-attention to integrate long-range dependencies. The information from these streams is embedded into feature maps, which are then fused into a unified feature map. Finally, a pooling operation distills the combined information into a compact representation for robust image analysis. We trained GLCSA with 34 prostate scans (over 20 slices per scan), and the model was tested with ten unseen scans. Performance was evaluated using the Dice similarity coefficient. Several networks were compared with and without GLCSA.

Results/Outcome

GLCSA achieved a higher DSC and minimum DSC, indicating superior performance.

Conclusion

GLCSA dynamically captures global–local spatial and channel information to address the challenge of prostate segmentations and the limitations of 2D networks.

Statement of Impact

GLCSA can improve the diagnosis, treatment planning, and monitoring of prostate pathology and volume for clinical diseases.

Keywords

Attention; Segmentation; Spatial; Channel

022 - Liver Surface Nodularity for Staging Hepatic Fibrosis on CT: A Comparative Study of Liver Segmenters

Presenter: Tejas S. Mathai, National Institutes of Health

Tejas S. Mathai¹, Meghan G. Lubner², Perry J. Pickhardt², Ronald M. Summers¹

¹National Institutes of Health, Bethesda, MD, USA

²University of Wisconsin-Madison, Madison, WI, USA

Introduction/Background

Cirrhosis is the 12th leading cause of death in the US, and liver fibrosis can be caused by metabolic disorders, alcoholism, and Hepatitis B/C virus. Earlier stages (METAVIR F0 – F2) are reversible with therapy, but later stages (F3 - advanced fibrosis and F4 - cirrhosis) are irreversible. Liver Surface Nodularity (LSN) score, a non-invasive CT-based biomarker that measures the left hepatic lobe surface smoothness, can distinguish later fibrosis stages. However, it depends on a precise liver segmentation.

Methods/Intervention

480 patients underwent CT imaging at Institution-A with fibrosis (METAVIR) staged using biopsy. An internal deep learning tool (INT) segmented the full liver and 8 Couinaud segments. The public TotalSegmentator (TS) tool also segmented the liver. The extents of Couinaud segments 2 and 3 were found. A fully automated image analysis technique detected the liver surface in each 2D slice, and a smooth spline (4th order) was fit to it. The LSN score was the mean distance between the detected surface and fit spline, and higher scores indicated worsening fibrosis. Youden indices were used to find the optimal LSN cutoffs for each fibrosis stage. ROC curves by INT for advanced fibrosis (F3-4 vs. F0-F2) and cirrhosis (F4 vs. F0-3) were compared against TS. An AUC below 0.6 was considered clinically ineffective.

Results/Outcome

AUCs were similar between INT and TS for prediction of cirrhosis (F4, 87.8% vs. 88.7%, p = .143) and advanced fibrosis (≥ F3, 82.5% vs. 83.9%, p = .381). A statistical bootstrap test revealed no differences between the two tools for all three clinically important stages. But the specificity was higher with INT for advanced fibrosis (79.5% vs. 65.1%) and significant fibrosis (73.3% vs. 49.5%), while being comparable for cirrhosis (79.5% vs. 78.4%). Both INT and TS had good agreement (R^2 of 0.8) of computed LSN scores.

Conclusion

Both the internal tool and TotalSegmentator attained comparable performance for fibrosis staging. However, TotalSegmentator did not achieve high specificity.

Statement of Impact

Both INT and TS tools can predict the fibrosis stage in ~45 seconds compared to the ~2 minutes needed to manually measure the LSN in a CT volume. They show promise for population-based studies.

Keywords

CT; Liver; Liver Fibrosis; Cirrhosis

023 - Opportunistic Detection of Splenomegaly Using Automated AI-Based Measurements and Reporting of Organ Volumes in the Clinical Workflow

Presenter: David Y. Zhang, University of Pennsylvania - Penn Medicine

David Y. Zhang¹, Ari Borthakur¹, Jeffrey Duda¹, Neil Chatterjee², Rohan Valia¹, James C. Gee¹, Charles E. Kahn Jr.¹, Daniel J. Rader¹, Hersh Sagreiya¹, Walter Witschey¹

¹University of Pennsylvania - Penn Medicine, Philadelphia, PA, USA

²Northwestern University, Evanston, IL, USA

Introduction/Background

Splenomegaly, an enlargement of splenic size and weight with a prevalence of 2% in the US, reflects a disruption of the organ’s complex role in immunological defense and hematopoiesis (1,2). Due to a broad array of underlying conditions such as hyperplasia, passive congestion, and infiltrative disease, it is often differentially diagnosed by CT imaging (3). Due to its widespread availability, opportunistic screening using CT captures information about clinical conditions when people are imaged for other reasons (4) and can augment the radiology workflow with detailed quantitative imaging traits (radiomics) that are cumbersome to obtain in traditional workflows that do not support computational imaging (5,6). Opportunistic screening for splenomegaly was performed in a disease-agnostic medical population (7), however, associations with other diseases were missing. To address these issues, we built and deployed an end-to-end opportunistic screening workflow using AI-based automated image analysis embedded in the radiology clinical workflow (8). We validate the system by measuring spleen volumes and demonstrating associations with systemic multi-organ diseases using a phenome-wide association study in patients that underwent CT at our institute in the last year.

Methods/Intervention

In an IRB-approved study, spleen volumes were estimated for 13,636 individuals from CTs using TotalSegmentator (9). Splenomegaly assignments were determined if a patient had an ICD diagnosis for the condition (ICD-10 = R16.1 or R16.2, ICD-9 = 789.2) prior to when the CT scan was performed. A phenome-wide association study was performed against phecodes adjusting for sex, age, age^2, principal components 1-10, and BMI.

Results/Outcome

AI-based spleen volume measurements were 484.3 ± 206.2 mL (mean ± sd) and 202.5 ±114.9 mL for splenomegaly vs. other patients, respectively. Increased spleen volumes were strongly associated with over 50 clinical entities including digestive system diseases and multi-organ diseases including infections, endocrine disease, and kidney disease.

Conclusion

Opportunistic screening for splenomegaly using an AI-based was validated with physician-determined enlarged spleen in a clinical population with systemic multi-organ diseases. PheWAS results serve as a nexus for future discovery and support the use of genetic analysis (GWAS).

Statement of Impact

AI-based measurements of spleen volume from CT images can be used to opportunistically screen patients for splenomegaly.

Keywords

Splenomegaly; Segmentation; Phenome-wide association study

024 - Validation of UniverSeg for Interventional Abdominal Angiographic Segmentation

Presenter: Michael J. Kovalchick, Wayne State University

Michael J. Kovalchick¹, Chad Klochko², Kundan Thind²

¹Wayne State University, Detroit, MI, USA

²Henry Ford Health, Detroit, MI, USA

Introduction/Background

Automatic segmentation of angiographic structures can aid in assessment of presence and extent of vascular disease. Recent deep learning segmentation models promise automated processing, however, lack validation on interventional angiographic data. This study performs a validation test on the UniverSeg model to examine suitability for future use.

Methods/Intervention

After IRB approval, a retrospective review identified 234 patients who underwent interventional fluoroscopy of the celiac axis with iodinated contrast via intravenous catheter injection between January 1st, 2019, and December 31st, 2022. From 261 fluoroscopic acquisitions, 303 images were selected with maximum contrast from the contrast agent. From each image a partition of 128x128 pixels was selected to encompass arterial details, and a corresponding binary mask was subsequently generated by convex hull calculation. The resulting image-mask pairs were distributed into three classes of 101 pairs each. Classes were defined by decreasing arterial diameter and the number of bifurcations of the vessel. UniverSeg was applied to each class independently in a 5-fold nested cross comprehensive validation test. An analysis of model performance for in-context learning was performed for each class to determine the minimum size for average model convergence. For each class size, ranging from 1 to 81 pairs, five sample images were tested against the class with twenty repetitions iterated across the class.

Results/Outcome

Dice-Similarity-Coefficients comparing UniverSeg output to generated masks across the three classes with decreasing arterial diameters were 78.7%, 72.5%, and 59.9% (σ=5.96, 7.99, 14.29). Balanced-Average-Hausdorff-Distances, representing the maximum separation distance between prediction surface and ground truth, were 0.86, 0.71, 1.16 (σ=0.37, 0.52, 0.68) pixels respectively. Inverted mask testing revealed degradation of performance in line with published UniverSeg expectations. Class size testing in all cases showed non-linear improvement with plateauing performance with increased image sets used for in-context learning. All test images converged in performance to ±1.34 Dice-Score of the full class size value by N=51.

Conclusion

UniverSeg performed well for angiographic segmentation, with improved performance with greater class size, increased vessel diameter, and reduced bifurcations.

Statement of Impact

This study validates a potential method for arterial segmentation in interventional fluoroscopic procedures and facilitates development of vascular disease models and imaging research applications.

Keywords

Segmentation; Interventional; Angiography; Deep-Learning

Oral Presentations

Large Language Models – Session 2 | Scientific Abstract Presentations

025 - Benchmarking Quantization: A Comprehensive Comparison of Open-Source Large Language Models

Presenter: Blake T. Passe, Mayo Clinic - Rochester

Blake T. Passe¹, Sanaz Vahdati¹, Bradley J. Erickson¹

¹Mayo Clinic - Rochester, Rochester, MN, USA

Introduction/Background

Large Language Models (LLMs) have seen a boom within the Artificial Intelligence community recently. As the parameter size of models has grown from millions to trillions, computational requirements present substantial deployment challenges. A potential approach to resolve these pressing difficulties is quantizing open-source LLMs - reducing the precision of model parameters - while aiming to preserve performance. In the current work, our aim is to compare how quantization of open-source LLMs impacts information extraction from radiology reports, latency, and computational demands.

Methods/Intervention

622 radiology reports were obtained in five categories: cervical spine fractures, glioma progression, liver metastases, pneumonia, and pulmonary embolism. Each glioma progression report was labeled “Improved”, “Progression”, “Stable”, “Pseudoprogression”, or “Pseudoresponse”, while the four remaining categories were labeled a binary “Yes'' or “No”. Different ‘instruct’ versions of Llama3 and Phi-3 models were applied using Ollama and allotted one NVIDIA A100 80Gb GPU for inference. Prompting was conducted by a radiology artificial intelligence expert to describe the model’s task and criteria succinctly. The prompting structure consisted of four sequential steps: identity establishment, cognitive framework setup, report presentation, and contextual clarification. Finally, the JSON output, RAM usage, and latency for the extraction process was recorded for each model at each quantization level.

Results/Outcome

Findings indicate model size displays a positive correlation with both RAM and latency during inference. Comparable accuracy (>92%) was observed between the 8, 5, and 4-bit quantized versions of Llama3:70b, Llama3:8b, and Phi3:14b. This is intriguing due to the large gap between model sizes. The extreme 2-bit quantization demonstrated a prominent confabulation of answers. Response divergence (i.e. responses not within the defined structure, such as “Maybe” rather than the required “Yes” or “No”) is displayed before the performance degradation of the models. This suggests that LLMs may lose output obedience before specific performance metrics decline.

Conclusion

Our findings indicate that models can perform quite well even with substantial quantization for question-answering tasks applied to radiology reports.

Statement of Impact

Navigating the tradeoff between quantization and quality is largely unstudied in medicine, but it indicates significant potential to reduce computation load.

Keywords

Artificial Intelligence; Large Language Model; Quantization; Radiology Report Data Extraction

026 - ConTEXTual Net 3D: Visual Grounding in PET/CT for Enhanced Interactive Reporting

Presenter: Zachary Huemann, University of Wisconsin-Madison

Zachary Huemann¹, Samuel Church¹, Joshua D. Warner¹, Daniel Tran¹, Xin Tie¹, Junjie Hu¹, Steve Y. Cho¹, Meghan G. Lubner¹, Tyler J. Bradshaw¹

¹University of Wisconsin-Madison, Madison, WI, USA

Introduction/Background

Visual grounding algorithms, which link text descriptions to specific image regions, have many potential applications in radiology. However, these algorithms require large training datasets of annotated image-text pairs, which currently do not exist for most imaging modalities. We developed a pipeline to extract reported descriptions of salient PET/CT findings and to automatically segment the corresponding image findings. We then applied this pipeline to generate a large, annotated dataset for training a 3D vision-language visual grounding model, enabling interactive PET/CT reports.

Methods/Intervention

Our multi-step pipeline operates on PET/CT images and corresponding radiology reports, uses a series of large language models (LLMs) to extract text descriptions of PET findings, and then automatically segments the findings in the image based on the reported slice number and maximum standardized uptake value (SUVmax). Starting with 25,000 PET/CT exams retrospectively collected from 2010 to 2023, the final training/validation/test set consisted of 11,356 sentence-label pairs from 5,094 PET/CT exams. This dataset was used to train a novel 3D vision-language model adapted from ConTEXTual Net, which uses the sentence description, encodes it through an LLM, and fuses the text encodings with a 3D segmentation nn-UNet via cross-attention. The model was then evaluated on a holdout test set of 256 cases reviewed by a board-certified radiologist.

Results/Outcome

The automatic labeling pipeline’s accuracy was 98% (251/256). ConTEXTual Net 3D achieved an F1-score of 0.78 on the holdout test set, with a sensitivity of 0.75 and a recall of 0.81. The model performed similarly on 18F-fluorodeoxyglucose (FDG) PET/CT exams (F1=0.78) and on non-FDG PET/CT exams (F1=0.76).

Conclusion

The proposed labeling pipeline demonstrated high accuracy in creating large, annotated datasets of image-text pairs for PET/CT, allowing for the development of 3D visual grounding models.

Statement of Impact

Our method can be used to generate the necessary image-text training data to train a visual grounding model to segment key lesions in PET/CT. It opens the door to interactive reports that improve patient and provider comprehension and may allow for retrospective quantitative PET studies.

Keywords

Multimodal; Vision-Language Models; PET/CT; Large Language Models

027 - Does Size Really Matter? Comparing llama 3 vs 3.1

Presenter: Suyash Khubchandani, CARPL.ai, Inc.

Vasanth Venugopal¹, Amit Kumar¹

¹CARPL.ai, Inc., Cupertino, CA, USA

Introduction/Background

Prompt engineering in radiological reports involves crafting inputs to guide AI models in generating accurate and useful outputs. This technique helps automate downstream tasks on radiological reports. However, its effectiveness is limited by context length constraints, which can restrict the amount of information the model can process and integrate simultaneously. Commercial LLMs like ChatGPT are bounded by max context window of 16 k tokens where 1.5 tokens are approximately used for one word. For large radiological reports like MRI, it becomes difficult to perform few shots learning when context window is of limited size. In this study we have used llama 3.1 open-source model with context size of 128 k tokens against llama 3.

Methods/Intervention

We collected 1000 radiology reports annotated by expert radiologists, classifying each report into findings: acute infarct, intra-axial tumor, intra-axial hemorrhage, extra-axial tumor, and extra-axial hemorrhage. These reports were evaluated using the Llama 3 and Llama 3.1 models under zero-shot, one-shot, and few-shot learning settings. For zero-shot learning, the models received no prior examples. In one-shot learning, each model received one example per finding, and for few-shot learning, multiple examples were provided. The models' performance was compared to determine the impact of different learning strategies on their accuracy in identifying radiological findings.

Results/Outcome

In our study, we observed a significant improvement in accuracy for the llama 3.1 model. For llama 3, its accuracy improved from 66-97% with zero short learning to 95-98 % with few-shot learning, as further prompt engineering was constrained by the limited number of context size. However, by incorporating additional examples into the prompt, LLama 3.1 demonstrated an accuracy of 99.5%.

Conclusion

This enhancement underscores the model's capability to learn effectively from expanded datasets, highlighting the importance of large context windows in achieving superior predictive accuracy for larger reports.

Statement of Impact

Larger context LLMs are better suited for analyzing radiological MRI reports as they can handle more detailed and comprehensive patient data, ensuring accurate diagnosis and interpretation. Their ability to process extensive contextual information enhances downstream tasks, providing more reliable and insightful outcomes compared to smaller context-size LLMs.

Keywords

Large Language Models; Natural Language Processing; Radiological Reports; Classification

028 - Large Language Models Create Useful, Accurate, Clear Summaries of Virtual Radiology Workgroup Meetings

Presenter: Benjamin Mervak, Michigan Medicine

Benjamin Mervak¹, Muhammad Bhalli¹, Tricia Niedbala¹, Kenneth Buckwalter¹

¹Michigan Medicine, Ann Arbor, MI, USA

Introduction/Background

Remote meetings have dramatically increased in recent years. While convenient, these can disrupt team-based processes unless there is effective documentation of the discussion and follow-up tasks. Software solutions can generate a meeting transcript, although reviewing the entire transcript can be inefficient and tedious. A summary can be synthesized by a scribe, but this process is time-consuming, costly, susceptible to error, and may be delayed. Large Language Models (LLMs) are well suited for natural language processing and summarization tasks. This study investigates the performance of an LLM in extracting data from collaborative Radiology information technology (IT) workgroup meeting transcripts and generating a summary with a concise list of topics discussed, key points, and action items for each participant.

Methods/Intervention

After obtaining IRB exemption, this study was conducted at an academic medical institution. Over two months in 2024, virtual Radiology IT group meetings were recorded and transcribed using teleconferencing software. Transcripts were ingested by an instance of an LLM (GPT-4 Turbo, OpenAI, San Francisco, CA) privately hosted on an institutional server. The LLM was prompted to generate a meeting summary including minutes and action items. LLM-generated summaries were provided to participants on the same day as the meeting. The accuracy, clarity, and usefulness of the LLM-generated summaries were evaluated by collecting anonymous feedback from meeting participants using 5-point Likert scales.

Results/Outcome

Transcripts from 16 Radiology IT group meetings were summarized by an LLM. Feedback was received from 44 meeting participants. Most respondents (68-81%) rated the summaries as either “extremely” or “very” accurate, clear, and useful. A minority of respondents (9-12%) rated the summaries as either “not at all” or only “somewhat” accurate, clear, and useful. “Usefulness” was rated highly despite any perceived inaccuracies or unclear statements generated by the LLM.

Conclusion

LLMs can rapidly provide Radiology IT team members with meeting summaries that are accurate, clear, and useful.

Statement of Impact

We show that LLMs can help address a key problem with virtual meetings – even those of a technical nature – by quickly and inexpensively providing participants with an accurate, clear, and useful meeting summary.

Keywords

Large language models; Virtual meetings; Meeting summary; Transcription

029 - Synthesizing Diagnostic Insights from Radiology Reports: A RAG-Based LLM Method for Reducing Hallucinations and Preventing Catastrophic Forgetting

Presenter: Briana Malik, University of Pittsburgh

Briana Malik¹

¹University of Pittsburgh

Introduction/Background

Disparities in data quality and context availability can introduce biases in Large Language models (LLMs), affecting their accuracy. These issues are particularly pronounced in the field of radiology, where precise interpretation and understanding of reports is critical. Incorporating accurate and contextually relevant information is essential to LLM performance, reducing hallucinations and catastrophic forgetting.

Methods/Intervention

We hypothesize that integrating Retrieval-Augmented Generation (RAG) with contextual search will significantly reduce hallucinations by grounding LLMs with accurate and contextually relevant information. We used 500 radiology reports from a chest X-ray collection, tokenized the text, and generated embeddings using Large Language Model (LLM) tokenizer. These embeddings were stored in LevelDB database for efficient storage and retrieval. A similarity search index was built to facilitate efficient contextual retrieval. Queries related to specific radiological conditions were processed through RAG system. RAG system retrieved the most relevant context, which was then combined with the query and input into LLM (GPT-2) to generate contextually rich responses.

Results/Outcome

The RAG-based method significantly improved LLM’s understanding of radiology reports and rare conditions by grounding the model and reducing hallucinations. The number of words in responses decreased from 388 to 223, showing more concise outputs. Unique words decreased from 40 to 109 with RAG, indicating less repetition. The repetition rate fell from 0.897 to 0.511. ROUGE-1 F1 score improved from 0.015 to 0.024, and ROUGE-1 precision increased from 0.007 to 0.013. ROUGE-1 recall remained constant at 0.500. Perplexity increased from 1.605 to 10.369, reflecting more contextually rich responses.

Conclusion

The RAG-based approach enhances the accuracy and relevance of responses and improves understanding of rare conditions. Reduced repetition rates and improved ROUGE scores demonstrate more accurate responses. Higher perplexity with RAG indicates richer, more contextual responses compared to lower perplexity and incoherence without RAG.

Statement of Impact

This study demonstrates the potential of RAG-based LLMs to advance radiology report interpretation and rare disease diagnosis. By providing more accurate, contextually relevant answers, this approach enhances diagnostic quality and patient care, addressing critical gaps in traditional LLM training, which is not domain specific, suffers from under/not training if very rare words do not appear in vocab generated during frontier model training.

Keywords

Retrieval Augmented Generation (RAG); Large Language Models (LLM); Deep Learning; Artificial Intelligence

030 - Transforming Plain Text Radiology Reports into Structured Data Using Common Data Elements and FHIR Standards

Presenter: Michael Hood, Massachusetts General Hospital

Michael Hood¹, Roshan Fahimi¹, Heather Chase², Tarik Alkasab¹

¹Massachusetts General Hospital, Boston, MA, USA

²Microsoft-Nuance, Redmond, WA, USA

Introduction/Background

This project aims to enable transforming plain text radiology reports into structured data using Common Data Elements (CDEs) and Fast Healthcare Interoperability Resources (FHIR) standards. This effort enhances downstream clinical applications and improves the compatibility and exchange of radiology data across healthcare systems.

Methods/Intervention

A large language model (LLM) was employed to generate preliminary CDE definitions from anonymized chest CT reports. These reports were segmented into overlapping chunks, with semantic vectors generated. For given chest CT findings, relevant chunks were retrieved using cosine-similarity search and reranked. A GPT model was then prompted to create structured data models using the report chunks as context. The model was prompted to generate models with attributes such as identification, characteristics, and associated findings. Models then underwent iterative refinement and expert radiologist review.

Results/Outcome

This initial pilot generated refined CDE definitions for over 200 chest CT findings, demonstrating LLM capability in rapidly producing preliminary CDEs. Our toolkit successfully transformed these into fully annotated JSON files for generating CDE-labeled FHIR Observation objects. This new process for appending ontological tags is robust — analysis of 82 chest CT reports revealed a total of 1190 findings, 83.2% of which could be successfully encoded as CDE-labeled FHIR Observations.

Conclusion

The developed methodologies and tools significantly expedite the generation and application of CDE definitions, which will enable structured and standardized representation of radiology findings using a standard FHIR structure.

Statement of Impact

This project demonstrates the feasibility of transforming unstructured radiology reports into structured data by integrating tailored LLMs, a streamlined toolchain, and standardized semantics. This improves compatibility and data exchange across healthcare systems, empowering downstream clinical workflows and ultimately improving patient care.

Keywords

Common Data Elements; Structured Data; Fast Healthcare Interoperability Resources (FHIR); Large Language Models

Oral Presentations

Radiomics & New Techniques | Scientific Abstract Presentations

031 – Adversarial Domain Adaptation for Robust Glaucoma Classification

Presenter: Homa Rashidisabet, University of Illinois Chicago

Homa Rashidisabet¹, R. V. Paul Chan¹, Thasarat S. Vajaranant¹, Darvin Yi¹

¹University of Illinois Chicago, Chicago, IL, USA

Introduction/Background

While deep learning models have shown promising results in automated glaucoma prediction, they lack robustness when faced with a data domain shift. In real-world medical settings, discrepancies among images from different hospitals and patient populations create a domain shift, adversely impacting model performance and generalization across diverse datasets.

Methods/Intervention

We propose a Domain Adaptation (DA) method to address the performance degradation of deep learning models under data shift in glaucoma classification. Implementing Ganin et al., 2015, our deep learning architecture incorporates an adversarial learning component to eliminate domain-specific information, enabling the model to learn invariant features across data domains. The DA model is jointly trained on glaucoma classification and domain classification tasks, minimizing differences between domains while optimizing for the main task. We validated the DA method against the state-of-the-art ResNet-50 model using fundus images from LAG (n=4854), REFUGE (n=249), and University of Illinois Chicago (UIC) dataset (n=711) as both source and target data.

Results/Outcome

Using the UIC dataset as the source, the source-only model achieves 71.7% accuracy on LAG and 82.0% accuracy on REFUGE. The DA model improves these by 12.3% and 6.0%, achieving 84.0% and 88.0% accuracy, respectively. The train-on-target models, which serve as a reference, represent the upper bound on DA performance, while the source-only model without adaptation indicates the lower bound. When swapping source and target, the source-only model achieves 72.0% and 70.5% accuracy on UIC. The DA model improves these by 12.5% and 7.0%, achieving 84.5% and 77.5% accuracy on UIC.

Conclusion

Our proposed DA method effectively addresses the performance degradation of deep learning models when there is a shift in data. By learning domain-invariant features and unlearning domain-specific information, the DA method significantly improves the performance of the state-of-the-art ResNet-50 model in glaucoma classification.

Statement of Impact

Glaucoma, affecting eighty million people worldwide, is a leading cause of blindness. Deep learning has improved glaucoma prediction using fundus images, but performance degrades with data shifts from varied sources. Addressing this degradation is crucial to ensure clinical reliability and avoid patient risks. We propose a method to maintain performance across diverse datasets by adapting to data variations, validated on three datasets.

Keywords

Glaucoma classification; Domain adaptation; Computer vision; Fundus analysis

032 - Cross-Vendor Reproducibility of Radiomics-based Machine Learning Models for Computer-aided Diagnosis

Presenter: Jatin K. Chaudhary, University of Turku

Jatin K. Chaudhary¹, Ivan K. Jambor¹, Hannu Aronen¹, Otto Ettala¹, Jani Saunavaara¹, Peter Bostrom¹, Jukka Heikkonen¹, Rajeev Kanth¹, Harri Merisaari¹

¹University of Turku, Turku, Finland

Introduction/Background

Prostate cancer (PCa) remains the most common cancer among men in the western world and the second leading cause of cancer-related deaths. The integration of machine learning (ML) into Magnetic Resonance Imaging (MRI) holds promise for enhancing the accuracy and efficiency of PCa diagnostics. This study aims to evaluate the reproducibility and performance of ML models using radiomic features derived from Pyradiomics and MRCradiomics across different MRI scanners.

Methods/Intervention

We utilized imaging data from 637 men with clinical suspicion of PCa, enrolled in various clinical trials. The data included axial T2-weighted images scanned using Siemens MAGNETOM Verio 3T and Philips Ingenia 3T MRI devices. Radiomic features were extracted using Pyradiomics and MRCradiomics packages, yielding a total of 2693 features. Feature selection was conducted using the Maximum Relevance Minimum Redundancy (MRMR) method, reducing the set to 14 highly predictive features. We trained and evaluated Support Vector Machine (SVM) and Random Forest models on training, validation, and test datasets, assessing their performance using Area Under the Curve (AUC) metrics.

Results/Outcome

The SVM model achieved an AUC of 0.74 on the Multi-Improd dataset using combined Pyradiomics and MRCradiomics features, but the AUC dropped to 0.35 on the Philips test set. The Random Forest model showed similar trends with AUCs of 0.73 on the Multi-Improd set and 0.60 on the Philips set. Models trained exclusively on Pyradiomics features demonstrated higher robustness, with the Random Forest achieving an AUC of 0.78 on the Philips set. In contrast, models using only MRCradiomics features had varied outcomes, highlighting the challenge of reproducibility across different scanners.

Conclusion

This study underscores the significant impact of scanner-induced variability on the performance of ML models in PCa diagnostics. While combining Pyradiomics and MRCradiomics features enhances predictive performance, rigorous cross-platform validation is crucial to ensure model reliability.

Statement of Impact

Our findings emphasize the need for standardized imaging protocols and comprehensive validation frameworks to bridge the gap between innovative AI applications and practical, patient-centric healthcare solutions. This research advances the field of PCa imaging and promotes the broader adoption of AI in medical diagnostics, ultimately aiming to improve diagnostic accuracy and patient outcomes.

Keywords

Inter-Vendor Reproducibility; Radiomics; Diagnostic tools; Model Reproducibility

033 – From AI to Eye: Training the Radiologist with Deep Learning Interpretations in Sex Differentiation

Presenter: Shahriar Faghani, Mayo Clinic - Rochester

Shahriar Faghani¹, Christin A. Tiegs-Heiden¹, Mana Moassefi¹, Garret Powell¹, Micheal Ringler¹, Bradley J. Erickson¹, Nicholas Rhodes¹

¹Mayo Clinic - Rochester, Rochester, MN, USA

Introduction/Background

To use deep learning (DL) as an educational and scientific discovery tool to improve the radiologist’s ability to directly make subtle imaging findings without additional DL assistance.

Methods/Intervention

We present a DL model that can identify sex differences from frontal knee radiographs with high accuracy, then use the resultant occlusion interpretation maps (OIMs) to train human readers to improve their ability to perform this same task. Two groups, each of three human readers, were tasked to separate radiographs into male and female sex correctly. Both groups were informed of the patient’s sex, while the first group was also given these radiographs OIMs. After two weeks, the group was retested with a new set of 50 radiographs. This group was compared to a second group trained without the OIMs.

Results/Outcome

The DL model separated sex with 0.96 accuracy. The average accuracy of the six human readers initially was 0.62(range:0.56-0.74). After the study, the average accuracy of the six human readers was 0.77(range:0.7-0.84). The improvement in accuracy of the six human readers was statistically significant (p=0.0364). The accuracy of the “heat map” group was 0.8, and the control group was 0.74. When pooled as a collective group, Group 1 again showed significant improvement from baseline(p=0.0058), whereas Group 2 did not(p=0.1245), though there was no statistical difference between the two groups at the end of the experiment(p=0.2380).

Conclusion

OIMs could not be shown to definitively account for the improved accuracy in our test, though the group provided those maps demonstrated a statistically significant improvement from baseline, while the group without these maps did not. Moreover, simply the high accuracy of the DL model in performing this task proved it was possible and motivated our human readers to learn to perform this task.

Statement of Impact

This initiative seeks to identify new imaging biomarkers, thereby improving the functionality of existing DL systems and enabling human-led scientific advancements beyond the reach of DL alone. Additionally, this strategy incorporates a layer of explainability to facilitate the monitoring and troubleshooting of DL models when errors occur, contributing to the development of DL systems with increased resilience to such errors.

Keywords

Deep learning; Interpretability; Classification; MSK radiology

034 - Impact of Random Prostate Volume Modifications on Automated Segmentation Model Performance

Presenter: Dominic LaBella, Duke University Medical Center

Dominic LaBella¹, Michaela Kop², Xuan Qi³, Thomas Sanford⁴

1Duke University Medical Center, Durham, NC, USA

2John A Burns School of Medicine, University of Hawaii, Honolulu, HI, USA

3National Institutes of Health, Bethesda, MD, USA

4University of Hawaii Cancer Center, Honolulu, HI, USA

Introduction/Background

Accurate and consistent prostate segmentation on magnetic resonance imaging (MRI) plays an important role in surgical planning, biopsy targeting, and radiotherapy planning. Interobserver variability is an inherent challenge for defining a consistent ground truth (GT) segmentation. This study investigates the effect of modifying training set GT segmentations on the performance of automated prostate segmentation models.

Methods/Intervention

Ground truth segmentations of the whole prostate from T2-weighted MRI prostate sequences were manually delineated by a board-certified urologist who is also fellowship trained in urologic oncology. GT segmentation modifications were made by either adding or subtracting a specified distance in millimeters radially from the surface of the whole prostate volume on each axial slice. Each axial slice had a one-third chance of either A) subtracting a uniform inner margin from the prostate surface, B) not modifying the slice’s segmentation, or C) adding a uniform outer margin from the prostate surface. Ten different modified prostate segmentation-image pair training datasets were created. Each training dataset had a specified amplitude of potential margin modification. Identical segresnet models from the Auto3DSeg framework were trained over 300 epochs for each of the 10 modified training datasets and an additional unmodified GT training dataset. Validation and testing sets included unmodified GT segmentations. Dice similarity coefficients (DSC) were used to compare model performance.

Results/Outcome

A total of 119 T2-weighted images with whole prostate segmentations were included in the study. A linear decrease in mean test set DSC ranged from 0.917 to 0.856 as GT variability increased from 0 to 10 millimeters.

Conclusion

The decrease in testing set DSC as training set segmentation modification amplitude increases shows the importance of consistent GT segmentations for automated segmentation model development. Future studies should assess the impact of interobserver variability across additional radiographical structures on larger datasets.

Statement of Impact

This study elucidates the critical role of consistent GT segmentations when training automated segmentation models. By demonstrating that even minor modifications to GT segmentations can degrade the performance of automated segmentation models, we highlight the importance of standardized segmentation protocols.

Keywords

Artificial Intelligence; Automated Segmentation; Interobserver Variability; Prostate

035 - Replicating and Validating Radiomics-Based Prediction of PD-L1 Expression Status in NSCLC Patients

Presenter: Anna Theresa Stüber, LMU University Hospital, LMU Munich

Anna Theresa Stüber¹, Maurice Heimer¹, Clemens C. Cyran¹, Michael Ingrisch¹

¹LMU University Hospital, LMU Munich, Munich, Germany

Introduction/Background

The purpose of this study is to investigate the predictive value of radiomics in determining PD-L1 expression (positive vs. negative) status among NSCLC patients using an external [18F]FDG PET/CT dataset. Specifically, we aim to replicate and validate the radiomics-based machine learning model proposed by Zhao et al.* to address concerns related to reproducibility and replicability in radiomics research. * Zhao et al. Predicting PD-L1 expression status in patients with non-small cell lung cancer using [18F]FDG PET/CT radiomics. EJNMMI Res. 2023 Jan 22;13(1):4. doi: 10.1186/s13550-023-00956-9. PMID: 36682020; PMCID: PMC9868196.

Methods/Intervention

We analyzed a cohort of 254 NSCLC patients (86 = 33,9 % negative, 168 = 66,1 % positve PD-L1 status) who underwent [18F]FDG PET/CT imaging, utilizing two distinct image segmentation methods: solid component-based segmentation (LUT) with lung tissue window (W1500/L-600) and attenuation-corrected PET volume, and a conservative, smaller segmentation (CON) with soft-tissue window (W400/40) and corresponding PET volume. We replicated two radiomics-based models (“Rad-score” and “complex model”) provided by Zhao et al. for both segmentation sets, along with their clinical stage model. Performance evaluation is based on 10-fold cross-validation and the Area Under the Curve (AUC).

Results/Outcome

Performance analysis of the Rad-score model revealed a mean AUC of 0.593 (95% CI: 0.573 - 0.613) for CON segmentation and 0.573 (0.544 - 0.586) for LUT segmentation, both falling below the reported mean AUC of 0.761 (0.664 - 0.860) by Zhao et al. Similarly, for the complex model, we achieved mean AUCs of 0.505 (0.485 - 0.524) and 0.519 (0.501 - 0.541), respectively, whereas Zhao et al. reported a mean AUC of 0.769 (0.675 - 0.863).

Conclusion

Our study failed to replicate the findings of the previous study. In particular, the original model achieved very poor prediction performance on our dataset, which is compatible with the original dataset. These findings underscore the challenges in replicating radiomics-based predictive models across different datasets and highlight the importance of rigorous validation in ensuring clinical utility.

Statement of Impact

These findings underscore the challenges in replicating radiomics-based predictive models across different datasets and highlight the importance of rigorous validation in ensuring clinical utility.

Keywords

Replication; Radiomics; Machine learning evaluation; PET/CT

036 - Transparent Radiomics ML Model: Combining Human and Artificial Intelligence for Prediction of Therapy Outcomes

Presenter: Shrey S. Sukhadia, Dartmouth Health

Shrey S. Sukhadia¹, Crisi Patel¹, Adrienne A. Workman¹, Roberta M. diFlorio-Alexander¹, Marthony L. Robins¹

¹Dartmouth Health, Lebanon, NH, USA

Introduction/Background

Predicting the response to neoadjuvant chemotherapy (NAC) in invasive breast carcinoma (IBC) is a key oncological challenge. Accurate prediction can help tailor individualized treatment, minimize chemotoxicity, and enhance therapeutic effectiveness. However, the efficacy of NAC varies significantly among patients, underscoring the importance of identifying reliable predictors of response. Advances in molecular biology, imaging, and artificial intelligence offer promising avenues for developing robust predictive models, which could revolutionize personalized treatment in breast cancer management. We identified the top nine latent radiomic features in the active tumor regions of pre-NAC MRI scans to predict the tumor’s response to NAC using a transparent Decision Tree Classifier (DTC) model.

Methods/Intervention

We collected pre-NAC MRI scans for 75 IBC patients, for which the active tumor regions of interest (ROIs) underwent voxel-based segmentation using ITK-SNAP 4.0. The ROIs were fed to a custom built radiomic feature extraction pipeline that extracted 108 IBSI approved radiomic features from 7 feature classes and normalized using a standard scaler technique. Tumor response to NAC was extracted from the EHR with confirmed post NAC imaging with pathology report confirmation. An aggregated score was developed indicating tumor response to NAC. A DTC model was trained and tested using a 90:10 split of the sample-set using IMAGENE v3.2. A 3-fold cross validation was performed for training-set to control overfitting. The model was tested using the testing-set.

Results/Outcome

Our DTC model predicted the response to NAC at a remarkable AUC and R-square of 1.0 (each), at a p-value < 0.002. The radiomic features contributing the most to this prediction were shape, gray-level co-occurrence matrix, gray-level dependence matrix and gray-level size zone matrix. The Decision Tree showcasing the algorithm performed by DTC aided transparency of the model.

Conclusion

We built a transparent Machine Learning model to predict NAC outcomes represented in EHR using the latent radiomic features extracted from MRIs pre-NAC. Our approach combines both human and artificial intelligence to identify novel radiomic biomarkers that predict therapy outcomes in breast cancer.

Statement of Impact

Our work offers a transparent Machine Learning model that predicts NAC outcomes using pre-NAC images in breast cancer.

Keywords

Radiomics; Neoadjuvant Chemotherapy; Therapy Response; Machine Learning

Oral Presentations

Clinical Implementation & Toolkits | Scientific Abstract Presentations

037 - An Ontology for Discoverable and Interoperable Radiology AI Models and Datasets

Presenter: Charles E. Kahn, Jr., University of Pennsylvania

Charles E. Kahn, Jr.¹, Abhinav Suri², Safwan Halabi³, Hari Trivedi⁴

¹University of Pennsylvania, Philadelphia, PA, USA

²University of California Los Angeles, Los Angeles, CA, USA

³Lurie Children’s Hospital, Chicago, IL, USA

⁴Emory University, Atlanta, GA, USA

Introduction/Background

"Model cards" and “datasheets for datasets” provide valuable metadata to detail the performance, intended use, and potential limitations of AI resources. However, their format as unstructured text limits the ability to search for relevant resources and to automate their analysis. We sought to create a formal description to increase transparency and interoperability, reduce bias, and promote reproducibility of radiology AI models and datasets.

Methods/Intervention

The Radiology Model and Dataset Ontology (RMDO) was created to define attributes for AI models and datasets in radiology. RMDO references external ontologies and vocabularies, including RadLex, the LOINC/RSNA Radiology Playbook, and radiology common data elements (CDEs). RMDO incorporates RSNA content codes, PapersWithCode.com classifications of machine-learning methods and tasks, and the Metrics Reloaded listing of model-performance metrics. A JavaScript Object Notation (JSON) Schema was defined to allow serialization of RMDO-based descriptions of radiology AI models and datasets.

Results/Outcome

RMDO comprises 3,323 classes related by 4,403 logical axioms. The primary RMDO entity is a Project; its metadata describe authors, versioning, availability, licensing, and other features. Each Project consists of zero or more Models and zero of more Datasets. Model descriptions include architecture, intended uses, metrics, and ethical considerations. Dataset descriptions include imaging procedure, number of patients and images, image file format, output information, availability and licensing, partitions, annotation methods, and study cohort characteristics, such as demographics and disease prevalence. RMDO has been applied to datasets created for RSNA’s AI competitions and for several published AI models.

Conclusion

RMDO provides a standardized vocabulary that allows more effective classification and indexing of radiology AI models and datasets to make these resources more easily findable and accessible and to allow automated analysis of their underlying content.

Statement of Impact

An ontology to describe radiology AI models and datasets can make AI resources more findable and accessible. By allowing structured descriptions, the ontology helps promote reproducibility of AI models, and can aid in identifying and mitigating potential biases.

Keywords

Ontology; Model cards; Datasheets; Metadata

038 - Beyond FDA Clearance: Automated Post Deployment Monitoring and Validation of Commercial AI Models using Local Large Language Models (LLMs)

Presenter: Theo Dapamede, Emory University

Theo Dapamede¹, Bardia Khosravi², Chad Robichaux¹, Aawez Mansuri¹, Mohammadreza Chavoshi³, Alex Belov¹, Angela Udongwo⁴, Chinonyelum Igwe⁵, Frank Li¹, Beatrice Brown-Mulry¹, Hanssen Li¹, John Moon¹, Judy Gichoya¹, Hari Trivedi¹

¹Emory University, Atlanta, GA, USA

²Yale University, New Haven, CT, USA

³Tehran University of Medical Sciences, Tehran, Iran

⁴Temple University, Philadelphia, PA, USA

⁵University of Ibadan, Ibadan, Nigeria

Introduction/Background

As AI models are deployed in diverse clinical settings, continuous monitoring and assessment of subgroup performance is critical. Automated techniques to compare radiologist interpretations to model performance must be developed. We used a large language model (LLM) to evaluate the performance of two clinically deployed commercial AI models for pulmonary embolism and intracranial hemorrhage detection.

Methods/Intervention

We identified 8,966 CT pulmonary embolism exams and 14,637 non-contrast CT head exams conducted between April and October 2023 that were evaluated by the AI model and extracted the corresponding radiology reports. A locally deployed instance of Llama3 8B was used to extract the PE and ICH labels ground truth labels from the radiology reports, using methods that were previously validated on 500 manually annotated reports (PE: Sn 1.0, Sp: 1.0; ICH: Sn: 0.93, Sp: 1.0). AI model performance was compared to extracted ground truth for multiple subgroups (race, age, sex, and patient location). Overall performance was also compared to the submitted FDA and published performances.

Results/Outcome

For the PE model, sensitivity was 80.3% (95%CI: 77.8% – 83.0%) and specificity was 98.0% (95%CI:97.7% – 98.3%), compared to the published FDA clearance sensitivity of 93.0% (90.2% - 95.1%) and specificity of 93.7% (92.7% - 94.6%). For the ICH model, the sensitivity was 92.2% (91.2%-93.2%) and specificity was 90.3% (89.8%-90.8%), compared to FDA clearance sensitivity of 93.6% (86.6%-97.6%) and specificity of 92.3% (85.4%-96.6%). Both models demonstrated the lowest performance for outpatients as compared to emergency and inpatients, with sensitivities of 77.5% (58.8%-85.0%) and 87.4% (76.8%-95.5%) for PE and ICH models, respectively. Both models demonstrated equitable performance across race, ethnicity, age, and sex subgroups.

Conclusion

We have shown the potential use of LLMs as an automated method for post deployment monitoring and evaluation of clinical AI models. It is notable that the lowest-performing group for both models was outpatients, where advanced detection models can potentially provide the most benefit. Further work and reader studies are required to understand model failure modes and confounders.

Statement of Impact

This study demonstrates a potential automated solution for post deployment monitoring of clinical AI models, which is necessary for ensuring safe and stable model performance after deployment.

Keywords

Post-Deployment Monitoring; AI Validation; LLM

039 - Enhancing Equitable Study Distribution Using Reinforcement Learning

Presenter: Yiting Xie, Merative

Sun Young Park¹, Linda Bagley¹, Christy Weatherbee¹, Ferenc Kis¹, Marwan Sati¹, Yiting Xie¹

¹Merative, Cambridge, MA, USA

Introduction/Background

Imaging organizations encounter heavy workloads requiring distribution among radiologists. Uneven study type distribution, cherry-picking and other factors can cause imbalanced workload distribution, which may lead to internal tension and burnout. We introduce an artificial intelligence model to assist in achieving workload balance. Our study compares PACS workload distribution between manual and AI-automated methods.

Methods/Intervention

We present a reinforcement learning model that distributes studies with the goals of maintaining fairness, respecting preferences, meeting priority deadlines, and balancing study value. The model takes requests from a PACS system including an exam and a list of active radiologists and returns an assignment recommendation. The model state is encoded as a 2D array, comprising information such as Relative Value Unit (RVU), due time, and the radiologists' workloads. The algorithm is rewarded if its recommendations meet the above goals and learns by maximizing cumulative rewards over time. Our model has two learning phases: offline learning using realistic simulations of small, medium, and large clinical settings, and online learning, where the model adapts to study distribution and radiologists’ preferences in real-time. We performed a comparative study between AI-automated and manual assignment phases.

Results/Outcome

Five radiologists reviewed 481 studies in the manual and AI-automated phases. While the modality distribution was similar in both phases, the radiologists favored CR and CT over MR in the manual phase. Modality distribution was more balanced for all radiologists in the AI-enabled phase, and a 34% more equitable RVU distribution across modalities was observed for all radiologists. MR RVUs read increased by 40% in the AI-automated phase, correcting the bias in favor of CR and CT from the manual phase.

Conclusion

We report a two-phase reinforcement learning-based study distribution framework that provides a balanced and efficient allocation of studies. We compared manual and AI-automated methods and showed a 34% reduction in the standard deviation of the RVUs read between radiologists when using the AI model.

Statement of Impact

We have demonstrated the impact of an AI-automated worklist on study distribution. We found a notable reduction in the RVU standard deviation and improved balance among modalities along with a reduction in cherry-picking.

Keywords

Reinforcement Learning; Study Distribution; Radiologist Efficiency; Artificial Intelligence

040 – Ensuring Real-Time Reliability: An Autonomous Monitoring System for Radiology AI Performance

Presenter: Suyash Khubchandani, CARPL.ai, Inc.

Vasantha K. Venugopal¹, Abhishek Gupta¹, Rohit Takhar¹

¹CARPL.ai, Inc., Cupertino, CA, USA

Introduction/Background

Integrating artificial intelligence (AI) in healthcare, especially radiology, has revolutionized diagnostics. However, maintaining AI model accuracy in real-time clinical settings is challenging, primarily due to the lack of real-time ground truth data. This study introduces an autonomous monitoring system using two novel metrics: predictive divergence and temporal stability, providing real-time insights to ensure AI model reliability.

Methods/Intervention

To overcome real-time monitoring challenges without ground truth data, we developed two key metrics: Predictive Divergence: This metric employs Kullback-Leibler (KL) and Jensen-Shannon (JS) divergences to compare predictions between the primary AI model and two supplementary models. Lower divergence indicates higher accuracy and agreement among models. Temporal Stability: This metric assesses AI model consistency by comparing current predictions with historical moving averages. Variations in temporal stability can indicate model decay or data drift. The system was validated using chest X-ray data from a single-center clinic. Three commercial AI models for chest X-ray classification were analyzed in a longitudinal retrospective study design, using Jensen-Shannon Divergence (JSD) to compute predictive divergence and temporal stability metrics.

Results/Outcome

The study analyzed 3,993 chest X-rays over several months, including the onset of the COVID-19 pandemic. Key findings include: Predictive Divergence: JSD values initially showed alignment between the main AI model (AI1) and its support models (AI2, AI3). Post-COVID, a significant increase in divergence between AI1 and AI2 indicated a need for model intervention. Temporal Stability: JSD values for AI1 indicated initial consistency, but increased significantly post-COVID, reflecting a deviation from historical performance. AI2 and AI3 also showed increased divergence post-COVID, highlighting the pandemic's impact on AI model predictions.

Conclusion

The proposed system, using predictive divergence and temporal stability, offers a robust framework for real-time AI performance evaluation in clinical settings. This ensures the safe integration of AI in healthcare, as demonstrated during the COVID-19 pandemic. The system's continuous insights can enhance AI model reliability, ultimately improving patient care.

Statement of Impact

Continuous AI model monitoring in healthcare is crucial. The proposed metrics enable real-time detection of performance issues without ground truth data, significantly enhancing AI model reliability in clinical practice. Future research will optimize this system for various clinical contexts and AI models.

Keywords

Deep learning; Post market surveillance; Mlops; AI Monitoring

041 - Imaging Management of Lumbar Spine MRI Annotation in Machine Learning Model Validation with an Emphasis on Subcohort Variation

Presenter: Veysel Kocaman, Gesund.ai

Sumir S. Patel¹, Veysel Kocaman², Enes Hosgor²

¹Emory University, Atlanta, GA, USA

²Gesund.ai, Cambridge, MA, USA

Introduction/Background

Lumbar spine disorders, prevalent across various demographics, often require precise imaging analysis for accurate diagnosis and treatment planning. Magnetic Resonance Imaging (MRI) stands as the gold standard for visualizing spinal pathologies owing to its detailed soft tissue contrast. Although radiologists are predominantly trained in the clinical interpretation of lumbar spine MRIs, there exists a need for radiologists to be adept in the annotation of these exams for machine learning training, validation, and monitoring. We aim to assess radiologist annotation to determine if automatic subcohort analysis of a dataset can improve performance.

Methods/Intervention

For this study, three subspecialty trained attending radiologists each annotated 100 lumbar spine magnetic resonance imaging (MRI) examinations of patients across various clinical settings. The annotations included vertebral body height, vertebral body area, intervertebral disc area, and neuroforaminal area. Automatic analysis on a software platform (Gesund.ai) was used to identify relative underperformance of the radiologists with respect to clinical subcohorts. We analyzed multiple subcohorts and data annotation metrics to include Cohort Time Consumption, Gender and Institution Influence, and Most Time-Consuming Measurement.

Results/Outcome

Cohorts 1 and 3 have significantly high average times, both from a Missouri institution with 'M' and 'M/F' genders, suggesting case nature or protocols affect processing times. Cohorts from two clinical sites, especially male and Missouri cohorts, indicate challenging diagnostic criteria. Patterns show certain geographical cohorts, particularly males, require more time, potentially reflecting complex anatomy or pathology. L1 and L4 vertebral bodies and area measurements are consistently time-consuming, suggesting a need for better training or tools. Older age groups (60-74) have higher times than younger ones due to age-related anatomical changes.

Conclusion

This study highlights the effectiveness of a sophisticated software platform designed to accustom radiologists to the type of annotation tasks required in modern machine learning. Our approach utilizes this software platform to automate the process of identifying deficiencies in the analysis of patient subcohorts. Once these deficiencies are identified, the system can tailor the training regimen by assigning additional cases from the same subcohort to the respective radiologists.

Statement of Impact

Automated performance analysis with respect to clinical subcohorts has the potential to improve radiologist

Keywords

Machine Learning; Validation; Radiologist; Subcohort; Training; MRI

Poster Presentations

042 - Abdominal Organ Labeling for Abnormality in CT Reports Using a Large Language Model

Presenter: Ricardo B. Lanfredi, National Institutes of Health

Ricardo B. Lanfredi¹, Yan Zhuang¹, Luke Krembs², Brandon Khoury², Pritam Mukherjee¹, Ronald M. Summers¹

¹National Institutes of Health, Bethesda, MD, USA

²Walter Reed National Military Medical Center, Bethesda, MD, USA

Introduction/Background

Medical report labelers capable of handling various abnormalities, such as CheXpert and CheXbert, have mainly targeted chest X-ray (CXR) reports. However, compared with chest X-ray reports, computed tomography (CT) reports, particularly in the abdominal region, are more complex and cover a broader range of organs and abnormalities. These challenges make abnormality labeling for abdominal organs underexplored.

Methods/Intervention

To address this challenge, we propose a large language model (LLM) labeler, MAPLEZ-CT, to annotate whether major abdominal organs are abnormal. MAPLEZ-CT is an adaptation to CT reports of the previously published MAPLEZ (Medical report Annotations with Privacy-preserving large language model using Expeditious Zero shot answers) LLM prompt system. The employed zero-shot prompt, which uses the publicly available Meta-Llama-3-70B-Instruct model, run locally to preserve privacy, is displayed in Figure 1. A key feature of the prompt was the inclusion of an extensive definition of abnormalities: any unusual findings the radiologist deems worth mentioning for a specific organ, including atypical anatomical variations, postsurgical changes, and findings in subparts organs. This definition excludes findings indicating limited evaluation, normal organs, adjacent structures, or broad anatomical areas. Additional modifications included LLM pre-extraction of relevant sentences and chain-of-thought reasoning, whose computational complexity was partially offset by the vLLM library, which reduced the processing time by around 92%.

Results/Outcome

One research fellow and two radiology residents annotated the test set for five major abdominal organs (the spleen, liver, kidneys, gallbladder, and intestines) using 100 private reports randomly sampled from the publicly available Deep Lesion dataset. The final labels were decided through majority voting. The proposed method was compared to MAPLEZ and the rule-based SARLE (Sentence Analysis for Radiology Label Extraction) labeler. It achieved a median F1 score of 0.954 [0.927, 0.975], with an improvement ranging from 0.135 to 0.403 over the median scores of the baseline models. Table 1 shows the results for each organ. We calculated 95% confidence intervals through bootstrapping.

Conclusion

MAPLEZ-CT can reliably label abnormalities for major organs in abdominal CT, outperforming alternatives. It has the potential to create large-scale annotated CT datasets for abnormalities detection.

Statement of Impact

Zero-shot privacy-preserving LLMs can successfully label abnormal organs for CT reports.

Keywords

Large-language models; Abdominal CT; Medical reports; Abnormality labels

043 - Analysis of Out-of-Distribution Factors to Detect Iris and Pupil Using Cataract Surgical Images

Presenter: Mahtab Faraji, University of Illinois Chicago

Mahtab Faraji¹, Rogerio G. Nespolo¹, Homa Rashidisabet¹, Daniel Wang¹, Alexis Warren¹, Hesham Gabr¹, Yannek Leiderman¹, Darvin Yi¹

¹University of Illinois Chicago, Chicago, IL, USA

Introduction/Background

Deep Learning (DL) has substantial potential in ophthalmology, particularly for disease classification and tasks like iris and pupil detection in cataract surgery. DL algorithms typically assume that train and test samples share the same distribution. However, in practice, test samples often differ in distribution, which is called Out-of-Distribution (OOD), affecting generalizability. The OOD can result from factors like differences in data acquisition and preprocessing methods between train and test datasets. This study investigates the impact of two possible OOD-causing factors on a YOLOv5 model's performance in detecting the iris and pupil in cataract surgery images.

Methods/Intervention

Surgical images were divided into training, validation, and test sets. The test set underwent two transformations to simulate OOD: 1) adding Gaussian noise at various levels and 2) converting images to grayscale. We trained a YOLOv5 model for iris and pupil detection using the train set and evaluated it with the validation set. The model was then tested on both the original and transformed test data. Performance was assessed using the Mean Average Precision (mAP) metric.

Results/Outcome

The results have shown a progressive decline in mAP as noise levels increase from 2% to 15%. Notably, dataset_3 remains stable up to 4% noise, while dataset_4's performance drops to nearly zero at 15% noise. Similarly, we show a 3%-6% reduction in mAP across all datasets when images are transformed to grayscale.

Conclusion

Our findings demonstrate that noise and grayscale conversion significantly impact DL model performance. This underscores the necessity of considering these factors when deploying DL models in real-world scenarios.

Statement of Impact

This study examines how OOD factors, such as noise and grayscale conversion, affect DL models for iris and pupil detection in cataract surgery images.

Keywords

Deep learning, Out-of-Distribution, Cataract surgery images, YOLOv5

044 - Answer Positioning Biases in Large Language Model Responses to Medical Multiple-Choice Questions

Presenter: Kartik Gupta, Schulich School of Medicine and Dentistry

Kartik Gupta¹, Jaron Chong²

¹Schulich School of Medicine and Dentistry, London, ON, Canada

²London Health Sciences Centre, London, ON, Canada

Introduction/Background

Large language models (LLMs) have demonstrated high performance on standardized medical examinations across various domains, typically employing a multiple-choice question (MCQ) format. Technical literature has reported that the precise order of answer choices may affect and bias LLM performance, leading to unreliable estimates of LLM performance. The objective of this study is to evaluate the accuracy of LLMs with forced re-positioning of multiple-choice answer options, utilizing MedQA, a widely recognized medical benchmarking dataset for LLMs.

Methods/Intervention

The comparative efficacy of GPT-3.5 and GPT-4 was assessed using three randomized subsets of the MedQA dataset, each comprising 1273 questions, representing 10% of the total dataset. For each subset, four permutations were generated by forced re-positioning of the correct answer into each of four possible answer positions. The models were evaluated utilizing two prompt templates: question-only format (QO) and chain-of-thought format (COT; "Think step-by-step."). Statistical analysis involved repeated measures ANOVA followed by post-hoc comparisons using Bonferonni’s multiple comparison’s test. The variance of performance was calculated by subtracting the accuracy of the least effective position from the most effective position, termed delta.

Results/Outcome

Using basic QO prompting, GPT-4 outperformed GPT-3.5 in accuracy (69.18% vs 57.53%; p< 0.001). COT outperformed QO prompting, with GPT-4 COT achieving a maximal performance of 80.36%, versus GPT-3.5 COT of 67.08% (p< 0.001). Across model-prompt interventions without COT, position A's performance was significantly greater than other positions. This positional bias is reduced by COT. Utilizing COT reduced Delta with GPT-3.5 (16.3% to 5.5%) and GPT-4 (15.6% to 2.9%).

Conclusion

COT outperforms basic QO prompting, without which, there is strong LLM performance bias towards earlier answer positions. The distribution of answer choice positions in a MCQ evaluation may affect the apparent performance of an LLM.

Statement of Impact

Clinical LLM evaluation should carefully consider the effect of multiple-choice answer position, given systemic biases in performance based upon answer position. LLM evaluation should ideally incorporate randomization of answer position for evaluation.

Keywords

Large Language Models (LLMs); Medical Question Answering; Answering Bias; LLM Safety

045 - Validating GPT-4 for Automated Protocoling in Diagnostic Imaging

Presenter: Kartik Gupta, Schulich School of Medicine and Dentistry

Kartik Gupta¹, Jaron Chong²

¹Schulich School of Medicine and Dentistry, London, ON, Canada

²London Health Sciences Centre, London, ON, Canada

Introduction/Background

The growing volume of radiology exams, especially CT scans, necessitates more efficient workflows. Protocoling, taking up to 6% of a radiologist's time, is an opportunity for automation. Traditional machine learning methods need large datasets and are hard to adapt across institutions. Large language models (LLMs) demonstrate performance in medical question-answering and protocoling. This study evaluates the zero-shot prediction of OpenAI's GPT-4 in automating Chest CT scan protocoling, using prompts with institution-specific rules.

Methods/Intervention

A dataset of 796 labelled Chest CT Thorax imaging requests and protocols from Victoria Hospital, London Ontario, was analyzed. One data sample contains a requisition with a provided clinical indication from the ordering physician, and the assigned imaging protocol. There were 4 different classes of protocols; Chest CT “with contrast”, “without contrast”, “interstitial”, and “low-dose contrast”. Four prompts were tested with GPT-4: a baseline 'Control' prompt, a 'Classification Rules' (CR) prompt with specific guidelines, an 'Ablated' version with fewer guidelines, and a 'Refined' version (CR-V2) with improved rules. Performance was measured using accuracy, precision, recall, and F1 score. Statistical significance was assessed using McNemar's test where p-values less than 0.05 were significant.

Results/Outcome

The CR prompt significantly outperformed the 'Control' (accuracy: 0.88 vs 0.79, precision: 0.77 vs 0.61, recall: 0.89 vs 0.79, F1 score: 0.82 vs 0.66; P < 0.001). The 'Ablated' model showed reduced performance to CR yet superior performance to the 'Control' (accuracy: 0.85, P = 0.002 vs Control). The CR-V2 model achieved the highest metrics (accuracy: 0.9, precision: 0.8, recall: 0.89, F1 score: 0.84), significantly outperforming both the 'Control' (P < 0.001) and 'Classification Rules' (P = 0.014).

Conclusion

Providing specific instructions for GPT-4 can markedly improve the accuracy of protocol predictions in radiology. The use of specific prompting also improves performance of protocoling compared to no prompt (“Control”). The study demonstrates the potential of large language models in zero-shot protocol prediction for enhancing radiological workflow across institutions by adapting a set of protocoling rules.

Statement of Impact

By introducing custom prompts, institutions can tailor their automated pipelines with LLMs according to their own rules and improve protocoling accuracy. The zero-shot performance ensures that large training datasets are not required.

Keywords

Large Language Models; Protocols; Zero-shot prediction

046 - Artificial Intelligence System for Estimation of Liver Size on Pediatric Abdominal Ultrasounds

Presenter: Dana Alkhulaifat, Children's Hospital of Philadelphia

Dana Alkhulaifat¹, Mario Sinti-Ycochea¹, Vahid Khalkhali¹, Michael Welsh¹, Laith Sultan¹, Susan Sotardi¹

¹Children's Hospital of Philadelphia, Philadelphia, PA, USA

Introduction/Background

Ultrasound (US) is a safe and efficient imaging tool for estimating liver size and detecting parenchymal abnormalities, and is essential for diagnosing and treating liver diseases in children. Unlike adults, children’s liver sizes vary with age, making accurate measurement according to age crucial for disease detection. Organ segmentation methods vary between manual and automated approaches. Artificial intelligence (AI) exhibits great potential in improving the accuracy of liver segmentation and size estimation to achieve reliable liver measurements on US images. Thus, our aim was to develop an AI model to accurately predict liver size on US images in pediatric patients.

Methods/Intervention

In this retrospective, IRB-approved study, a dataset of 55 abdominal US images containing a sagittal view of the liver was utilized. The estimated liver size was extracted from each image. The liver was then manually segmented by a radiologist with 3 years of experience using 3D Slicer. 33 images were used for re-training a pretrained fully convolutional neural network (FCN50). 11 images were used for validation and 11 images for testing, respectively. Shape features were extracted from the segmented liver area of any image based on physical pixel spacing that were taken out of original DICOM tags. Image post-processing was deployed to remove artifacts and clean the segmented liver area. A Random Forest regressor predicted the liver spans after standard scaling of liver shape features and truncating the liver lengths to centimeters without decimal points. R2-score and cross validation with 5 folds in a Monte Carlo iteration were used as performance metrics.

Results/Outcome

55 patients (25 male) were included in the analysis, with a mean age of 8.17 years (SD 6.83 years). The 3-phase model was able to predict the liver span with an average R2 of 0.59 and a maximum R2 of 0.74. Attached histograms show the absolute error and the percentage errors of the estimations.

Conclusion

A three-phase system consisting of a transfer deep learning model (FCN50), image post-processor, and a machine learning regressor (Random Forest) can estimate the liver sizes from pediatric ultrasound images with great accuracy.

Statement of Impact

This proposed system holds significant potential in detecting anomalies on liver ultrasound images.

Keywords

Deep learning; Ultrasound; Pediatric radiology

047 - Attention Variant Mechanism for Airways Segmentation

Presenter: Chetana Krishnan, University of Alabama at Birmingham

Chetana Krishnan¹, Shah Hussain¹, Denise Stanford¹, Venkata Sthanam¹, Sandeep Bodduluri¹, Steve Rowe¹, Harrison Kim¹

¹University of Alabama at Birmingham, Birmingham, AL, USA

Introduction/Background

Attention mechanisms enhance neural networks by focusing on relevant input features but often have limitations, such as addressing only single or dual aspects and struggling with diverse inputs. Our architecture overcomes these by integrating multiple attention strategies and adaptive embedding, ensuring dynamic, robust feature extraction and improved performance in small region segmentations (SRS). Integrating positional (POS), semantic (SEM), image (IM), cross-spatial (CS), and self-channel attentions (SC) with adaptive embedding will significantly enhance feature extraction and representation, improving accuracy and efficiency in SRS, such as airways in comparison to single/dual attentions.

Methods/Intervention

Non-contrast enhanced in vivo CT scans of the lungs were conducted on 25 ferrets, achieving a spatial resolution of 80 μm. The ferrets were anesthetized with inhaled isoflurane and gated to capture a single inspiratory phase using a μCT scanner (MiLabs, Utrecht, Netherlands). The ground truth airway was determined using a region-growing method. The proposed attention variant network (AVN) incorporates information from other pixels within the image by performing SEM, POS, SC, CS, and IM. AVN inputs feature maps from all locations at different scales and outputs refined feature maps. This approach captures and utilizes correlations between neighboring pixels, leading to more accurate segmentation. Multi-scale feature maps refine the attention mechanism, enabling precise adaptation to image data variations. SEM focuses on important semantic information, POS identifies where vital information is located, IM determines task-relevant regions, CS examines spatial relationships, and SC emphasizes relevant channel features. We trained AVN with 18 scans (over 500 slices per scan), and the model was tested with seven unseen scans. Performance was evaluated using the Dice similarity coefficient (DSC) and Intersection over Union score (IoU). AVN was compared with other popular deep-learning networks.

Results/Outcome

AVN achieved a higher DSC and exhibited the highest minimum DSC, indicating superior performance.

Conclusion

AVN dynamically captures spatial and channel information to address the challenge of SRS and the limitations of 2D networks.

Statement of Impact

AVN can improve the diagnosis, treatment planning, and airway branch and volume monitoring for clinical lung diseases.

Keywords

Attention; Segmentation; Spatial; Channel

048 - Classification of Vitreomacular Adhesion Types Using Deep Learning Models on Optical Coherence Tomography Images

Presenter: A. Q. M. Sala Uddin Pathan, University of New South Wales

A. Q. M. Sala Uddin Pathan¹, Brughanya Subramanian², Salil S. Kanhere¹, Matthew P. Simunovic³, Rajiv Raman², Maitreyee Roy¹

¹University of New South Wales, Kensington, Australia

²Sankara Nethralaya, Chennai, India

³The University of Sydney Save Sight Institute, Sydney, Australia

Introduction/Background

In recent years, Deep Learning (DL) approaches have received considerable interest in ophthalmology due to their ability to promptly diagnose diseases and aid clinicians in decision-making. Using DL models on Optical Coherence Tomography (OCT) images for detecting and classifying Vitreomacular Adhesion (VMA) is still in the early stages. This research aims to design an automated system to classify two types of VMA, Focal VMA and Broad VMA, from Diabetic Macular Oedema (DME) patients using OCT images.

Methods/Intervention

This retrospective study analyzed 302 OCT images from 202 DME patients collected at Chennai Eye Hospital (January 2015 to June 2022), approved by the Vision Research Foundation Institutional Review Board. Two optometrists graded the images, categorizing 107 as Focal VMA and 195 as Broad VMA. Data augmentation and resampling addressed data imbalance, and the data was normalized and resized. VGG16, InceptionV3, and XceptionNet models using transfer learning with pre-trained ImageNet weights, classified the VMA types with 80% of the data for training and 20% for validation. Grad-CAM was used to visualize the regions of interest that influenced the model's decisions. Model performance was assessed by accuracy, sensitivity, specificity, AUC, and F1-score.

Results/Outcome

All three models performed well. VGG16 illustrated 84.19% accuracy with 84% Sensitivity, 83% Specificity, 84% AUC score, and 84% F1-Score. InceptionV3 showed slightly better accuracy of 84.40% with 84% sensitivity and specificity, 84% AUC score, and 84% F1-Score. The XceptionNet model outperformed all with 85% accuracy. The sensitivity, specificity, AUC score, and F1 scores were 85%, 84%, 85%, and 85%, respectively.

Conclusion

DL models correctly classified Focal VMA and Broad VMA from OCT images. Transfer learning reduced the program execution time. Among all the models, XceptionNet performed slightly better. The DL models utilized in this research show the potential to automate the diagnosis of various vitreomacular interface disorders with higher accuracy and a streamlined diagnostic process.

Statement of Impact

The study demonstrates that deep learning models, particularly the XceptionNet model, can accurately and efficiently classify vitreomacular adhesion types in diabetic macular edema patients using OCT images, achieving up to 85% accuracy. This automation significantly improves diagnostic speed and accuracy, facilitating better treatment planning and clinical workflow efficiency.

Keywords

Vitreomacular adhesion; Deep learning; Focal VMA; Broad VMA

049 - Classifying Common Breast Pain Symptoms for Patients Using a Large Language Model, ChatGPT

Presenter: Hana Haver, Mass General Brigham

Hana Haver¹, Manisha Bahl¹, Maggie Chung²

¹Mass General Brigham, Boston, MA, USA

²University of California, San Francisco, CA, USA

Introduction/Background

Breast pain is a common symptom for which diagnostic imaging evaluation is recommended based on clinical significance according to the American College of Radiology’s (ACR) Appropriateness Criteria. Imaging is not recommended for clinically insignificant breast pain, which is defined as nonfocal, diffuse, or cyclical pain. This study aims to use ChatGPT GPT-4 (March 2023 release, OpenAI) to automate the classification of common breast pain symptoms based on clinical significance.

Methods/Intervention

The authors created a library of 150 breast pain symptoms representing breast pain variants described in the ACR Appropriateness Criteria, including clinically insignificant and significant pain, and non-pain-related clinically significant symptoms (e.g., palpable lump, pathologic nipple discharge). A zero-shot prompt for the LLM was developed to characterize breast concerns as clinically insignificant or clinically significant, “Use the ACR appropriateness criteria for breast pain. Respond with only is this ‘clinically significant breast symptom’ or ‘not clinically significant symptom.’" Each breast symptom was submitted with the prompt in three independent tests in June 2024. Clinical significance was determined by the mode of the three tests and compared to the ground truth, established by radiologist consensus based on the ACR Appropriateness Criteria for breast pain.

Results/Outcome

ChatGPT GPT-4 assigned the appropriate clinical significance, in agreement with the breast imaging radiologists, in 74.7% (112/150) of breast pain symptoms. ChatGPT GPT-4 correctly identified 89.1% (57/64) of clinically significant breast symptoms. Among instances where the model did not agree with the ground truth, the majority (81.6%; 31/38) were clinically insignificant cases that ChatGPT GPT-4 considered to be clinically significant. All 30 pain symptoms with non-pain-related clinically significant symptoms (e.g., palpable lump, pathologic nipple discharge) were correctly assessed by ChatGPT GPT-4 as clinically significant. Eighty-nine point three percent (134/150) of LLM-generated results were identical across three independent tests.

Conclusion

We demonstrate the first known potential application of an LLM to classify breast pain symptoms as clinically significant or clinically insignificant.

Statement of Impact

To automate ascertaining breast pain clinical significance, prior to patient scheduling, could influence decision-making about imaging evaluation, as only clinically significant symptoms would be indicated for imaging evaluation.

Keywords

Large language model; Breast Imaging; Clinical decision support

050 - Classifying, Fast and Slow: Adversarial Training for Bias Mitigation in Medical Imaging

Presenter: Felipe Matsuoka, Faculdade de Ciências Médicas da Santa Casa de São Paulo

Felipe Matsuoka¹, Eduardo Farina², Felipe Kitamura²

¹Faculdade de Ciências Médicas da Santa Casa de São Paulo, São Paulo, Brazil

²UNIFESP, São Paulo, Brazil

Introduction/Background

Ethnicity bias in deep learning models poses significant ethical concerns. Leveraging Daniel Kahneman's dual-process theory in "Thinking, Fast and Slow," which distinguishes between rapid, intuitive System 1 and deliberate, analytical System 2 thinking, we propose an approach to mitigate bias in chest X-ray classification. Previous studies have demonstrated the effectiveness of adversarial methods in reducing bias, such as COVID-19 classification from electronic medical records (Zhang et al., 2021), making this approach both relevant and innovative in medical imaging.

Methods/Intervention

Our methodology employs two complementary models: a predictor model and an adversarial model. The predictor model, akin to System 1, efficiently classifies chest X-rays, identifying normal and abnormal cases. Simultaneously, the adversarial model, similar to System 2, challenges the predictions to reduce bias. We used the CheXpert dataset (Irvin et al., 2019), ensuring a balanced representation of ethnic groups through binary label adjustment and sampling techniques. During training, the adversarial model increases its error in predicting patient ethnicity, forcing the predictor model to focus on unbiased features.

Results/Outcome

Using One-way ANOVA, we assessed the ROCAUC performance across different ethnicities for both models. The baseline model showed no significant differences across ethnicities (p-value = 0.258). Similarly, the adversarial model also exhibited no significant differences (p-value = 0.405). These findings suggest that the adversarial model maintained consistent performance across ethnic groups without introducing additional bias, highlighting the complexity of addressing ethnicity bias in medical imaging.

Conclusion

The adversarial training framework in chest X-ray classification demonstrates an innovative approach to mitigating ethnicity bias. Despite not showing significant performance differences, this study emphasizes the importance of developing methods to address bias in medical imaging. The results suggest that achieving equitable performance across all ethnic groups is challenging, and a potential alternative could involve optimizing models for specific ethnicities.

Statement of Impact

This study introduces a novel adversarial training framework to mitigate ethnicity bias in deep learning models for medical imaging. The ANOVA results indicate that adversarial methods can maintain equitable performance across different ethnic groups. By applying principles from psychology, this research connects theoretical concepts with practical applications, advancing the development of more reliable AI systems in medical diagnostics.

Keywords

Ethnicity Bias; Deep Learning; Medical Imaging; Adversarial Training

051 - Comparative Evaluation of Computationally Efficient and Explainable 1D Brightness Profiles from Axial Projections for Lung Ultrasound Frame Classification

Presenter: Srishti Jain, Boston University

Srishti Jain¹, Umair Khan², Russell Thompson³, Lauren P. Etter⁴, Ingrid Camelo⁵, Rachel C. Pieciak⁶, Ilse Castro-Aragon⁷, Bindu Setty⁷, Christopher C. Gill⁶, Margrit Betke¹

¹Boston University, Boston, MA, USA

²University of Trento, Trento, Italy

³University of Massachusetts-Dartmouth, Dartmouth, MA, USA

⁴University of Wisconsin-Madison, Madison, WI, USA

⁵Augusta University, Augusta, GA, USA

⁶Boston University School of Public Health, Boston, MA, USA

⁷Boston Medical Center, Boston, MA, USA

Introduction/Background

Analyzing lung ultrasound (LUS) data to identify pneumonia in pediatric patients is crucial for providing timely and accurate patient care. Automated solutions developed in this regard are mostly CNN-based methods; while effective, offer high computational complexity, limiting their wide application in resource-constrained environments. Their lack of transparency leads to clinician distrust. To address the aforementioned challenges, a computationally efficient yet explainable method to evaluate LUS data is our research focus.

Methods/Intervention

Lung consolidations in ultrasound images appear as dark, wedge-shaped areas with mixed textures, characteristic patterns to be observed in pneumonia patients. We hypothesize that the compressed 1D data representation of the 2D frames can retain the characteristic features of lung consolidations. In the presence of lung consolidations, the intensity values across the axes plummet, as darker regions have lower pixel values. This drop in intensity values adds a valuable characteristic feature to the 1D BP vector that allows the MLP to discriminate between frames with/without consolidations. Classification of such representations can lead to the development of a computationally efficient and explainable automated solution. As a proof of concept, this study explores a novel LUS frame classification method using 1D Brightness Profiles (BP). These are obtained by summing pixel values along the y-axis and x-axis. Three types of BP were extracted from LUS frames: fy (sum of pixel values along the y-axis), fx (sum of pixel values along the x-axis), and fy+x (concatenation of fy and fx). These are then fed to separate Multilayer Perceptron Models (MLP) which perform the binary classification task.

Results/Outcome

Our findings reveal that fy+x projections give better classification metrics than fy and fx projections. We now have a robust perspective on how different projection axes capture information.

Conclusion

The Brightness Profiles, derived from pediatric pneumonia patients, are tested for their reliability in capturing frame patterns. The study demonstrates that the 1D arrays offer a compressed representation of LUS frames, outperforming existing CNN-based methods in terms of explainability yet offering reliable classification metrics.

Statement of Impact

Brightness Profiles is a simplistic yet powerful information capture data type that is computationally efficient and contributes towards AI Explainability.

Keywords

Brightness Profiles; 1D Projections; Multilayer Perceptron; Computational Efficiency

052 - Comparing Classic to State-of-the-Art Image Features: A Clustering Approach Using Local Binary Patterns and ResNet-18 Features for Lung Ultrasound Video Classification

Presenter: Saunak Bhattacharjee, Boston University

Saunak Bhattacharjee¹, Umair Khan², Russell Thompson³, Lauren P. Etter⁴, Ingrid Camelo⁵, Rachel C. Pieciak⁶, Ilse Castro-Aragon⁷, Bindu Setty⁷, Christopher C. Gill⁶, Margrit Betke¹

¹Boston University, Boston, MA, USA

²University of Trento, Trento, Italy

³University of Massachusetts-Dartmouth, Dartmouth, MA, USA

⁴University of Wisconsin-Madison, Madison, WI, USA

⁵Augusta University, Augusta, GA, USA

⁶Boston University School of Public Health, Boston, MA, USA

⁷Boston Medical Center, Boston, MA, USA

Introduction/Background

Lung ultrasound (LUS) is a valuable non-invasive tool for diagnosing respiratory diseases, and the use of AI to support LUS interpretation has been proposed. Automatically interpreting LUS data is complex and requires advanced techniques to detect abnormalities like lung consolidations, especially with limited labeled datasets for training AI models. This study explores using Local Binary Pattern (LBP) features and features computed by a ResNet-18 model in an unsupervised learning context to classify LUS video frames in an efficient way.

Methods/Intervention

The study used 178 LUS videos from 200 patients. LBP and ResNet-18 features were extracted from each video frame to capture texture information for distinguishing abnormal from normal lung patterns. Both feature sets underwent unsupervised clustering using a k-means clustering approach, with k=2, to identify natural data groupings. The effectiveness of the resulting clusters was assessed by calculating the precision in isolating frames that contained lung consolidations, which was determined by comparing the clusters against clinical data on a frame-by-frame basis.

Results/Outcome

The analysis showed that the clustering approach based on LBP features achieved a mean overall precision of 83.72%, and based on ResNet-18 features, 88.73% precision. Visual analysis of the resultant clusters revealed that consolidation frames sometimes appeared to form separate distinct clusters of their own, while in other cases, they were interspersed within either one of the two primary clusters. ResNet-18 outperformed LBP features, but the simplicity and efficiency of computing LBP features make them a practical alternative to ResNet-18 features, particularly in resource-limited settings.

Conclusion

Both ResNet-18 and LBP features showed promise as inputs to an unsupervised clustering method for identifying lung consolidations in LUS video frames. Future work will refine these methods to better handle the variability and complexity of medical imaging data by using representative samples from clusters instead of the entire dataset, reducing computational demands and potentially improving generalization by focusing on key data points.

Statement of Impact

This research highlights the potential of traditional and modern techniques in enhancing LUS diagnostics. By addressing current limitations, the study contributes valuable insights into the development of efficient, generalizable AI-based diagnostic tools for respiratory diseases.

Keywords

Lung Ultrasound; Local Binary Pattern (LBP); ResNet-18; Unsupervised Learning

053 - Enhanced Sperm Image Segmentation Using MCFA Unet: Integrating Multi-Channel Feature Extraction and Attention Mechanisms

Presenter: Qiufeng Yi, University of Birmingham

Qiufeng Yi¹, Chenyang Wang¹, Xiazhen Xu¹, Jiaqi Ye¹, Amir Hajiyavand¹

¹University of Birmingham, Birmingham, AL, USA

Introduction/Background

The segmentation of sperm images is critical in the field of assisted reproductive technology (ART). Infertility affects millions globally, and male reproductive health plays a significant role in many of these cases. Sperm quality analysis has therefore become essential for evaluating male fertility. Accurate sperm segmentation is crucial as it enables automated sperm counting, morphological analysis, and motion tracking, significantly enhancing diagnostic accuracy and efficiency.

Methods/Intervention

In this study, we introduced an improved U-Net model, the Multi-Channel Feature Extraction U-Net (MCFA Unet), aimed at enhancing the precision and reliability of sperm segmentation. The U-Net architecture, well-known for its effectiveness in biomedical image segmentation, was adapted with several key enhancements: Multi-Channel Feature Extraction: This allows the model to capture a wider range of sperm characteristics, improving segmentation accuracy. Advanced Data Augmentation: By increasing the variety of training images, the model becomes more robust to different sperm image variations. Improved Loss Function: Combining Dice loss and cross-entropy loss ensures more precise segmentation. We trained and tested the MCFA Unet on subset B of the Sperm Video Image Analysis (SVIA) dataset, which provided a comprehensive set of annotated sperm images for robust evaluation.

Results/Outcome

Our experiments on subset B of the SVIA dataset showed that the MCFA Unet significantly outperformed traditional models in sperm image segmentation. The key performance metrics were: Dice Coefficient: 91.27, indicating a high overlap between the predicted segmentation and the ground truth. Jaccard Coefficient: 84.14, which measures the similarity between the segmented results and the actual sperm cells. These results demonstrate the high precision and reliability of the MCFA Unet model, attributed to its enhanced feature extraction capabilities.

Conclusion

The MCFA Unet model offers a significant improvement in the precision and reliability of sperm image segmentation compared to traditional methods. This enhancement has substantial implications for the automation and accuracy of sperm quality analysis in ART, reducing the dependency on manual analysis and making the process faster and less error-prone.

Statement of Impact

By improving sperm segmentation accuracy, the MCFA Unet model contributes to better diagnostic tools in assessing male reproductive health.

Keywords

Segmentation; Sperm Analysis; Attention Mechanisms; Deformable Convolutions

054 - Enhancing Radiology Report Comprehension: A Study on GPT-4's Identification of Key Radiological Terms

Presenter: Jad Alsheikh, Creighton University School of Medicine

Jad Alsheikh¹, Ali Memon¹, Daniel Spalinski¹, Kimberly Mendez², Sherif Zineldine¹, Dorina Pinkhasova¹, Michael Fei¹

¹Creighton University School of Medicine, Omaha, NE, USA

²Baylor College of Medicine, Houston, TX, USA

Introduction/Background

This study evaluates GPT-4's accuracy in identifying common radiological terms from chest X-ray (CXR) reports, comparing it to terms derived from actual radiology reports. The objective is to assess GPT-4's potential in creating a database for highlighting and defining medical terms to aid patient comprehension.

Methods/Intervention

This was a retrospective analysis of CXR reports. Two lists of the top 40 most common radiological CXR findings and phrases were generated. The first list was derived from 3,999 reports from the Open-i service of the NLM, covering a wide array of pathologies. The second list was generated by GPT-4, identifying what it believed to be the 40 most common findings and phrases. We compared GPT-4’s performance against terms derived from the sample reports by analyzing the overlap between the two lists and assessing the frequency of GPT-4 terms in actual reports, considering exact and similar matches. Additionally, we evaluated how well GPT-4 can account for variations in radiologist terminology by examining the coefficient of variation (CV) for the term frequencies.

Results/Outcome

GPT-4 demonstrated the ability to identify frequently used terms such as "effusion" and "pneumothorax" with high accuracy. Terms with high exact match proportions included "pleural effusion" (100%), "pulmonary edema" (100%), and "pneumothorax" (73.7%). The precision, recall, and F1 score were all 0.30, indicating moderate overlap between the terms identified by GPT-4 and those derived from the CXR reports. The Spearman's rank correlation was 0.32, suggesting a weak correlation between the ranks of term frequencies in GPT-4’s list and the actual reports. The Chi-Square Test (chi2=840.00, p=7.53e-04, dof=714) indicated that the differences between the observed frequencies of terms in actual reports and those identified by GPT-4 were statistically significant.

Conclusion

GPT-4 demonstrated reasonable accuracy in identifying common radiological terms in CXR reports and can effectively account for variations in terminology. While it successfully identified frequently used terms, its performance varied for less common terms.

Statement of Impact

This study underscores the potential of GPT-4 in enhancing patient understanding of radiological reports by providing a reliable database of terms. Incorporating AI tools like GPT-4 could improve patient communication and engagement in radiology, ultimately contributing to better healthcare outcomes.

Keywords

GPT-4; Chest X-ray Reports; Natural Language Processing; Term Identification

055 - Evaluating Performance and Environmental Impact: A Comparative Study of Large Language Models

Presenter: Sanaz Vahdati, Mayo Clinic - Rochester

Sanaz Vahdati¹, Bardia Khosravi¹, Bradley J. Erickson¹

¹Mayo Clinic - Rochester, Rochester, MN, USA

Introduction/Background

Leveraging artificial intelligence (AI) for diagnostic purposes has shown promise in enhancing patient care while also raising concerns about environmental sustainability. Large language models (LLMs) are increasingly impacting the medical field by automating complex tasks and facilitating a deeper understanding of vast datasets, thus revolutionizing the approach to patient care and medical research. This study focuses on the extraction of acute cervical spine fractures from radiology reports using open-source LLMs juxtaposed with an analysis of their associated carbon emissions.

Methods/Intervention

We randomly acquired radiology reports from 1000 non-contrast cervical spine CT scans conducted between January and February 2022. After prompt optimization on 110 reports, the remaining 890 served as a test dataset to assess the model's performance. The model aimed to indicate the presence or absence of an acute cervical vertebral fracture. We calculated the carbon emissions generated by running two models, Zephyr Alpha 7 Billion and LLama3 70 Billion. We applied utilizing a carbon tracker package to assess the environmental impact of LLM’s operations.

Results/Outcome

The Zephyr7B model achieved an accuracy of 94% for extracting acute cervical spine fracture, and LLAMA3 70B obtained an accuracy of 92% for this task. The sensitivity and specificity were 0.97,0.94 and 0.97,0.91 for Zephyr 7B and LLama70B, respectively. The carbon emission analysis revealed that the inference of the Zephyr model is estimated to use Energy 0.42 kWh of electricity, contributing to 0.145 Kg of CO2eq. The LLama3 model is estimated to use Energy 1.17 kWh of electricity, contributing to 0.42 Kg of CO2eq.

Conclusion

In this study, we compared the LLM models’ performance and their environmental impact. We demonstrate the potential of achieving high performance using a smaller model size, which can lead to a more environmentally sustainable application of LLMs. We highlight the trade-offs between efficiency and environmental impact on the deployment of AI in medical settings.

Statement of Impact

Our findings advocate for a balanced approach to adopting AI technologies, considering their medical benefits and ecological footprints. Future work should explore optimization techniques to reduce the energy consumption of AI systems without compromising their performance, thereby aligning AI advancements with sustainable healthcare practices.

Keywords

Artificial Intelligence; Large Language models; Sustainability

056 - Evaluating TotalSegmentator for Muscle and Fat Segmentation in Patients with Ascites

Presenter: Tiffany Wei, National Institutes of Health

Tiffany Wei¹, Benjamin Hou¹, Tejas S. Mathai¹, Jianfei Liu¹, Ronald M. Summers¹, Zhiyong Lu¹

¹National Institutes of Health, Bethesda, MD, USA

Introduction/Background

TotalSegmentator (TS) is a public CT segmentation tool that can segment 117 anatomical structures. The dataset it was trained on was randomly sampled and contained patients with certain abnormality types (e.g., inflammation, trauma, bleeding). However, it is not known if patients with fluid retention, such as ascites, were excluded. Excess fluid in the peritoneal cavity (often seen in patients with liver fibrosis) can have visual similarities to visceral fat and can pose a challenge for TS, thereby impacting its segmentation performance. This study determines if TS over-segments muscle and fat (subcutaneous and visceral) into regions of ascites.

Methods/Intervention

285 CT scans of 140 female patients from the public TCGA-OV-AS dataset were used. This dataset contained only ascites labels that were manually annotated on a voxel-level (all slices, all volumes) and no other organ labels. TS segmented the muscle and fat (subcutaneous and visceral) regions in these volumes, and its outputs were compared against the manual ascites labels to determine any over-segmentations. In this context, a lower Dice score is desirable as it signifies less overlap between ascites and the structure segmented by TS.

Results/Outcome

TS often over-segmented muscle as it had the highest mean Dice score (0.00965±0.0157), followed by visceral fat (0.00691±0.0152), then subcutaneous fat. Significant over segmentation in muscle (44.45 ± 97.11mL) into ascites was seen in 20 out of 285 scans (scans which exceeded 50mL).

Conclusion

TS is generally capable of accurately segmenting visceral fat, subcutaneous fat, and muscle in patients with ascites. However, it must be used with caution as significant over-segmentations can affect body composition measurements.

Statement of Impact

For population-based studies and opportunistic screening, body composition measurements (e.g., muscle and fat volume/attenuation) can be correlated with underlying disease conditions (e.g., Diabetes). They play a critical role in early interventions and patient management.

Keywords

Ascites; CT; Segmentation; Muscle

057 - Utility of Fully Automated Liver and Spleen Biomarkers for Staging Hepatic Fibrosis in CT

Presenter: Tejas S. Mathai, National Institutes of Health

Sydney V. Lewis¹, Tejas S. Mathai¹, Meghan G. Lubner², Perry J. Pickhardt², Ronald M. Summers¹

¹National Institutes of Health, Bethesda, MD, USA

²University of Wisconsin-Madison, Madison, WI, USA

Introduction/Background

Liver fibrosis can be caused by metabolic disorders (e.g., obesity, Diabetes), alcoholism, or Hepatitis B/C virus. While earlier fibrosis stages are reversible, later stages (advanced fibrosis and cirrhosis) are irreversible. Notably, cirrhosis is the 12th leading cause of death in the US. Biopsies are the gold standard for staging, but they are invasive and prone to sampling error. Consequently, there is a need for non-invasive CT-based biomarkers to distinguish early fibrosis (F0 – F2) from later stages (F3 – F4).

Methods/Intervention

372 patients underwent CT imaging at Institution-A with fibrosis (METAVIR) confirmed through biopsy. An automated deep learning-based model segmented the full liver, 8 liver Couinaud segments, and spleen. Using Couinaud segments 2 and 3, another fully automated technique computed the liver surface nodularity (LSN) score (defined as the smoothness of liver surface). Additionally, CT-based biomarkers, such as volume and attenuation, were also calculated for the full liver, 8 segments, and spleen. Liver Segmental Volume Ratio (LSVR) was also calculated as the sum of the volumes of segments 1 – 3 divided by that of segments 4 – 8. The dataset was divided into 80% training (n = 297) and 20% testing (n = 75) set. Univariate and multivariate logistic regression models were trained to stage fibrosis using the biomarkers and LSN. An AUC below 0.6 was considered clinically ineffective.

Results/Outcome

The best univariate models used spleen volume (Cirrhosis AUC = 0.829, Advanced Fibrosis AUC = 0.805) and LSN (Cirrhosis AUC = 0.766, Advanced Fibrosis = 0.695). The best multivariate model for predicting cirrhosis included LSVR, spleen volume, and segmental volume proportions (AUC = 0.927). For advanced fibrosis, the best multivariate model included LSVR, spleen volume, and automated LSN (AUC = 0.839).

Conclusion

The best multivariate models for staging liver fibrosis included LSVR, spleen volume, segmental volume proportions, and automated LSN. The addition of automated LSN score had the greatest impact for prediction of advanced fibrosis.

Statement of Impact

For population-based studies and opportunistic screening, non-invasive CT-based biomarkers may be clinically useful in differentiating advanced fibrosis and cirrhosis from earlier stages. They play a critical role in early interventions to reverse fibrosis and improve patient care.

Keywords

CT; Liver Fibrosis; Cirrhosis; Liver Segmental Volume Ratio

058 - Generating Structured Radiology Reports of Chest Radiographs using Retrieval Augmented Generation

Presenter: Yash S. Saboo, University of Texas at Austin

Yash S. Saboo¹, Aaron Fanous², Kal L. Clark³

¹University of Texas at Austin, Austin, TX, USA

²Stanford School of Medicine, Palo Alto, CA, USA

³University of Texas Health San Antonio, San Antonio, TX, USA

Introduction/Background

Radiologist workload has increased over the past decade, increasing burnout and the risk of diagnostic inaccuracies. Artificial intelligence (AI) algorithms have been developed for tasks such as disease detection, image segmentation, and impression-generation. However, much work remains in using AI to generate comprehensive radiology reports. The purpose of this study is to develop a retrieval-augmented generative AI model that accurately generates radiology reports of chest radiographs (CXRs).

Methods/Intervention

We trained the DenseNet-121 model on 13964 CXRs from the VinDr-CXR dataset to classify the CXRs into seven classes: aortic enlargement, cardiomegaly, interstitial lung disease, lung opacity, pleural effusion, pneumothorax, no finding. We then used the trained DenseNet-121 model as an encoder to generate embeddings for 159,970 CXRs from the Medical Information Mart for Intensive Care Chest X-ray JPG (MIMIC-CXR-JPG) dataset. The embeddings were stored in a vector database. We generated reports for a separate test set of 59 CXRs from MIMIC-CXR-JPG using similarity search, where we compared the vector embedding of each of the 59 test CXRs with the embeddings in the vector database. The most similar embedding in the vector database for each of the 59 CXRs was identified using cosine similarity, and the most-similar embedding’s associated report was retrieved and restructured into six distinct sections (cardiomediastinum, pleural space, lungs, bones, hardware, other) using Generative Pretrained Transformer (GPT) 4. These restructured reports were recommended as the reports for the 59 CXRs, respectively.

Results/Outcome

On the test set of 59 CXRs, the model achieved a median BLEU score of 0.0701, median BERT score of 0.216, median CheXbert score of 0.227, and median RadCliQ score of 1.713. Additionally, a board-certified radiologist assigned a RADPEER Score, ranging from 1 to 3, to each section (cardiomediastinum, pleural space, lungs, bones, hardware, other) of the 59 AI-generated reports. Averaging across all sections, the model achieved a RADPEER Score of 1.548.

Conclusion

This retrieval-augmented generative AI model has the potential to assist radiologists with generating structured radiology reports.

Statement of Impact

This approach of generating structured radiology reports on CXRs may increase workflow efficiency and reduce radiologist burnout.

Keywords

Generative AI; Retrieval Augmented Generation; Natural Language Processing; Large Language Models

059 - Implementation of U-Net Deep Learning Model in SPECT Myocardial Perfusion Image Segmentation

Presenter: Ahmad Alenezi, Kuwait University

Ahmad Alenezi¹, Ali Mayya², Mahdi Alajmi³, Hamad Alhamad¹

¹Kuwait University, Kuwait City, Kuwait

²Tishreen University, Latakia, Syria

³Ministry of Health Kuwait, Kuwait City, Kuwait

Introduction/Background

Myocardial perfusion imaging (MPI) is a type of single photon emission computed tomography (SPECT) imaging that is performed to evaluate patients with suspected or docu-mented coronary artery disease (CAD) that detection and diagnosis is among the complex prog-nosis that requires accurate and precise image processing (2). Processing and segmentation should be done accurately to provide an accurate diagnosis. Many problems may arise from segmentation issues, leading to difficulties in diagnosis (5). Machine learning (ML) algorithms have been de-veloped with superior performance to overcome segmentation problems (7). To solve segmenta-tion problems and provide accurate segmentation, this study used a deep learning (DL) algorithm called U-Net for image segmentation in MPI.

Methods/Intervention

One thousand one hundred patients who had an MPI study were collected from the PACS system at Al Jahra Hospital between the period of 2015 and 2024. To train the U-net model, 100 studies have been segmented by different nuclear medicine (NM) experts to provide ground truth (i.e., gold-standard coordinates). To assess the performance of the model, multiple cross-validation tests (i.e., accuracy, precision, intersection over union (IOU), recall, and F1 score) were utilized after breaking down the main dataset into a training set (n= 100 images) and valida-tion subsets (n= 900 images).

Results/Outcome

A dataset of 4560 images and 4560 masks was obtained, and a holdout and k-fold (k-5) were utilized. Both cross entropy and dice score were also utilized. The findings indicate that the best case was corresponding to the holdout split scenario with a cross-entropy loss function with a test accuracy stands at 98.9%, test IOU at 89.5%, and the test Dice coefficient at 94%. The K-fold sce-nario was more balance between true positive rate and false positive rate. The results of U-Net segmentation were not significantly different from that produced by an expert nuclear medicine technologist (p=0.1).

Conclusion

The results show that the U-Net model provides a solution for segmentation problems, allowing better diagnosis and subsequent accurate reporting.

Statement of Impact

This research demonstrates that the U-Net deep learning algorithm significantly enhances MPI segmentation accuracy, aligning closely with expert evaluations and promising improved diagnostic precision for CAD.

Keywords

Artificial intelligence; Deep Learning; SPECT; Myocardial Perfusion

060 - Improved Osteoporosis Prediction in Breast Cancer Patients Using a Novel Semi-Foundational Deep Learning Model

Presenter: Katherine Q. Tibbets, USF Morsani College of Medicine

John D. Mayfield¹, Katherine Q. Tibbets², Aziz Rehman², Millena Levin², Dayna Goltz², Neelesh Prakash²

¹Massachusetts General Hospital, Boston, MA, USA

²USF Morsani College of Medicine, Tampa, FL, USA

Introduction/Background

Small cohorts of certain disease states are common especially in medical imaging. Despite the growing culture of data sharing, information safety often precludes open sharing of these datasets for creating generalizable machine learning models. To overcome this barrier and maintain proper health information protection, foundational models are rapidly evolving to provide deep learning solutions that have been pretrained on the native feature spaces of the data. Although this has been optimized in Large Language Models (LLMs), there is still a sparsity of foundational models for computer vision tasks.

Methods/Intervention

It is in this space that we provide an investigation into pretraining a Visual Geometry Group (VGG)-16 on an unrelated dataset of 8,500 chest CTs which was subsequently fine-tuned to classify bone mineral density (BMD) in 200 breast cancer patients using the L1 vertebra on CT.

Results/Outcome

This semi-foundational model showed significant improved ternary classification into mild, moderate, and severe demineralization in comparison to ground truth Hounsfield Unit (HU) measurements in trabecular bone. For the 20% holdout testing set, the AUC was 0.92 (p-value < 0.05, ANOVA versus no pretraining versus ImageNet transfer learning) and F1-score 0.84 (p-value < 0.05).

Conclusion

In this study, the use of a semi-foundational model trained on the native feature space of CT provided improved classification in a completely disparate disease state with different window levels.

Statement of Impact

Future implementation with these models may provide better generalization despite smaller numbers of a disease state to be classified.

Keywords

Foundational Models; Machine Learning; Artificial Intelligence; Osteoporosis

061 - Machine Learning Clustering of Qualitatively Assessed Lung Computed Tomography Scans to Distinguish Nonhuman Primate Models of Respiratory Virus Infection: A Pilot Study

Presenter: Edmond Adib, National Institutes of Health

Edmond Adib¹, Shiva Singh¹, Marcelo Castro¹, Mark Rustad¹, Winston Chu¹, Maryam Homayounieh¹, Gabriella Worwa¹, Daniel Chertow¹, Reed Johnson¹, Michael Holbrook¹, Yu Cong¹, Ian Crozier², Ashkan Malayeri¹, Jeffrey Solomon²

¹National Institutes of Health, Bethesda, MD, USA

²Leidos Biomedical Research, Fredrick, MD, USA

Introduction/Background

We explore unsupervised machine learning (ML) to analyze expert-generated qualitative assessments of lung computed tomography (CT) scans to differentiate nonhuman primate (NHP) models of experimental respiratory virus infections.

Methods/Intervention

We utilized CT scans from four distinct experiments evaluating NHP infection models after cowpox virus (CPV), influenza A virus (IAV; with and without superimposed methicillin-resistant staphylococcal [MRSA] exposure), Nipah virus (NiV), and SARS-CoV-2 exposures. N=19 subjects with imaging abnormality across multiple time points were selected. While CT protocols were controlled, NHP species, age, weight, and dose/route of inoculation varied across experimental groups. Using a standardized evaluation questionnaire, a radiology specialist qualitatively graded each CT lung-lobe. Features were one-hot encoded, and Uniform Manifold Approximation and Projection (UMAP) was applied for dimensionality reduction followed by k-means clustering.

Results/Outcome

A UMAP plot demonstrates the grouping of CT scan qualitative features from different NHP models into clusters revealing key insights: • # Clusters: Six-clustering analysis generated the highest silhouette score (SS= 0.575). Two clusters express peak vs. non-peak disease and another contains an individual (subject 12) which had pre-existing lung abnormality. • Exposure Clustering: CT abnormalities from IAV +/- MRSA and SARS-CoV-2 virus models cluster together suggesting similar qualitative lung features across these models. CT abnormalities after CPV exposure consistently clustered separately, suggesting distinct qualitative features in this model. • Longitudinal Variability: after a specific viral exposure (e.g. IAV + MRSA), peak CT abnormality (cluster 0 = blue) clustered distinctly versus non-peak (cluster 3 = red) abnormality, consistent with the expected time-series analysis.

Conclusion

Lung lesion phenotypes likely vary across viral infections, routes of inoculation, dose and other factors. Using a radiologist's lobe-based qualitative assessment, ML methods can effectively distinguish differences. Future efforts with more subjects will explore fully automated methods (e.g. lung segmentation, radiomic feature extraction) as input to machine learning-based classification.

Statement of Impact

Differentiating qualitative CT lung abnormality across NHP models of viral infections provides initial proof-of-principle that urges ML approaches of user-independent radiomic feature analysis in the future.

Keywords

SARS-CoV-2; Nipah virus; Cowpox virus; Influenza virus

062 - Natural Language Processing for Automated Correlation of Radiology and Pathology Reports in Prostate Cancer Detection

Presenter: Anthony T. Wu, University of California, Irvine

Anthony T. Wu¹, Gavin Shu¹, Ryan O'Connell¹, Peter Chang¹, Robert Edwards¹, Sungmee Park¹, Roozbeh Houshyar¹

¹University of California, Irvine, Irvine, CA, USA

Introduction/Background

Prostate cancer (PCa) is the second most common cancer among men and has the second highest mortality rate. With an aging U.S. population, the demand for accurate PCa detection is outpacing the diagnostic radiology workforce, particularly in remote areas. Though data-driven self-improvement in radiologist PCa detection can help alleviate workforce constraints, automated pathology correlation to prostate MRI reports (RRs) is currently lacking, limiting this approach’s feasibility. Herein we propose a novel natural language processing algorithm (NLP) for automated radiology-pathology report correlation using a 12-core biopsy template (12cBT) in real time.

Methods/Intervention

Radiology reports (RRs) and their corresponding pathology reports (PRs) from UCI Health (October 2013-October 2023) were retrieved from a HIPAA-compliant data warehouse, totaling 1162 pairs across 1093 patients. A random 10% subset was labeled by medical students and verified by physicians. RRs and PRs were analyzed separately. Regex expressions extracted lesion locations and PI-RADS scores from RRs, while core biopsy regions and Gleason scores were extracted from PRs. A custom spell-check addressed domain-specific errors. Lesions were mapped to 12cBTs, and with NLP mapping performance evaluated on the test set.

Results/Outcome

The NLP achieved 97.4% accuracy in detecting significant PI-RADs (≥3) in RRs and 100% accuracy in detecting significant Gleason scores (≥3+3) in PRs. Mapping of 12cBT regions for RRs and PRs yielded 89.6% and 89.4% overall accuracy, respectively. PI-RADs v2 demonstrated 60.8% accuracy in detecting PCa, with 12cBT regional sensitivities shown.

Conclusion

Our NLP system effectively mapped radiology reports (RRs) to pathology reports (PRs) in near-real time. To our knowledge, this is the first instance of an automated radiology-pathology correlation for PCa. In addition, we found that while radiologists faced challenges in pinpointing exact PCa locations, they were able to detect the general region of PCa with relatively high sensitivity.

Statement of Impact

Our tool provides radiologists with a means for self-improvement by delivering feedback as soon as pathology reports are available if integrated with hospital electronic health record systems. Additionally, we assessed the performance of PI-RADS v2 at UCI Health in detecting PCa across over 1000 reports and patients.

Keywords

Natural Language Processing; Prostate Cancer Detection; Radiology Reports; 12-core Pathology Reports

063 - Preliminary evaluation of the state-of-the-art large language models in processing reports from the American Association of Physicists in Medicine

Presenter: Hossein Jafarzadeh, McGill University

Hossein Jafarzadeh¹, Jonathan Kalinowski¹, Farhood Farahnak¹²³, Shirin A. Enger¹²³

¹McGill University, Montreal, Quebec, Canada

²Lady Davis Research Institute, Montreal, Quebec, Canada

³Jewish General Hospital, Montreal, Quebec, Canada

Introduction/Background

Reports from the American Association of Physicists in Medicine (AAPM) contain consensus guidelines and tabulated reference data essential for daily clinical tasks in radiology and radiotherapy. A chatbot capable of accurately answering questions regarding the reports would facilitate compliance with the AAPM guidelines. Retrieval-Augmented Generation (RAG) allows large language models (LLMs) to find the answer to questions from large amounts of text due to their context comprehension, attention mechanisms, and reasoning abilities. We evaluated Google’s Gemini 1.5 Pro and OpenAI’s GPT 4O on answering technical questions from two AAPM reports using human evaluation.

Methods/Intervention

Out of 259 AAPM reports, reports number 233 (TG233) and 084S were chosen for system evaluation, totaling 90 PDF pages. The PDFs were converted to text, and tables and images were extracted using the available APIs for each system. For each report, 5 technical questions were designed, and the models were asked to find the correct answers within the text. Finally, two graduate students with accredited training in medical radiation physics evaluated the models' responses and scored them from 1 to 5 based on accuracy and conciseness.

Results/Outcome

Gemini and ChatGPT scored 3.7 ± 1.4 and 2.7 ± 1.3 out of 5, respectively, in the human evaluation, showing Gemini's superiority. Gemini's responses averaged 144 ± 72 words, shorter and more concise than ChatGPT's 233 ± 87 words, though both were much longer than the ground truth (40 ± 14 words). Human evaluators noted that both models' answers were often verbose and inaccurate when questions required an understanding of relevant physics.

Conclusion

In conclusion, this experiment demonstrates LLMs' capability to understand AAPM reports and answer related questions. Future work will include integrating a search module to retrieve relevant reports for queries. Additionally, training task-specific LLMs, such as those fine-tuned on medical physics textbooks, is essential. A robust evaluation framework is also necessary to accurately assess these systems.

Statement of Impact

While assessing the capability of commercial LLMs in processing reference documents specific to the medical physics domain, this work signifies the need for a more standardized method of evaluating model performance on technical reference documents.

Keywords

Artificial intelligence; Large Language Models; Medical Physics; Computer Tomography

064 - Prompt-Induced Bias in Vision Language Models: Implications for Pneumonia Detection in Pediatric Chest Radiographs

Presenter: David Li, Lοndοn Health Sciences Center

David Li¹, Jaron Chong¹

¹Lοndοn Health Sciences Center, London, ON, Canada

Introduction/Background

Vision language models (VLMs) have the potential to revolutionize medical imaging. However, the effects of text prompts on visual tasks are not well understood. This study investigates how variations in text prompts influence the diagnostic accuracy of GPT-4 Turbo in detecting pneumonia in pediatric chest radiographs. We hypothesize that subtle differences in prompt context can lead to biased predictions in visual diagnostic tasks.

Methods/Intervention

This retrospective study utilized publicly available data and was exempt from institutional review board approval. 5856 pediatric chest radiographs were obtained from the Guangzhou Women and Children’s Medical Center. A test set of 200 radiographs, including 100 pneumonia cases and 100 normal cases, was randomly selected from patients aged 1 to 5 years. The latest version of GPT-4 Turbo with Vision was used to classify each radiograph as either pneumonia or normal, employing four prompt variations: neutral, query positive, clinically symptomatic, and leading answer. VLM performance was evaluated using sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC). Statistical analysis was performed using McNemar’s tests, with a significance threshold of p < 0.05.

Results/Outcome

The AUROC for the four prompts ranged from 0.35 to 0.53. Subgroup analysis showed that sensitivity increased progressively with greater prompt bias, ranging from 0.18 for neutral prompts to 0.99 for leading answer prompts. Significant differences were observed in pairwise comparisons between the neutral and clinically symptomatic prompts (p = 0.026) and between the neutral and leading answer prompts (p < 0.001).

Conclusion

This study highlights that prompt-induced bias significantly impacts GPT-4 Turbo’s performance in detecting pneumonia in pediatric chest radiographs. Moreover, VLM performance was lower compared to previously published benchmarks for convolutional neural networks in chest radiograph interpretation. Further research is needed to identify and address prompt-induced bias to ensure reliable clinical deployment.

Statement of Impact

Currently, VLMs without specialized medical fine-tuning demonstrate limited accuracy in interpreting chest radiographs. Prompt-induced bias significantly affects diagnostic performance in visual tasks. To enhance the clinical effectiveness of VLMs, it is crucial to conduct rigorous validation studies using neutral prompts to minimize bias and avoid overestimating results.

Keywords

Vision language models; Large language models; Multimodality; Generative pre-trained transformers

065 - Radiology AI Leaderboard: An Evaluation Platform for Large Language and Vision Language Models

Presenter: David Li, Lοndοn Health Sciences Center

David Li¹, Jaron Chong¹

¹Lοndοn Health Sciences Center, London, ON, Canada

Introduction/Background

Rapid advancements in large language models (LLMs) and vision language models (VLMs) hold great promise for transforming radiology. However, assessing and comparing the performance of these models in radiology remains challenging due to the lack of standardized, transparent benchmarks. To address this gap, we created a comprehensive platform designed to evaluate and compare LLM and VLM performance in radiology tasks.

Methods/Intervention

The platform features an evaluation and voting framework with domain-specific criteria to ensure accurate performance assessment. It supports both public and proprietary datasets, including multimodal datasets. Visualization tools enable radiologists to easily compare model performance across various metrics, datasets, and tasks over time. Researchers and vendors are encouraged to submit their models for evaluation.

Results/Outcome

The platform has demonstrated both feasibility and effectiveness in evaluating LLMs and VLMs for radiology tasks. Models are assessed across a range of radiology-specific tasks and datasets, with performance transparently reported and ranked. While initial results were based on academic research, we have also evaluated 19 models with board-certified radiologists. The latest proprietary models, such as GPT-4o and Claude 3.5 Sonnet, as well as open-source models like LLaMA 3.1 405B, have been benchmarked. Preliminary results indicate that model performance on radiology-specific tasks differs substantially from general-purpose benchmarks, highlighting the need for radiology-specific benchmarks.

Conclusion

The Radiology AI Leaderboard represents a major advancement in standardizing the evaluation of LLMs and VLMs within radiology. It addresses a critical gap by introducing specialized benchmarks tailored to radiology, setting new standards for transparency and collaboration. The platform not only improves the accuracy of performance evaluations but also establishes a robust foundation for the safe and effective integration of AI into clinical practice.

Statement of Impact

This platform advances the evaluation of LLMs and VLMs in radiology by providing standardized and transparent benchmarking. By ensuring rigorous and equitable assessments, it facilitates the integration of generative AI into clinical practice.

Keywords

Large language models; Vision language models; Model validation; Benchmark

066 - Regression in GPT-4 Turbo’s Diagnostic Accuracy for Generating Radiology Differential Diagnoses

Presenter: David Li, Lοndοn Health Sciences Center

David Li¹, Kartik Gupta¹, Mousumi Bhaduri¹, Paul Sathiadoss¹, Sahir Bhatnagar², Jaron Chong¹

¹Lοndοn Health Sciences Center, London, ON, Canada

²McGill University, Montreal, Quebec, Canada

Introduction/Background

Large language models (LLMs) have demonstrated impressive capabilities across a variety of domains; however, their effectiveness in clinical tasks, such as generating differential diagnoses, remains underexplored. This study evaluates the diagnostic accuracy of GPT-4 Turbo, an advanced generative pre-trained transformer (GPT), in analyzing Radiology Diagnosis Please cases. These cases encompass a broad range of pathologies, reflecting the complexities of diagnostic radiology. We hypothesize that GPT-4 Turbo will outperform its predecessors in generating accurate differential diagnoses.

Methods/Intervention

This study was exempt from institutional review board review due to the use of publicly available data. We retrospectively compiled a test set of 287 Radiology Diagnosis Please cases from August 1998 to July 2023, excluding cases with information leaks. Patient histories, imaging findings, and ground truth diagnoses were extracted. The latest version of GPT-4 Turbo (April 2024 release) was evaluated. Diagnostic accuracy was assessed by generating the top five differential diagnoses based on text inputs of history, imaging findings, and their combination. A panel of three radiologists, averaging 13 years of experience, evaluated blinded differentials and resolved discrepancies through mediated discussion.

Results/Outcome

GPT-4 Turbo’s diagnostic accuracy based on the history, imaging findings, and both combined were 43/287 (15%), 119/287 (41%), and 132/287 (46%), respectively. Accuracy varied across subspecialties, ranging from 0/26 (0%) in genitourinary cases to 4/6 (67%) in obstetrics cases. Qualitative observations of diagnostic regression included lower rankings of correct diagnoses and the omission of eponyms and previously accurate diagnoses.

Conclusion

This clinical validation study identifies an unexpected regression in the diagnostic accuracy of GPT-4 Turbo compared to previously published benchmarks for GPT-4 and GPT-3.5. These results highlight the need for additional fine-tuning to enhance GPT-4 Turbo’s performance and ensure its effectiveness before clinical deployment.

Statement of Impact

This clinical validation study underscores the importance of exercising caution when integrating LLMs into diagnostic workflows. The regression in GPT-4 Turbo’s performance suggests that foundational models require additional fine-tuning with medical datasets. Rigorous validation of LLMs is crucial to establish their effectiveness and reliability before widespread clinical adoption. With continuous improvements, LLMs have the potential to become valuable decision support tools for radiologists.

Keywords

Large language model; Generative pre-trained transformer; GPT-4 Turbo; Clinical validation

067 - Reduction of Molecular Breast Imaging Scan Time by Half with a Denoising Diffusion Probabilistic Model-based Algorithm

Presenter: Fred Nugen, Mayo Clinic - Rochester

Fred Nugen¹, Bardia Khosravi¹, Lacey Gray¹, Katie N. Hunt¹, Bradley J. Erickson¹, Carrie Hruska¹

¹Mayo Clinic - Rochester, Rochester, MN, USA

Introduction/Background

Molecular Breast Imaging (MBI) uses a dedicated gamma camera to image functional uptake of a radiopharmaceutical (Tc-99m sestamibi) in breast cancer. Following an MBI patient satisfaction study (Hruska, JNMT 2024), reducing scan time and/or radiation dose while maintaining image quality is desirable. We designed an automated denoising tool and evaluated its application to “reduced count” MBI in maintaining lesion contrast and diagnostic accuracy.

Methods/Intervention

MBI scans comprise four ten-minute images: two projections for each breast. We generated “reduced-count” MBI scans by using only data from the first five minutes of each image. We trained a denoising diffusion probabilistic machine learning model (DDPM) (Khosravi, CMPB 2023) to “denoise” reduced-count images without reducing clarity of features such as suspicious lesions. We used a random sample of patients undergoing MBI at Mayo Clinic from January to April, 2021. The model was trained repeatedly on the training set (343 patients, 4962 images) and validation set (114 patients, 1614 images) until it had seen 40 million images. For model evaluation, we selected 81 additional MBI exams (15 negative, 66 positive). Quantitative evaluation was performed through region of interest analysis on both reduced-count denoised and ground truth images. We calculated contrast-to-noise ratios as CNR = (mean of lesion - mean of breast tissue) / (standard deviation of breast tissue). A retrospective reader study of this dataset is underway; breast radiologists are presented the reduced-count denoised and ground truth exams in a random order while blinded to image status and provide an assessment of cancer likelihood (ACR BI-RADS 1-5 scale) and image quality (1-5 scale).

Results/Outcome

In a random sample of 18 images containing a lesion, CNR of the DDPM-denoised image was equal to or higher than CNR of the ground truth, due to reducing standard deviation of intensities in breast tissue. Pending results from the reader study will be presented.

Conclusion

DDPM-denoised images acquired for half the acquisition time of ground truth data provided similar image quality without producing artifacts. DDPM-denoised images maintained or improved lesion contrast, and had less noise (ie, reduced standard deviation of breast tissue intensities).

Statement of Impact

DDPM-denoised MBI exams could improve patient satisfaction and/or reduce radiation dose.

Keywords

Molecular Breast Imaging; Denoising; Generative AI; Diffusion models

068 - Survival Prediction in Colorectal Liver Metastases using Radiomics

Presenter: Akhil Ambekar, Brown University

Akhil Ambekar¹, Jon Steingrimsson¹

¹Brown University, Providence, RI, USA

Introduction/Background

Colorectal Liver Metastases (CRLM) signal an advanced stage of colorectal cancer. Accurate survival estimation can guide treatment decisions. We evaluate how accurate radiomics (a non-invasive method that generates numerical data from medical images) is for survival estimation in patients with CRLM and compare it with using traditional clinical information. We also examine the predictive value of radiomics features extracted from images at different quantization levels. Quantization is a process of limiting the number of distinct pixel values in an image to streamline the analysis by reducing noise and computational demands.

Methods/Intervention

This study uses the Colorectal-Liver-Metastases dataset, including preoperative CT DICOM scans with liver segmentations, and overall survival data for 197 patients post-CRLM resection. Using 'pyradiomics', 474 radiomic features were extracted at four different quantization levels (8, 32, 128, and 255). Feature extraction was performed on individual CT slices, with summary statistics computed for each patient based on slice-level data. A Random Survival Forest (RSF) was trained using the training feature set (80%) and evaluated using the censored data concordance index on the test dataset (20%). Feature selection was done using permutation importance analysis, followed by a comparison of the outcomes using concordance indexes. Moreover, we compared the predictive accuracy of just using clinical data acquired at the time of acquisition with radiomic analysis.

Results/Outcome

The most effective RSF model used the top 45 features and had a concordance index of 0.75 at the quantization level of 255. For features extracted at quantization levels 8, 32, and 128, the concordance indexes were 0.67, 0.74, and 0.73, respectively. The optimal results for these levels were obtained using the top 25, 55, and 25 features, respectively. By contrast, models employing traditional clinical data yielded a lower concordance index of 0.64.

Conclusion

Comparing radiomic features to clinical data shows a notable improvement in survival prediction accuracy. Radiomic data consistently outperforms traditional clinical data across quantization levels of 32 and above, this shows the value of quantization to streamline the analysis.

Statement of Impact

Radiomic features provide independent predictive value on top of using only clinical information when predicting survival in patients with CRLM.

Keywords

Colorectal cancer; Radiomic features; Quantization; Random Survival Forest

069 - The Phenotypic Basis of CT-derived Kidney Traits and Their Utility in Predicting Estimated Glomerular Filtration Rate

Presenter: David Y. Zhang, University of Pennsylvania

David Y. Zhang¹, Rachit Kumar¹, Ali H. Dhanaliwala¹, Jeffrey T. Duda¹, Hersh Sagreiya¹, James C. Gee¹,

Charles E. Kahn, Jr.¹, Marylyn D. Ritchie¹, Daniel J. Rader¹, Walter R. Witschey¹

¹University of Pennsylvania, Philadelphia, PA, USA

Introduction/Background

The volume of imaging data, necessity of accurate and consistent granular imaging traits defined across the lifespan, and need to support underserved communities require novel end-to-end automation strategies, especially for kidney evaluation. To address this challenge, we developed and applied AI to CT scans and analyzed the clinical relevance of imaging traits with disease. We hypothesized that kidney imaging-derived phenotypes (IDPs) could be used to predict estimated glomerular filtration rate (eGFR).

Methods/Intervention

We extracted thorax, abdomen, and/or pelvis CT scans for 20,289 individuals in the Penn Medicine Biobank, segmented the kidneys using TotalSegmentator, and derived quantitative imaging traits to perform association studies, controlled for sex, age, age^2, BMI, and population stratification. A simple feed-forward neural network was also trained to predict eGFR using the kidney traits as well as age and sex. The dataset was split into 70%/15%/15% training/validation/testing.

Results/Outcome

We performed phenome-wide association studies against multiple quantitative kidney IDPs. For kidney volume, we observed strong significant negative associations with end-stage renal disease as well as related circulatory conditions such as hypertension and congestive heart failure, with similar trends also identified for kidney surface area and mean attenuation. Our neural network model for predicting eGFR from kidney traits, age, and sex was trained on eGFR values documented within 7 days of a CT scan. The model exhibited robust predictive ability and had a mean squared error of 413.93 on the testing dataset. Using a cutoff of 60 mL/min/1.73m2 for chronic kidney disease, our model had a sensitivity of 61.7% and specificity of 87.3%.

Conclusion

Our association studies demonstrate not only strong correlations between CT imaging-derived kidney traits and health conditions, but also granularity in how different kidney diseases affect certain kidney traits and not others. We will also perform genetic association studies to study the genetic architecture of our kidney IDPs. Furthermore, the quantitative IDPs showed strong predictive potential for estimating eGFR.

Statement of Impact

Our results not only validate the biological relevance of our IDPs, but also demonstrate the clinical utility of predicting eGFR from imaging traits that could be integrated into a clinical workflow and used to indicate further testing in relevant patients.

Keywords

Imaging; Deep learning; Phewas; Genomics

070 - Two-Step Fully Automated Classification of Choroidal Metastases on MRI: Orbit Localization via Bounding Boxes Followed by Binary Classification via Evolutionary Strategies

Presenter: Joseph N. Stember, Memorial Sloan Kettering Cancer Center

Jeffrey S. Shi¹, Bala McRae-Posani¹, Andrei Holodny¹, Hrithwik Shalu², Joseph N. Stember¹

¹Memorial Sloan Kettering Cancer Center, New York, NY, USA

²Indian Institute of Technology Madras, Chennai, India

Introduction/Background

The choroid of the eye is a rare site for metastatic spread of a tumor, and choroidal metastases (CMs) may be visualized on magnetic resonance imaging (MRI). However, as small lesions on the periphery of the image, they are often missed on brain MRI.

Methods/Intervention

Here, we describe sequential cropping and classification on brain MRI images to detect CMs using artificial intelligence (AI). We first trained an orbit localization model with a YOLOv5 architecture using 386 normal T2-weighted brain MRI images. The model predicted and cropped the positions of the orbits on MRI brain scans from 33 patients without and 33 patients with CMs. After zooming in around the orbits, the cropped images served as inputs to a binary classifier convolutional neural network (CNN) to classify images as normal or CM-containing. We used 36 images for training and the other 30 for testing. Given the small training set, we trained the network weights via the data-efficient deep neuroevolution (DNE) strategy.

Results/Outcome

Our orbit localization model achieved mean average precision at intersection over union of 0.5 of 0.590. For a confidence of 0.3, the model achieved recall of 1.00 and precision of 0.50, as the model accurately identified all orbits but was unable to distinguish “left” and “right”. Laterality was assigned afterwards using relative position. The model generalizes to scans with CMs; on our dataset of 33 slices demonstrating CMs, the model accurately determined the bounding boxes without errors. The predicted bounding boxes were used to crop the images for training our CNN classification model. After training via DNE for over 80,000 episodes, the model converged on a training set accuracy of 100% and testing set accuracy of 100%.

Conclusion

We trained a YOLOv5 model to accurately localize and crop the orbits on brain MRI. The cropped images were subsequently used to train a CNN with excellent performance in detecting CMs.

Statement of Impact

Our method provides an end-to-end model to accurately detect small, peripheral, easy-to-miss lesions to potentially improve sensitivity for detection of CMs. It could thereby help reduce “corner of the image” false negatives.

Keywords

Object detection; Classification; Tumor; Cancer

071 - Utilizing Natural Language Processing and Deep Learning Classification of Radiology Reports to Evaluate the Sensitivity of Chest CT for Detecting Signs of Congestive Heart Failure

Presenter: Ali Memon, Creighton University School of Medicine

Daniel Spalinski¹, Sherif Zineldine¹, Ali Memon¹, Kimberly Mendez², Dorina Pinkhasova¹, Michael Fei¹, Daniel Nguyen¹, Jad Alsheikh¹, Randy Richardson¹

¹Creighton University School of Medicine, Omaha, NE, USA

²Baylor College of Medicine, Houston, TX, USA

Introduction/Background

There is a lack of considerable data on the use of language models (LMs) in the interpretation of radiology reports, notably chest CTs. Chest CTs in the U.S. have a reported sensitivity and specificity of 86% and 68% in detecting signs of congestive heart failure (CHF).1 LMs can be used to find key characteristics of pathology. These models can allow for enhanced interpretation of reports. This study assesses the capabilities of natural language processing (NLP) in the evaluation of CT reports in patients with a diagnosis of CHF.

Methods/Intervention

This study is a retrospective review of data from the MIMIC-IV, an open-access database derived from the electronic health records of Beth Israel Deaconess Medical Center from 2008 to 2019.2 The multi-label radiology report classification model SARLE was implemented to generate lists of significant findings for chest CTs performed in the same admission with a diagnosis of CHF according to appropriate ICD codes.3 Radiology reads were classified as positive according to the presence of one to six key radiographic findings. An ROC curve was generated based on these varying numbers of positive findings.

Results/Outcome

3,670 hospital admissions where a chest CT was performed with a concurrent diagnosis of CHF were included. Odds ratios (OR, 95% CI) were calculated for each finding using the model interpretation of 91,281 total chest CT reports. These features included cardiomegaly (6.4, 5.9-6.8), vascular congestion (OR 5.8, 4.7-7.2), pleural effusion (OR 8.1, 7.4-8.8), septal thickening (OR 5.2, 4.7-5.7]), pulmonary edema (8.2, 7.6-8.9), and dilated pulmonary vessels (OR 2.3, 2.1-2.5). The presence of at least one key radiographic finding had a 92.34% sensitivity and specificity of 52.92% for the presence of CHF. The resulting ROC curve had an AUC of 0.81.

Conclusion

NLP allowed for the comprehensive interpretation of a large number of radiographic studies with CHF. Chest CT sensitivity for CHF may be greater than previously reported, and expanding this methodology to other modalities has the potential to better evaluate the accuracy of imaging modalities for specific pathologies.

Statement of Impact

The study of NLP has implications for the future of interpretation of radiology reports.

Keywords

Natural Language Processing; Deep Learning; Radiology report classification

072 - CT to MRI Style Transfer Deep Learning for Enhanced Detection of Brain Metastases

Presenter: Adhithya Narayanan, Geisel School of Medicine at Dartmouth

Adhithya Narayanan¹, Nooriel Banayan², Andrei I. Holodny³, Hrithwik Shalu⁴, Dylan G. Hsu³, Joseph N. Stember³

¹Geisel School of Medicine at Dartmouth, Hanover, NH, USA

²SUNY Downstate Health Sciences University College of Medicine, New York, NY, USA

³Memorial Sloan Kettering Cancer Center, New York, NY, USA

⁴Indian Institute of Technology Madras, Chennai, India

Introduction/Background

Style transfer is a technique in AI computer vision that generates synthetic images by combining the content of one image with the visual attributes of another [1]. This approach has been employed in various contexts, such as enhancing the resolution of images from portable low-field MRI scanners to resemble those produced by high-field MRI scanners [2]. In this study, we evaluate cross-modality style transfer to assist radiologists in detecting brain metastases on CT. Vasogenic edema surrounding brain metastases can be subtle on non-contrast CT, often appearing as vague hypoattenuation. In contrast, T2 FLAIR imaging provides better visualization, as the edema appears bright with high contrast resolution [3, 4]. However, CT is much more commonly acquired, less expensive, and quicker than MRI. Therefore, we aimed to enhance the conspicuity of brain metastases on CT by style transferring to a virtual T2 FLAIR MRI. We assert that producing this synthetic MRI image may enable more confident detection of brain metastases.

Methods/Intervention

We used a two-dimensional Basic UNet++ model to generate style-transferred synthetic MRI from non-contrast CT head studies. The model was trained on 300 pairs of non-contrast CT and T2 FLAIR MRI images from 280 patients at our institution.

Results/Outcome

Qualitative assessment of the synthetic MRI images was performed by a board-certified neuroradiologist who determined that the synthetic images could improve confidence in detecting brain metastases over non-contrast CT alone.

Conclusion

In future and ongoing work, improved sensitivity for small metastases is being validated by surveying a larger group of board-certified neuroradiologists.

Statement of Impact

By increasing the conspicuity of features of metastases such as edema, synthetic MRI generation through style transfer from non-contrast CT can improve radiologists' confidence in detecting brain metastases and help inform clinical decisions.

Keywords

Artificial Intelligence; Style Transfer; Neuroradiology; Brain Metastases

073 - Prescreening Radiology Reports for Prostate Cancer Recurrences Using a Large Language Model (LLM)

Presenter: Ali Ganjizadeh, Mayo Clinic - Rochester

Ali Ganjizadeh¹, Lance Mynderse¹, Shahriar Faghani¹, David A. Woodrum¹, Bradley J. Erickson¹

¹Mayo Clinic – Rochester, Rochester, MN, USA

Introduction/Background

This study evaluated the efficiency and accuracy of using the Mixtral 8x7b v0.1 Instruct LLM to prescreen radiology reports for prostate cancer recurrences post-MR-guided seminal vesicle cryoablation.

Methods/Intervention

This retrospective study included 164 patients who underwent seminal vesicle cryoablation and were followed up with either PET or MRI scans every three months for 2 years. A total of 582 radiology reports were assessed using the Mixtral 8x7b v0.1 LLM, which was not fine-tuned but provided specific details about prostate cancer and radiological report analysis. The LLM analyzed the reports for recurrence indications at the ablation site without anatomical guidance. The performance of the model was evaluated by comparing its predictions with manual assessments of the radiology reports.

Results/Outcome

The model identified 21 true positive and 498 true negative reports, alongside 63 false positive and no false negative results. The performance metrics were: PPV = 25.00%, NPV = 100.00%, sensitivity = 100.00%, and specificity = 88.77%. The use of a language model for the prescreening of radiology reports substantially enhances efficiency, processing 582 reports in approximately 4 hours.

Conclusion

This study demonstrates the potential of using an LLM for prescreening radiology reports to facilitate the detection of prostate cancer recurrences. Despite a high rate of false positives, which were largely attributed to anatomical proximity errors in token embedding, the model effectively ensured that no recurrence was missed. This emphasizes its utility as a clinical tool for enhancing radiologist awareness. Further refinement of the model's performance is necessary to reduce false positives.

Statement of Impact

The application of LLMs in prescreening radiological reports for cancer recurrence offers a significant time-saving advantage and ensures high sensitivity in clinical settings. This technology can serve as a supplementary tool to assist radiologists in managing large volumes of data, focusing on high-priority cases, improving patient care, and improving the early detection and management of cancer recurrences. This leads to timely interventions and improved patient outcomes.

Keywords

Urology; Interventional Radiology; Large Language Model; Cancer Screening

Supplement Details (Please complete the list below.)

Journal name: Journal of Imaging Informatics in Medicine

Supplement title: 2024 Conference on Machine Intelligence in Medical Imaging (CMIMI) – Selected Abstracts

Conference data (venue, location, date): Boston University-George Sherman Union | Boston, MA | October 21-22, 2024

Chair, Guest editor(s), or Organizing-committee name:

Katherine P. Andriole, PhD, FSIIM

Associate Professor of Radiology, Director of Imaging Informatics, Brigham and Women's Hospital, Harvard Medical School

Director of Research Strategy and Operations, Director of Research Strategy and Operations

MGH & BWH Center for Clinical Data Science

Peter D. Chang, MD

Associate Professor, Departments of Radiological Sciences and Computer Science, University of California, Irvine

Director, UCI Center for AI in Diagnostic Medicine

Ingrid Reiser, PhD, FAAPM

Associate Professor of Radiology, University of Chicago

Eliot L. Siegel, MD, FSIIM

Professor of Radiology, University of Maryland School of Medicine

Chief, Imaging Services, VA Maryland Health Care System

Jeffrey H. Siewerdsen, PhD, FAAPM, FAIMBE

Professor, Department of Imaging Physics

Director, Surgical Data Science Program, Institute for Data Science in Oncology

The University of Texas MD Anderson Cancer Center

Sponsor (or Society name): Society for Imaging Informatics in Medicine (SIIM)

Sponsorship statement (required): Publication of this supplement was sponsored by the Society for Imaging Informatics in Medicine (SIIM). All content was reviewed and selected by the CMIMI Program & Review Committee, which held full responsibility for the abstract selections.

Abstracts (quantity): 73

Tables (quantity): n/a

Figures (quantity): n/a

Collated page proofs (contact name and e-mail): Anna Zawacki, azawacki@siim.org

Target publication date (if scheduled): as close to October 21, 2024 as possible

Publication format (Print and/or Online): Online

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

PERMALINK

Scientific Abstracts from the 2024 Conference on Machine Intelligence in Medical Imaging (CMIMI) of the Society for Imaging Informatics in Medicine (SIIM)

Footnotes

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Scientific Abstracts from the 2024 Conference on Machine Intelligence in Medical Imaging (CMIMI) of the Society for Imaging Informatics in Medicine (SIIM)

Footnotes

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases