Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 May 1.
Published in final edited form as: Comput Biol Med. 2025 Mar 23;190:110052. doi: 10.1016/j.compbiomed.2025.110052

Development and validation of pan-cancer lesion segmentation AI-model for whole-body 18F-FDG PET/CT in diverse clinical cohorts

Fahmida Haque 1, Alex Chen 1, Nathan Lay 1, Jorge Carrasquillo 2, Esther Mena 2, Liza Lindenberg 2, Julia E Segal 3, Philip C Eclarinal 2, Sara Talvacchio 3, Alberta Derkyi 3, Peter L Choyke 2, Karel Pacak 3,4, Rosandra N Kaplan 5, Frank I Lin 2, Baris Turkbey 1,2, Stephanie A Harmon 1
PMCID: PMC12107610  NIHMSID: NIHMS2069178  PMID: 40127518

Abstract

Background:

This study develops a deep learning-based automated lesion segmentation model for whole-body 3D 18F-fluorodeoxyglucose (FDG)-Position emission tomography (PET) with computed tomography (CT) images agnostic to disease location and site.

Method:

A publicly available lesion-annotated dataset of 1014 whole-body FDG-PET/CT images was used to train, validate, and test (70:10:20) eight configurations with 3D U-Net as the backbone architecture. The best-performing model on the test set was further evaluated on 3 different unseen cohorts consisting of osteosarcoma or neuroblastoma (OS cohort) (n=13), pediatric solid tumors (ST cohort) (n=14), and adult Pheochromocytoma/Paraganglioma (PHEO cohort) (n=40). Both lesion-level and patient-level statistical analyses were conducted to validate the performance of the model on different cohorts.

Results:

The best performing 3D full resolution nnUNet model achieved a lesion-level sensitivity and DISC of 71.70% and 0.40 for the test set, 97.83% and 0.73 for ST, 40.15% and 0.36 for OS, and 78.37% and 0.50 for the PHEO cohort. For the test set and PHEO cohort, the model has missed small volume and lower uptake lesions (p<0.01), whereas no statistically significant differences (p>0.05) were found in the false positive (FP) and false negative lesions volume and uptake for the OS and ST cohort. The predicted total lesion glycolysis is slightly higher than the ground truth because of FP calls, which experts can easily check and reject.

Conclusion:

The developed deep learning-based automated lesion segmentation AI model which utilizes 3D_FullRes configuration of the nnUNet framework showed promising and reliable performance for the whole-body FDG-PET/CT images.

Keywords: FDG PET/CT, artificial intelligence, deep learning, oncology

INTRODUCTION

Medical imaging plays an important role in cancer detection. Integrated Positron Emission Tomography (PET) with computed tomography (CT) imaging technique is one of the most used imaging techniques in oncology. PET scans include biological information such as metabolic processes about body tissues whereas CT scans provide anatomical information (Gatidis et al., 2022). 18F-fluro-deoxy-glucose (FDG) is one of the most widely used PET radiotracers (Weber, 2010) and is recommended for detection, stratification, assessment of therapy response, and recurrence of tumor lesions in various malignancies (Sadaghiani et al., 2021). As an established diagnostic imaging technique in oncology, the use of FDG-PET/CT is increasing significantly, causing a rapid spike in demand for quantitative analysis and reporting of PET/CT scans. As most of the PET/CT acquisition covers the whole body, the amount of information and data processing time is significantly high, which makes the process increasingly time-consuming and sometimes complicated for healthcare experts tasked with analyzing these scans. Automated tools for lesion detection from whole-body FDG PET/CT images are a potential solution to make the process faster and easier for healthcare professionals.

Recent progress in computational power and the introduction of deep-learning (DL) models for medical imaging showed substantial progress in lesion detection and oncological diagnosis. Based on a recent review article by Fallahpoor et al. (2024), DL-assisted approaches showed successful application of lesion detection, segmentation, and classification for PET/CT images; however, it still faces challenges due to a lack of annotated data which makes it difficult to develop generalized and interpretable DL models. One published literature review (Fallahpoor et al., 2024) reports that 26 studies used whole-body PET/CT images for different applications using DL. Of these, 14 of the studies used U-Net-based architecture and 24 studies used FDG PET/CT whole-body images. This indicates the interests of researchers in this domain. DL-based lesion segmentation models from FDG-PET/CT images were studied for different cancer types such as non-Hodgkin’s lymphoma and non-small cell lung cancer (Jemaa et al., 2020), diffuse large B cell lymphoma (DLBCL) (Blanc-Durand et al., 2021), pancreatic ductal adenocarcinoma (Zhang et al., 2023), lung cancer, and lymphoma (Sibille et al., 2020). Most of the studies in the literature where DL-based approaches were used for FDG-PET/CT images are for specific cancer types (Blanc-Durand et al., 2021; Ferrandez et al., 2023; Zhang et al., 2023) or organ segmentation-based tasks (Wang et al., 2019; Wang et al., 2022; Yang et al., 2024). Not many studies were found in the literature that focus on developing DL-based automated lesion segmentation tools on 3D FDG-PET/CT images with different cancer types, locations, and sites.

In that context, the goal of our research is to create an AI model for automated lesion segmentation using deep learning techniques on whole-body FDG-PET/CT images agnostic to disease type, location, and site. This study will focus on benchmarking two popular segmentation frameworks, nnU-Net (Isensee et al., 2021) and Medical Open Network for AI (MONAI) frameworks (Cardoso et al., 2022), with 3D U-Net as the backbone architecture to develop the segmentation model. Models’ performance will be validated on diverse clinical cohorts with multiple types of rare cancers. The novelty of this project is to benchmark the best-performing model from the two frameworks on various unseen cancer types to observe the generalizability of the model. This study showcases the comparative performance analysis of these two frameworks when trained and tested using the same cases. This work represents, where these models (nnUNet, and MONAI) stand in terms of generalization and how algorithms trained from various common cancer types translate to rare cancer types. This research aims to develop automated AI segmentation tools that can assist radiologists by making the lesion annotation process from 3D whole-body FDG-PET/CT images easier, faster, and more efficient.

MATERIALS AND METHODS

Study Population

In this research, 4 different clinical cohorts of patients were used to develop and validate the AI models, including one publicly available dataset and three independent clinical trial populations at our institution. Detailed information about the FDG PET/CT scans of all four cohorts is listed in Table 1.

Table 1:

Information on PET/CT images on different cohorts

AUTOPET Cohort ST Cohort OS Cohort PHEO Cohort

Number of patients 900 14 13 40
Number of Scans 1014 15 13 204
Median total dose [MBq] 315 [150, 432] 276.76 [82.14, 390.35] 254.93 [163.54, 258.26] 374.27 [366.82, 400.65]
Median Pixel Spacing [mm] 2.04 3.18 [2.65, 3.18] 3.18 2.74 [ 2.74, 4.11]
Median Slice Thickness [mm] 3 3 [2.027, 3] 3 3.27 [2.78, 5]
Gender [M/F] 500/394* (6 unknown) 9/5 10/3 25/19
Median Baseline Height [m] 1.72 [0.7, 2.04] 1.66 [1.28, 1.816] 1.72 [1.233, 1.89] 1.73 [1.52, 1.95]
Median Baseline Age [years] 62 [11, 90] 17 [10, 26] 19 [10, 28] 55 [18, 76]
Median Baseline Weight [kg] 77 [33, 164] 57.3 [22, 97] 63.3 [21.1, 134.37] 74.8 [49, 184.9]
Manufacturer SIEMENS SIEMENS SIEMENS GE MEDICAL SYSTEMS, SIEMENS
Model Name (no of scans) Biograph128_mCT (900), SOMATOM Definition AS_mCT (114) Biograph128_mCT (15) Biograph128_mCT (13) Discovery MI DR (164), Biograph128_mCT (40)

The first cohort is an annotated publicly available dataset from the automated lesion segmentation from the whole-body FDG-PET/CT images (AUTOPET) challenge (Gatidis et al., 2023; Gatidis & Küstner, 2022). The AUTOPET cohort was used to train the AI models. The dataset consists of 1014 whole-body FDG-PET/CT images of 900 patients, from the University Hospital Tübingen (UKT) and partly from the University Hospital of the LMU Munich (LMU), Germany. The whole dataset consisted of 513 scans without any FDG-avid tumor lesions, 168 scans of patients with non-small cell lung cancer, 145 patients with lymphoma, and 188 patients with melanoma. The details of the image acquisition process have been stated by Gatidis et al. (2022). Data was downloaded from The Cancer Imaging Archive (TCIA) (Clark et al., 2013) under the collection name under the collection name “FDG-PET-CT-Lesions” (Gatidis & Küstner, 2022) upon approval of the TCIA Restricted License Agreement. The anonymized dataset was received as Digital Imaging and Communications in Medicine (DICOM) files consisting of a 3D volume of a whole-body FDG-PET scan, a corresponding 3D volume of a CT scan, and a binary segmentation mask. This dataset will be identified as the AUTOPET cohort for the rest of the study.

The second clinical cohort consists of 13 Children/young adult patients undergoing treatment for relapsed or refractory Osteosarcoma and Neuroblastoma at our institution (NCT04539366), referred to herein as (OS cohort. All available baseline FDG PET/CT scans acquired during this clinical trial were used for analysis. Ground truth expert annotations were generated using semi-automated commercial software (MIM, Cleveland, OH, USA) which utilized a region-growing technique within expert-identified lesions, as described previously (Kaczanowska et al., 2024). The expert on the team then verified the radiology reports and confirmed the generated tumor masks using the semi-automated technique and manually contoured lesions, if any of them were missing.

The third cohort consists of 14 children/young adult patients undergoing treatment for refractory pediatric malignant solid tumors at our institution (NCT02107963), referred to for the remainder of the study as the ST cohort. All available baseline FDG PET/CT scans acquired during this clinical trial were used for analysis. Ground truth expert annotations of individual lesion volumes were not available and radiology reports established by expert nuclear medicine physicians at the time of study initiation (including location and SUV uptake of target lesions in the trial) were available for evaluation. Therefore, using the radiology reports, an experienced nuclear medicine specialist, generated the tumor annotations.

The fourth cohort consists of 40 adult patients undergoing imaging prior to and throughout Lu-177-DOTATATE treatment for inoperable and histologically proven Pheochromocytoma/Paraganglioma (NCT03206060, NCT00004847), referred to as PHEO cohort for the remainder of the study. Patients underwent baseline imaging, followed by additional imaging after two and four treatment administrations, respectively. All scans were utilized for analysis in this cohort. Ground truth expert annotations were generated using threshold-based Universal Lesion ID function in semi-automated commercial software MIM software (Cleveland, OH, USA) which applies a global threshold to the image and then separates spatially distinct regions into unique lesions for the accept/reject procedure. Due to the high burden of disease within this cohort, nuclear medicine physicians utilized the software to generate an automated global threshold-based segmentation of any uptake areas with SUVmax ≥ 4. An expert with over 20 years of radiology experience (J.C), then reviews all suggested regions and excludes any normal tissue uptake or indeterminate / non-disease-related uptake areas and finalizes the ground truth expert annotation. The final contours were considered as true positive tumors per the clinical protocol.

Image processing

Before preparing the dataset for the AI model, the following steps were taken to process the scans. Initially, all the DICOM data was converted into the Neuroimaging Informatics Technology Initiative (NIfTI) format using Python scripts provided by the authors of the AUTOPET dataset (Clark et al., 2013; Gatidis & Küstner, 2022). All PET data was converted to standardized uptake values (SUV). CT image volumes and binary masks reflecting expert annotations were resampled to PET volume size and resolution. For the PHEO and OS cohorts, the segmentation masks were in DICOM RT-Struct files. Python package, dcmrtstruct2nii (Phil et al., 2023) was used to convert the DICOM RT-Struct to the NIfTI format.

For all AI development, the AUTOPET dataset was the only cohort used for training and all other cohorts were reserved for external validation. The details of the AUTOPET dataset splits for model development are illustrated in Figure 1. Initially, the AUTOPET dataset was split into the train (70%), validation (10%), and test (20%) sets using the “MakeRandomSplit” function from RandomSplit (Lay, 2023) Python package. The split was stratified for different cancer types in each of the sets. Patients with multiple scans were put into one of the sets (train, test, or validation) to make sure no patients were present in two sets. This was done to avoid overfitting the model and remove biases. In total, this resulted in 704 scans for training, 109 for validation, and 201 for the testing set as shown in Figure 1. For developing models using frameworks requiring 5-fold cross-validation, the train set, and the validation set were merged, and stratified splits of 5-fold were created for 5-fold cross-validation (Figure 1). We have used the “MakeBalancedCrossValidation” function for 5-fold cross-validation from the RandomSplit (Lay, 2023) Python package. To avoid patient-level leakage bias, all scans from an individual patient are assigned to the same cross-validation split using the same procedure as described previously. The test set was reserved for model evaluation.

Figure 1:

Figure 1:

Flow chart representing the split of the AUTOPET dataset for model development.

AI model development

This study focuses on different variations of U-Net by nnU-Net (Isensee et al., 2021) and Medical Open Network for AI (MONAI) frameworks (Cardoso et al., 2022) to identify, the best-performing configuration to segment lesions, from whole-body 3D FDG-PET/CT images. Figure 2 shows the overall workflow to develop an automated lesion segmentation model.

Figure 2:

Figure 2:

Workflow of developing automated lesion segmentation model from whole-body FDG-PET/CT images. (a) Models received 3D whole-body PET images, CT images, and lesion masks as inputs from the AUTOPET train set. (b) Different approaches and configurations from two different frameworks (nnUNet and MONAI) were trained with U-Net as the backbone architecture. (c) The best model was used to run inference on clinical cohorts (d) an example of AI-generated lesion segmentation masks. (e) AI predicted lesions on PET images.

nnU-Net framework

The nnUNet framework follows a structured model development process, thoroughly detailed in the original paper by Isensee et al. (2021) and available on their GitHub repository (https://github.com/MIC-DKFZ/nnUNet). The dataset preparation begins by organizing the data into the Medical Segmentation Decathlon format (http://medicaldecathlon.com/), where each training case is given a unique identifier to link the images to their corresponding segmentation. nnUNet requires that each input channel be stored as a separate image. For example, in this case, PET images are used as the first channel with filenames ending in “_0000” and CT images as the second channel with “_0001.” A JSON file, named “dataset.json,” contains details about the channels, sample numbers, and image formats (such as NIFTI “.nii.gz”).

After formatting the dataset, the “nnUNetv2_extract_fingerprint” function is used to extract dataset-specific properties such as image sizes, voxel spacings, and intensity information. These properties are then employed to configure the U-Net model hyper-parameters and preprocess images using the “nnUNetv2_plan_experiment” and “nnUNetv2_preprocess” functions. From the nnU-Net framework, multiple default configurations of 3D full-resolution (3D-FullRes) and 3D low-resolution (3D-LowRes) UNet were studied in this work. The 3D-FullRes configurations use full image resolutions, whereas the 3D-LowRes uses downsampled images. The 3D-LowRes architecture is configured by adjusting down sampling based on patch size and voxel spacing. Downsampling continues until the feature map size drops below four voxels or spacings become anisotropic. Axes with higher resolution are downsampled separately until their resolution is close to the lower resolution axis, after which all axes are downsampled together. This stops for each axis when feature map constraints are met. The default convolutional kernel size is 3 × 3 × 3 for 3D U-Net. However, if there’s a large initial resolution difference, the kernel for the out-of-plane axis is set to one until resolutions align. We have also studied two different variations for combining 3D-FullRes and 3D-LowRes models, including 3D U-Net cascade of full-resolution and low-resolution (3D-cascade) and an ensemble of 3D full-resolution and low-resolution configurations (3D-FullRes-LowRes). For 3D-cascade, first a 3D-LowRes was trained and then the generated segmentation maps from the first stage were refined using the 3D-FullRes model. For the 3D-FullRes-LowRes, both the models were trained independently and the inference on the test set was run using an ensemble technique using average SoftMax probability from both models.

nnU-Net uses 3D U-Net architecture (Ronneberger et al., 2015) as the backbone, with an automated pipeline, consisting of pre-processing, data augmentation, and post-processing. This method uses deep supervision to finetune the hyper-parameters automatically based on a given dataset. In this research, PET images and CT images were simultaneously used as inputs for training the model. A variety of data augmentation techniques during training: rotations, scaling, Gaussian noise, Gaussian blur, brightness, contrast, simulation of low resolution, gamma correction, and mirroring. For image normalization, nnUNet use, z-scoring for PET, and a pre-defined scheme for CT images (Isensee et al., 2021). The learning rate was initially set to 0.01 and gradually decreased using the “poly” learning rate rule using the formula 1epoch/epochmax0.9 (Chen et al., 2018). The final checkpoint for each model was based on the epoch with the best average foreground DISC score during the training (Dice, 1945; Sørensen, 1948).

nnU-Net uses 5-fold cross-validation to train and validate models. Cross-validation splits were generated (Figure 1) to have a balanced distribution of the disease types and positive/negative scans. Afterward, 4 configurations of nnU-Net, 3D-FullRes, 3D-LowRes, 3D-cascade, and an ensemble of 3D-FullRes-LowRes were trained for 1000 epochs where each epoch consists of 250 training iterations. The Patch size was set to 128×128×128. Separate training for both 3D-FullRes and 3D-LowRes configurations was conducted using the “nnUNetv2_train” function across 5 folds. After 3D-LowRes training, the 3D-cascade configuration was trained, utilizing the 3D-LowRes output as input for the second stage 3D-FullRes model. The model was trained on 5 A100 GPUs, with each fold trained separately. After training was completed, the inference was run on the test dataset from AUTOPET for each of the configurations using a sliding window approach with the same patch size of 128×128×128. Predictions were made using the “nnUNetv2_predict” function for all folds, and the 3D-FullRes and 3D-LowRes predictions were averaged using the “nnUNetv2_ensemble” function. For inference on the AUTOPET test set, all 5-fold cross-validation models were used as an ensemble.

MONAI Framework

From the MONAI framework, 3D UNet with four different pre-processing techniques has been studied in this research (Cardoso et al., 2022). The process of using the MONAI framework for 3D image segmentation has been discussed in detail in their GitHub repository (https://github.com/Project-MONAI/tutorials/blob/main/3d_segmentation). The model development process started with creating data splits for train, validation, and test sets as described in the section “Image processing”. Different pre-processing was done using the following functions LoadImaged, EnsureChannelFirst, ScaleIntensityRanged, Spacingd, ConcatItemsd, RandSpatialCropd, and CropForegroundd from the MONAI transform class. The details about different transformations applied to different variants are discussed below.

In the first variation (Variant-1), the whole-body PET/CT images were used to train the model. The first transformation applied is intensity normalization. Normalization was applied to the images in a channel-wise manner for each scan. For PET images, the intensity range [0, 15] has been scaled to [0, 1] and for CT images, the intensity range [−100, 200] has been scaled to [0, 1]. The spacing of the PET, CT, and annotated mask images was set to [2.036, 2.036, 3.0]. As the third transform, 6 random crops of size 128*128*32, were selected based on the 2:1 background-to-foreground ratio, from each whole-body image for both PET and CT scans.

For the second variation (Varient-2), whole-body PET/CT images from the train set were cropped into 3D patches of size 400*400*32. Crops from the chest to the lower abdomen were oversampled at a ratio of 3:1 and other body regions at a ratio of 2:1.. So, finally, the model had a total of 28,159 crops with 6,057 crops containing at least one foreground voxel (i.e. including minimum one positive lesion), whereas 22,102 crops only consisted of background. Intensity normalization and spacing adjustment transformation were applied to the crops as pre-processing, the same as Variant-1. The model was trained with all the crops with 400*400*32 size.

For the third variation (Varient-3), similar image size, number, intensity normalization, and spacing adjustment were applied as applied to Varient-2. In Varient-3, the Cropforegroundd transform was used, which removes all the zero borders to focus on the valid body areas of the images and masks, followed by RandSpatialCropd transform, which randomly crop patch samples of size 128*128*32 from the cropped images of size 400*400*32. Spatial padding of size 128*128*32 was applied to make the crops symmetric for all sides for each direction.

For the fourth variation (Varient-4), foreground crops of size 400 × 400 × 32 were oversampled with a 3:1 ratio for foreground vs background crops to be balanced, resulting in a final number of total train dataset of 40,273 crops for training, with 18,171 crops of foreground voxels (including at least one lesion) and 22,102 background voxels (including no lesions). All the transformations for this variant were kept the same as those used for Varient-3.

The UNet model from MONAI with 3 spatial dimensions, two input and output channels, with 2 strides were selected. The dice loss and Cross Entropy Loss (DICECE) function was selected with the softmax function for the prediction.

Sliding window approaches were applied over the entire image with a patch size of 128*128*32 for validation for all the variants. The mean dice metric was used in the validation stage to update the loss and weights. The learning rate was set to 0.01 and models were trained for various epochs with different cache rates. The final checkpoint for each variant was based on the best validation DISC. Inference was run on the AUTOPET test set using the final model. The details of the models have been discussed in Table 2. Our written scripts to develop MONAI models can be found in the following GitHub repository: https://github.com/HFahmida/FDG-PET-CT_AI/blob/main/Train_monai_Unet.py

Table 2:

Different variations in training UNet architecture.

Varient-1 Varient-2 Varient-3 Varient-4

Image size Whole-body PET/CT Cropped PET/CT 400*400*32 Cropped PET/CT 400*400*32 Cropped PET/CT 400*400*32
Number of images 1014 scans positive crops: 6057, total crops: 28159 positive crops: 6057, total crops: 28159 Positive crops: 18,171, total crops: 40,273
intensity normalization intensity range [0, 15], and for CT images, the intensity range [−100, 200] and have been scaled to [0, 1].
Number of random crops 6 random crops of size 128*128*32, based on the 2:1 background to foreground ratio No random crops, The model was trained on all the crops of size. 400*400*32 randomly crop patch samples of size 128*128*32 from the cropped images of size 400*400*32 randomly crop patch samples of size 128*128*32 from the cropped images of size 400*400*32
Best validation DISC epoch / total epoch trained 271/292 26/83 23/272 70/137

Tumor burden calculation

Scan level tumor burden was calculated for the artificial intelligence (AI) generated lesion masks and expert annotated ground truth (GT) lesion masks from all the clinical cohorts (ST, OS, PHEO). For the ST cohort, as expert annotations were not provided, researchers identified the TP, FP, and FN lesions from AI in comparison to available radiology reports and clinical annotations in the PACS viewer for this cohort. Afterward, the corrected GT masks were generated by removing FP lesions from the AI-generated masks and manually drawing the contours of the FN lesions by an expert. Then these corrected GT masks were used for scan-level tumor burden analysis.

We have calculated the total number of lesions, maximum values of SUV of the lesion (SUVmax), total metabolic tumor volume (TMTV) (Im et al., 2018), and total lesion glycolysis (TLG), using the following equations (Satoh et al., 2014), for both the AI-generated mask and the GT mask:

OverallSUVmax=max(SUV[mask>0]) (1)
TotalMetabolicTumorVolume=i=0nmaski>0*Sx*Sy*Sz (3)
TotalLesionGlycolysis=i=0nSUVmeanmaski>0*maski>0*Sx*Sy*Sz (4)

Whereas mask is a binary area with 1 as lesion and 0 as background, Sx, Sy, Sz is the image spacing in x,y, and z direction of the voxels of the mask, n is the number of individual lesions in a mask.

Statistical Analysis

Two approaches have been adopted to evaluate the performance of the DL models. In the first approach, the performance of the AI models was evaluated on a scan level. A TP scan is defined as any AI detection in a scan with positive lesions based on GT. A scan is considered a TN scan when AI correctly detects no lesions in a negative GT scan. A scan is considered an FP scan when the AI mask has a positive lesion, and the GT mask has no lesion. For the FP scans, the AI model detects no lesions, but the GT mask has at least one positive lesion for that scan. After scans were evaluated, performance metrics such as sensitivity, specificity, and overall accuracy. The Dice Similarity Coefficient (DISC) was calculated. The DISC score was calculated by considering the overlap between expert-annotated GT lesion masks and AI-generated lesion masks (Dice, 1945; Sørensen, 1948). For comparative analysis with the existing models from literature by (Hadlich et al., 2023) and (Yousefirizi et al., 2024), the model weights have been downloaded and tested on the clinical cohorts.

In the second approach, the performance was evaluated on the lesion level, considering all the lesions from a cohort. As the AI generates predictions on a voxel-level with labels “1” for lesion and “0” for background. To identify unique lesions (i.e. groups of neighboring voxels that correspond to one individual lesion), we calculated 3D connected components (cc3d) with 26 connected neighborhoods (Silversmith, 2021). The cc3d identifies the 26-connected neighborhood of voxels in 3 dimensions from one positive voxel and provides all of them a unique value, considering it as one cluster or lesion. A technical explanation has been provided by W. Silversmith (2021) regarding cc3d. We have calculated true positive (TP), false positive (FP), false negative (FN), sensitivity, positive predict value (PPV), median sensitivity per scan, and median FP per scan. For the ST cohort, as expert annotations were available, experts in the team identified the TP, FP, and FN lesions by going through the AI annotations and the available radiology reports. Afterward, the contours of the FN lesions were created by the experts. Then these corrected masks were used for lesion-level analysis. This study aims to develop a model with better sensitivity to identify most of the lesions accurately. With expert interventions, identifying and removing the false positive lesions will be a better trade-off than having a higher number of false negative lesions. In the detection of lesions by AI, is considered as TP if it has any overlap with a ground truth segmentation.

Further statistical analysis was performed to analyze the tumor burden calculation between the AI-generated lesion masks (AI mask) and expert-annotated lesion masks considered GT masks. Now statistical analysis was again performed at scan-level and lesion level. In the scan-level tumor burden analysis, Bland-Altman (B&A) plot (Giavarina, 2015) of difference with mean difference (Karimdjee et al., 2023) and 95% confidence intervals (CI), median of relative mean difference (RMD), interclass correlation coefficients (ICC) (Koo & Li, 2016) between AI and GT masks were calculated for SUVmax, TMTV, and TLG. ICC is interpreted as superior if ICC ≥ 0.8, great: 0.61 ≤ ICC < 0.8, moderate: 0.41 ≤ ICC < 0.6, low: 0.21 ≤ ICC < 0.4, and poor: ICC ≤ 0.2 (Landis & Koch, 1977). For lesion-level analysis, SUVmax and TMTV differences between TP, FN, and FP lesions were evaluated using the Clustered Wilcoxon Ranked test (Jiang et al., 2020). The statistical significance level was considered as p< 0.05. For statistical analysis, Python and R software were used.

To understand the nature of the failure cases, two measures have been taken into consideration for failure analysis. In the first step, failure analysis was done to evaluate the anatomical locations of TP, FP, and FN lesions. A publicly available pre-trained model on CT images called “TotalSegmentator” (Wasserthal et al., 2023) was used to generate masks containing segmentation of major anatomical structures such as the brain, lung and mediastinum, liver, spleen, and kidney. Each lesion was characterized by its membership to a major anatomical structure and the frequency of TP, FP, and FN lesions were evaluated. Secondly, the suspicious cases were checked by an expert radiologist to verify the nature of AI misclassifications, including FP and FN lesions. To understand the failed anatomical locations, the AUTOPET train dataset was further investigated, to observe the number of lesions and volume of the lesions from the failed anatomical locations. Anatomical locations were generated using the same way as previously mentioned.

RESULTS

Performance evaluation of AI models

To identify the best-performing model, 4 configurations from the nnU-Net framework and 4 variations of U-Net from the MONAI framework were trained and tested using the AUTOPET dataset. The best models for each configuration were selected based on the validation DISC (DISCv). Models with the best validation DISC of 0.36, 0.31, 0.39, 0.27 were selected for 3D_FullRes, 3D_LowRes, Ensemble of 3D_FullRes, and 3D_LowRes and 3D_Cascade configurations from the nnUNet framework, respectively. From the MONAI framework, models with best validation DISC of 0.24, 0.12, 0.21, 0.13 were selected for Varient-1, Varient-2 Varient-3, and Varient-4, respectively.

For the scan level performance on the AUTOPET test set, 3D_FullRes, 3D_LowRes, Ensemble of 3D_FullRes and 3D_LowRes, and 3D_Cascade models had DISC of 0.40, 0.40, 0.45, 0.43 respectively, whereas models from MONAI frameworks had DISC of 0.22, 0.11, 0.21, 0.17 for Varient-1, Varient-2 Varient-3, Varient-4, respectively (Table S1S2). nnUNet configurations had sensitivity in the range of 90%–98% on the AUTOPET test set, indicating that models were able to identify positive scans and find suspicious lesions (Table S1). However, for specificity, models had a varied performance. 3D_LowRes has the best specificity of 69.31%. 3D_LowRes was able to identify all 70 negative scans. The second-best specificity of 58.42% was achieved by an ensemble of FullRes and LowRes models. The lowest specificity was achieved by the 3D_FullRes model with 33.66%, which was able to correctly identify 34 negative scans out of 70. None of the MONAI variations achieved high specificity (Table S2).

The AI-predicted lesion masks generated by the models were compared with the expert annotated GT lesion masks, to evaluate the performance of the models. The lesion level performance of these models on the AUTOPET test set has been presented in Table 3 and Table 4.

Table 3:

Lesion level performance evaluation of different configurations from nnUNet and MONAI framework on AUTOPET test set.

TP FN FP Sensitivity (%) PPV (%) Median per-scan sensitivity (%) Median FP per scan

nnUNet Framework 3D_FullRes 1044 412 650 71.7 61.63 77.83 (0, 100) 2 (0, 23)
3D_LowRes 725 731 223 49.79 76.48 50 (0, 100) 0 (0, 17)
Ensemble of 3D_FullRes and 3D_LowRes 877 579 338 60.23 72.18 62.92 (0, 100) 0 (0, 22)
3D_Cascade 971 485 452 66.7 68.24 68.78 (0, 100) 1 (0, 21)

MONAI Framework Variant-1 591 866 1952 40.56 23.24 50 (0, 100) 8 (0, 46)
Variant-2 317 1140 824 21.76 27.78 23.55 (0, 100) 2 (2, 50)
Variant-3 934 523 9350 64.1 9.08 69.85 (0, 100) 45 (11, 104)
Varient-4 960 497 9807 65.89 8.92 77.35 (0, 100) 42 (7, 116)

Table 4:

Lesion level performance of 3D_FullRes model on clinical cohorts.

ST Cohort* OS Cohort PHEO Cohort

TP 180 157 3278
TN 0 0 0
FN 4 234 905
FP 37 85 967
Sensitivity (%) 97.83 40.15 78.37
PPV (%) 82.95 64.88 77.22
Median per-scan sensitivity (%) 100 (0, 100) 44.12 (0, 100) 76.92 (0, 100)
Median FP per scan 2.5 (0, 6) 4 (1, 22) 5 (0, 33)
DISC 0.73 0.36 0.50

Among all eight models, the 3D_FullRes configuration from the nnU-Net framework had the best performance with a sensitivity of 71% (Table 3). The second-best sensitivity of 66.7% was achieved by the 3D_cascade model. The best PPV of 76.48% was achieved by 3D_LowRes configuration. None of the U-Net variations from the MONAI framework achieved good performance compared to the nnUNet framework (Table 3). Among the 4 variations from the MONAI framework, the highest sensitivity of 64% was achieved by variant-3 which is still less than 3D_FullRes from nnUNet. The standalone use of different image transformations and variations in image sizes was not enough to achieve a good segmentation performance by the U-Net architecture. As nnUNet uses deep supervision, and tunes model hyperparameters automatically based on the datasets, it outperformed the MONAI framework. As the goal of this study is to assist radiologists with an AI tool for the whole-body lesion annotation process, there is always human intervention in the loop. Therefore, this study aims to identify most of the lesions accurately, the model with the best sensitivity, i.e. 3D_FullRes was selected for further analysis.

Performance of AI model in external testing cohorts

Inference was performed for the external clinical testing cohorts with the trained 3D_FullRes model from the nnUNet framework. Table 4 presents the lesion-level performance of the AI models in these cohorts. The sensitivity of 97.83%, 40.15%, and 78.37% and DISC of 0.73, 0.36, and 0.50 were achieved by ST, OS, and PHEO cohorts on lesion level, respectively (Table 4). On scan level analysis, the model achieved a sensitivity of 92.86%, 100%, and 99.44% for ST, OS, and PHEO cohorts respectively (Table S3). Table 5 summarizes the lesion level performance of the two models from the literature (Hadlich et al., 2023; Yousefirizi et al., 2024) on the external clinical cohort. Based on Tables 4 and 5, it can be observed that the performance of the model by Yousefirizi et al. (2024) is lower in comparison to ours (Table 4) for all the clinical cohorts. Our model achieved better performance for the ST cohort, similar results on the PHEO cohort, and slightly less performance for the OS cohort, in compared to the Hadlich et al., 2023 model. However, they have used the nnUNet model only on PET images. This showcases that even our simple and unmodified model like a 3D full-resolution model can achieve better results by stratified data splits to patients and outcomes, impacting the generalizability of the models.

Table 5:

Lesion-level performance of two models from the literature on clinical cohorts.

ST Cohort* OS Cohort PHEO Cohort

Yousefirizi et al. 2024 Hadlich et al. 2023 Yousefirizi et al. 2024 Hadlich et al. 2023 Yousefirizi et al. 2024 Hadlich et al. 2023
TP 117 128 107 131 3088 3232
TN 1 1 0 0 0 0
FN 67 56 237 213 1095 951
FP 52 68 64 78 620 936
Sensitivity (%) 63.59 69.57 31.1 38.08 73.82 77.27
PPV (%) 69.23 65.31 62.57 62.68 83.28 77.54
Median per-scan sensitivity (%) 73.33 (0 100) 75 (0, 75) 6 (0, 94.74) 42.86 (0, 100) 57.14 (0, 100) 81.25 (0, 100)
Median FP per scan 2 (0, 15) 4 (0, 14) 2 (0, 18) 5 (1, 15) 1 (0, 30) 4 (0, 30)
DISC 0.49 0.52 0.26 0.45 0.49 0.5

Tumor Burden Analysis

Figure 3 shows the box plot of total TP, FP, and FN lesions TMTV and SUVmax with mean and standard deviations for AUTOPET Test, ST, OS, and PHEO cohort. From Figure 3, it can be observed that, for the AUTOPET test, OS, ST, and PHEO cohorts, the mean value of SUVmax, of FN lesions is lower than TP and FP lesions. The mean values of TMTV for FN lesions are smaller than TP and FP lesions for the AUTOPET test, OS, and PHEO cohorts and are high in the ST cohort. Based on the clustered Wilcox test, no statistically significant differences (p>0.05) were found in TP, FP, and FN tumor TMTV and SUVmax values for the OS and ST cohort. Statistically significant differences (p<0.001) were observed between the TP and FN, TP and FP, lesions TMTV and SUVmax values for the AUTOPET test and PHEO cohorts. From this analysis, it can be said that, for the AUTOPET test set, the AI model missed lower uptake lesions and for the PHEO cohort, low TMTV and low uptake lesions were missed by the AI model.

Figure 3:

Figure 3:

Tumor TMTV (top row) and SUVmax comparison (bottom row) for each TP, FN, and FP lesions of AUTOPET Test (from left 1st column), OS (2nd column), ST (3rd column), PHEO cohorts (4th column). Statistical significance was tested using the clustered Wilcox test and only the statistically significant difference (p<0.001) is shown with the paired line between lesion types. Without any annotations are not statistically significant pairwise (p>0.05).

Supplementary Figure S1 shows the Bland-Altman (B&A) plot of the difference between predicted and ground truth masks, SUVmax, TMTV, and total lesion glycolysis (TLG) for all the cohorts. The value of MD with 95% CI is reported in supplementary table S4 for all the cohorts. From Supplementary Figure S1, it is observed that the mean difference for all the tumor burden parameters for all the cohorts was higher for AI prediction masks compared to the GT masks. Two scans outside of the 95% CT, considered outliers were observed for the AUTOPET test set in SUVmax, and TMTV, indicating a high TMTV of FPs, and one scan with a high number of missed lesions in the predicted mask, falls outside of the negative side of the 95% CI.

Table 6 presents the median RMD and ICC for different tumor burden parameters, calculated between AI and GT lesion masks for different clinical cohorts. In the AUTOPET test cohort, the median RMD indicates a negligible difference between the AI-predicted and the GT masks for SUVmax. In the OS and PHEO cohorts, RMD for TLG were similar (0.45 and 0.41, respectively) and reflected an overestimation of TLG by AI. RMD for SUVmax was worst in the OS cohort, which is characterized by fewer lesions with lower uptake compared to all other cohorts. Very negligible differences between the AI and GT masks were observed for the ST cohort for all the tumor burden parameters (Table 6); however, it is important to mention the GT masks were corrected from the AI-generated masks for the ST cohort using the radiology reports by an expert and therefore have an additional source of bias compared to other cohorts. Table 6 also reports the ICC between AI-generated masks and expert annotated GT masks, where it can be observed that SUVmax had poor ICC values (0.02–0.87) compared to TMTV and TLG in all the cohorts (ICC values ranging from 0.6 to 1).

Table 6:

Median RMD and ICC of different tumor burden parameters between AI and GT lesion masks for different clinical cohorts.

OS PHEO ST* AUTOPET Test
Median RMD ICC Median RMD ICC Median RMD ICC Median RMD ICC

SUVmax 1.51 (−0.03, 12.4) 0.17 0.00 (−1.00, 30.00) 0.22 0.05 (−1.00, 4.11) 0.87 0.00 (−1.00, 6.7) 0.02
TMTV 0.25 (−0.98, 36.70) 0.93 0.85 (−1.00, 413.00) 0.76 0.04 (−1.00, 1.73) 1 −0.01 (−1.00, 36.22) 0.60
TLG 0.45 (−0.97, 38.94) 0.71 0.41 (−1.00, 915.69) 0.92 0.05 (−1.00, 3.41) 1 0.00 (−1.00, 106.14) 0.70

Failure Analysis

In all cohorts, FP, and FN lesions were observed and these were further interrogated (Figures 4 and 5). Figure 6 shows a histogram containing the frequency of TP, FN, and FP lesions by anatomical site. For the PHEO cohort, the highest number of FN are in liver and FP lesions in the lung and mediastinum regions (Figure 6). For the AUTOPET test cohort, the highest number of FN and FP lesions are in the lung and mediastinum region (Figure 6). For the OS cohort, a large number of liver and lung lesions were missed (Figure 6). Additional case samples of our failure analysis are presented in Figures S2 and S3 in the supplementary material. To further investigate, why the model struggled in the liver and lung lesion detections, the AUTOPET train dataset was analyzed for those areas and reported in Supplementary Table S6. We found that in the AUTOPET train set, we had 265/813 scans with lung (median 2 lesions/scan) and 92/813 scans with liver lesions (median 2 lesions/scan). In total, 1164/7164 and 437/7164 were lung and liver lesions, respectively, suggesting representation may contribute, but not fully explain model failures in these regions.

Figure 4:

Figure 4:

(a) Whole body-FDG PET image of a patient from the PHEO cohort, (b) lesion annotations by experts with the indication of the two missed lesions in the brain by artificial intelligent (AI) model as false negative (FN) lesions (c) lesion annotations AI model with two false positive (FP) lesions.

Figure 5:

Figure 5:

(a) Whole body-FDG PET image of a patient from the PHEO cohort, (b) lesion annotations by experts (c) lesion annotations by artificial intelligent (AI) model with the marking of three false positives (FP) lesions (in the dotted circle) and kidney and urethra as FP lesion. The patient had right-sided hydroureteronephrosis, which showed higher uptake in the kidney collecting system and ureter, and the model misidentified them as a lesion, which can be easily removed by experts. (d-f) marking of the FP lesions (in the dotted circle) on the overlay of FDG-PET and CT images. (d) indicating a port with some residual radioactive tracer, which showed uptake in the PET image and was considered a lesion by the AI model. (e) a lesion in the left hilar region, which has been missed by the expert during annotation, but detected by the AI model and later verified by the radiologists in the team. (f) an FP lesion with high uptake, misidentified by the AI model.

Figure 6:

Figure 6:

Histogram representation of the frequency of different types of lesions in different anatomical structures for all the cohorts.

In Supplementary Table S5, the Median SUVmax value of TP, FN, and FP lesions in different anatomical locations were reported. For the PHEO cohort, it is to be noted that 26% of the total 201 FP lesions had SUVmax values less than 4. And out of this 26% of the total 201 FP lesions, 78 lesions were in the lung and mediastinum region. As mentioned in section 2.1, for the PHEO cohort, lesions with SUVmax<4 were ignored by the experts based on the clinical protocol criteria of lesion SUVmax≥4. An example of a suspicious lesion that did not meet the SUV>4 criteria is shown in Figure 5. All the cohorts have missed low SUVmax uptake lesions which is quite challenging for AI to distinguish. (Table S5). Similarly, very high SUVmax in healthy regions, specifically kidney regions, can be a possible cause of FPs by AI model.

DISCUSSION

The use of deep learning in 18F-FDG-PET/CT images for specific types of cancer lesion segmentations is well-studied in the literature, however, only a few of them focused on multiple types of cancer lesion segmentations due to a lack of annotated dataset availability (Blanc-Durand et al., 2021; Constantino et al., 2023; Jemaa et al., 2020; Sadaghiani et al., 2021). In this study, we utilized the publicly available AUTOPET dataset to build an AI model. The best-performing 3D_FullRes model’s performance metrics revealed a DISC of 0.4 and a sensitivity of 71% on the test cohort. In the report on the AUTOPET 2022 challenge most of the participants used different variants of 3D U-Net architecture which are the top-performed models (Gatidis et al., 2023). The top reported DISC was 0.93 by Peng et al. (2022) using a false positive reduction network ensembled with nnU-Net, 0.68 by Sibille et al. (2022) using a stacked ensemble of 3D UNet and CNN, 0.62 by Ye et al. (2022) using vanilla UNet on different sizes of the test set. However, these models were tested on an unseen dataset drawn from UKT and partly from LMU, which was not available to us and, hence a direct comparison could not be made with our AI model. Gatidis et al. 2022, reported the AUTOPET baseline where they only showed validation DICE on positive cases (0.73 +/− 0.23). They did not report how their model performed on cases with no detectable FDG-avid tumor lesion, i.e. negative. Also, they used the whole AUTOPET dataset, and we trained our model using part of this dataset and kept 20% of it to use as an unseen test set. Furthermore, their reported mean false positive/false negative volumes were 8.1 (±81.4) ml/15.1 (±80.3) ml, whereas our model achieved mean false positive/false negative volumes were 6.27 (±18.62) ml / 1.62(±9.33) ml on AUTOPET test set, which is significantly lower compared to their results, especially false negative volume. If we followed the reporting standards of Gatidis et al. 2022, our overall DISC would have been 0.67 with 46% of the scans with DISC above 0.8 for the AUTOPET test set. To elaborate on our model’s relatively low performance, we did an extensive failure analysis. Based on our observations in the related literature, failure analysis is not a common practice, however, we choose that route for transparency reasons. Also, our results show that DISC might not always showcase the real performance of AI models to decide where lesion detection/localization is more a high-risk and high-benefit goal in oncology imaging, which was also advocated by Gatidis et al. 2022. (Leung et al., 2024), reported median DISC of 0.81, 0.76, 0.83, and 0.73 for patients with lung cancer, melanoma, lymphoma, and prostate cancer, respectively with their developed deep semi-supervised transfer learning model for lesion segmentation model using both FDG and PSMA PET/CT images. However, many of their used datasets had incomplete annotations either fully or partially, and it is not clear how they deal with their FPs for these cases. They did not share much detail about the model development or the model weights for us to make a direct comparison. Tarai et al. (2024) developed a multistage 3D lesion segmentation model for PET/CT or PET/MR. In their study, reported the 3D UNet model achieved 0.7±0.22 DICE for the AUTOPET dataset. However, it failed to show considerable improvement for lesions with SUVmax <4 and observed FN lesions present near the healthy organs with high physiological uptakes like the brain, bladder, and heart.

The developed AI model using 3D_FullRes configuration from the nnUNet framework was further tested in 3 different clinical datasets which included 3 relatively rare cancer types and performed very well for the ST and PHEO cohorts with a sensitivity of 97.83% and 78.37% and DISC score of 0.73, 0.50, respectively, whereas it achieved a sensitivity of 40.15% and DISC score of 0.36 for the OS cohort. It is worth noting that the model was trained on a different cohort with different types of cancers, which indicates the generalizability of our AI model. Additionally, in these three cohorts, expert annotations were corrected from the AI-generated masks based on radiology reports (ST cohort) or generated using variable threshold-based techniques (OS and PHEO cohorts). Depending on the clinical protocols, and type of cancers, the expert’s region of interest varies, so using expert annotations from such clinical trials might not always include the entire lesion burden, hence, the performance of AI might appear artificially lower, due to the large number of “FPs”. Particularly, the OS cohort reflects patients from a Phase 1 clinical trial where patients were heavily pre-treated, including prior amputation, prosthesis, surgical changes, and systemic therapies. Therefore, it is possible the tumor-to-background characteristics observed in other cohorts are not reflected in the OS cohort and may lead to poor performance. Considering all these, this cohort was quite challenging for the AI as it did not see these types of cases prior. Overall, despite this big domain shift in the test population related to big differences in disease biology, the AI model performed well for PHEO and ST cohorts and moderately for the OS cohort and the results were still comparable with existing studies’ DISC scores in range of 0.68–0.89 for lymphoma, adenocarcinoma, head and neck tumors segmentation models from the literature (Blanc-Durand et al., 2021; Shiri et al., 2024; Zhang et al., 2023).

To further evaluate the performance of our AI model, detailed tumor burden and failure analyses were conducted. Tumor burden analysis showcases an overestimation of SUVmax, TMTV, and TLG for all the cohorts, indicating false inclusion of physiological uptake by the AI model. It was also observed that all the cohorts have very small SUVmax value lesions that have been missed and it is quite difficult for AI to distinguish them from healthy organs. As 18F-FDG is not a tumor-specific tracer, it has poor uptake in low-grade tumors and it can sometimes be difficult to differentiate post-treatment inflammatory changes from residual diseases in the body (Hassan & Cook, 2012). Moreover, 18F-FDG-PET/CT shows higher uptake in different healthy organs with high glucose intake i.e. brain, heart, kidney, bladder, etc., which poses a challenge for the AI model to avoid these healthy organs as lesions. Failure analysis based on the locations of FP and FN lesions in major anatomical structures such as the brain, lung, mediastinum, liver, spleen, and kidney showcased the highest number of FN lesions in the liver region for PHEO and OS cohorts, and lung and mediastinum region for AUTOPET test, and ST cohorts. The highest number of FP lesions were in the lung and mediastinum region for all the cohorts. To understand why the model struggled in these two specific areas (i.e. liver and lung) the training dataset was investigated. There was a relatively small number of training cases with liver lesions, suggesting more data may be needed to improve results. In lung lesions, it is possible the method of ground truth segmentation may influence the rate of FP in PHEO (under-segmentation due to threshold-based technique) and FN in OS (low uptake in missed lesions with increased background activity from prior treatments). Therefore, the model did not learn those areas properly. ICC of TLG for all the cohorts by this research range between 0.7–1, whereas some of the literature reported ICC for TLG in the range of 0.9–0.98 for lymphoma detection (Constantino et al., 2023; Fruh et al., 2021) Not many deep-learning-based lesion segmentation studies were found that discussed the failure analysis of the FN and FP lesions. Karimdjee et al. (2023) discussed tumor burden analysis for fully automated AI-based diffuse large B cell lymphoma detection models and reported similar findings as this study, where they observed smaller volume lesions are missed and overestimated TMTV and TLG by AI model.

This research has some important limitations to be mentioned. As discussed earlier, the ground truth used in PHEO, OS, and ST cohorts does not reflect 100% of the tumor burden and some lesions were missed by the expert in some cases, hence the comparison with these ground truths showcased increased tumor burden for the AI model. This study only focuses on the models with UNet as backbone architectures based on the literature, however, there are many potential architectures (Hatamizadeh et al., 2022; Liu et al., 2021), that might be used for comparisons and should be considered for future studies. Thirdly, the model failed to detect liver and lung lesions for most of the cohorts due to having small and low lumber of liver and lung lesions for most of the training scans. The model needs to be trained with more data consisting of a higher number of liver and lung lesions to improve the performance in those areas. A cascade model, specifically trained on the liver and lung lesions with more data can also be a potential solution to overcome this limitation.

In conclusion, the AI model developed in our study for automated segmentation of the lesions on whole-body 18FDG-PET/CT images delivers reliable performance metrics with an acceptable range and category of FP lesions. Our deep learning-based AI model may help imaging physicians with the lesion annotation process.

Supplementary Material

1

Highlights:

  1. Generalizability of medical imaging AI models is often poorly studied

  2. Multiple frameworks for automated lesion segmentation on FDG-PET/CT were explored

  3. Models were trained on public datasets and evaluated in external test set of rare tumors

  4. Balancing tumor type and scan characteristics results in better generalizability

  5. A universal AI tool may help imaging physicians with the lesion annotation process

Acknowledgement:

The authors declare no conflicts of interest. This project is in part supported by the Intramural Research Program of the Center for Cancer Research, NCI. The investigators used the computational resources of the NIH HPC Biowulf cluster.

This research is funded by the intramural research program of NIH.

Footnotes

Declaration of interests

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:

B. Turkbey receives support as part of a CRADA between NIH and NVDIA and between NIH and Philips Healthcare, receives royalties from NIH, of which none are related to this work.

Peter L. Choyke receives payment from royalties paid to the U.S. government for patents on MRI US fusion biopsy licensed to Philips Medical. None are related to this work.

The remaining authors declare that there are no other disclosures relevant to the subject matter of this article.

The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government.

Code Availability

The best-performing model weights can be downloaded from the following GitHub repository https://github.com/HFahmida/FDG-PET-CT_AI.git

The authors confirm that all research was performed in compliance with relevant laws and institutional guidelines and have been approved by our institution’s IRB. Patients gave written informed consent as part of one of three clinical trials: NCT04539366, NCT02107963, NCT03206060, NCT00004847.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Blanc-Durand P, Jegou S, Kanoun S, Berriolo-Riedinger A, Bodet-Milin C, Kraeber-Bodere F, Carlier T, Le Gouill S, Casasnovas RO, Meignan M, & Itti E (2021). Fully automatic segmentation of diffuse large B cell lymphoma lesions on 3D FDG-PET/CT for total metabolic tumour volume prediction using a convolutional neural network. Eur J Nucl Med Mol Imaging, 48(5), 1362–1370. 10.1007/s00259-020-05080-7 [DOI] [PubMed] [Google Scholar]
  2. Cardoso MJ, Li W, Brown R, Ma N, Kerfoot E, Wang Y, Murrey B, Myronenko A, Zhao C, Yang D, Nath V, He Y, Xu Z, Hatamizadeh A, Myronenko A, Zhu W, Liu Y, Zheng M, Tang Y, … Feng A (2022). MONAI: An open-source framework for deep learning in healthcare. arXiv preprint, arXiv:2211.02701. 10.48550/arXiv.2211.02701 [DOI] [Google Scholar]
  3. Chen L-C, Papandreou G, Kokkinos I, Murphy K, & Yuille AL (2018). DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 40(4), 15. [DOI] [PubMed] [Google Scholar]
  4. Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, Tarbox L, & Prior F (2013). The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J Digit Imaging, 26(6), 1045–1057. 10.1007/s10278-013-9622-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Constantino CS, Leocadio S, Oliveira FPM, Silva M, Oliveira C, Castanheira JC, Silva A, Vaz S, Teixeira R, Neves M, Lucio P, Joao C, & Costa DC (2023). Evaluation of Semiautomatic and Deep Learning-Based Fully Automatic Segmentation Methods on [(18)F]FDG PET/CT Images from Patients with Lymphoma: Influence on Tumor Characterization. J Digit Imaging, 36(4), 1864–1876. 10.1007/s10278-023-00823-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Dice LR (1945). Measures of the Amount of Ecologic Association Between Species. Ecology, 26(3), 297–302. 10.2307/1932409 [DOI] [Google Scholar]
  7. Fallahpoor M, Chakraborty S, Pradhan B, Faust O, Barua PD, Chegeni H, & Acharya R (2024). Deep learning techniques in PET/CT imaging: A comprehensive review from sinogram to image space. Comput Methods Programs Biomed, 243, 107880. 10.1016/j.cmpb.2023.107880 [DOI] [PubMed] [Google Scholar]
  8. Ferrandez MC, Golla SSV, Eertink JJ, de Vries BM, Lugtenburg PJ, Wiegers SE, Zwezerijnen GJC, Pieplenbosch S, Kurch L, Huttmann A, Hanoun C, Duhrsen U, de Vet HCW, Petra, Zijlstra JM, & Boellaard R (2023). An artificial intelligence method using FDG PET to predict treatment outcome in diffuse large B cell lymphoma patients. Sci Rep, 13(1), 13111. 10.1038/s41598-023-40218-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Fruh M, Fischer M, Schilling A, Gatidis S, & Hepp T (2021). Weakly supervised segmentation of tumor lesions in PET-CT hybrid imaging. J Med Imaging (Bellingham), 8(5), 054003. 10.1117/1.JMI.8.5.054003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Gatidis S, Früh M, Fabritius M, Gu S, Nikolaou K, Fougère CL, Ye J, He J, Peng Y, Bi L, Ma J, Wang B, Zhang J, Huang Y, Heiliger L, Marinov Z, Stiefelhagen R, Egger J, Kleesiek J, … Küstner T (2023). The autoPET challenge: Towards fully automated lesion segmentation in oncologic PET/CT imaging.
  11. Gatidis S, Hepp T, Fruh M, La Fougere C, Nikolaou K, Pfannenberg C, Scholkopf B, Kustner T, Cyran C, & Rubin D (2022). A whole-body FDG-PET/CT Dataset with manually annotated Tumor Lesions. Sci Data, 9(1), 601. 10.1038/s41597-022-01718-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gatidis S, & Küstner T (2022). A whole-body FDG-PET/CT dataset with manually annotated tumor lesions (FDG-PET-CT-Lesions). 10.7937/gkr0-xv29 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Giavarina D (2015). Understanding Bland Altman analysis. Biochemia Medica, 25(2), 10. 10.11613/BM.2015.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hadlich M, Marinov Z, & Stiefelhagen R (2023). AutoPET Challenge 2023: Sliding Window-based Optimization of U-Net. arXiv preprint, 2. 10.48550/arXiv.2309.12114 [DOI] [Google Scholar]
  15. Hassan FU, & Cook GJ (2012). PET/CT in oncology. Clinical Medicine, 12(4), 368–372. 10.7861/clinmedicine.12-4-368 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Hatamizadeh A, Nath V, Tang Y, Yang D, Roth H, & Xu D (2022). Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images. arXiv preprint. 10.48550/arXiv.2201.01266 [DOI] [Google Scholar]
  17. Im HJ, Bradshaw T, Solaiyappan M, & Cho SY (2018). Current Methods to Define Metabolic Tumor Volume in Positron Emission Tomography: Which One is Better? Nucl Med Mol Imaging, 52(1), 5–15. 10.1007/s13139-017-0493-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Isensee F, Jaeger PF, Kohl SAA, Petersen J, & Maier-Hein KH (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods, 18(2), 203–211. 10.1038/s41592-020-01008-z [DOI] [PubMed] [Google Scholar]
  19. Jemaa S, Fredrickson J, Carano RAD, Nielsen T, de Crespigny A, & Bengtsson T (2020). Tumor Segmentation and Feature Extraction from Whole-Body FDG-PET/CT Using Cascaded 2D and 3D Convolutional Neural Networks. J Digit Imaging, 33(4), 888–894. 10.1007/s10278-020-00341-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Jiang Y, He X, Lee M-LT, Rosner B, & Yan J (2020). Wilcoxon Rank-Based Tests for Clustered Data with R Package clusrank. JournalofStatisticalSoftware, 96(6). 10.18637/jss.v096.i06 [DOI] [Google Scholar]
  21. Kaczanowska S, Murty T, Alimadadi A, Contreras CF, Duault C, Subrahmanyam PB, Reynolds W, Gutierrez NA, Baskar R, Wu CJ, Michor F, Altreuter J, Liu Y, Jhaveri A, Duong V, Anbunathan H, Ong C, Zhang H, Moravec R, … Kaplan RN (2024). Immune determinants of CAR-T cell expansion in solid tumor patients receiving GD2 CAR-T cell therapy. Cancer Cell, 42(1), 35–51 e38. 10.1016/j.ccell.2023.11.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Karimdjee M, Delaby G, Huglo D, Baillet C, Willaume A, Dujardin S, & Bailliez A (2023). Evaluation of a convolution neural network for baseline total tumor metabolic volume on [18F]FDG PET in diffuse large B cell lymphoma. European Radiology, 33, 2286–3395. 10.1007/s00330-022-09375-1 [DOI] [PubMed] [Google Scholar]
  23. Koo TK, & Li MY (2016). A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. Journal of Chiropractic Medicine, 15(2), 9. 10.1016/j.jcm.2016.02.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Landis JR, & Koch GG (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 16. 10.2307/2529310 [DOI] [PubMed] [Google Scholar]
  25. Lay N (2023). RandomSplit. https://github.com/nslay/RandomSplit
  26. Leung KH, Rowe SP, Sadaghiani MS, Leal JP, Mena E, Choyke PL, Du Y, & Pomper MG (2024). Deep Semisupervised Transfer Learning for Fully Automated Whole-Body Tumor Quantification and Prognosis of Cancer on PET/CT. J Nucl Med. 10.2967/jnumed.123.267048 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, & Guo B (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv preprint. 10.48550/arXiv.2103.14030 [DOI] [Google Scholar]
  28. Peng Y, Kim J, Feng D, & Bi L (2022). Automatic Tumor Segmentation via False Positive Reduction Network for Whole-Body Multi-Modality PET/CT Images. arXiv preprint. 10.48550/arXiv.2209.07705 [DOI] [Google Scholar]
  29. Phil T, Albrecht T, Gay S, & Rasmussen ME (2023). Sikerdebaard/dcmrtstruct2nii: dcmrtstruct2nii v5 (Version v5). In Zenodo. [Google Scholar]
  30. Ronneberger O, Fischer P, & Brox T (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. Medical Image Computing and Computer-Assisted Intervention, Pt Iii, 9351, 234–241. 10.1007/978-3-319-24574-4_28 [DOI] [Google Scholar]
  31. Sadaghiani MS, Rowe SP, & Sheikhbahaei S (2021). Applications of artificial intelligence in oncologic 18F-FDG PET/CT imaging: a systematic review. Annals of Translational Medicine, 9(9). [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Satoh Y, Nambu A, Ichikawa T, & Onishi H (2014). Whole-body total lesion glycolysis measured on fluorodeoxyglucose positron emission tomography/computed tomography as a prognostic variable in metastatic breast cancer. BMC Cancer, 14(525). 10.1186/1471-2407-14-525 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Shiri I, Amini M, Yousefirizi F, Sadr AV, Hajianfar G, Salimi Y, Mansouri Z, Jenabi E, Maghsudi M, Mainta I, Becker M, Rahmim A, & Zaidi H (2024). Information fusion for fully automated segmentation of head and neck tumors from PET and CT images. Medical Physics, 51, 319–333. 10.1002/mp.16615 [DOI] [PubMed] [Google Scholar]
  34. Sibille L, Seifert R, Avramovic N, Vehren T, Spottiswoode B, Zuehlsdorff S, & Schäfers M (2020). 18F-FDG PET/CT Uptake Classification in Lymphoma and Lung Cancer by Using Deep Convolutional Neural Networks. Radiology, 294(2), 445–452. 10.1148/radiol.2019191114 [DOI] [PubMed] [Google Scholar]
  35. Sibille L, Zhan X, & Xiang L (2022). Whole-body tumor segmentation of 18F-FDG PET/CT using cascaded and ensembled convolutional neural networks. arXiv preprint. 10.48550/arXiv.2210.08068 [DOI] [Google Scholar]
  36. Silversmith W (2021). cc3d: Connected components on multilabel 3D & 2D images. (3.2.1).
  37. Sørensen T (1948). A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons (Publication Number 4) https://www.royalacademy.dk/Publications/High/295_S%C3%B8rensen,%20Thorvald.pdf
  38. Tarai S, Lundstrom E, Sjoholm T, Jonsson H, Korenyushkin A, Ahmad N, Pedersen MA, Molin D, Enblad G, Strand R, Ahlstrom H, & Kullberg J (2024). Improved automated tumor segmentation in whole-body 3D scans using multi-directional 2D projection-based priors. Heliyon, 10(4), e26414. 10.1016/j.heliyon.2024.e26414 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Wang H, Zhang N, Huo L, & Zhang B (2019). Dual-modality multi-atlas segmentation of torso organs from [(18)F]FDG-PET/CT images. Int J Comput Assist Radiol Surg, 14(3), 473–482. 10.1007/s11548-018-1879-3 [DOI] [PubMed] [Google Scholar]
  40. Wang X, Jemaa S, Fredrickson J, Coimbra AF, Nielsen T, De Crespigny A, Bengtsson T, & Carano RAD (2022). Heart and bladder detection and segmentation on FDG PET/CT by deep learning. BMC Med Imaging, 22(1), 58. 10.1186/s12880-022-00785-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Wasserthal J, Breit H-C, Meyer MT, Pradella M, Hinck D, Sauter AW, Heye T, Boll DT, Cyriac J, Yang S, Bach M, & Segeroth M (2023). TotalSegmentator: Robust Segmentation of 104 Anatomic Structures in CT Images. Radiology: Artificial Intelligence, 5(5). 10.1148/ryai.230024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Weber WA (2010). Quantitative analysis of PET studies. Radiother Oncol, 96(3), 308–310. 10.1016/j.radonc.2010.07.004 [DOI] [PubMed] [Google Scholar]
  43. Yang E, Kim JH, Min JH, Jeong WK, Hwang JA, Lee JH, Shin J, Kim H, Lee SE, & Baek SY (2024). nnU-Net-Based Pancreas Segmentation and Volume Measurement on CT Imaging in Patients with Pancreatic Cancer. Acad Radiol. 10.1016/j.acra.2024.01.004 [DOI] [PubMed] [Google Scholar]
  44. Ye J, Wang H, Huang Z, Deng Z, Su Y, Tu C, Wu Q, Yang Y, Wei M, Niu J, & He J (2022). Exploring Vanilla U-Net for Lesion Segmentation from Whole-body FDG-PET/CT Scans. arXiv preprint. 10.48550/arXiv.2210.07490 [DOI] [Google Scholar]
  45. Yousefirizi F, Klyuzhin IS, O JH, Harsini S, Tie X, Shiri I, Shin M, Lee C, Cho SY, Bradshaw TJ, Zaidi H, Benard F, Sehn LH, Savage KJ, Steidl C, Uribe CF, & Rahmim A (2024). TMTV-Net: fully automated total metabolic tumor volume segmentation in lymphoma PET/CT images - a multi-center generalizability analysis. Eur J Nucl Med Mol Imaging, 51(7), 1937–1954. 10.1007/s00259-024-06616-x [DOI] [PubMed] [Google Scholar]
  46. Zhang G, Bao C, Liu Y, Wang Z, Du L, Zhang Y, Wang F, Xu B, Zhou SK, & Liu R (2023). (18)F-FDG-PET/CT-based deep learning model for fully automated prediction of pathological grading for pancreatic ductal adenocarcinoma before surgery. EJNMMI Res, 13(1), 49. 10.1186/s13550-023-00985-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES