Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Dec 1.
Published in final edited form as: Gastroenterology. 2023 Aug 30;165(6):1533–1546.e4. doi: 10.1053/j.gastro.2023.08.034

Automated Artificial Intelligence Model Trained on a Large Dataset Can Detect Pancreas Cancer on Diagnostic CTs as well as Visually Occult Pre-invasive Cancer on Pre-diagnostic CTs

Panagiotis Korfiatis 1,*, Garima Suman 2,*, Nandakumar G Patnam 3, Kamaxi H Trivedi 4, Aashna Karbhari 5, Sovanlal Mukherjee 6, Cole Cook 7, Jason R Klug 8, Anurima Patra 9, Hala Khasawneh 10, Naveen Rajamohan 11, Joel G Fletcher 12, Mark J Truty 13, Shounak Majumder 14, Candice W Bolan 15, Kumar Sandrasegaran 16, Suresh T Chari 17, Ajit H Goenka 18,**
PMCID: PMC10843414  NIHMSID: NIHMS1929660  PMID: 37657758

Abstract

Background & Aims:

The aims of our case-control study were − 1) to develop an automated 3D-Convolutional Neural Network (CNN) for detection of PDA on diagnostic CTs, 2) evaluate its generalizability on multi-institutional public datasets, 3) its utility as a potential screening tool using a simulated cohort with high pretest probability, and 4) its ability to detect visually occult pre-invasive cancer on pre-diagnostic CTs.

Methods:

A 3D-CNN classification system was trained using algorithmically generated bounding boxes and pancreatic masks on a curated dataset of 696 portal phase diagnostic CTs with PDA and 1080 controls with non-neoplastic pancreas. Model was evaluated on (a) an intramural hold-out test subset (409 CTs with PDA, 829 controls); (b) a simulated cohort with a case-control distribution that matched the risk of PDA in glycemically-defined new-onset diabetes and END-PAC score ≥3; (c) multi-institutional public datasets (194 CTs with PDA, 80 controls), and (d) a cohort of 100 pre-diagnostic CTs (i.e., CTs incidentally acquired 3–36 months before clinical diagnosis of PDA) without a focal mass, and 134 controls.

Results:

Majority CTs (n=798; 64%) in intramural test subset were from outside hospitals. The model correctly classified 360 (88%) CTs with PDA and 783 (94%) controls [accuracy (mean; 95% CI) 0·92 (0·91–0·94); AUROC 0·97 (0·96–0·98), sensitivity 0·88 (0·85–0·91), specificity 0·95 (0·93–0·96)]. Activation areas on heat maps overlapped with the tumor in most CTs (350/360 CTs; 97%). Performance was high across tumor stages (sensitivity 0·80, 0·87, 0·95 and 1.0 on T1 through T4 stages, respectively), comparable for hypodense versus isodense tumors (sensitivity: 0·90 vs. 0·82), different age, sex, CT slice thicknesses & vendors (all p >0·05), and generalizable on both the simulated cohort [accuracy 0·95 (0·94–0·95), AUROC 0·97 (0·94–0·99)] and public datasets [accuracy 0·86 (0·82–0·90), AUROC 0·90 (0·86–0·95)]. Despite being exclusively trained on diagnostic CTs with larger tumors, the model could detect occult PDA on pre-diagnostic CTs [accuracy 0·84 (0·79–0·88), AUROC 0·91 (0·86–0·94), sensitivity 0·75 (0·67–0·84), specificity 0·90 (0·85–0·95)] at a median 475 days (range: 93–1082) prior to clinical diagnosis.

Conclusions:

Automated AI model trained on a large and diverse dataset shows high accuracy and generalizable performance for detection of PDA on diagnostic CTs as well as for visually occult PDA on pre-diagnostic CTs. Prospective validation with blood-based biomarkers is warranted to assess the potential for early detection of sporadic PDA in high-risk subjects.

Keywords: Pancreas, Artificial Intelligence, Biomarkers, Pancreatic Ductal Carcinoma, Computed Tomography

Graphical Abstract

graphic file with name nihms-1929660-f0001.jpg

LAY SUMMARY

Automated artificial intelligence model shows promise for early pancreas cancer detection on standard CT scans at a stage when surgical cure may be possible.

INTRODUCTION

Pancreatic ductal adenocarcinoma (PDA) is a recalcitrant cancer and the third leading cause of cancer-related deaths in the United States (US).1 Earlier detection has been shown to improve surgical resectability, post-surgical prognosis, and confer survival benefit even beyond lead time.2, 3 Screening efforts in high-risk populations have shown that screen-detected PDAs tend to be smaller and have a better prognosis than clinically detected cancers.4,5 Thus, there is an urgent need to overcome the barriers to earlier detection to reduce the mortality from PDA.

Recently, individuals with glycemically-defined new-onset diabetes (NOD) and an Enriching New-Onset Diabetes for Pancreatic Cancer (END-PAC) score of ≥3 have been identified as high-risk individuals (HRIs) with a 3%−4% risk of sporadic PDA.6 The success of screening even in such HRIs is critically dependent upon the ability of imaging to detect early PDA.7 Unfortunately, despite rigorous screening, an alarming majority of HRIs are being diagnosed with advanced stages of PDA. Importantly, most of these HRIs show no pancreatic abnormalities on previous CTs preceding a PDA diagnosis.5, 8, 9 Thus, such CTs harbor ‘visually occult’ PDAs. Secondly, a substantial number of incidental PDAs including later-stage tumors tend to be missed on CTs due to subtle imaging features, inadequate attention to the pancreas, or technical inadequacies.3, 1012 Such CTs have truly missed PDAs. Recent studies have highlighted the significant potential of AI for detection of PDA on imaging.1317 While these studies have contributed valuable knowledge, it is essential to address certain aspects. These include small sample sizes, inadequate consideration of secondary anatomic signs or explainability, failure to stratify performance based on tumor size and stage, lack of evaluation on pre-diagnostic CTs or public datasets, and dependence on radiologist-provided segmentations, which restricts the scalability in clinical practice. These considerations emphasize the need for further exploration and refinement in AI-enabled early PDA detection.

However, the development of clinically translatable AI systems to overcome the aforementioned limitations is a major challenge. First, there is paucity of training datasets with early PDA because the disease is typically detected at a late-stage. Second, manual segmentation to create large training datasets is a very time-intensive task and prone to high inter- and intra-observer variation1820. Third, the pancreas tends to be normal-appearing on visual inspection at the pre-diagnostic stage in many patients.21 Therefore, a clinically relevant AI model should also have the ability to detect pre-invasive visually occult cancer at the asymptomatic stage. Finally, transparency in the model’s decision-making process is necessary for quality assurance, to elicit stakeholder trust, promote physician-patient dialogue, and identify inadvertent biases.22, 23

We reasoned that these requirements could be addressed through development and validation of an automated (i.e., not dependent upon manual inputs) and potentially interpretable AI model for detection of PDA at both the diagnostic and the pre-diagnostic stage. Thus, the aims of our case-control study were − 1) to develop a 3D-Convolutional Neural Network (CNN) for automated detection of PDA on CTs, 2) evaluate its generalizability on multi-institutional public datasets, 3) its utility as a potential screening tool using a simulated cohort with high pretest probability of PDA, and, 4) its ability to detect visually occult pre-invasive PDA on pre-diagnostic CTs (i.e., incidental CTs acquired for unrelated indications around 3–36 months prior to clinical diagnosis).

MATERIALS AND METHODS

This Health Insurance Portability and Accountability Act-compliant retrospective case-control study was approved by our institutional review board, which waived the requirement for informed consent.

Intramural patient cohort

CTs with PDA:

We searched the electronic medical records (EMR) for patients with treatment-naïve biopsy-proven PDA. The abdomen CTs of these patients at the time of diagnosis, which are hereafter referred to as ‘diagnostic CTs’, were de-identified by anonymization of Digital Imaging and Communications in Medicine (DICOM) tags utilizing Clinical Trial Processor and converted into Neuroimaging Informatics Technology Initiative (NIfTI) files. These CTs were reviewed by radiologist investigators to exclude CTs with suboptimal image quality or biliary stents (figure 1). The process led to a curated medical imaging data readiness (MIDaR) grade A dataset24 of 1105 unique diagnostic CTs (627 men, 478 women; mean age: 66 years, range: 25–92 years; January 2006-December 2020). Following variables were extracted from all CTs: - epicenter of the tumor (pancreatic head, body, tail, or extra-pancreatic); tumor density (isodense versus hypodense); presence of common bile duct (CBD) or main pancreatic duct (MPD) dilatation and/or cut off; pancreatic atrophy; and T-stage as per American Joint Committee on Cancer (AJCC) (8th edition) staging system.18, 25

Figure 1.

Figure 1.

Study datasets and design: Dataset curation process for intramural and multi-institutional public CT datasets.

Control CTs with non-neoplastic pancreas:

A control CT dataset was identified from our Radiology Information System based on the statement of a negative pancreas in the CT report (figure 1). Only one CT scan was selected per subject. The process led to a cohort of curated 1909 control CTs (872 men, 1037 women; mean age: 56 years, range: 18–94 years; January 2002-September 2020). All control CTs (n=1909) were reviewed by radiologist investigators to document non-malignant pancreatic findings (n=277), which were as follows: fatty infiltration with or without atrophy (n=179), cystic lesions (n=63), calcifications (n=24), and miscellaneous (n=11). Of note, 267 out of these 1909 CTs with normal pancreas had been part of a different study.15, 18

Thus, the final curated intramural dataset consisted of 1105 diagnostic CTs with PDA and 1909 control CTs. Of the 1105 diagnostic CTs with PDA, 35 (3%) CTs had tumors with <−2 cm diameter (stage T1). These CTs were exclusively assigned to the hold-out test subset so that the model’s performance could be tested on the smallest tumors. The remainder of the CTs with PDA and the control CTs were randomly divided into training-validation (total n=1776 CTs; 696 diagnostic CTs with PDA and 1080 control CTs) and test subsets (total n=1238 CTs; 409 diagnostic CTs with PDA and 829 controls) (Table 1).

Table 1.

Comparison of training-validation and intramural test subsets

Training-Validation Subset Intramural Test Subset
Controls (n=1080) PDA (n=696) Controls (n=829) PDA (n=409)
Mean age (SD) (years) 56.4 (162) 65.1 (10) 55.8 (15.8) 66.6 (10)
Males: Females 0.85:1 1.4:1 0.8:1 1.2:1
CT slice thickness (mm)
≤ 1.25 364 (34%) 291 (42%) 296 (36%) 174 (43%)
>1.25 to ≤3 710 (66%) 291 (42%) 513 (62%) 147 (36%)
> 3 6 (<0.001%) 114 (16%) 20 (24%) 88 (22%)
CT vendor
Siemens 986 (91%) 391 (56%) 731 (88%) 240 (59%)
GE 55 (5%) 176 (25%) 76 (9%) 120 (29%)
Toshiba 39 (4%) 91 (13%) 21 (3%) 42 (10%)
Philips 0 38 (5%) 1 (<0.001%) 7 (<0.1%)
Tumor location
Head .. 364 (52%) .. 292 (71%)
Body .. 152 (22%) .. 46 (11%)
Tail .. 156 (22%) .. 67 (16%)
Predominantly extra-pancreatic tumor .. 24 (34%) .. 4 (<0.001%)
Tumor stage
1 .. 0 .. 35 (9%)
2 .. 155 (22%) .. 291 (71%)
3 .. 186 (27%) .. 60 (15%)
4 .. 355 (51%) .. 23 (6%)
Isodense tumors .. 85 (12%) .. 87 (21%)
Non-malignant findings 156 (14.4%) 121 (14.6%)
Fatty infiltration with/without atrophy 104 (9.6%) .. 75 (9%) ..
Cystic lesions 28 (2.6%) .. 35 (4.2%) ..
Calcifications 17 (1.6%) .. 7 (0.8%) ..
Miscellaneous 7 (0.7%) .. 4 (0.5%) ..

We also conducted a focused prospective investigation specifically to evaluate if non-malignant pancreatic findings result in false positive prediction. We used 297 consecutive portal venous phase abdominal CTs prospectively performed in our hospital over one week in May 2023 in patients who had research authorization on file [152 men, 145 women; mean (range) age: 58.2 (19–89) years); 77.4% outpatients; 22.6% hospitalized or emergency care patients]. Each CT was reviewed by radiologist investigators to record non-malignant pancreatic findings (n=27), which were as follows: - fatty infiltration with or without atrophy (n=14), cystic lesions (n=9), and pancreatitis with or without calcifications (n=4).

CNN architecture

We developed a three-dimensional (3D) CNN model, which utilized inputs from two channels (figure 2): 1) algorithmically generated bounding box containing the pancreas (including the tumor in diagnostic CTs with PDA) with the peripancreatic tissue; and 2) segmentation mask of the pancreas (including the tumor in diagnostic CTs with PDA) without the peripancreatic tissue. In this automated approach, a segmentation mask of pancreas (including the tumor in the diagnostic CTs with PDA) was generated using our previously published automated CNN for volumetric pancreas segmentation. The CT images were interpolated to a resolution of 0.8×0.8×3mm. Pancreas bounding boxes were all padded to a size of 128×128×64. When the pancreas bounding box generated using the pancreas segmentation algorithm was larger than this pre-defined size, i.e., 128×128×64, we resized it to the selected dimensions. CT window settings of width 250 Hounsfield Unit (HU) and level 75 HU were applied to the data. Images were normalized to [0, 1].

Figure 2.

Figure 2.

Convolutional neural network (CNN) architecture and pipeline for classification of CTs into those with PDA or with non-neoplastic pancreas. Bounding box and segmentation mask help narrow the area-of-analysis for the model to the pancreas and peripancreatic tissue. The entire pancreas together with tumor (in CTs with PDA) was created as a single segmentation mask. The model’s pipeline is fully automated and does not depend upon manual segmentations.

We utilized a modified ResNet architecture26 as part of our algorithm (Figure 2). In the modified CNN versions, Squeeze-and-Excitation modules were utilized together with attention modules.27 The Squeeze-and-Excitation modules identify the interdependencies between channels of the input and adaptively recalibrate channel-wise features generated. To prevent overfitting, L2 regularization of 1e−3 was applied to all learnable parameters.28 Leaky rectified linear units (Leaky - RELU) where used as activation functions (α=0.001).29 The final prediction of the model was either class 0, which corresponded to normal pancreas, or class 1, which corresponded to PDA.

Offline image augmentation was performed to simulate over- and under-segmentation errors, known inter-reader variability in segmentations, and to compensate for not using the manual segmentations as input channels. The latter approach was chosen in view of our goal of an automated approach. The augmentation strategies included rotations (+/− 10 degrees), zoom (+/− 10%), corruption of the pancreas segmentation region utilizing erosion and dilations, and intensity shifting.

The study examined two different objective functions: binary cross entropy and focal loss. The initial layer of the model was constructed using either 16, 32 or 64 filters. During the training, the model’s performance was assessed using precision, recall and F1 score metrics. A batch size of 16 was used and the adaptive optimizer, Adam, was employed with a learning rate of 0.001. Additionally, a step decay learning rate scheduler was implemented. These were implemented using Tensorflow (Version 2.3.1) and were run on Nvidia GPUs (Nvidia, Inc., Santa Clara, CA) with 32 GB of memory.

Multi-institutional public CT datasets

To evaluate the model’s generalizability, we also tested it on three multi-institutional public CT datasets: the Cancer Imaging Archive Clinical Proteomic Tumor Analysis Consortium Pancreatic Ductal Adenocarcinoma (TCIA-CPTAC) dataset, the Medical Segmentation Decathlon (MSD) dataset, and the National Institutes of Health-Pancreas CT (NIH-PCT) dataset.3032 Detailed description of these public datasets has been previously documented.20 These datasets were curated in a manner identical to the intramural dataset. The curation process resulted in a public test subset of 194 diagnostic CTs with PDA (42 from TCIA-CPTAC and 152 from MSD datasets), and 80 control CTs from the NIH-PCT dataset (figure 1). Patient demographics and CT acquisition parameters were obtained from the available metadata.

Simulated population with high pretest probability of PDA

For additional cross-validation of the model, we used a bootstrapping approach to generate multiple random samples of the intramural test subsets such that their case-control distribution would match the 3-year risk of sporadic PDA (1–5%) in a cohort with NOD and an ENDPAC score ≥3.33 Bootstrapping is a statistical procedure that randomly resamples a single dataset to create many simulated test sets. In this study, we performed multiple bootstrap iterations to assess the model’s performance. During each iteration, we randomly selected the disease prevalence from a range of 1–5%. Then, we randomly selected a set of patients, with replacement, to match the selected disease prevalence for that iteration.34 For uniformity, we used an equal number of controls in each bootstrap iteration as in the test set. Finally, we computed the area under the receiver operating characteristic (AUROC) curve for each bootstrapped sample.34 Thus, we randomly resampled the test subset to create multiple simulated test subsets such that each test subset contained diagnostic CTs with PDA and control CTs in a ratio ranging between 1:100 to 5:100. This process of random sampling the test subset to create simulated datasets for testing purposes was repeated 1000 times to investigate various values of disease prevalence.

Pre-diagnostic CT cohort

To evaluate whether the model could detect pre-invasive cancer even in the absence of a focal lesion (i.e., visually occult PDA), we used a previously identified cohort of pre-diagnostic CTs15. The latter were defined as incidental portal venous phase CTs performed for unrelated indications (e.g., trauma, fever or sepsis of unknown origin, abdominal aortic aneurysms, guidance for biopsies and other procedures, bowel obstruction and/or mesenteric ischemia, etc.) between 3- and 36-months prior to the clinical diagnosis of PDA. All these CTs had been previously interpreted to be negative for PDA during routine clinical evaluation, which was further confirmed as part of data curation by radiologist investigators. The pre-diagnostic CTs of those patients whose diagnostic CT were part of the training-validation of the model (vide supra) were excluded to avoid bias or overestimation of model’s performance. The curation process resulted in a dataset of 100 pre-diagnostic CTs [59 men, 41 women; mean age (SD): 67 (10.8) years] (figure 1). Finally, a cohort of 133 subjects [72 men, 61 women; mean age (SD): 66 (12) years] with non-malignant pancreas, which were randomly drawn using our Radiology Information System, and who had at least 3-years of follow-up that was negative for PDA, were used as controls. Of these control CTs (n=133), 24 (18%) had non-malignant pancreatic findings, which were as follows: fatty infiltration with or without atrophy (n=16), cystic lesions (n=6), calcifications (n=2).

Evaluation of Interpretability

Heat maps were created using the Gradient-weighted Class Activation Mapping (Grad-CAM) technique, which was based on the input of the bounding box CNN. Grad-CAM uses gradients of the outcome-of-interest flowing into the final convolutional layer to generate a coarse localization map. This map highlights regions that were deemed important by the model for its prediction. The end-result is a ‘heatmap’ that highlights part(s) of the input image where the convolution layers focused the most during the process of outcome prediction. All preprocessing transformations were inverted to record the heat map onto the original CT image to facilitate its review by radiologist investigators.

In diagnostic CTs with PDA, regions of highest activation on the heatmap were algorithmically compared to the known tumor location to determine whether the model focused on the tumor and the peri-tumoral pancreatic parenchyma for its outcome prediction. Conversely, in the misclassified control CTs, such regions of activation and the underlying pancreatic parenchyma were inspected by radiologist investigators to assess for potential variations in normal morphology that could explain the activation.

Statistical analysis

Model’s performance was evaluated using sensitivity, specificity, accuracy, precision, recall, F1 score, and area under the receiver operating characteristic curve (AUROC). A Shapiro-Wilk test (scipy package v 1.5.4) was used to determine normal data distribution. Continuous variables are reported as mean [standard deviation (SD)] or mean [95% confidence intervals (CI)].

To binarize the model’s output, an optimal cut off value that yielded maximum sensitivity and specificity for classification was estimated based on Youden’s index, leveraging the validation set. This threshold value was then applied on the test cohorts to categorize the model’s output as PDA (or pre-invasive cancer in the case of pre-diagnostic CTs) versus control pancreas.

To assess the impact of parameters such as age, sex, CT slice thickness, CT vendor, and site of CT acquisition (our hospital versus outside hospital) on the model’s performance, a bootstrap test for differences in AUROC was performed utilizing the functions in the pROC package in R.35 Separate ROC curves were generated for each binning of the variable-of-interest. Pairwise tests were performed via stratified bootstrap to compare the AUROC between ROC curves generated from each binning of the variables-of-interest. To assess the potential impact of change in CT technologies, the model’s performance was analyzed through stratification of CTs into time intervals (2002–10, 2010–15, 2015–20). The Benjamini-Hochberg procedure controlled for the false discovery rate in these pairwise tests and accounted for multiple comparisons.36, 37

As part of failure mode analyses, we compared the differences in patient demographics (age, sex), CT parameters (slice thickness, vendor) and tumor-related parameters (tumor epicenter, CBD or MPD dilatation and/or cut off, and pancreatic atrophy) between the correctly classified and the misclassified groups to investigate any confounding factor or biases. To evaluate whether the model’s performance was biased towards any tumor-related parameters, permutation tests based on mean log loss were performed. Log loss is based on probability metrics and quantifies the average difference between predicted and observed probability distributions. Mean performance based on log loss was calculated and compared between different subgroups (e.g., different T-stages). If a demographic variable or CT acquisition parameter differed significantly between PDA and control cohorts, the potential classification improvement for the variable-in-question was investigated via differences in AUROC between the model’s prediction and a logistic regression with the model’s prediction and variable-in-question to investigate any biased association. The statistical and failure mode analyses was done in a similar manner for both the diagnostic and pre-diagnostic CTs, except that the above-mentioned tumor related parameters could not be assessed on pre-diagnostic CTs because those CTs had visually occult pre-invasive PDA.

RESULTS

Intramural patient cohort

Training-validation subset:

Total 1776 CTs (696 diagnostic CTs and 1080 control CTs) were used for training-validation. Specifically, 1418 CTs were used for training while 358 CTs were used for validation. The mean age of control subjects (56-years) was lower than the mean age of patients with PDA (65 years) (p<0·05). The mean (SD) tumor size was 4.9 (1.7) cm with majority of PDA lesions being T4 (n=355, 51%), followed by T3 (n=186, 27%), and T2 (n=155, 22%). As previously mentioned, the T1 PDA lesions were exclusively allocated to the test subset so that the model could be evaluated on the smallest tumors.

In the validation subset, the model correctly classified 186/191 (97%) diagnostic CTs and 161/167 (96%) control CTs. Overall, the model correctly classified 347 CTs and misclassified 11 CTs yielding an accuracy (CI) of 0·96 (0·95–0·98), AUROC of 0·99 (0·98–0·99), precision of 0·97 (0·93–0·98), recall of 0·97 (0·92–0·98), and F1 score of 0·97 (0·94–0·99). An optimal cutoff of 0·45 obtained on the validation subset using Youdens J statistic was subsequently used to differentiate between diagnostic CTs with PDA and controls in the test subset.38

Test subset:

There was no statistical difference in the prevalence of the non-malignant pancreatic findings, neither overall nor individually, between the training-validation (n=156, 14.4%) and test (n=121, 14.6%) subset control CTs (χ1 =0.00, p =0.98; and χ4 =6.15, p=0.19, respectively) (Table 1).

The mean (SD) tumor size in the test subset was 3·4 (1·3) cm. Majority of the lesions were T2 (n=291, 71%) followed by T3 (n=60, 15%), T1 (n=35, 9%), and T4 (n=23, 6%). The model correctly classified 360 of the 409 (88%) diagnostic CTs and 783 of the 829 (94%) control CTs yielding a sensitivity and specificity of 0·88 (0·85–0·91) and 0·95 (0·93–0·96), respectively. Overall, 1143 of the 1238 CTs were correctly classified yielding an accuracy (CI) of 0·92 (0·91–0·94), AUROC of 0·97 (0·96–0·98), precision of 0·89 (0·86–0·91), recall of 0·88 (0·85–0·91), and F1 score of 0·88 (0·85–0·91) (supplementary figure 1). None of the non-cancerous pancreatic findings in the prospectively collected cohort (n=297) resulted in a false positive prediction. Nevertheless, these patients were placed on a 3-year follow-up to evaluate for any potential changes in outcome over the course of time.

Stage-based stratification (Table 2) demonstrated a sensitivity of 0·80 (0·66–0·91) for T1 tumors. There was no significant difference in model’s performance between different T stages and tumor epicenter within the pancreas (Table 3). In addition, there were 87 isodense tumors (21%) in the test subset (15 stage T1, 64 stage T2, 7 stage T3, and 1 stage T4). Majority of these isodense tumors were also correctly classified (71 out of 87; 82%). There were more CTs with pancreatic or biliary ductal dilatation and/or cut off, and pancreatic atrophy in the correctly classified versus misclassified diagnostic CT groups (p<0·05). There was no significant difference in the mean pancreatic volumes between the correctly classified and misclassified control CTs (97·93 versus 101·70 cc, p=0·56). Finally, there were no differences in age, sex, and CT acquisition parameters (vendor and slice thickness) between correctly classified and misclassified CTs (all p values > 0·05).

Table 2:

T stage-based stratification of the model’s performance on intramural and external test subsets

Intramural test subset (409 CTs with PDA & 829 control CTs) Metric Public dataset (194 CTs with PDA & 80 control CTs) Metric
Tumor stage Total CTs with PDA (n=409) Misclassified CTs with PDA (n=49) Sensitivity (95% CI) Total CTs with PDA (n=194) Misclassified CTs with PDA (n=24) Sensitivity (95% CI)
1 35 (9%) 7 (14%) 0.80 (0.66 to 0.91) 25 (13%) 6 (25%) 0.76 (0.60−0.92)
2 291 (71%) 39 (80%) 0.87 (0.82 to 0.90) 113 (58%) 15 (63%) 0.87 (0.81−0.93)
3 60 (15%) 3 (<0.1%) 0.95 (0.88 to 1.00) 30 (15%) 2 (8%) 0.93 (0.83−1.00)
4 23 (6%) 0 1.00 (1.0 to 1.0)* 26 (13%) 1 (4%) 0.96 (0.88−1.00)
#

All p values < 0.05

*

Insufficient statistical power due to small sample size.

Table 3.

Comparison between correctly classified versus misclassified CTs in the intramural test subset

CTs with PDA (n=409) Control CTs (n=829)
Correctly classified (n=360) Misclassified (n=49) P-value Correctly classified (n=783) Misclassified (n=46) P-value
Males: Females 1.2:1 1:1 0.83 0.8:1 1.2:1 0.31
Mean age (SD) (years) 66.7 (9.9) 65.4 (10.4) 0.38 55.3 (15.8) 63.9 (14.1) <0.001
Slice thickness (mm) 0.86 0.47
0–1.25 151 (42%) 23 (47%) 277 (35%) 19 (41%)
1.25–3.0 130 (36%) 17 (35%) 488 (62%) 25 (54%)
3.0–5.0 79 (22%) 10 (18%) 18 (2%) 2 (4%)
CT vendor 0.75 0.06
Siemens 210 (58%) 30 (62%) 694 (88%) 37 (80%)
GE 105 (29%) 15 (30%) 67 (9%) 9 (20%)
Toshiba 39 (11%) 3 (6%) 21 (3%) 0 (0%)
Philips 6 (2%) 1 (2%) 1 (0.1%) 0 (0%)
Tumor location 0.14
Head 259 (72%) 33 (68%) .. .. ..
Body 43 (12%) 3 (6%) .. .. ..
Tail 54 (15%) 13 (26%) .. .. ..
Predominantly extrapancreatic 4 (1%) 0 (0%) .. .. ..
Ancillary features
MPD dilation and/or cut off 281 (78%) 22 (45%) <0.01 .. .. ..
CBD dilation and/or cutoff 178 (49%) 18 (37%) 0.13 .. .. ..
Pancreatic atrophy 231 (64%) 22 (45%) 0.01 .. .. ..
Tumor Size (cm) 0.001
Mean (SD) 3.5 (1.3) 2.8 (0.8) .. .. ..
Median 3.3 2.8 .. .. ..
Tumor stage 0.04
1 28 (8%) 7 (14%) .. .. ..
2 252 (70%) 39 (80%) .. .. ..
3 57 (16%) 3 (6%) .. .. ..
4 23 (6%) 0 (0%) .. .. ..
Isodense tumors 71 (20%) 16 (33%) 0.06 .. .. ..

Although the mean age of the control subjects (56-years) was lower than the mean age of patients with PDA (67-years) (p<0·05), logistic regression analysis showed no difference in the model’s performance with and without age information [difference in AUROC=0·00, p=1·00]. The model’s performance was equivalent between males and females [AUROC (CI) males: 0·97 (0·96–0·98); females 0·97 (0·95–0·98), p=0·8). Likewise, the model had consistently high performance across different CT slice thicknesses (range: 0·6–5 mm) and CT vendors. Majority of the CTs in the test subset had been performed at outside hospitals (n=798; 64%). There was no significant difference (p>0.1) in the model’s performance on CTs from our hospital versus those from other hospitals [AUROC (CI): 0.97 (0.94–0.98), and 0.96 (0.95–0.97), respectively]. Finally, there was no statistically significant difference in the model’s performance when the CTs were stratified into time intervals (supplemental table 1) to assess the potential impact of change in CT technologies over time.

Multi-institutional public CT datasets

TCIA CPTAC dataset:

All 42 diagnostic CTs (including three CTs with isodense tumors) were correctly classified (accuracy=1.0) despite diversity in the T-stage distribution [T1 (n=0), T2 (n=20, 48%), T3 (n=13, 31%), and T4 (n=9, 21%)], tumor location [pancreatic head (n=31, 74%), body (n=7, 17%), tail (n=4, 10%)], vendor profile [GE (n=27, 60%), Siemens (n=9, 21%), Toshiba (n=2, 4%), Philips (n=1, 2%), unknown (n=3, 7%)] and CT slice thickness [1.5–3 mm (n=31, 74%), <1.5 mm (n=8, 19%), and >3–5 mm (n=3, 7%)]. The mean (SD) patient age was 65 (10) years, tumor size was 4.2 (1.2) cm, and there were equal males and females. CBD or MPD dilation and/or cut off was present in 83% (n=35) of CTs and pancreatic atrophy in 60% (n=25) of CTs.

MSD dataset:

Of the 152 diagnostic CTs, majority (128; 84%) were correctly classified. T-stage distribution was again diverse [T1 (n=25, 16%), T2 (n=93, 62%), T3 (n=17, 11%), and T4 (n=17, 11%)]. Most of the misclassified CTs belonged to stage T2 (15 out of 24, 63%). The dataset also included 23 isodense tumors with around half of them being T2 (n=12, 52%). Nineteen (76%) of the 25 T1 tumors and 20 (87%) of the 23 isodense tumors were correctly classified. Performance could not be stratified by CT slice thickness and vendors due to unavailability of the metadata.

NIH-PCT dataset with normal pancreas:

The mean (SD) patient age was 46.8 (16.7) years and there were twice males than females. Sixty six out of 80 control CTs were correctly classified yielding a specificity of 0.83.

Overall, 236 out of 274 CTs (170/194 diagnostic CTs and 66/80 control CTs) from the external datasets were correctly classified, yielding an accuracy of 0·86 (0·82–0·90), AUROC of 0·90 (0·86–0·95), sensitivity of 0·88 (0·83–0·92), specificity of 0·83 (0·74–0·90), precision of 0·92 (0·89–0·96), recall of 0·88 (0·83–0·92), and F1 score of 0·90 (0·85–0·93) (figure 3). The model correctly classified 19 out of 25 stage T1 tumors (76%), 98 out of 113 stage T2 tumors (87%), 28 out of 30 stage T3 tumors (93%), and 25 out of 26 stage T4 tumors (96%) (Table 2). Additionally, 23 of 26 (88%) isodense tumors were also correctly classified. There were no differences in the tumor-related findings (such as tumor size, ductal dilatation, etc.) between the correctly classified versus misclassified CTs (p>0·05) (supplementary table 2).

Figure 3.

Figure 3.

Comparison of model’s performance on intramural test subset (red) and multi-institutional public datasets (blue). Dark solid lines (red or blue) are the observed performance with corresponding 95% confidence intervals depicted as shaded region. The solid gray line is the performance for random chance. The mean AUROC is 0·97 (0·96–0·98) on intramural test subset and 0·90 (0·85–0·94) on the multi-institutional public datasets (p <0·05).

Simulated population with high pre-test probability of PDA

On bootstrapped simulated intramural test subsets, the accuracy of the model was 0·95 (0·94–0·95), AUROC 0·97 (0·94–0·99), precision 0·711 (0·60–0·78), recall 0·92 (0·86–0·95), and F1 0·77 (0·65–0·83) (figure 4), which was comparable to the model’s performance on the intramural test subset as well as on the public datasets.

Figure 4:

Figure 4:

Model’s performance graph on simulated population with high pre-test probability of PDA and on the pre-diagnostic cohort. The faint grey curves depict the ROCs of multiple bootstrapped test subsets simulating PDA risk of 1–5% (A) and pre-diagnostic cohort (B). The red curve corresponds to mean AUROC (CI) of 0·97 (0·94–0·99) in A, and 0·91 (0·86–0·94) in B. The dashed black line is the performance for random chance.

Pre-diagnostic CT cohort

Overall, 195 out of 233 CTs (75/100 pre-diagnostic CTs and 120/133 control CTs) were correctly classified, yielding an accuracy of 0·84 (0·79–0·88), AUROC of 0·91 (0·86–0·94), sensitivity of 0·75 (0·67–0·84), specificity of 0·90 (0·85–0·95), precision of 0·85 (0·77–0·92), recall of 0·75 (0·67–0·84), and F1 score of 0·80 (0·73–0·86) (figure 4). The median (range) time interval between pre-diagnostic CTs and histopathological diagnosis of PDA was 475 (93–1082) days. The mean (SD) pancreatic volume in control CTs [73.4 (23.3) cc] was lower than pre-diagnostic CTs [89.9 (37.9) cc], p <0.05. However, logistic regression analysis showed no difference in the model’s performance with and without the volume information [difference in AUROC=0·00, p=1·00]. Of 233 CTs in this cohort, majority (178 CTs; 76%) were from other hospitals and 55 CTs (24%) were from our hospital. Yet, there was no significant difference (p>0.1) in the model’s performance between the two groups [AUROC (CI): 0.78 (0.67–0.88), versus 0.84 (0.78–0.89), respectively]. Mean (SD) time interval between pre-diagnostic CTs and histopathological diagnosis was comparable between correctly classified and misclassified pre-diagnostic CTs [440 (263) days versus 518 (331) days, respectively (p = 0.23)]. There was marginal difference in the distribution of binned slice thickness (<1.25, 1.25 – 3.00, 3.00 – 5.00-mm) between correctly classified and misclassified control CTs (p=0.046) but logistic regression analysis showed no difference with or without the binned slice thickness information (difference in AUROC=0.00, p=1.00). Finally, there were no differences in age, sex, CT vendor or the pancreatic morphology between misclassified versus correctly classified CTs (all p values > 0·05) (supplementary table 3).

Evaluation of Interpretability

In the intramural test subset, algorithmic analyses of the heat maps demonstrated that area of activation on the heat map overlapped with the tumor location in most CTs (350/360 CTs; 97%). Likewise, the area of activation on the heat map demonstrated overlap with the tumor location on most CTs of the public dataset (180/194 CTs; 92.8%), which was verified through re-review by radiologist investigators. However, inspection of misclassified control CTs did not demonstrate systemic variations of normal anatomy or other patterns that could explain the heat map activation despite the absence of a tumor.

DISCUSSION

An automated 3D CNN developed on a large and diverse dataset shows high accuracy for detection of PDA on standard-of-care diagnostic CTs (0.97 AUROC) as well as for visually occult pre-invasive PDA on pre-diagnostic CTs (0.9 AUROC) with the latter being at a substantial lead time prior to clinical diagnosis [475 (93–1082) days]. Stage-based analyses showed high sensitivity of the model across all tumor sizes. Importantly, we chose to train the model on larger tumors (mean tumor size in training dataset: 4.9 cm) and had exclusively reserved the T1 tumors for the test subset so we could evaluate the model on the smallest tumors. Yet, the model had a high sensitivity for the challenging stage T1 and isodense tumors. This is reassuring because small and isodense tumors are more commonly missed on CTs.3, 11 The model also had high specificity as evident from the correct classification of majority (94%) of control CTs including those CTs with potentially confounding non-malignant pancreatic findings. High specificity is necessary to minimize false positives in the screening paradigm for a relatively lower incidence tumor like the PDA.

The CT dataset used in this study is the largest reported in literature and spans a gamut of variations in patient demographics and image acquisition parameters. In fact, majority of CTs (n=798; 64%) in the test subset were from other hospitals, which had been sent to us for second opinions or as part of referral evaluation. This is the reason we chose the term ‘intramural’ rather than ‘internal’ to accurately describe the dataset. When coupled with the public dataset comprising 274 patients, the total number of external CTs used for model testing exceeded 1,000 patients. Such large and diverse dataset was likely one of the key reasons for the model’s generalizable performance across patient age, sex, different CT vendors and slice thicknesses. The model’s performance was also generalizable on multi-institutional public datasets including for the stage T1 and isodense tumors in those datasets.

Pre-invasive PDA tends to be occult on CT at visual inspection.3, 39 Yet, detection at this stage is the very goal of screening high-risk cohorts. Therefore, we tested the model on a cohort of pre-diagnostic CTs without a focal mass. These pre-diagnostic CTs had been interpreted as negative for PDA during routine clinical interpretation and were subsequently confirmed as such through re-review by radiologist investigators. Even though the model had been trained exclusively on diagnostic CTs with large tumors, it had a high performance (0.9 AUROC) for detection of visually occult pre-invasive PDA on these pre-diagnostic CTs. Of note, majority (77%) of the CTs in this cohort were from other hospitals. Recently, radiomics-based machine learning (ML) studies have shown that heterogeneity in pancreatic texture, which is often beyond the scope of human perception, precedes the development of a focal mass by many months.15, 16 Deep neural networks are inherently apt to detect and learn such subtle differences in textural patterns of input images, which likely explains the model’s performance on the pre-diagnostic cohort. Scarcity of pre-diagnostic datasets is one of the critical barriers for development of AI models for early PDA detection. Our findings suggest that deep neural networks could learn to detect the imaging signature of even visually occult PDA if they are trained on sufficiently large and diverse diagnostic CT datasets.

The intrinsic black-box character of AI is another critical barrier for its adoption in healthcare. We attempted post-hoc explainability using heat maps (or saliency maps). Heat maps highlight the proportional contribution of each region of an image to a given decision. In our test subset, the area of activation on the heat maps overlapped with the tumor in majority of CTs in both the intramural (97%) and the public datasets (95%). Secondly, CTs with ductal obstruction and pancreatic atrophy upstream to the tumor were more frequently classified correctly by the model. This suggests that indirect signs of PDA, which are often sought by radiologists for the detection of subtle PDA, were also factored into the model’s decision-making. While this is encouraging, methods such as heat maps provide only partial insight into the model’s operationality and exclusive reliance upon them is vigorously debated.22, 23, 40 Therefore, validation across diverse and distinct populations, as was done in the current study, is the recommended alternate to evaluate reliability of and to minimize biases in AI models.22

Glycemically-defined NOD cohort is the only validated high-risk group for sporadic PDA.2, 33 However, early detection of PDA within these high-risk cohorts remains a pressing challenge in the disease pathway. Unfortunately, despite rigorous screening efforts, most high-risk individuals (HRIs) are still diagnosed with advanced stages of PDA. Thus, the identification of early-stage PDA in these HRIs necessitates a collaborative workflow between clinicians and AI systems. Our bootstrapping experiment showed promising potential for the model when tested on a simulated case-control distribution within such a high-risk group. Moreover, when coupled with its performance on other independent cohorts, we posit that the AI model could expand the interpretive capabilities of human experts, pending prospective validation. The model could serve as a second reader, providing a peer-review mechanism to alert readers to subtle or small lesions and secondary signs of early PDA. Furthermore, based on its observed ability to predict subsequent PDA in pre-diagnostic pancreases, the model could augment the risk-stratification of HRIs, guiding the need either for more frequent screening evaluations or further assessment with invasive modalities. Importantly, such a model has the potential for seamless integration into standard imaging protocols without requiring modifications to the imaging workflow. It is designed to be fully automated and can be executed at a central site on previously acquired imaging from multi-center screening evaluations. It is worth considering whether the model can differentiate between other malignant entities such as distal cholangiocarcinoma, ampullary cancer, or duodenal adenocarcinoma. While this question lies beyond the scope of our study, in asymptomatic HRIs, the model’s detection of any of these conditions would signify a diagnostic gain and a clinically actionable event warranting subsequent invasive characterization.

Recently, AI has been shown to distinguish between CTs with PDA versus controls by others as well.13, 14 However, others have included CTs with biliary stents in their cohort. Such devices are a known source of bias as the model learns to associate such a device with the tumor20 leading to overestimation of the model’s performance.21 Therefore, we a priori excluded CTs with stents from our study. Moreover, our control cohort included CTs with potentially confounding non-cancerous pancreatic findings to assess for model’s generalizability. Further, some studies have inadvertently used the entire MSD dataset (n=420), which has also unfavorably impacted their results.13, 14 This is because only around 152 CTs from the MSD dataset have treatment-naïve PDA without biliary stents.20 Therefore, use of the entire MSD dataset has resulted in unpredictable inconsistencies.41, 42. To avoid such pitfalls, we only used the 152 CTs with PDA for the external validation component of our study. Finally, other described approaches have been contingent on radiologist-provided segmentations.14 Such approach could be difficult to scale in a high-volume setting as it would be punitive to the clinical workflow. Therefore, we chose an automated approach centered on AI-derived segmentations, which has been previously validated.18, 43, 44 Additionally, we leveraged augmentation techniques to simulate segmentation errors and our model was purposefully trained to utilize two channels as inputs to compensate for potential errors in the AI-derived pancreas segmentations.

We acknowledge limitations of our study. First, the retrospective design is naturally prone to selection bias. However, this study design allows us to maximize the sample size to train, validate and test the model. Secondly, our model presents the results in a dichotomous manner as either cancerous or control. This intentional categorization aims to address the critical need for early-stage PDA detection in asymptomatic high-risk individuals (HRIs) with normal-appearing pancreases. Notably, in both the test subset and the prospective dataset, control CTs with potentially confounding non-cancerous pancreatic findings did not contribute to false positive results, which provides further reassurance regarding generalizability of the model. Further, considering the enrichment of our dataset with PDA cases relative to a cohort of HRIs (e.g., NOD and high ENDPAC score), we performed bootstrapping to assess the model’s potential performance within these HRIs in the context of prospective screening. It is essential to emphasize that while the bootstrapping experiment provides valuable preliminary insights, the next phase of investigation necessitates prospective clinical trials, alongside the incorporation of epidemiological risk factors and emerging blood-based biomarkers, to further evaluate the impact of pre-test probability on the model’s performance.45 Such prospective evaluation is being considered through the EDI trial to gain insights into effectiveness of the model in a clinical setting, which will enable us to make informed decisions regarding its further refinement.

In summary, the automated AI model shows high accuracy and generalizable performance for detection of PDA on standard-of-care diagnostic CTs as well as for detection of pre-invasive visually occult PDA on pre-diagnostic CTs at a substantial lead time prior to clinical diagnosis. Despite being trained on larger tumors, the model had a high sensitivity for stage T1 and isodense tumors as well as high specificity for control CTs. The model’s performance was consistent across variations in patient demographics and image acquisition parameters, and generalizable on multi-institutional public datasets. The model also showed promising potential in a bootstrapped population with a case-control distribution that matches high-risk groups such as glycemically-defined NOD. Further optimization and prospective evaluation in combination with emerging blood-based biomarkers is warranted to assess the potential for early detection of sporadic PDA in high-risk cohorts.

Supplementary Material

1

Supplementary Figure 1. Axial portal venous phase CT images showing correctly classified malignant CTs and their corresponding heat maps. A 3.3 cm hypodense PDA (stage T2) (white arrow) in the head (A). An isodense PDA (stage T1) (white arrow) causing main pancreatic duct obstruction and upstream pancreatic atrophy (C). Corresponding heat activation maps (B and D) show that the model’s determination of the CTs as a PDA-containing CT was based on accurate localization of the tumor.

Supplementary Figure 2. Axial portal venous phase CT images from correctly classified control CT despite potentially confounding pancreatic findings and their corresponding heat maps. A 3 cm cystic lesion (white arrows) in pancreatic head (A). Heat activation map (B) shows that the model paid attention to the cystic lesion and yet accurately classified it as a non-PDA control CT.

Supplementary Figure 3. Axial portal venous phase CT images from correctly classified control CT despite potentially confounding pancreatic findings and their corresponding heat maps. A few hypodense lesions (white arrows) in pancreas due to metastases from lung cancer (A). Heat activation map (B) shows that the model paid attention to the hypodense lesions and yet accurately classified it as a non-PDA control CT.

WHAT YOU NEED TO KNOW.

BACKGROUND AND CONTEXT

Since standard-of-care imaging is inadequate for early detection of pancreas cancer, augmentation using Artificial Intelligence is needed for cancer detection at a stage when surgical cure is possible.

NEW FINDINGS

Automated Artificial Intelligence model trained using the largest and most diverse dataset could detect cancer with high accuracy and generalizability on diagnostic CTs as well as visually occult pre-invasive cancer on pre-diagnostic CTs at a substantial lead time prior to clinical diagnosis.

LIMITATIONS

Prospective multicenter validation in combination with blood-based biomarkers is needed to assess performance for screening of high-risk subjects for sporadic pancreas cancer.

BASIC RESEARCH RELEVANCE

Advancements in imaging artificial intelligence underscore the need for expedited discovery and validation of complementary blood-based biomarkers to enhance identification of high-risk populations suitable for pancreatic cancer screening.

CLINICAL RESEARCH RELEVANCE

Artificial intelligence model could mitigate the inadequacies of imaging and the diagnostic errors in interpretation, which often contribute to delayed diagnosis of pancreas cancer. In combination with emerging blood-based biomarkers, such a model could be evaluated to screen for sporadic cancer in ongoing trials of high-risk cohorts such as the Early Detection Initiative (NCT04662879).

Grant support:

Dr. Goenka acknowledge grants from non-profit entities such as the Champions for Hope Pancreatic Cancer Research Program of the Funk Zitiello Foundation, the Centene Charitable Foundation, and the Advance the Practice Award from the Department of Radiology, Mayo Clinic, Rochester, Minnesota.

Unrelated to this work (Dr. Goenka): CA190188, Department of Defense (DoD), Office of the Congressionally Directed Medical Research Programs (CDMRP); R01CA256969, National Cancer Institute (NCI) of the National Institutes of Health (NIH); R01CA272628-01, National Cancer Institute (NCI) of the National Institutes of Health (NIH); Institutional research grant from Sofie Biosciences and Clovis Oncology; Advisory Board (ad hoc), BlueStar Genomics; Consultant, Bayer Healthcare, LLC; Consultant, Candel Therapeutics; Consultant, UWorld

Abbreviations:

AI

artificial intelligence

CNN

convolutional neural network

MSD

medical segmentation decathlon

NIH

national institute of health

PDA

pancreatic ductal adenocarcinoma

TCIA-CPTAC

the Cancer Imaging Archive Clinical Proteomic Tumor Analysis Consortium Pancreatic Ductal Adenocarcinoma

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Conflict of interest:

None

Data transparency statement:

Access to datasets from the Mayo Clinic Foundation should be requested directly via their data access request forms, which would include a detailed proposal of scope of work. Subject to the institutional review boards’ ethical approval and execution of inter-institutional data use agreement, deidentified data could be made available. The multi-institutional external datasets used in this study are available in open access repositories, which can be accessed through these links: https://wiki.cancerimagingarchive.net/display/Public/CPTAC-PDA.

https://wiki.cancerimagingarchive.net/display/Public/Pancreas-CT

http://medicaldecathlon.com/

REFERENCES

  • 1.Siegel RL, Miller KD, Fuchs HE, et al. Cancer statistics, 2022. CA Cancer J Clin 2022:7–33. [DOI] [PubMed] [Google Scholar]
  • 2.Chari ST, Kelly K, Hollingsworth MA, et al. Early detection of sporadic pancreatic cancer: summative review. Pancreas 2015;44:693–712. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Singh DP, Sheedy S, Goenka AH, et al. Computerized tomography scan in pre-diagnostic pancreatic ductal adenocarcinoma: Stages of progression and potential benefits of early intervention: A retrospective study. Pancreatology 2020;20:1495–1501. [DOI] [PubMed] [Google Scholar]
  • 4.Canto MI, Almario JA, Schulick RD, et al. Risk of Neoplastic Progression in Individuals at High Risk for Pancreatic Cancer Undergoing Long-term Surveillance. Gastroenterology 2018;155:740–751 e2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Klatte DCF, Boekestijn B, Onnekink AM, et al. Surveillance for Pancreatic Cancer in High-Risk Individuals Leads to Improved Outcomes: A Propensity Score-Matched Analysis. Gastroenterology 2023;164:1223–1231.e4. [DOI] [PubMed] [Google Scholar]
  • 6.Schwartz NRM, Matrisian LM, Shrader EE, et al. Potential Cost-Effectiveness of Risk-Based Pancreatic Cancer Screening in Patients With New-Onset Diabetes. Journal of the National Comprehensive Cancer Network 2022;20:451–459. [DOI] [PubMed] [Google Scholar]
  • 7.Kurita Y, Kuwahara T, Hara K, et al. Diagnostic ability of artificial intelligence using deep learning analysis of cyst fluid in differentiating malignant from benign pancreatic cystic lesions. Sci Rep 2019;9:6893. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Chhoda A, Vodusek Z, Wattamwar K, et al. Late-Stage Pancreatic Cancer Detected During High-Risk Individual Surveillance: A Systematic Review and Meta-Analysis. Gastroenterology 2022;162:786–798. [DOI] [PubMed] [Google Scholar]
  • 9.Overbeek KA, Goggins MG, Dbouk M, et al. Timeline of Development of Pancreatic Cancer and Implications for Successful Early Detection in High-Risk Individuals. Gastroenterology 2022;162:772–785.e4. [DOI] [PubMed] [Google Scholar]
  • 10.Kang J, Clarke SE, Abdolell M, et al. The implications of missed or misinterpreted cases of pancreatic ductal adenocarcinoma on imaging: a multi-centered population-based study. Eur Radiol 2021;31:212–221. [DOI] [PubMed] [Google Scholar]
  • 11.Kang JD, Clarke SE, Costa AF. Factors associated with missed and misinterpreted cases of pancreatic ductal adenocarcinoma. Eur Radiol 2021;31:2422–2432. [DOI] [PubMed] [Google Scholar]
  • 12.Dewitt J, Devereaux BM, Lehman GA, et al. Comparison of endoscopic ultrasound and computed tomography for the preoperative evaluation of pancreatic cancer: a systematic review. Clin Gastroenterol Hepatol 2006;4:717–25; quiz 664. [DOI] [PubMed] [Google Scholar]
  • 13.Chen PT, Wu T, Wang P, et al. Pancreatic Cancer Detection on CT Scans with Deep Learning: A Nationwide Population-based Study. Radiology 2022:220152. [DOI] [PubMed] [Google Scholar]
  • 14.Liu KL, Wu T, Chen PT, et al. Deep learning to distinguish pancreatic cancer tissue from non-cancerous pancreatic tissue: a retrospective study with cross-racial external validation. Lancet Digit Health 2020;2:e303–e313. [DOI] [PubMed] [Google Scholar]
  • 15.Mukherjee S, Patra A, Khasawneh H, et al. Radiomics-based Machine-learning Models Can Detect Pancreatic Cancer on Prediagnostic Computed Tomography Scans at a Substantial Lead Time Before Clinical Diagnosis. Gastroenterology 2022;163:1435–1446 e3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Qureshi TA, Gaddam S, Wachsman AM, et al. Predicting pancreatic ductal adenocarcinoma using artificial intelligence analysis of pre-diagnostic computed tomography images. Cancer Biomarkers 2022;33:211–217. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Alves N, Schuurmans M, Litjens G, et al. Fully Automatic Deep Learning Framework for Pancreatic Ductal Adenocarcinoma Detection on Computed Tomography. Cancers (Basel) 2022;14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Panda A, Korfiatis P, Suman G, et al. Two-stage deep learning model for fully automated pancreas segmentation on computed tomography: Comparison with intra-reader and inter-reader reliability at full and reduced radiation dose on an external dataset. Med Phys 2021;48:2468–2481. [DOI] [PubMed] [Google Scholar]
  • 19.Suman G, Panda A, Korfiatis P, et al. Development of a volumetric pancreas segmentation CT dataset for AI applications through trained technologists: a study during the COVID 19 containment phase. Abdom Radiol (NY) 2020;45:4302–4310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Suman G, Patra A, Korfiatis P, et al. Quality gaps in public pancreas imaging datasets: Implications & challenges for AI applications. Pancreatology 2021;21:1001–1008. [DOI] [PubMed] [Google Scholar]
  • 21.Suman G, Patra A, Mukherjee S, et al. Radiomics for Detection of Pancreas Adenocarcinoma on CT Scans: Impact of Biliary Stents. Radiol Imaging Cancer 2022;4:e210081. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Ghassemi M, Oakden-Rayner L, Beam AL. The false hope of current approaches to explainable artificial intelligence in health care. The Lancet Digital Health 2021;3:e745–e750. [DOI] [PubMed] [Google Scholar]
  • 23.Quinn TP, Jacobs S, Senadeera M, et al. The three ghosts of medical AI: Can the black-box present deliver? Artificial Intelligence in Medicine 2022;124:102158. [DOI] [PubMed] [Google Scholar]
  • 24.Willemink MJ, Koszek WA, Hardell C, et al. Preparing Medical Imaging Data for Machine Learning. Radiology 2020;295:4–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Chun YS, Pawlik TM, Vauthey JN. 8th Edition of the AJCC Cancer Staging Manual: Pancreas and Hepatobiliary Cancers. Ann Surg Oncol 2018;25:845–847. [DOI] [PubMed] [Google Scholar]
  • 26.He K, Zhang X, Ren S, et al. Deep residual learning for image recognition, In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. [Google Scholar]
  • 27.Hu J, Shen L, Albanie S, et al. Squeeze-and-Excitation Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 2020;42:2011–2023. [DOI] [PubMed] [Google Scholar]
  • 28.Ian G, Yoshua B, Aaron C. Deep Learning. Cambridge, MA, USA: MIT Press, 2016:224–270. [Google Scholar]
  • 29.Xu B, Wang N, Chen T, et al. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853 2015. [Google Scholar]
  • 30.National Cancer Institute Clinical Proteomic Tumor Analysis Consortium. Radiology Data from the Clinical Proteomic Tumor Analysis Consortium Pancreatic Ductal Adenocarcinoma [CPTAC-PDA] Collection [Data set]. The Cancer Imaging Archive, 2018. [Google Scholar]
  • 31.Simpson AL, Antonelli M, Bakas S, et al. A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv:1902.09063 2019. [Google Scholar]
  • 32.Roth Holger R., Farag Amal, Turkbey Evrim B., et al. Data From Pancreas-CT. The Cancer Imaging Archive., 2016. [Google Scholar]
  • 33.Sharma A, Kandlakunta H, Nagpal SJS, et al. Model to Determine Risk of Pancreatic Cancer in Patients With New-Onset Diabetes. Gastroenterology 2018;155:730–739 e3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Huang SC, Kothari T, Banerjee I, et al. PENet-a scalable deep-learning model for automated diagnosis of pulmonary embolism using volumetric CT imaging. NPJ Digit Med 2020;3:61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Robin X, Turck N, Hainard A, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 2011;12:77. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Benjamini YHY. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the royal statistical society 1995;57:289–300. [Google Scholar]
  • 37.Thissen D, Steinberg L, Kuang D. Quick and Easy Implementation of the Benjamini-Hochberg Procedure for Controlling the False Positive Rate in Multiple Comparisons. Journal of Educational and Behavioral Statistics 2002;27:77–83. [Google Scholar]
  • 38.Youden WJ. Index for rating diagnostic tests. Cancer 1950;3:32–5. [DOI] [PubMed] [Google Scholar]
  • 39.Toshima F, Watanabe R, Inoue D, et al. CT Abnormalities of the Pancreas Associated With the Subsequent Diagnosis of Clinical Stage I Pancreatic Ductal Adenocarcinoma More Than 1 Year Later: A Case-Control Study. American Journal of Roentgenology 2021;217:1353–1364. [DOI] [PubMed] [Google Scholar]
  • 40.Reddy S. Explainability and artificial intelligence in medicine. The Lancet Digital Health 2022;4:e214–e215. [DOI] [PubMed] [Google Scholar]
  • 41.Suman G, Panda A, Korfiatis P, et al. Convolutional neural network for the detection of pancreatic cancer on CT scans. Lancet Digit Health 2020;2:e453. [DOI] [PubMed] [Google Scholar]
  • 42.Liao WC, Simpson AL, Wang W. Convolutional neural network for the detection of pancreatic cancer on CT scans - Authors’ reply. Lancet Digit Health 2020;2:e454. [DOI] [PubMed] [Google Scholar]
  • 43.Khasawneh H, Patra A, Rajamohan N, et al. Volumetric Pancreas Segmentation on Computed Tomography: Accuracy and Efficiency of a Convolutional Neural Network Versus Manual Segmentation in 3D Slicer in the Context of Interreader Variability of Expert Radiologists. J Comput Assist Tomogr 2022;46:841–847. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Mukherjee S, Korfiatis P, Khasawneh H, et al. Bounding box-based 3D AI model for user-guided volumetric segmentation of pancreatic ductal adenocarcinoma on standard-of-care CTs. Pancreatology 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Mazer BL, Lee JW, Roberts NJ, et al. Screening for pancreatic cancer has the potential to save lives, but is it practical? Expert Rev Gastroenterol Hepatol 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

Supplementary Figure 1. Axial portal venous phase CT images showing correctly classified malignant CTs and their corresponding heat maps. A 3.3 cm hypodense PDA (stage T2) (white arrow) in the head (A). An isodense PDA (stage T1) (white arrow) causing main pancreatic duct obstruction and upstream pancreatic atrophy (C). Corresponding heat activation maps (B and D) show that the model’s determination of the CTs as a PDA-containing CT was based on accurate localization of the tumor.

Supplementary Figure 2. Axial portal venous phase CT images from correctly classified control CT despite potentially confounding pancreatic findings and their corresponding heat maps. A 3 cm cystic lesion (white arrows) in pancreatic head (A). Heat activation map (B) shows that the model paid attention to the cystic lesion and yet accurately classified it as a non-PDA control CT.

Supplementary Figure 3. Axial portal venous phase CT images from correctly classified control CT despite potentially confounding pancreatic findings and their corresponding heat maps. A few hypodense lesions (white arrows) in pancreas due to metastases from lung cancer (A). Heat activation map (B) shows that the model paid attention to the hypodense lesions and yet accurately classified it as a non-PDA control CT.

Data Availability Statement

Access to datasets from the Mayo Clinic Foundation should be requested directly via their data access request forms, which would include a detailed proposal of scope of work. Subject to the institutional review boards’ ethical approval and execution of inter-institutional data use agreement, deidentified data could be made available. The multi-institutional external datasets used in this study are available in open access repositories, which can be accessed through these links: https://wiki.cancerimagingarchive.net/display/Public/CPTAC-PDA.

https://wiki.cancerimagingarchive.net/display/Public/Pancreas-CT

http://medicaldecathlon.com/

RESOURCES