Skip to main content
Frontiers in Artificial Intelligence logoLink to Frontiers in Artificial Intelligence
. 2023 Jan 20;5:1059093. doi: 10.3389/frai.2022.1059093

Statistical biopsy: An emerging screening approach for early detection of cancers

Gregory R Hart 1, Vanessa Yan 2, Bradley J Nartowt 3, David A Roffman 4, Gigi Stark 2, Wazir Muhammad 5, Jun Deng 2,*
PMCID: PMC9895959  PMID: 36744110

Abstract

Despite large investment cancer continues to be a major source of mortality and morbidity throughout the world. Traditional methods of detection and diagnosis such as biopsy and imaging, tend to be expensive and have risks of complications. As data becomes more abundant and machine learning continues advancing, it is natural to ask how they can help solve some of these problems. In this paper we show that using a person's personal health data it is possible to predict their risk for a wide variety of cancers. We dub this process a “statistical biopsy.” Specifically, we train two neural networks, one predicting risk for 16 different cancer types in females and the other predicting risk for 15 different cancer types in males. The networks were trained as binary classifiers identifying individuals that were diagnosed with the different cancer types within 5 years of joining the PLOC trial. However, rather than use the binary output of the classifiers we show that the continuous output can instead be used as a cancer risk allowing a holistic look at an individual's cancer risks. We tested our multi-cancer model on the UK Biobank dataset showing that for most cancers the predictions generalized well and that looking at multiple cancer risks at once from personal health data is a possibility. While the statistical biopsy will not be able to replace traditional biopsies for diagnosing cancers, we hope there can be a shift of paradigm in how statistical models are used in cancer detection moving to something more powerful and more personalized than general population screening guidelines.

Keywords: cancer screening, machine learning and AI, neural network, biopsy, data mining, cancer detection, individualized medicine

Introduction

Cancer is a global public health burden with an estimated 21.7 million new cases and 13 million cancer deaths annually by 2030 (Ferlay et al., 2019). Despite a huge amount of money and resources spent on cancer screening, diagnosis, and treatment, it is estimated that 609,360 people in the United States will die from cancer in 2022

alone (Siegel et al., 2022). One important factor contributing to the high mortality is the lack of an efficient tool for cancer screening, missing the most effective window of opportunity for detecting cancers at their earliest stages. Another factor is the lack of individualized risk management for tailored cancer prevention. Hence, it is critical to develop safe and cost-effective approaches for cancer screening prior to disease onset with high sensitivity, specificity, and accessibility.

Tissue biopsy has long been used to diagnose cancer and often considered the gold standard, but it is limited by constraints on sampling frequency and incomplete representation of the organ being biopsied (Bravo et al., 2001). In addition, the surgical procedure is invasive, time-intensive, and costly with pain and risk of complications. Liquid biopsy offers a non-invasive alternative to cancer screening, but detection and analysis of circulating tumor DNA in a body fluid specimen present a considerable challenge (Alix-Panabières and Pantel, 2013; Crowley et al., 2013). Another challenge for liquid biopsy is how to identify the tumor site in the body, even after an individual has tested positive (Su, 2019).

Numerous schemas have been developed to improve clinical decision-making in cancer screening, detection, and prevention (Kramer, 2004; Holle, 2017).13 While cancer screening usually involves a procedure or body fluid test to detect cancer at an early stage, cancer prevention aims to reduce cancer risk and mortality by avoiding carcinogens, modifying lifestyles, and using chemoprevention (Kramer, 2004; Holle, 2017). As of now, routine cancer screening is only recommended for breast, cervical, colorectal, lung, and prostate cancers (see Footnotes 1–3). Cancer prevention strategies are only available for breast cancer, colorectal cancer, human papillomavirus-related cancers (anal, cervical, penile, vaginal, and vulvar cancers), ovarian cancer, and prostate cancer, as recommended by the American Cancer Society (ACS), National Comprehensive Cancer Network (NCCN), and US. Preventive Services Task Force (USPSTF) (see Footnotes 1–3). While the benefits of those schemas may include reduced cancer incidence and cancer mortality, their common limitations include the requirement of clinical testing, suboptimal positive/negative predictive values, frequent involvement of invasive procedures, and over diagnosis and overtreatment (Kramer, 2004; Holle, 2017). Ideally, it would be in the best interest of people to improve estimates of cancer risk prior to any clinical testing so that the cost and potential harms associated with invasive procedures would be limited (Cruz and Wishart, 2006; Ayer et al., 2010; Kourou et al., 2014; Boursi et al., 2017; Rajkomar et al., 2019).

Recently, we have demonstrated that deep neural networks, trained and validated with the National Health Interview Survey (NHIS) and/or the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial datasets, can be used to predict and stratify cancer risks with high discriminatory power based solely on personal health data (Hart et al., 2018, 2019, 2020; Roffman et al., 2018a,b; Muhammad et al., 2019; Nartowt et al., 2019a,b; Stark et al., 2019). Compared to the clinician's judgment, the strong performance of our models presents a novel opportunity to perform a “statistical biopsy” on individuals prior to disease onset (Hart et al., 2020). As shown in Figure 1, statistical biopsy mines personal health data from individuals for early cancer detection, analogous to tissue biopsy evaluating cells from a tissue specimen and liquid biopsy evaluating circulating tumor DNA from a fluid sample. What is different is that statistical biopsy seeks to decipher the invisible correlations and inter-connectivity between multiple medical conditions and health parameters via sophisticated statistical modeling. With statistical biopsy, it is possible to generate a holistic analysis of an individual's risk for a variety of cancers simultaneously. Furthermore, if integrated into a modern electronic medical record (EMR) system, it offers a cost-effective and safe approach to cancer screening in real time, informing preventive interventions and screening decisions.

Figure 1.

Figure 1

Tissue biopsy, liquid biopsy, and statistical biopsy (A) tissue biopsy is used to characterize tissues and diagnose cancer by evaluating the cells from a tissue specimen. However, it is an invasive, time intensive, and costly procedure, which inflicts pain and risk on patients. (B) Liquid biopsy has been recently developed to evaluate circulating tumor DNA from a body fluid sample to screen for cancer. It offers a noninvasive alternative to cancer screening, but detection and analysis of circulating tumor DNA in a body fluid specimen remains a challenging task for medical researchers and practitioners. (C) Statistical biopsy is a new approach proposed by our group that mines personal health data for early cancer detection with sophisticated statistical modeling. The basic idea is that a trove of personal health data can be used to train and validate deep learning models to generate a holistic profile of one's risks for a variety of cancers simultaneously prior to disease onset. Panel (A) adapted from the PreOp website (https://preop.com/wp-content/uploads/2021/08/333_surgery.jpg).

In order to personalize early cancer detection and prevention, an accurate risk assessment of a variety of cancers for each individual is needed. Hence, we begin the development of a novel cancer risk profiler based on deep learning of personal health data for better risk stratification and more precise screening. We hypothesize that the trove of personal health data, including clinical and demographic data, family history, socio-behavioral, dietary and lifestyle data, can be used to train and validate a deep learning model capable of screening cancer prior to disease onset, with high sensitivity and specificity and with minimal toxicity and maximal accessibility.

Materials and methods

Data sets

In this work we use two large medical datasets, one for training a neural network to predict the appearance of cancer within 5 years and the other for testing the neural network. The first is the Prostate, Lung, Colorectal, and Ovarian (PLCO) trial (Tammemagi et al., 2011) which is used for training. The testing set came from the UK Biobank database (UK Biobank, 2022).

PLCO was a randomized controlled trial investigating the effectiveness of screening methods for prostate, lung, colorectal, and ovarian cancers. PLCO enrolled 154,897 participants 55–75 years of age between November 1993 and July 2001 in the United States. Participants were followed for 13 years, until they developed cancer, or passed away. We removed those that did not complete the baseline health survey leaving 149,623 participants. PLCO recorded the appearance of 13 general cancers (biliary, bladder, colorectal, glioma, head and neck, hematopoietic, liver, lung, melanoma, pancreas, renal, thyroid, and upper GI cancers), 3 female specific cancers (breast, endometrial, and ovarian), and 2 male specific cancers (male breast and prostate). In addition to these cancers, we use 116 general features, 20 female specific features, and 12 male specific features. We split the data into a set for females to predict 16 cancer types and set for males to predict 15 cancer types. See Table 1 for a list of features and their statistics and Table 2 for the number of cancer cases.

Table 1.

Feature distributions and missingness.

Feature Female Male
Train Test Train Test
Binary % Yes (% missing) % Yes (% missing)
Ever had arthritis 45.82 (0.00) 2.62 (0.00) 29.97 (0.67) 3.80 (0.00)
Ever had chronic bronchitis 5.94 (0.00) 1.10 (76.49) 3.64 (0.68) 1.17 (75.35)
Ever had colon co-morbidity 1.70 (0.00) 1.32 (0.00) 1.17 (1.02) 1.14 (0.00)
Ever had diabetes 6.42 (0.00) 1.03 (0.59) 9.07 (0.63) 3.83 (0.43)
Ever had diverticulitis or diverticulosis 8.32 (0.00) 7.51 (0.00) 5.38 (0.78) 7.01 (0.00)
Ever had emphysema 2.05 (0.00) 0.40 (76.49) 3.05 (0.65) 0.22 (75.35)
Ever had gall bladder stones or inflammation 15.90 (0.00) 2.94 (0.00) 6.99 (0.73) 5.54 (0.00)
Ever had coronary heart disease or a heart attack 4.84 (0.00) 11.62 (0.00) 13.46 (0.64) 4.10 (0.00)
Ever had high blood pressure 33.97 (0.00) 26.16 (0.00) 34.38 (0.59) 19.39 (0.00)
Ever had liver co-morbidity 3.37 (0.00) 0.65 (0.00) 4.09 (0.77) 0.37 (0.00)
Ever had osteoporosis 9.64 (0.00) 0.62 (0.00) 0.82 (0.75) 1.93 (0.00)
Ever had colorectal polyps 5.54 (0.00) 6.08 (0.00) 8.12 (0.75) 4.04 (0.00)
Ever had a stroke 2.14 (0.00) 0.72 (0.00) 2.75 (0.63) 0.37 (0.00)
Ever smoked regularly 44.34 (0.00) 65.30 (0.58) 63.52 (0.03) 55.20 (0.52)
Current smoker 9.71 (0.00) 12.56 (0.60) 11.71 (0.03) 8.96 (0.53)
Family history of biliary cancer 0.34 (0.00) – (100.00) 0.20 (4.50) – (100.00)
Family history of bladder cancer 2.18 (0.00) – (100.00) 1.51 (4.48) – (100.00)
Family history of breast cancer 14.56 (0.00) 12.55 (23.42) – (100.00) 12.97 (16.78)
Family history of colorectal cancer 11.33 (0.00) 14.14 (23.10) 9.29 (4.31) 12.57 (16.87)
Family history of endometrial cancer 2.89 (0.00) – (100.00) – (100.00) – (100.00)
Family history of glioma cancer 2.01 (0.00) – (100.00) 1.74 (4.46) – (100.00)
Family history of head and neck cancer 1.42 (0.00) – (100.00) 1.09 (4.48) – (100.00)
Family history of hematopoietic cancer 6.67 (0.00) – (100.00) 5.35 (4.40) – (100.00)
Family history of liver cancer 2.04 (0.00) – (100.00) 2.19 (4.44) – (100.00)
Family history of lung cancer 11.71 (0.00) 15.14 (22.51) 9.85 (4.28) 14.69 (16.34)
Family history of male breast cancer – (100.00) – (100.00) 21.01 (2.47) – (100.00)
Family history of melanoma cancer 1.40 (0.00) – (100.00) 0.80 (4.49) – (100.00)
Family history of ovarian cancer 3.93 (0.00) – (100.00) – (100.00) – (100.00)
Family history of pancreas cancer 3.06 (0.00) – (100.00) 2.18 (4.47) – (100.00)
Family history of prostate cancer – (100.00) – (100.00) 7.40 (2.53) 9.65 (17.15)
Family history of renal cancer 1.79 (0.00) – (100.00) 1.25 (4.48) – (100.00)
Family history of thyroid cancer 0.70 (0.00) – (100.00) 0.35 (4.50) – (100.00)
Family history of upper GI cancer 4.51 (0.00) – (100.00) 4.63 (4.41) – (100.00)
Ever had enlarged prostate 21.80 (0.18) 0.00 (0.00)
Ever had inflamed prostate 8.45 (16.54) 0.00 (0.00)
Ever had a prostate biopsy 4.98 (2.90) 0.00 (0.00)
Ever had a prostatectomy 0.31 (3.21) – (100.0)
Ever had a prostate resection 2.98 (3.16) 0.00 (0.00)
Ever had a vasectomy 27.28 (0.35) 0.00 (0.00)
Had ovaries removed 16.57 (0.00)
Had tubes tied 21.49 (0.00) 0.00 (0.00)
Ever take birth control pills 54.22 (0.00) – (100.00)
Currently using female hormones 49.33 (0.00) 0.00 (0.00)
Ever take female hormones 66.37 (0.00) – (100.00)
Ever been pregnant 92.49 (0.00) – (100.00)
Ever dealt with infertility 14.51 (0.00) 0.00 (0.00)
Ever had benign or fibrocystic breast disease 28.45 (0.00) 0.01 (0.00)
Ever had benign ovarian tumor/cyst 12.80 (0.00) 0.00 (0.00)
Ever had endometriosis 8.39 (0.00) 0.00 (0.00)
Ever had Uterine fibroid tumors 22.48 (0.00) 0.00 (0.00)
Categorical % in Category % in Category
Race
White 88.55 93.97 88.37 94.18
Black 5.68 1.65 4.56 1.96
Hispanic 1.60 0.00 2.17 0.00
Asian 3.37 2.72 4.07 2.22
Pacific Islander 0.49 0.00 0.62 0.00
American Indian 0.27 0.00 0.25 0.00
Missing 0.04 1.66 0.06 1.64
Education level
< 8 years 0.72 0.00 1.25 0.00
8–11 years 5.82 0.00 7.00 0.00
12 years 27.47 23.85 18.25 28.55
Non-college training 12.85 4.48 12.25 5.76
Some college 23.15 19.28 20.41 16.24
College graduate 15.02 33.56 18.83 31.02
Postgraduate 14.71 0.0 21.73 0.00
Missing 0.26 18.8 0.29 18.44
Marriage status
Married or cohabitating 68.71 76.73 82.51 69.47
Widowed 13.85 0.00 3.60 0.00
Divorced 12.91 0.00 9.05 0.00
Separated 0.92 0.00 1.11 0.00
Never married 3.39 0.00 3.43 0.00
Missing 0.23 23.27 0.29 30.53
Occupation
Homemaker 22.23 0.54 0.08 4.62
Working 35.18 60.23 44.00 54.59
Unemployed 0.96 2.35 1.16 1.06
Retired 36.58 31.22 49.59 35.00
Extended sick leave 0.20 0.00 0.17 0.00
Disabled 2.08 4.09 2.42 2.73
Other 2.24 0.89 2.09 0.94
Missing 0.52 1.08 0.49 1.06
Continuous Mean (SD); % missing Mean (SD); % missing
Age at enrollment 62.5 (5.4); 0.0 56.7 (8.2); 0.0 62.7 (5.3); 0.0 56.3 (8.0); 0.0
BMI at enrollment 27.1 (5.5); 0.0 27.8 (4.2); 0.0 27.5 (4.2); 1.6 27.1 (5.2); 0.5
Weight at age 20 124 (18.1); 0.0 – (–); 100.0 160 (24.3); 1.3 – (–); 100.0
Years since quitting smoking 25.0 (13.3); 0.0 24.6 (14.2); 1.0 16.9 (13.5); 1.0 25.3 (17.8); 46.0
Pack years smoked 13.3 (22.4); 0.0 26.1 (20.9); 1.0 25.2 (31.5); 2.3 20.2 (15.5); 51.4
Monthly aspirin use 9.8 (16.5); 0.0 0.0 (0.0); 0.0 12.2 (16.7); 0.3 0.0 (0.0); 0.0
Monthly ibuprofen use 7.5 (17.4); 0.0 0.0 (0.0); 0.0 4.9 (14.3); 0.5 0.0 (0.0); 0.0
Youngest relative with biliary cancer 68.1 (5.4); 0.0 – (–); 100.0 68.4 (12.2); 0.0 – (–); 100.0
Youngest relative with bladder cancer 67.7 (6.3); 0.0 – (–); 100.0 67.9 (11.9); 1.9 – (–); 100.0
Youngest relative with breast cancer 58.4 (8.0); 0.0 – (–); 100.0 – (–); 0.0 – (–); 100.0
Youngest relative with colorectal cancer 66.2 (6.8); 0.0 – (–); 100.0 65.7 (12.7); 2.1 – (–); 100.0
Youngest relative with endometrial cancer 56.0 (7.2); 0.0 – (–); 100.0 – (–); 0.0 – (–); 100.0
Youngest relative with glioma cancer 54.9 (8.3); 0.0 – (–); 100.0 55.0 (17.8); 1.3 – (–); 100.0
Youngest relative with head and neck cancer 60.7 (5.6); 0.0 – (–); 100.0 61.4 (13.0); 2.6 – (–); 100.0
Youngest relative with hematopoietic cancer 57.0 (10.1); 0.0 – (–); 100.0 56.1 (20.0); 1.9 – (–); 100.0
Youngest relative with liver cancer 64.2 (6.6); 0.0 – (–); 100.0 65.3 (12.5); 1.6 – (–); 100.0
Youngest relative with lung cancer 65.0 (6.1); 0.0 – (–); 100.0 63.9 (11.5); 1.7 – (–); 100.0
Youngest relative with male breast cancer – (–); 0.0 – (–); 100.0 58.8 (15.6); 2.4 – (–); 100.0
Youngest relative with melanoma cancer 55.9 (8.9); 0.0 – (–); 100.0 56.8 (17.3); 1.2 – (–); 100.0
Youngest relative with ovarian cancer 57.9 (8.1); 0.0 – (–); 100.0 – (–); 0.0 – (–); 100.0
Youngest relative with pancreas cancer 68.9 (5.7); 0.0 – (–); 100.0 67.9 (11.9); 1.0 – (–); 100.0
Youngest relative with prostate cancer – (–); 0.0 – (–); 100.0 70.3 (9.8); 2.5 – (–); 100.0
Youngest relative with renal cancer 63.1 (6.7); 0.0 – (–); 100.0 62.5 (14.9); 2.4 – (–); 100.0
Youngest relative with thyroid cancer 43.4 (8.3); 0.0 – (–); 100.0 49.2 (18.6); 2.9 – (–); 100.0
Youngest relative with upper GI cancer 64.3 (6.7); 0.0 – (–); 100.0 63.6 (13.8); 2.0 – (–); 100.0
Number of relatives with biliary cancer 1.0 (0.2); 0.0 – (–); 100.0 1.0 (0.1); 0.0 – (–); 100.0
Number of relatives with bladder cancer 1.0 (0.2); 0.0 – (–); 100.0 1.0 (0.2); 0.0 – (–); 100.0
Number of relatives with breast cancer 1.1 (0.3); 0.0 1.0 (0.0); 0.0 – (–); 0.0 1.0 (0.0); 0.0
Number of relatives with colorectal cancer 1.1 (0.3); 0.0 1.0 (0.0); 0.0 1.1 (0.3); 0.0 1.0 (0.0); 0.0
Number of relatives with endometrial cancer 1.0 (0.2); 0.0 – (–); 100.0 – (–); 0.0 – (–); 100.0
Number of relatives with glioma cancer 1.0 (0.2); 0.0 – (–); 100.0 1.0 (0.2); 0.0 – (–); 100.0
Number of relatives with head and neck cancer 1.0 (0.2); 0.0 – (–); 100.0 1.0 (0.2); 0.0 – (–); 100.0
Number of relatives with hematopoietic cancer 1.1 (0.3); 0.0 – (–); 100.0 1.1 (0.2); 0.0 – (–); 100.0
Number of relatives with liver cancer 1.0 (0.2); 0.0 – (–); 100.0 1.0 (0.2); 0.0 – (–); 100.0
Number of relatives with lung cancer 1.1 (0.4); 0.0 1.0 (0.0); 0.0 1.1 (0.3); 0.0 1.0 (0.0); 0.0
Number of relatives with male breast cancer – (–); 0.0 – (–); 100.0 1.0 (0.1); 0.0 – (–); 100.0
Number of relatives with melanoma cancer 1.0 (0.2); 0.0 – (–); 100.0 1.0 (0.2); 0.0 – (–); 100.0
Number of relatives with ovarian cancer 1.0 (0.2); 0.0 – (–); 100.0 – (–); 0.0 – (–); 100.0
Number of relatives with pancreas cancer 1.0 (0.2); 0.0 – (–); 100.0 1.0 (0.2); 0.0 – (–); 100.0
Number of relatives with prostate cancer – (–); 0.0 – (–); 100.0 1.1 (0.3); 0.0 1.0 (0.0); 0.0
Number of relatives with renal cancer 1.0 (0.2); 0.0 – (–); 100.0 1.0 (0.1); 0.0 – (–); 100.0
Number of relatives with thyroid cancer 1.0 (0.2); 0.0 – (–); 100.0 1.0 (0.2); 0.0 – (–); 100.0
Number of relatives with upper GI cancer 1.0 (0.2); 0.0 – (–); 100.0 1.1 (0.3); 0.0 – (–); 100.0
Age when prostate became enlarged 52.6 (9.3); 0.5 56.8 (10.2); 0.0
Age when prostate became inflamed 45.0 (13.2); 0.6 – (–); 0.0
How many times you get up at night to urinate 1.3 (0.9); 0.2 – (–); 100.0
Age at which you started urinating at night 50.5 (10.5); 58.5 – (–); 0.0
Age at first prostate surgery 54.9 (7.9); 7.4 54.5 (6.9); 0.0
Age at vasectomy 29.0 (3.5); 0.5 – (–); 0.0
Age at hysterectomy 41.5 (4.6); 0.0 – (–); 100.0
Age started birth control 24.8 (6.4); 0.0 – (–); 100.0
Number of years taking female hormones 6.8 (3.0); 0.0 – (–); 100.0
Age at birth of first child 21.0 (4.5); 0.0 – (–); 100.0
Number of live births 3.1 (1.3); 0.0 – (–); 100.0
Number of miscarriages 0.5 (0.7); 0.0 – (–); 100.0
Number of still births 0.1 (0.3); 0.0 – (–); 100.0
Number of tubal/ectopic pregnancies 0.0 (0.2); 0.0 – (–); 100.0
Age at first menstrual period 12.2 (1.6); 0.0 – (–); 100.0

Table 2.

Count of cancer cases in the data sets.

Cancer Female Male
Train Test Train Test
Biliary 20 77 10 53
Bladder 89 276 387 781
Breast 1,912 4,525 13 31
Colorectal 429 1,034 681 1,352
Endometrial 352 614
Glioma 42 452 60 459
Head and Neck 63 264 171 681
Hematopoietic 351 651 482 849
Liver 8 1,082 60 1,050
Lung 526 949 806 838
Melanoma 195 599 289 575
Ovarian 225 514
Pancreas 89 208 134 202
Prostate 3,749 3,365
Renal 98 229 155 407
Thyroid 47 118 31 46
Upper GI 30 160 164 338

UK Biobank is a large-scale biomedical database trying to accelerate medical and public health research by gathering and maintaining a staggering amount of information. They enrolled half a million participants from 2006 to 2010. Many types of follow-up and additions are frequently made. Everything from repeating the baseline health evaluation to imaging and sequencing. Information is pulled from death and cancer registries and hospital admissions and primary care data. From this data base we have 229,263 male participants and 273,375 female participants. The UK Biobank data is more detailed than the PLCO data, so we map it onto the PLCO features we used in training.

For both datasets we normalized all the inputs, situating them within the range 0–1. Categorical inputs were handled using one-hot encoding. For the cancer diagnoses we considered diagnoses <5 years after baseline evaluation to be positive and all others to be negative. We handled missing data through k-nearest neighbor imputation with k = 5. Imputation was done separately on PLCO and UK Biobank so that there was no information passed between them, except in the case of a feature completely missing from UK Biobank, in which case we set it to the mean value from the PLCO dataset (Figure 2).

Figure 2.

Figure 2

Flow process for data preparation and model training.

The data was read in and processed in Python with the Pandas library, version 1.5.1. The Pandas data frames were converted to 2d Numpy arrays (version 1.23.4) before being passed to the training software.

Neural network

Using the PLCO dataset we train two different neural networks, one to take in the female data and predict the risk for 16 different cancers and another to take in the male data and predict the risk for 15 different cancers. The networks were trained as binary classifiers, with the positive class being those that developed cancer within 5 years of enrolling in the study. Each network has 2 hidden layers with 120 nodes in the first layer and 80 in the second. This network architecture was chosen because it was previously used with good results in a master's thesis that used the PLCO dataset to predict cancer risk (Yan, 2020). For both the female and male models the biases are initialized to 0 and weights are initialized with a glorot normal initializer. We used the ReLu activation function and the Adam optimizer with a learning rate of 0.01. To avoid the exploding gradient problem, we use gradient clipping. For the loss function we use binary cross-entropy. We train with batch sizes of 1,024 for 10 epochs. The prediction for each cancer coming from the output layer was put through a logistic function to scale it to the interval 0–1. We think of these values as the probability of developing cancer and later will multiply them by 100 and use them as the percent risk of developing cancer. The training and predictions were done with TensorFlow 2 via Keras, version 2.11.0.

For each cancer the neural network returns a number in the range of 0–1. Traditionally a threshold value of 0.5 is selected so that values ≥0.5 are considered positive and values below 0.5 are considered negative. However, in the data we are using there are more people without cancer than with cancer. This data imbalance can lead to bias in the predictions, but this can be addressed by avoiding the default threshold value. We empirically set the threshold (for each cancer) to maximize the Youden index. The Youden index is the difference between the true positive rate and the false positive rate. Maximizing this index picks the threshold value where the ROC curve begins to bend. We maximize the Youden index using the training data and then apply the results thresholds to the testing data (Duda et al., 2001; Bishop, 2006; Mitchell, 2006).

Results

Fitting the neural network to predict cancer incidence within 5 years for all 17 cancer types is quite successful. Looking at the ROC for the PLCO data (dotted lines in Figure 3) the classifier is near perfect for every cancer. This is further confirmed by looking at various metrics of effectiveness. On this training data no cancer has an AUC below 0.98, informedness below 0.85, or diagnostic odds ratio below 270 (see Table 3).

Figure 3.

Figure 3

ROC curves for the neural networks trained on PLCO data and tested on UK Biobank data to predict 17 different cancers.

Table 3.

Metrics of performance.

Cutoff Positive predictive value Negative predictive value AUC of ROC Matthews correlation coefficient Informed-ness Diagnostic odds ratio
Biliary
Female
Train 0.263 0.6129 1.0000 0.9933 0.7630 0.9498 120,276
Test 0.0004 0.9998 0.6341 0.0061 0.1804 7
Male
Train 0.028 0.3448 1.0000 0.9999 0.5871 0.9997 Inf
Test Nan 1.0000 0.1339 Nan 0.0000 0.0000
Bladder
Female
Train 0.002 0.2145 1.0000 0.9995 0.4621 0.9957 Inf
Test 0.1047 0.9999 0.9658 0.3113 0.9264 1,727
Male
Train 0.002 0.1414 0.9997 0.9911 0.3609 0.9229 691
Test 0.0139 0.9984 0.7727 0.0750 0.4569 12
Breast
Female
Train 0.025 0.4344 0.9997 0.9883 0.6443 0.9563 2,942
Test 0.0391 0.9996 0.9815 0.1498 0.5788 319
Male
Train 0.001 0.0044 1.000 0.9950 0.0653 0.9605 Inf
Test 0.0000 0.9999 0.3992 −0.0018 −0.0242 0
Colorectal
Female
Train 0.003 0.2396 0.9997 0.9860 0.4734 0.9362 1,211
Test 0.0062 1.0000 0.9979 0.0496 0.3948 Inf
Male
Train 0.006 0.2273 0.9989 0.9640 0.4415 0.8616 295
Test 0.441 0.9999 0.9887 0.1938 0.8546 463
Female
Train 0.003 0.2892 0.9999 0.9954 0.5292 0.9689 4,445
Test 0.0724 0.9999 0.9911 0.2589 0.9266 775
Glioma
Female
Train 0.284 0.4082 1.0000 0.9959 0.6232 0.9516 26,203
Test 0.0686 0.9984 0.9732 0.0416 0.0259 40
Male
Train 0.009 0.3333 1.0000 0.9972 0.5672 0.9651 18,427
Test 0.5401 0.9987 0.8773 0.4275 0.3393 893
Head and Neck
Female
Train 0.001 0.0432 1.0000 0.9858 0.2026 0.9505 1,745
Test 0.0287 0.9999 0.9660 0.1615 0.9112 536
Male
Train 0.003 0.0839 0.9999 0.9948 0.2816 0.9461 1,380
Test 0.3362 0.9996 0.8963 0.5395 0.8668 1,291
Hematopoietic
Female
Train 0.005 0.1424 0.9999 0.9945 0.3683 0.9527 1,845
Test 0.0024 0.9981 0.9339 0.0029 0.0162 187
Male
Train 0.011 0.2616 0.9996 0.9864 0.4897 0.9183 851
Test 0.4614 0.9997 0.9558 0.6461 0.9054 2,555
Female
Train 0.043 0.3200 1.0000 0.9999 0.5656 0.9998 Inf
Test 0.5537 0.9981 0.9208 0.5321 0.5131 642
Male
Train 0.291 0.4836 1.0000 0.9989 0.6893 0.9825 68,978
Test 0.0000 0.9954 0.4788 −0.0001 −0.0001 218
Lung
Female
Train 0.004 0.2603 0.9998 0.9902 0.4972 0.9504 1,692
Test 0.0625 1.0000 0.9981 0.2434 0.9471 Inf
Male
Train 0.007 0.2978 0.9995 0.9878 0.5255 0.9292 856
Test 0.0644 0.9991 0.8314 0.2125 0.7114 78
Melanoma
Female
Train 0.002 0.2345 0.9999 0.9887 0.4695 0.9407 2,340
Test 0.0023 0.9991 0.9648 0.0061 0.0271 1,231
Male
Train 0.013 0.3305 0.9997 0.9818 0.5522 0.9234 1,824
Test 0.7500 0.9975 0.9543 0.1249 0.0209 4,874
Ovarian
Female
Train 0.001 0.0733 0.9997 0.9681 0.2511 0.8641 270
Test 0.0022 1.0000 0.9989 0.0174 0.1383 Inf
Pancreas
Female
Train 0.003 0.2733 1.0000 0.9992 0.5190 0.9857 28,626
Test 0.0658 0.9998 0.9372 0.2262 0.7797 429
Male
Train 0.002 0.0627 0.9999 0.9953 0.2430 0.2430 1,262
Test 0.3000 0.9991 0.5980 0.0665 0.0148 314
Prostate
Male
Train 0.040 0.4559 0.9992 0.9812 0.6478 0.9223 1,137
Test 0.3226 1.0000 0.9923 0.5589 0.9685 Inf
Renal
Female
Train 0.011 0.3862 1.0000 0.9957 0.6112 0.9674 15,944
Test 0.2194 0.9998 0.9170 0.4027 0.7401 1,302
Male
Train 0.005 0.1014 0.9999 0.9921 0.3059 0.9243 938
Test 0.4878 0.9995 0.8460 0.5978 0.7333 2,001
Thyroid
Female
Train 0.314 0.4423 1.0000 0.9977 0.6577 0.9780 60,262
Test 0.5674 0.9999 0.9877 0.6967 0.8556 20,820
Male
Train 0.001 0.0465 1.0000 0.9982 0.2112 0.9594 3,621
Test 0.0002 0.9999 0.8714 0.0048 0.1508 33
Upper GI
Female
Train 0.120 0.3011 1.0000 0.9845 0.5298 0.9325 16,371
Test 0.6000 0.9996 0.9078 0.3871 0.2499 3,503
Male
Train 0.001 0.1629 0.9998 0.9329 0.3875 0.9222 1,314
Test 0.6238 0.9988 0.6655 0.3405 0.1862 1,345

We tested the model's generalizability on the UK Biobank data. Figure 3 (solid lines) shows that for most cancers the generalization is very good. The cancers that did not generalize well, biliary, male breast, liver, and pancreas, are those with the fewest cases in the training set and tend to have few cases in the test set as well (see Table 2). Also, the difference in the ROC curves tend to be larger for the model predicting cancer in males than for the one predicting cancer in females, indicating that the model for females generalized better than the model for males. Also, the male model did not generalize as well as the female model. However, the model this performs very well in terms of AUC and diagnostic odds ratio, with all but 3 cancers have diagnostic odds ratio above 10 with most of them still in the hundreds or thousands.

In addition to simply training the neural network to predict future cancer incidence. We take the raw output of the model (always in the range of 0–1) as a risk indicator. Multiplying this risk by 100, we can treat it as a risk score and look at individual's risks across all cancers. In Figure 4A we see an example of such an analysis for a male from the UK Biobank dataset. It shows that he has high risk for colorectal and prostate cancer, but essentially no risks for the other cancers. While in Figure 4B we ran the same analysis for a female from the UK Biobank dataset and find that she has moderate risk for most cancers.

Figure 4.

Figure 4

Radar plots of cancer risk for a single patient. (A) A male participant from the UK Biobank data that has high risk for two cancers. (B) A female participant from the UK Biobank data that has moderate risk for many cancers. Information such as this could help individuals and their primary care providers make decisions on screening and preventative measures.

Discussions

In this work we introduce the idea of a statistical biopsy, which mines personal health data from individuals for early cancer detection, analogous to tissue biopsy evaluating cells from a tissue specimen and liquid biopsy evaluating circulating tumor DNA from a fluid sample. Taking advantage of two rich datasets, PLCO and UK Biobank, we were able to train two neural networks (one for men and one for women) to predict cancer risk for 17 different cancers. This model was trained on the cancer focused PLCO dataset and then tested on the much larger UK Biobank dataset.

Testing with the UK Biobank dataset helps to show the model's generalizability and give us confidence that we are not overfitting the PLCO data, especially given the large number of features that we are using. Also given that the UK Biobank data comes from a different population, does not record all data in the same way, and is missing some of the features we used in our model, high performance on this dataset shows that the model has a high degree of robustness. Furthermore, the UK Biobank dataset is representative of the noisy and messy data that a physician would have access to via electronic medical records as opposed to much cleaner data gathered in a clinical trial, giving confidence that this idea can work in practice. While testing on this second dataset that comes from a different population adds a lot of confidence in the generalization of the model, it is important to note that both the training set and test set come from primarily Caucasian populations living in wealthy countries. Validating on additional datasets coming from other countries is important, especially depending on where this model is used.

Despite all this there were places where the model did not perform well. On cancers such as biliary, liver, and male breast cancer the model did not generalize at all and for two of these would do better if its predictions were reversed. Furthermore, on almost every cancer the male model generalized worst then the female model. This is particularly surprising since there are more missing female only features in the test set then in there are missing male only features. We need to further test the importance of this female/male only features and where there are other features that should be included. In addition to exploring feature importance, we are also working on quantifying the uncertainty in our prediction from these missing features and a way for the model to not only give a prediction but indicate which feature to learn to most improve the prediction. Also, while the diagnostic odds ratio is high for almost all the cancers, they need to be compared against tested screening guidelines (whether recommended or not) to see if our statistical biopsy is actually an improvement over traditional methods.

Lastly, while the stochastic nature of the development of cancer means a statistical biopsy could never completely replace a liquid or tissue biopsy, like the screening guidelines (see Footnote 1–3) it could point those traditional biopsies to individuals who would get the most benefit from them. Furthermore, it is possible to generate a holistic analysis of an individual's risk for a variety of cancers simultaneously, having the benefit of a liquid biopsy's general screening but retaining the specificity of a tissue biopsy (i.e., identifying which cancers one is at high risk for). Furthermore, if integrated into a modern electronic medical record (EMR) system, it offers a cost-effective and safe approach to cancer screening in real time, informing preventive interventions and screening decisions.

This model will form the backbone of a user-facing mobile health platform that will not only let individuals evaluate their cancer risk in real time, but also see the effect of certain preventative measures or lifestyle changes on those risks.

In the short term we hope that this mobile health platform will not only help individuals in early cancer detection, but also continue improving itself as it builds up a large and diverse longitudinal data set shared by the consented individuals.

Ultimately, we envision a model like this will be integrated into EMR systems, where every time an individual visits their doctor, has a test done, etc. it can update its predictions. It would assist physicians and patients, prompting conversations about cancer prevention and screenings as needed. In addition, as the model matures with more data, it could also provide information on what tests or diagnostics would provide the most information on cancer risk as well as the timing and spacing of such diagnostics.

While there are still many hurdles to overcome, at the scientific, social, and legal levels, there is already a good start toward this vision of statistical biopsies. Keeping active discussions on all three levels in the community is necessary for stakeholders to make steady progress toward the vision of statistical biopsy.

Conclusion

We trained two neural networks to predict the risk of 16 types of cancers in females and 15 types in males and validated it against a second dataset that came from a different population. We showed this model could be used to look holistically at an individual's cancer risks. We introduced the term “statistical biopsy” to help change the paradigm around these types of models. With the large amounts of data available and powerful computers and algorithms it is time we move beyond guidelines for general population screening to more powerful and personalized methods akin to the liquid and tissues biopsies currently used in the medical field.

Data availability statement

The existing datasets analyzed in this study can be accessed by application via the following links: https://cdas.cancer.gov/datasets/plco/ and https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access.

Author contributions

GH and VY developed models and code. GH, VY, and JD developed the core ideas and did most of the writing. BN, GH, VY, DR, GS, and WM did preliminary work predicting individual cancers with different models and datasets. All authors contributed to the article and approved the submitted version.

Funding Statement

This work was supported by the National Institute of Biomedical Imaging and Bioengineering of the National Institutes of Health under Award Number R01EB022589, the National Science Foundation under Award Number DMS 1918925, the National Cancer Institute under Award Number 21X130F, and the Department of Energy under Award Number DE-SC0021655 to JD.

Footnotes

2National Comprehensive Cancer Network Guidelines. https://www.nccn.org/professionals/physician_gls/default.aspx.

3United States Preventive Services Task Force Published Recommendations. https://www.uspreventiveservicestaskforce.org/BrowseRec/Index.

Conflict of interest

DR was employed by Sun Nuclear Corporation (Mirion Technologies Inc.), Melbourne, FL, United States. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

  1. Alix-Panabières C., Pantel K. (2013). Circulating tumor cells: liquid biopsy of cancer. Clin. Chem. 59, 110–118. 10.1373/clinchem.2012.194258 [DOI] [PubMed] [Google Scholar]
  2. Ayer T., Alagoz O., Chhatwal J., Shavlik J. W., Kahn C. E., Burnside E. S. (2010). Breast cancer risk estimation with artificial neural networks revisited: discrimination and calibration. Cancer 116, 3310–3321. 10.1002/cncr.25081 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bishop C. M. (2006). Pattern Recognition and Machine Learning. New York, NY: Springer. [Google Scholar]
  4. Boursi B., Finkelman B., Giantonio B. J., Haynes K., Rustgi A. K., Rhim A. D., et al. (2017). A clinical prediction model to assess risk for pancreatic cancer among patients with new-onset diabetes. Gastroenterology 152, 840–850.e3. 10.1053/j.gastro.2016.11.046 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bravo A. A., Sheth S. G., Chopra S. (2001). Liver biopsy. N. Engl. J. Med. 344, 495–500. 10.1056/NEJM200102153440706 [DOI] [PubMed] [Google Scholar]
  6. Crowley E., Di Nicolantonio F., Loupakis F., Bardelli A. (2013). Liquid biopsy: monitoring cancer-genetics in the blood. Nat. Rev. Clin. Oncol. 10, 472–484. 10.1038/nrclinonc.2013.110 [DOI] [PubMed] [Google Scholar]
  7. Cruz J. A., Wishart D. S. (2006). Applications of machine learning in cancer prediction and prognosis. Cancer Inform. 2, 59–77. 10.1177/117693510600200030 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Duda R. O., Hart P. E., Stork D. G. (2001). Pattern Classification, 2nd Edn. New York, NY: Wiley. Available online at: https://books.google.com/books/about/Pattern_classification.html?id=YoxQAAAAMAAJ
  9. Ferlay J., Colombet M., Soerjomataram I., Mathers C., Parkin D. M., Pineros M., et al. (2019). Estimating the global cancer incidence and mortality in 2018: GLOBOCAN sources and methods. Int. J. Cancer 144, 1941–1953. 10.1002/ijc.31937 [DOI] [PubMed] [Google Scholar]
  10. Hart G., Roffman D., Decker R., Deng J. (2018). A multi-parameterized artificial neural network for lung cancer risk prediction. PLoS ONE 13, e0205264. 10.1371/journal.pone.0205264 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Hart G. R., Nartowt B. J., Muhammad W., Liang Y., Huang G. S., Deng J. (2019). Stratifying ovarian cancer risk using personal health data. Front. Big Data 2, 2. 10.3389/fdata.2019.00024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Hart G. R., Yan V., Huang G. S., Liang Y., Nartowt B. J., Muhammad W., et al. (2020). Population-based screening for endometrial cancer: human vs. machine intelligence. Front. Artif. Intell. 3, 539879. 10.3389/frai.2020.539879 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Holle L. M. (2017). “Cancer screening and prevention,” in ACSAP 2017 BOOK 1 (Cincinnati: Oncologic/Hematologic Care; ), 7–29. [Google Scholar]
  14. Kourou K., Exarchos T. P., Exarchos K. P., Karamouzis M. V., Fotiadis D. I. (2014). Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J. 13, 8–17. 10.1016/j.csbj.2014.11.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Kramer B. S. (2004). The science of early detection. Urol. Oncol. 22, 344–347. 10.1016/j.urolonc.2003.04.001 [DOI] [PubMed] [Google Scholar]
  16. Mitchell T. M. (2006). The Discipline of Machine Learning. Pittsburgh, PA: Carnegie Mellon University. Available online at: http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf
  17. Muhammad W., Hart G., Nartowt B., Farrell J., Johung K., Liang Y., et al. (2019). Pancreatic cancer prediction through an artificial neural network. Front. Artif. Intell. 2, 2. 10.3389/frai.2019.00002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Nartowt B. J., Hart G. R., Roffman D. A., Llor X., Ali I., Muhammad W., et al. (2019a). Scoring colorectal cancer risk with an artificial neural network based on self-reportable personal health data. PLoS ONE 14, e0221421. 10.1371/journal.pone.0221421 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Nartowt B. J., Hart G. R., Stark G. F., Muhammad W., Liang Y., Deng J. (2019b). Building robust machine learning models for colorectal cancer risk prediction. Front. Big Data 3, 6. 10.3389/fdata.2020.00006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Rajkomar A., Dean J., Kohane I. (2019). Machine learning in medicine. N. Engl. J. Med. 380, 1347–1358. 10.1056/NEJMra1814259 [DOI] [PubMed] [Google Scholar]
  21. Roffman D., Hart G., Girardi M., Ko C. J., Deng J. (2018a). Predicting non-melanoma skin cancer via a multi-parameterized artificial neural network. Sci. Rep. 8, 1701. 10.1038/s41598-018-19907-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Roffman D., Hart G., Leapman M., Yu J., Guo F., Ali I., et al. (2018b). Development and validation of a multi-parameterized artificial neural network for prostate cancer risk prediction. JCO CCI 2, 1–10. 10.1200/CCI.17.00119 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Siegel R. L., Miller K. D., Fuchs H. E., Jemal A. (2022). Cancer statistics, 20122. CA Cancer J. Clin. 72, 7–33. 10.3322/caac.21708 [DOI] [PubMed] [Google Scholar]
  24. Stark G. F., Hart G. R., Nartowt B. J., Deng J. (2019). Predicting breast cancer risk using personal health data and machine learning models. PLoS ONE 14, e0226765. 10.1371/journal.pone.0226765 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Su Y. -H. (2019). Liquid biopsy: An old concept with a new twist. Genet. Test Mole. Biomark. 23, 230–232. 10.1089/gtmb.2018.0326 [DOI] [PubMed] [Google Scholar]
  26. Tammemagi C. M., Pinsky P. F., Caporaso N. E., Kvale P. A., Hocking W. G., Church T. R., et al. (2011). Lung cancer risk prediction: prostate, lung, colorectal, and ovarian cancer screening trial models and validation. J. Natl. Cancer Inst. 103, 1058–1068. 10.1093/jnci/djr173 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. UK Biobank (2022). UK Biobank. Cheshire: UK Biobank Limited. Available online at: https://www.ukbiobank.ac.uk/ (accessed September 29, 2022).
  28. Yan V. (2020). Noninvasive Personal Cancer Risk Profiling (PCRP) via Machine Learning. Yale Master's Thesis. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The existing datasets analyzed in this study can be accessed by application via the following links: https://cdas.cancer.gov/datasets/plco/ and https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access.


Articles from Frontiers in Artificial Intelligence are provided here courtesy of Frontiers Media SA

RESOURCES