An enrichment model using regular health examination data for early detection of colorectal cancer

Qiang Shi; Zhaoya Gao; Pengze Wu; Fanxiu Heng; Fuming Lei; Yanzhao Wang; Qingkun Gao; Qingmin Zeng; Pengfei Niu; Cheng Li; Jin Gu

doi:10.21147/j.issn.1000-9604.2019.04.12

. 2019 Aug;31(4):686–698. doi: 10.21147/j.issn.1000-9604.2019.04.12

An enrichment model using regular health examination data for early detection of colorectal cancer

Qiang Shi ^1,^*, Zhaoya Gao ^1,^*, Pengze Wu ^2,^*, Fanxiu Heng ³, Fuming Lei ¹, Yanzhao Wang ¹, Qingkun Gao ¹, Qingmin Zeng ¹, Pengfei Niu ¹, Cheng Li ^1,^2,^4,^*, Jin Gu ^1,^2,^5,^*

PMCID: PMC6736654 PMID: 31564811

Abstract

Objective

Challenges remain in current practices of colorectal cancer (CRC) screening, such as low compliance, low specificities and expensive cost. This study aimed to identify high-risk groups for CRC from the general population using regular health examination data.

Methods

The study population consist of more than 7,000 CRC cases and more than 140,000 controls. Using regular health examination data, a model detecting CRC cases was derived by the classification and regression trees (CART) algorithm. Receiver operating characteristic (ROC) curve was applied to evaluate the performance of models. The robustness and generalization of the CART model were validated by independent datasets. In addition, the effectiveness of CART-based screening was compared with stool-based screening.

Results

After data quality control, 4,647 CRC cases and 133,898 controls free of colorectal neoplasms were used for downstream analysis. The final CART model based on four biomarkers (age, albumin, hematocrit and percent lymphocytes) was constructed. In the test set, the area under ROC curve (AUC) of the CART model was 0.88 [95% confidence interval (95% CI), 0.87−0.90] for detecting CRC. At the cutoff yielding 99.0% specificity, this model’s sensitivity was 62.2% (95% CI, 58.1%−66.2%), thereby achieving a 63-fold enrichment of CRC cases. We validated the robustness of the method across subsets of test set with diverse CRC incidences, aging rates, genders ratio, distributions of tumor stages and locations, and data sources. Importantly, CART-based screening had the higher positive predictive value (1.6%) than fecal immunochemical test (0.3%).

Conclusions

As an alternative approach for the early detection of CRC, this study provides a low-cost method using regular health examination data to identify high-risk individuals for CRC for further examinations. The approach can promote early detection of CRC especially in developing countries such as China, where annual health examination is popular but regular CRC-specific screening is rare.

Keywords: Classification and regression trees, colorectal cancer, regular health examination data, routine lab test biomarkers

Introduction

Colorectal cancer (CRC) is the third most common cancer in males and second most common in females across the world (1). With changes in modern lifestyles, such as high-fat diets and sedentary occupations (2), the number of new CRC cases has increased rapidly, with estimated 1.8 million new cases and 861,663 deaths in 2018 worldwide (1). In China, the incidence and mortality rates of CRC in 2015 are 27 and 13 per 100,000, respectively, ranking CRC as the fifth most frequent cancer nationwide (3). In Beijing, CRC incidence is higher in urban areas than in rural areas, but the incidence rate is increasing faster in rural areas than in urban areas (4).

Most colorectal carcinomas develop from a preclinical state of adenoma, which takes years to progress to advanced cancers (5,6). During this progression, early-stage CRC can be diagnosed by invasive imaging techniques, such as flexible sigmoidoscopy and colonoscopy (7,8). Early detection and removal of adenomas can reduce CRC incidence and mortality significantly (7-10). In the US, the 5-year survival rate for CRC cases diagnosed at early stages is 90%, in contrast to 65% for all CRC cases (11). From 1989 to 2011, the largest reductions in CRC mortality rates, more than 25%, have been achieved in European countries with better accessibility to CRC screening programs (12). Over the last decade, newer tests based on DNA, RNA and protein biomarkers in stool and blood have also improved the accuracy of CRC screenings (7,13-17).

However, challenges remain for regular CRC screening of large populations, especially in developing countries such as China. First, people are often unwilling to undergo invasive examinations such as colonoscopy due to physical or psychological reasons and expensive cost (7). Low compliance results in incomplete CRC screening of the populations. Second, given the very low prevalence of CRC and the low specificity of current screening tests (18,19), it is not cost-effective for the whole population to receive invasive or molecular biomarker-based screenings for CRC, especially in regions with no or limited access to treatment (20). Third, since different countries have different health policies, economics and medical cultures, it is difficult to develop a universal CRC screening program suitable for all countries (7,21-23).

An alternative to invasive CRC screening for the whole population is to first identify high-risk groups for CRC by non-invasive examinations or questionnaires, and then to perform invasive CRC examination only in these high-risk groups. Such two-step schemes (13) have been experimented in certain regions of China and have advantages of better compliance, lower overall cost and feasibility across different regions and cultures (24-26). Under the premise of good sensitivity, an extremely high specificity or low false positive rate (FPR) in initial screening is crucial to ensure that only very small proportion of CRC-free individuals are incorrectly assigned to the high-risk group which will receive further invasive examinations. However, existing mathematical model-based methods for CRC screening cannot balance sensitivity and specificity well. Usually, FPRs are more than 5% at around 60% sensitivity, reducing the utility of these methods for the general population screening (27-33).

In this study, we aimed to identify the high-risk groups for CRC from the general population by their routine lab test biomarkers from regular physical health examination data. Based on the classification and regression trees (CART) model detecting CRC cases, we achieved a 63-fold enrichment of CRC cases in the identified high-risk group relative to the original population with high sensitivity and specificity.

Materials and methods

Study population

Study data were from two independent hospitals in Beijing, China: Peking University Cancer Hospital (PUCH) and Peking University Shougang Hospital (PUSH). Specially, routine lab test data of CRC cases were from Departments of Gastrointestinal or General Surgery, while data of controls free of colorectal neoplasms were from physical health examination centers that provide services to the public. The PUCH data set consists of 7,068 diagnosed CRC cases from 2010 to 2015 and 80,194 controls who received physical health examinations but were not clinically diagnosed with CRC from 2007 to 2014. The PUSH data set consists of 453 CRC cases from 2011 to 2016 and 66,570 controls from 2009 to 2016 (Supplementary Table S1 ). All patient and control records were anonymized and de-identified prior to analysis.

S1.

Data quality control

Source	Class	Year	Raw		Data quality control
			Sample	Item	Sample	Item
			Sample	Item	Sample	Common	Specific
PUCH, Peking University Cancer Hospital; PUSH, Peking University Shougang Hospital; CRC, colorectal cancer.
PUCH	CRC	2010−2015	7,068	363	4,211	25 + gender + age	11
PUCH	Controls	2007−2014	80,194	220	77,099		11
PUSH	CRC	2011−2016	453	331	436		3
PUSH	Controls	2009−2016	66,570	434	56,799		3

Open in a new tab

Ethics approval and consent to participate

Approval for the study was provided by the Ethics Committee of Peking University Shougang Hospital (IRBK-2017-035-01). The Ethics Committees granted waivers of informed consent since this study involved analysis of retrospective data and all patient and control records were anonymized and de-identified prior to analysis.

Availability of data and materials

The R code and part data analyzed during the study are available in the GitHub repository, https://github.com/ChengLiLab/CRC_screening. The completed datasets are available from the corresponding author on reasonable request.

Data quality control

We performed quality control for the two data sets before analysis. First, when CRC cases underwent routine lab examinations multiple times, we only reserved the first preoperative data, which best represent the original symptoms of patients before interventions or treatments. Correspondingly, if controls underwent examinations more than once, we only used the latest one. Second, we deleted those samples without gender or age records. Third, for the PUCH data set, we filtered the top 35% and top 40% of lab test biomarkers that have the least missing values for the CRC and control groups, respectively. The cutoffs filtering biomarkers were determined by the distributions of the number of non-missing values. Finally, we obtained 38 biomarkers that overlapped between the two top lists. We then extracted the final PUCH data set containing these 38 biomarkers for both CRC and control groups (Supplementary Table S1 ). Furthermore, we corrected obvious data input mistakes, for example, one individual’s age recorded as 410. Similarly, we obtained a quality-checked PUSH data set containing 30 biomarkers in which 27 biomarkers were common between the two data sets (Supplementary Table S1 ).

CART

CART (34) is an important algorithm of decision tree. A CART model can be represented by a binary tree, which splits its branches at feature thresholds according to the Gini splitting rule. To avoid overfitting, we used minimal cost complexity pruning (35) to prune the original tree. Compared with other methods, not only the CART model is easier to interpret, but also the CART algorithm is more robust in handling missing values with better computational speed and accuracy (34,36,37).

Data allocation

We divided the PUCH data set into two parts, 80% for training models and the remaining 20% for testing the performance of the final model. Then we divided the training set into two parts, 75% for developing models and 25% for validating the models.

However, influenced by specific instrument, the values of lab test biomarkers were not always comparable across hospitals.Sometimes, the reference values of the same biomarkers were different between PUCH and PUSH. Therefore, the CART model trained from PUCH cannot be directly applied to PUSH. As a further validation of the method, we used the 70% of PUSH data set to train another CART model by the same biomarkers as the final CART model from PUCH, and tested its performance using the remaining 30% of PUSH data set.

Variable selection for CART

For each variable used in the CART model, the variable importance is defined as the sum of the decrease in impurity at all nodes where it is used as a splitter (34). For a classification with Inline graphic classes, let be the percentage of class samples in a node, and Gini impurity is defined as:

After obtaining the first CART model, we selected the important variables according to their importance and combined these variables to train a new CART model. We repeated the process of refining the CART model from the training set and validating its performance using the validation data, until the best CART model was derived. The surrogate split method is used to handle missing values in CART (34).

Measuring performance of models

We evaluated models by overall performance and specific performance in different application scenarios. First, as an overall measure of performance, we used the area under the receiver operating characteristic (ROC) curve (AUC). Second, we inspected the sensitivities at cutoffs yielding 99.0% specificity, which represents how many real CRC cases can be detected at the cost of misclassifying 1% of CRC-free individuals as CRC cases.

Results

Sample selection and routine lab test biomarkers

After data quality control, 4,211 CRC cases and 77,099 controls free of colorectal neoplasms of the PUCH data set and 436 CRC cases and 56,799 controls of the PUSH data set were used for downstream analysis (Table 1 , Supplementary Table S1 ). In both data sets, patients were significantly older in CRC group than the control (P<2.2e−16) (Table 1 , Figure 1A ). In addition, the proportion of male was higher than female in both CRC and control groups. Particularly, 73.2% of the controls in the PUSH data set were male, due to that PUSH provided health services to a major steel factory company in China (Table 1 ). The two data sets both covered CRC cases clinically diagnosed at multiple tumor stages and locations (Table 1 ).

1.

Characteristics of study population

Group	PUCH [n (%)]		PUSH [n (%)]
Group	CRC	Control	CRC	Control
PUCH, Peking University Cancer Hospital; PUSH, Peking University Shougang Hospital; CRC, colorectal cancer.
Total	4,211	77,099	436	56,799
Age (year)
<45	444 (10.5)	42,531 (55.2)	17 (3.9)	36,928 (65.0)
45−54	810 (19.2)	16,893 (21.9)	65 (14.9)	15,131 (26.6)
55−64	1,297 (30.8)	9,559 (12.4)	129 (29.6)	3,751 (6.6)
65−74	1,031 (24.5)	5,402 (7.0)	96 (22.0)	488 (0.9)
≥75	629 (14.9)	2,714 (3.5)	129 (29.6)	501 (0.9)
	60.6±12.5	43.9±14.6	66.4±12.8	39.0±11.8
Gender
Male	2,470 (58.7)	39,127 (50.7)	252 (57.8)	41,594 (73.2)
Female	1,741 (41.3)	37,972 (49.3)	184 (42.2)	15,205 (26.8)
Tumor stage
I	252 (6.0)	−	37 (8.5)	−
II	595 (14.1)	−	78 (17.9)	−
III	633 (15.0)	−	93 (21.3)	−
IV	147 (3.5)	−	31 (7.1)	−
Unspecified	2,584 (61.4)	−	197 (45.2)	−
Tumor location
Colon	2,279 (54.1)	−	179 (41.1)	−
Rectum	1,932 (45.9)	−	257 (58.9)	−

Open in a new tab

Distributions of two biomarkers and flowchart of model generation. Violin plots of age (A) (P<2.2e−16) and albumin (B) (P<2.2e−16) distributions for colorectal cancer (CRC) and control groups; (C) Flowchart of model generation. CART, classification and regression trees.

For the PUCH training set, we obtained 38 common biomarkers from routine lab tests, in which 14 biomarkers were employed for downstream analysis that showed significant differences between the CRC and control groups in term of both statistics and effect size (Table 2 , P<0.001, |Cohen’s d| ≥0.5). CRC patients often have abnormal blood counts and some of these biomarkers have been used for CRC screening, diagnosis and prognosis (27-31,38-40). For example, blood albumin was significantly lower in the CRC group compared to the control (P<2.2e−16) (Figure 1B ). Based on these observations, we hypothesized that a multivariate classification model can distinguish CRC cases from CRC-free individuals.

2.

Biomarkers of quality-controlled PUCH population

Biomarker	Reference value	Unit	P	Cohen’s d (95% CI)	Magnitude*
PUCH, Peking University Cancer Hospital; 95% CI, 95% confidence interval; A/G, rate of albumin to globulin; Alb, albumin; ALT, alanine transaminase; AST, aspartate transaminase; BASO%, percent basophils; Ca, calcium; Crea, creatinine; EO%, percent eosinophils; Glu, glucose; HCT, hematocrit; HDL-C, high density lipoprotein-cholesterol; HGB, hemoglobin; K, kalium; LDL-C, low density lipoprotein-cholesterol; LYMPH%, percent lymphocytes; MCH, mean corpuscular hemoglobin; MCHC, mean corpuscular hemoglobin concentration; MCV, mean corpuscular volume; MONO%, percent monocytes; MPV, mean platelet volume; NEUT%, percent neutrophils; P, phosphorus; P-LCR, platelet large cell ratio; PCT, plateletcrit; PDW, platelet distribution width; PLT, platelet; RBC, red blood count; RDW-CV, variable coefficient of red blood cell distribution width; RDW-SD, standard deviation of red blood cell distribution width; TBil, total bilirubin; TCHO, total cholesterol; TG, triglyceride; TP, total protein; UA, uric acid; WBC, white blood count; , the magnitudes were assessed using the Cohen’s d thresholds: \|d\|<0.2 “Negligible”, \|d\|<0.5 “Small”, \|d\|<0.8 “Medium”, otherwise “Large”; *, biomarkers selected in the final Classification and regression trees (CART) model and plotted inSupplementary Figure S3 .
Age**	−	−	<0.001	1.13 (1.09, 1.17)	Large
A/G	1.0−2.5	−	<0.001	−0.32 (−0.39, −0.25)	Small
Alb**	35.0−55.0	g/L	<0.001	−1.46 (−1.53, −1.40)	Large
ALT	0−40	IU/L	<0.001	−0.28 (−0.32, −0.24)	Small
AST	0−45	IU/L	<0.001	−0.16 (−0.21, −0.11)	Negligible
BASO%	0.00−1.00	%	0.364	0.02 (−0.03, 0.07)	Negligible
Ca	2.12−2.75	mmo/L	<0.001	−0.82 (−0.88, −0.77)	Large
Crea	50−130	μmol/L	<0.001	−0.34 (−0.38, −0.30)	Small
EO%	1.00−5.00	%	0.005	0.08 (0.03, 0.13)	Negligible
Gender	−	−	−	−	−
Glu	3.60−6.10	mmol/L	<0.001	0.70 (0.66, 0.74)	Medium
HCT**	37.0−49.0	%	<0.001	−1.09 (−1.13, −1.04)	Large
HDL-C	0.82−1.96	mmol/L	<0.001	−0.54 (−0.59, −0.49)	Medium
HGB	110−150	g/L	<0.001	−1.11 (−1.15, −1.06)	Large
K	3.5−5.3	mmol/L	0.004	0.10 (0.03, 0.16)	Negligible
LDL-C	1.80−3.90	mmol/L	0.001	−0.08 (−0.13, −0.03)	Negligible
LYMPH%**	20−40	%	<0.001	−1.33 (−1.37, −1.28)	Large
MCH	27.00−31.00	pg	<0.001	−0.52 (−0.56, −0.48)	Medium
MCHC	320−360	g/L	<0.001	−0.48 (−0.52, −0.43)	Small
MCV	82.00−92.00	fL	<0.001	−0.39 (−0.43, −0.35)	Small
MONO%	3.00−8.00	%	<0.001	0.27 (0.22, 0.32)	Small
MPV	6.80−13.50	fL	0.954	0.00 (−0.04, 0.04)	Negligible
NEUT%	50.00−70.00	%	<0.001	0.98 (0.94, 1.03)	Large
P	0.69−1.60	mmol/L	<0.001	0.26 (0.20, 0.31)	Small
P-LCR	13.0−43.0	%	<0.001	−0.09 (−0.14, −0.05)	Negligible
PCT	0.108−0.370	%	0.643	−0.01 (−0.06, 0.03)	Negligible
PDW	15.5−18.1	%	<0.001	−0.55 (−0.59, −0.50)	Medium
PLT	100−350	×10⁹/L	<0.001	0.15 (0.11, 0.19)	Negligible
RBC	3.50−5.50	×10¹²/L	<0.001	−0.61 (−0.65, −0.57)	Medium
RDW-CV	11.60−14.80	%	<0.001	0.51 (0.46, 0.55)	Medium
RDW-SD	37−50	fL	<0.001	0.33 (0.29, 0.37)	Small
TBil	1.70−20.0	μmol/L	0.175	0.05 (−0.03, 0.13)	Negligible
TCHO	2.84−5.68	mmol/L	<0.001	−0.20 (−0.25, −0.16)	Small
TG	0.56−1.70	mmol/L	0.007	−0.04 (−0.09, 0.00)	Negligible
TP	60.0−80.0	g/L	<0.001	−1.23 (−1.29, −1.17)	Large
UA	90−340	μmol/L	<0.001	−0.16 (−0.20, −0.12)	Negligible
Urea	1.7−8.3	mmol/L	<0.001	0.14 (0.10, 0.19)	Negligible
WBC	4.0−10.0	×10⁹/L	<0.001	0.26 (0.21, 0.30)	Small

Open in a new tab

A CART-based CRC classification model using routine lab test biomarkers

Based on routine lab test biomarkers in the PUCH training set, a model detecting CRC cases was constructed by the CART algorithm, and this model’s performance was evaluated using the test set by the ROC curve (Figure 1C , Supplementary Figure S1 ). The final, best CART model consisted of only four biomarkers: age (Age), albumin (Alb), hematocrit (HCT) and percent lymphocytes (LYMPH%). Meaningfully, all four biomarkers had large effect sizes (Table 2 , |Cohen’s d| ≥0.8) between the CRC and control groups. In addition, we also built a simple model as the baseline comparison only using age as the predicting variable.

Final classification and regression trees (CART) model trained from Peking University Cancer Hospital (PUCH) data set. CART model represented by a binary tree consists of four biomarkers (variables): age, albumin (Alb), hematocrit (HCT), and percent lymphocytes (LYMPH%). Green boxes contain more healthy cases and blue boxes contain more patient cases. There are three-row notes in each box: the top labels the major class (patient or healthy); the middle shows the percentages of healthy and patient cases, respectively; the bottom labels the proportion of cases within the box to the total.

In the training set, the AUC values of the CART and age models were 0.90 [95% confidence interval (95% CI), 0.88−0.91] and 0.81 (95% CI, 0.80−0.82), respectively (Figure 2A ), showing that the CART model was superior to the age model overall. Noteworthy, the sensitivity of the CART model was 67.0% (95% CI, 63.7%−70.2%), much higher than 4.8% (95% CI, 3.2%−6.4%) of the age model when defining cutoffs yielding 99.0% specificity (Figure 2B ). Therefore, the CART model can correctly identify 67% of real CRC cases at the cost of misclassifying 1% of CRC-free individuals as CRC cases. The CART-predicted probabilities of being CRC were indeed higher for real CRC cases than for controls (Figure 2C ). The reliability of the CART model was validated by its performance on the test set. Specifically, the AUC of the CART model was 0.88 (95% CI, 0.87−0.90) and the sensitivity was 62.2% (95% CI, 58.1%−66.2%) at the 99.0% specificity (Figure 2D ,E ). And CART-predicted probabilities of being CRC also supported this (Figure 2F ). We concluded that the CART model was able to distinguish CRC cases from CRC-free individuals with high sensitivity and specificity.

Performance of models on training set and test set. (A) Receiver operating characteristic (ROC) curves of age model and final classification and regression trees (CART) model on the training set. The values shown are area under the curves (AUCs) as well as 95% confidence intervals (95% CIs) based on 1,000 bootstrap iterations. P<0.001 for two-tailed Delong’s test between the two ROC curve; (B) Enlarged local ROC curves inFigure 2A . Dashed lines show that the sensitivities at the 99.0% specificity of the CART and age models are 67.0% (95% CI, 63.7%−70.2%) and 4.8% (95% CI, 3.2%−6.4%), respectively. 95% CIs were constructed based on 1,000 bootstrap iterations; (C) CART-predicted probability distribution of being colorectal cancer (CRC) in the training set. Vertical dashed line shows that the cutoff probability is 0.48 yielding 99.0% specificity; (D−F) Similar figures asFigure 2A −C when applying the same models to test set. (D) P<0.001 for two-tailed Delong’s test between the two ROC curve; (E) Dashed lines show that the sensitivities at the 99.0% specificity of the CART and age models are 62.2% (95% CI, 58.1%−66.2%) and 3.73% (95% CI, 2.42%−5.04%), respectively; (F) Vertical dashed line shows that the cutoff probability is 0.48 yielding 99.0% specificity.

In order to ensure that the CRC and control groups are comparable for sample collection period, we selected the subset of PUCH data in which the CRC cases and controls were both from 2010 to 2014 and performed training and testing of the CART model. The final CART model consisted of the same four biomarkers as before: Age, Alb, HCT and LYMPH%. AUCs were almost the same as the previous results using the whole data (Supplementary Table S2 ). At the 99.0% specificity, sensitivities were also almost the same as the previous results using the whole data (Supplementary Table S2 ). Therefore, we have demonstrated that our results were little affected by the time periods of sample collection.

S2.

Performance of CART model on data subsets from same time periods

Variables	Data subset (years 2010−2014)	Whole data
CART, classification and regression trees; AUC, the area under receiver operating characteristic curve; 95% CI, 95% confidence interval; *, sensitivity at the 99.0% specificity.
AUC (95% CI)
Training	0.89 (0.88−0.91)	0.90 (0.88−0.91)
Test	0.89 (0.88−0.91)	0.88 (0.87−0.90)
Sensitivity (95% CI)*
Training	66.0% (62.4%−69.5%)	67.0% (63.7%−70.2%)
Test	62.8% (58.3%−67.3%)	62.2% (58.1%−66.2%)

Open in a new tab

Robustness and generalization of CART model

Next, we examined whether the CART model could be applied to the general population with diverse characteristics. First of all, we applied the CART model to randomly sampled subsets of the test set with different class ratios varying from 1:10 to 1:10,000. Surprisingly, these imbalanced sample-class ratios (41) resulted in highly uniform AUCs and sensitivities at the 99.0% specificity (Figure 3A ), which demonstrates that the CART model will be effective for different regions with various CRC incidence rates.

Robustness and generalization of classification and regression trees (CART) model. (A) Area under the curves (AUCs) and sensitivities at 99.0% specificity of CART model on subsets of test set with different sample-class ratios. Every bar shows the mean of 1,000 random samples. Error bar is standard error derived from these 1,000 random samples; (B) Similar figure as *Figure 3A* for subsets of test set with different proportions of elderly individuals; (C) Gender-specific receiver operating characteristic (ROC) curves. Three solid curves show CART model’s performance on the whole test set (All), male subset (Male) and female subset (Female), respectively. Values shown are AUCs as well as 95% confidence intervals (95% CIs). Dashed lines highlight the sensitivities at the 99.0% specificity; (D,E) Similar figures as *Figure 3C* for tumor stage-specific (D) and tumor location-specific (E) subsets of test set; (F) Data source-specific ROC curves. Peking University Cancer Hospital (PUCH) curve shows the final CART model’s performance on the test set of PUCH. Peking University Shougang Hospital (PUSH) curve shows the performance of CART model, trained from the 70% of PUSH data set, on the remaining 30% of PUSH data set. All P>0.05 for two-tailed Delong’s test between any two ROC curves within the same panel. CRC, colorectal cancer.

In addition, previous studies indicated that the incidence rate of CRC increased with age (42). To determine age’s effect on this model, we applied the CART model to randomly sampled subsets with different proportions of elderly cases who were more than 60 years old. The results showed that the CART model still had good predictive power especially for groups with aging rates less than 20% (Figure 3B ), which indicates that the CART model can be effective in almost all developing world and some developed countries.

Next, we showed that the CART model’s performance on only male or female subsets were almost as good as the whole test set, which indicates that the CART model was less affected by gender factor (Figure 3C , Supplementary Figure S2A ). Similarly, we demonstrated that the CART model can detect both early-stage (stages I/II) and advanced-stage (stages III/IV) CRC, and detect specified-stage as well as unspecified-stage CRC with similar performances (Figure 3D , Supplementary Figure S2B ). We also showed that the CART model had no predicted bias for CRC locations (Figure 3E , Supplementary Figure S2C ). In addition, we found the CART model has slightly better sensitivity for proximal colon neoplasia than distal colon neoplasia (Supplementary Table S3 ).

Robustness and generalization of classification and regression trees (CART) model. (A) Gender-specific receiver operating characteristic (ROC) curves. The three solid curves show CART model’s performance on the whole test set (All), male subset (Male) and female subset (Female), respectively. Values shown are AUCs as well as 95% confidence intervals (95% CIs); (B,C) Similar figure as *Supplementary Figure S2A* for tumor stage-specific (B) and tumor location-specific (C) subsets of the test set; (D) Data source-specific ROC curves. Peking University Cancer Hospital (PUCH) curve shows the final CART model’s performance on the test set of PUCH. Peking University Shougang Hospital (PUSH) curve shows performance of CART model, trained from 70% of PUSH data set, on the remaining 30% of PUSH data set; (E) Comparison between CART model and fecal immunochemical test (FIT). CART model from PUCH is shown on the test set, while the FIT curve is plotted by public studies (*Supplementary Table S4* ).

S3.

Comparison of CART model’s performance for proximal and distal colon neoplasia

Variables	Proximal	Distal
CART, classification and regression trees; AUC, the area under receiver operating characteristic curve; 95% CI, 95% confidence interval; *, sensitivity at the 99.0% specificity.
AUC (95% CI)	0.91 (0.88−0.94)	0.88 (0.85−0.91)
Sensitivity (95% CI)*	66.8% (58.6%−74.9%)	55.8% (48.2%−63.3%)

Open in a new tab

Importantly, influenced by specific instrument and reagents used, the values of lab test biomarkers were not always comparable across hospitals (Supplementary Figure S3 ). Sometimes, the reference values of the same biomarkers were different between PUCH and PUSH. Therefore, the CART model trained from PUCH could not be directly applied to PUSH. In order to test this approach’s applicability across different data sources, we trained another CART model using the same four markers (Age, Alb, HCT and LYMPH%) on the 70% of PUSH data set, and tested its performance on the remaining 30% of PUSH data set. For PUSH, we obtained similar AUC (0.87, 95% CI, 0.84−0.91) and sensitivity (60.8%, 95% CI, 53.1%−68.4%) at the 99.0% specificity, indicating the CART model performed well on the PUSH (Figure 3F , Supplementary Figure S2D ). Taken together, the CART model was applicable to different populations with diverse CRC incidences, aging rates, genders ratio, distributions of tumor stages and locations, and data sources.

Value distributions of four biomarkers. (A) Age distribution in colorectal cancer (CRC) and control groups of Peking University Cancer Hospital (PUCH) and Peking University Shougang Hospital (PUSH) data sets. Significant P-values of two-tailed t-tests are noted in graphs; (B−D) Similar figures asSupplementary Figure S3A for Alb (B), HCT (C) and LYMPH% (D). Alb, albumin; HCT, hematocrit; LYMPH%, percent lymphocytes.

Comparative effectiveness of CART-based screening relative to stool-based screening

The CART model can predict a high-risk group for CRC from the population who received regular health examination, which likely contains a higher proportion of non-symptomatic or early-stage CRC cases than the general population. This provides a framework for enriching CRC cases from the general population using regular health examination data. Assuming the CRC incidence rate is 25 per 100,000 in one region (Supplementary Figure S4A ), the CART model will predict a high-risk group that contains 16 CRC cases (62.2% sensitivity) and 1,000 CRC-free individuals (99.0% specificity) from a population of 100,000 individuals (Figure 2E , Supplementary Figure S4B ). The proportion of CRC cases in this high-risk group (16 per 1,016) represents a 63-fold enrichment relative to in the region's population (25 per 100,000).

Classification and regression trees (CART)-based two-step scheme for colorectal cancer (CRC) screening. (A) There are 25 CRC cases per 100,000 individuals in the general population; (B) High-risk group for CRC predicted by the CART model contains 16 real CRC cases (62.2% sensitivity) and 1,000 CRC-free individuals (99.0% specificity).

We next considered how the CART model compares with the stool-based tests for enriching CRC cases. Guaiac fecal occult blood test (gFOBT) is a vital criterion for clinical diagnosis of CRC (43) that has only 33.3% sensitivity at 95.2% specificity (19). Compared with gFOBT, fecal immunochemical test (FIT) has higher sensitivity for adenomas and cancers by specifically detecting human hemoglobin and does not require dietary restriction before test, thus having higher participation (44,45). For quantitative FIT, a lower cut-off increases the detection of advanced neoplasia but lowers the specificity thus demanding more follow-up colonoscopy (46).

Previous study showed that the specificity of FIT is 94.0% at the 79.0% sensitivity (47). Due to the very low prevalence of CRC, the CART model had a much higher positive predictive value over FIT screening (1.6% vs. 0.3%) (Table 3 ), which greatly enriches the CRC cases and reduces the cost of follow-up tests such as colonoscopy. The theoretically calculated FIT’s enrichment factor for CRC cases was 13-fold, lower than the CART model’s 63-fold enrichment factor. In addition, we plotted ROC curve of FIT by public studies (19,48-52) (Supplementary Table S4 ) and showed that the sensitivity of FIT was only 25.0% at the 99.0% specificity, although FIT had the slightly higher AUC than the CART model (0.90 vs. 0.88) (Figure 4 , Supplementary Figure S2E ). Taken together, CART-based screening is more effective than stool-based screening.

3.

Comparison of CART model and FIT

Variables	Sensitivity (%)	Specificity (%)	PPV (%)	NPV (%)
CART, classification and regression trees; FIT, fecal immunochemical test; PPV, positive predictive value; NPV, negative predictive value.
CART	62.2	99.0	1.6	~100
FIT	79.0	94.0	0.3	~100

Open in a new tab

S4.

Characteristics of FIT in public studies (1)

Study	Cut-off value (μg/g)	Cohort size (n)	CRC (n)	Sensitivity	Specificity
FIT, fecal immunochemical test; CRC, colorectal cancer.
Sohn et al. (2)	20	3,794	12	0.25	0.99
Nakama et al. (3)	20	4,611	18	0.56	0.97
Brenner and Tao (4)	6.1	2,235	15	0.73	0.96
de Wijkerslooth et al. (5)	20	1,256	8	0.75	0.95
Park et al. (6)	20	770	13	0.77	0.94
Chiu et al. (7)	10	8,822	13	0.85	0.92

Open in a new tab

Comparison between classification and regression trees (CART) model and fecal immunochemical test (FIT). CART model from Peking University Cancer Hospital (PUCH) was shown on test set, while FIT curve was plotted by public studies.

Discussion

CRC screening programs have been established in many western countries and paid for by health insurance. However, in developing countries such as China, which have huge populations, relatively weak economic foundations and unbalanced regional development, nationwide CRC screening is cost-prohibitive and full compliance is difficult to realize. In this study, we utilized widely-adopted regular heath examination data in China to develop a statistical classification model that can identify high-risk cases for CRC from the general population. Such routine lab test data are currently available to individuals who participated in physical health examinations but are not largely used for pooled analysis. Therefore, our approach does not incur additional examinations, which would improve screening compliance with little cost (53).

Specifically, the CART model we constructed showed high AUC (0.88, 95% CI, 0.87−0.90) and sensitivity (62.2%, 95% CI, 58.1%−66.2%) at the 99.0% specificity, and performed equally well in subpopulations stratified by multiple CRC incidences, aging rates, genders, tumor stages and tumor locations. In other words, we achieved a 63-fold enrichment of CRC cases in the high-risk group identified by the CART model. Therefore, compared with previous models (27-32), the CART model has stronger discriminatory power and better generalizability.

In current CRC screening practices, stool-based tests such as gFOBT and FIT, or one-time screening with both FOBT and sigmoidoscopy, can identify subjects at risk for colorectal neoplasms from large populations (7,18). However, false positive results are the big challenge. Due to the very low prevalence of CRC, the CART model has a much higher positive predictive value over FIT (1.6% vs. 0.3%) thus a higher enrichment for CRC than FIT screening (63- vs. 13-fold) (47), which greatly reduces false positive cases who need further examinations. Therefore, CART-based screening significantly reduces the cost of follow-up tests such as colonoscopy, which is crucial in regions with limited colonoscopy resources (19).

Colonoscopy is a common confirmative CRC screening strategy, but it is also the most invasive method with the highest cost and risk of complications (43). Therefore, we propose the two-step CRC screening procedure (13), in which only individuals predicted to be CRC positive by the CART model receive follow-up invasive examinations. The first step uses regular physical health examination data free of CRC-specific costs, and follow-up colonoscopy recommended by CRC specialists can be covered by healthcare insurance systems.

Besides, compared with other mathematical models for screening or enriching CRC cases (27-32), the CART model excels at handling missing values and imbalanced classification problems. These are particularly relevant in this CRC enrichment study since not all individuals received a full spectrum of biomarker tests in their physical health examination and the sample numbers of CRC cases and controls were not balanced. We tested the final CART model using data subsets with different sample-class ratios between the CRC and control groups and obtained similar performance. In addition, there may be unidentified CRC cases in controls. However, given the very low prevalence of CRC (25 per 100,000), the few CRC cases (estimated to be 33 CRC cases in the 133,898 controls) have little effect on overall distributions of lab test biomarkers and are therefore not likely to affect the overall performance of the CART model.

The final CART model trained from the PUCH data set consisted of only four biomarkers: Age, Alb, HCT and LYMPH%. Compared with previous studies (27-32), the number of variables in the CART model is smaller, reducing the likelihood of overfitting and enhancing the model’s generalizability. Meaningfully, all four biomarkers showed statistically significant differences and large effect sizes indicating clinical significance between the CRC and control groups. Moreover, these biomarkers cover different aspects of human physiology such as blood counts and specific proteins, facilitating the interpretation of the CART model and being informative for CRC screening practices. These changes are consistent with increased proinflammatory cytokines and growth factors in response to high physiological stress and hypoxia in cancer tissues (54,55), which modulates the production of albumin (56). During an acute inflammatory response, the ratio between different leucocyte subsets is altered, and there is a neutrophilia often accompanied by a relative lymphocytopenia (57). To our knowledge, our study is the first to incorporate these changes in a multivariate CRC enrichment model.

There are several limitations in our study. First, our approach is dependent on data from routine physical health examinations and is not applicable to the populations that do not participate in such examinations. Such populations may be at higher risk for CRC due to less access to early screening and health facilities. Second, the CART model was trained using quantitative values in routine lab test biomarkers. However, influenced by specific instrument calibration, the values of lab test biomarkers may not be entirely comparable across medical institutions. In the future, we will implement prospective validation for this approach and study how to normalize data from different facilities.

Conclusions

As an alternative approach for the early detection of CRC, this study utilized regular health examination data to identify high-risk groups for CRC with no additional examination cost, and this approach is applicable to populations with diverse characteristics. Overall, this study provides a novel approach for CRC screening to trigger follow-up invasive examinations for definite diagnosis, which may improve the CRC screening efficiency especially in the developing world.

Acknowledgements

This study was supported by funding from Beijing Municipal Science & Technology Commission, Clinical Application and Development of Capital Characteristic (No. Z161100000516003) and National Natural Science Foundation of China (No. 31871266). We also thank Dr. Xianjin Xie (the University of Iowa) for critical advice, and the high performance computing platform of the Center for Life Sciences, Peking University.

Footnote

Conflicts of Interest: The authors have no conflicts of interest to declare.

References

1. Robertson DJ, Lee JK, Boland CR, et al. Recommendations on fecal immunochemical testing to screen for colorectal neoplasia: A consensus statement by the US multi-society task force on colorectal cancer. Gastroenterology 2017;152:1217- 37.e3.

2. Sohn DK, Jeong SY, Choi HS, et al. Single immunochemical fecal occult blood test for detection of colorectal neoplasia. Cancer Res Treat 2005;37:20-3.

3. Nakama H, Yamamoto M, Kamijo N, et al. Colonoscopic evaluation of immunochemical fecal occult blood test for detection of colorectal neoplasia. Hepatogastroenterology 1999;46:228-31.

4. Brenner H, Tao S. Superior diagnostic performance of faecal immunochemical tests for haemoglobin in a head-to-head comparison with guaiac based faecal occult blood test among 2235 participants of screening colonoscopy. Eur J Cancer 2013;49:3049-54.

5. de Wijkerslooth TR, Stoop EM, Bossuyt PM, et al. Immunochemical fecal occult blood testing is equally sensitive for proximal and distal advanced neoplasia. Am J Gastroenterol 2012;107:1570-8.

6. Park DI, Ryu S, Kim YH, et al. Comparison of guaiac-based and quantitative immunochemical fecal occult blood testing in a population at average risk undergoing colorectal cancer screening. Am J Gastroenterol 2010;105:2017-25.

7. Chiu HM, Lee YC, Tu CH, et al. Association between early stage colon neoplasms and false-negative results from the fecal immunochemical test. Clin Gastroenterol Hepatol 2013;11:832-8.e1-2.

Contributor Information

Cheng Li, Email: cheng_li@pku.edu.cn.

Jin Gu, Email: zlgujin@126.com.

References

1.Bray F, Ferlay J, Soerjomataram I, et al Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2018;68:394–424. doi: 10.3322/caac.21492. [DOI] [PubMed] [Google Scholar]
2.Shanahan F, O’Toole PW Host-microbe interactions and spatial variation of cancer in the gut. Nat Rev Cancer. 2014;14:511–2. doi: 10.1038/nrc3765. [DOI] [PubMed] [Google Scholar]
3.Chen W, Sun K, Zheng R, et al Cancer incidence and mortality in China, 2014. Chin J Cancer Res. 2018;30:1–12. doi: 10.21147/j.issn.1000-9604.2018.01.01. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Li Z, Yang L, Du C, et al Characteristics and comparison of colorectal cancer incidence in Beijing with other regions in the world. Oncotarget. 2017;8:24593–603. doi: 10.18632/oncotarget.15598. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Brenner H, Hoffmeister M, Stegmaier C, et al Risk of progression of advanced adenomas to colorectal cancer by age and sex: estimates based on 840,149 screening colonoscopies. Gut. 2007;56:1585–9. doi: 10.1136/gut.2007.122739. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Kuntz KM, Lansdorp-Vogelaar I, Rutter CM, et al A systematic comparison of microsimulation models of colorectal cancer: the role of assumptions about adenoma progression. Med Deci Making. 2011;31:530–9. doi: 10.1177/0272989x11408730. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Schreuders EH, Ruco A, Rabeneck L, et al Colorectal cancer screening: a global overview of existing programmes. Gut. 2015;64:1637–49. doi: 10.1136/gutjnl-2014-309086. [DOI] [PubMed] [Google Scholar]
8.Levin B, Lieberman DA, McFarland B, et al Screening and surveillance for the early detection of colorectal cancer and adenomatous polyps, 2008: a joint guideline from the American Cancer Society, the US Multi-Society Task Force on Colorectal Cancer, and the American College of Radiology. CA Cancer J Clin. 2008;58:130–60. doi: 10.3322/ca.2007.0018. [DOI] [PubMed] [Google Scholar]
9.Zauber AG, Winawer SJ, O’Brien MJ, et al Colonoscopic polypectomy and long-term prevention of colorectal-cancer deaths. N Engl J Med. 2012;366:687–96. doi: 10.1056/NEJMoa1100370. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Mandel JS, Church TR, Bond JH, et al The effect of fecal occult-blood screening on the incidence of colorectal cancer. New Engl J Med. 2000;343:1603–7. doi: 10.1056/nejm200011303432203. [DOI] [PubMed] [Google Scholar]
11.Miller KD, Siegel RL, Lin CC, et al Cancer treatment and survivorship statistics, 2016. CA Cancer J Clin. 2016;66:271–89. doi: 10.3322/caac.21349. [DOI] [PubMed] [Google Scholar]
12.Ait Ouakrim D, Pizot C, Boniol M, et al Trends in colorectal cancer mortality in Europe: retrospective analysis of the WHO mortality database. BMJ. 2015;351:h4970. doi: 10.1136/bmj.h4970. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Nguyen MT, Weinberg DS Biomarkers in colorectal cancer screening. J Nati Compr Canc Netw. 2016;14:1033–40. doi: 10.6004/jnccn.2016.0109. [DOI] [PubMed] [Google Scholar]
14.Dalerba P, Sahoo D, Paik S, et al CDX2 as a prognostic biomarker in stage II and stage III colon cancer. New Engl J Med. 2016;374:211–22. doi: 10.1056/NEJMoa1506597. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Warren JD, Xiong W, Bunker AM, et al Septin 9 methylated DNA is a sensitive and specific blood test for colorectal cancer. BMC Med. 2011;9:133. doi: 10.1186/1741-7015-9-133. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Rodia MT, Ugolini G, Mattei G, et al Systematic large-scale meta-analysis identifies a panel of two mRNAs as blood biomarkers for colorectal cancer detection. Oncotarget. 2016;7:30295–306. doi: 10.18632/oncotarget.8108. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Chen H, Zucknick M, Werner S, et al Head-to-Head comparison and evaluation of 92 plasma protein biomarkers for early detection of colorectal cancer in a true screening setting. Clin Cancer Res. 2015;21:3318–26. doi: 10.1158/1078-0432.ccr-14-3051. [DOI] [PubMed] [Google Scholar]
18.Lieberman DA, Weiss DG, Veterans Affairs Cooperative Study Group 380 One-time screening for colorectal cancer with combined fecal occult-blood testing and examination of the distal colon. New Engl J Med. 2001;345:555–60. doi: 10.1056/NEJMoa010328. [DOI] [PubMed] [Google Scholar]
19.Brenner H, Tao S Superior diagnostic performance of faecal immunochemical tests for haemoglobin in a head-to-head comparison with guaiac based faecal occult blood test among 2235 participants of screening colonoscopy. Eur J Cancer. 2013;49:3049–54. doi: 10.1016/j.ejca.2013.04.023. [DOI] [PubMed] [Google Scholar]
20.Ginsberg GM, Lim SS, Lauer JA, et al Prevention, screening and treatment of colorectal cancer: a global and regional generalized cost effectiveness analysis. Cost Eff Resour Alloc. 2010;8:2. doi: 10.1186/1478-7547-8-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Ahmed F Barriers to colorectal cancer screening in the developing world: The view from Pakistan. World J Gastrointest Pharmacol Ther. 2013;4:83–5. doi: 10.4292/wjgpt.v4.i4.83. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Laiyemo AO, Brawley O, Irabor D, et al Toward colorectal cancer control in Africa. Int J Cancer. 2016;138:1033–4. doi: 10.1002/ijc.29843. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Ng SC, Wong SH Colorectal cancer screening in Asia. Br Med Bull. 2013;105:29–42. doi: 10.1093/bmb/lds040. [DOI] [PubMed] [Google Scholar]
24.Zhao L, Zhang W, Ma D, et al Analysis of colorectal cancer screening practices in the general population of Tianjin. Zhongguo Zhong Liu Lin Chuang. 2015;42:760–4. doi: 10.3969/j.issn.1000-8179.20150644. [DOI] [Google Scholar]
25.Zheng Y, Gong Y Research and practice of screening for colorectal cancer in population of Shanghai. Zhongguo Zhong Liu. 2013;22:86–9. [Google Scholar]
26.Tian Z, Chen H, Zhai A, et al Investigation of colorectal cancer opportunistic screening being combined with the physical examination in Beijing Yungang. Shou Du Yi Ke Da Xue Xue Bao. 2016;37:34–7. doi: 10.3969/j.issn.1006-7795.2016.01.007. [DOI] [Google Scholar]
27.Kinar Y, Kalkstein N, Akiva P, et al Development and validation of a predictive model for detection of colorectal cancer in primary care by analysis of complete blood counts: a binational retrospective study. J Am Med Inform Assoc. 2016;23:879–90. doi: 10.1093/jamia/ocv195. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Boursi B, Mamtani R, Hwang WT, et al A risk prediction model for sporadic CRC based on routine lab results. Dig Dis Sci. 2016;61:2076–86. doi: 10.1007/s10620-016-4081-x. [DOI] [PubMed] [Google Scholar]
29.Spell DW, Jones DV Jr., Harper WF, et al. The value of a complete blood count in predicting cancer of the colon. Cancer Detect Prev. 2004;28:37. doi: 10.1016/j.cdp.2003.10.002. [DOI] [PubMed] [Google Scholar]
30.Wild N, Andres H, Rollinger W, et al A combination of serum markers for the early detection of colorectal cancer. Clin Cancer Res. 2010;16:6111–21. doi: 10.1158/1078-0432.ccr-10-0119. [DOI] [PubMed] [Google Scholar]
31.Werner S, Krause F, Rolny V, et al Evaluation of a 5-marker blood test for colorectal cancer early detection in a colorectal cancer screening setting. Clin Cancer Res. 2016;22:1725–33. doi: 10.1158/1078-0432.ccr-15-1268. [DOI] [PubMed] [Google Scholar]
32.Ma GK, Ladabaum U Personalizing colorectal cancer screening: a systematic review of models to predict risk of colorectal neoplasia. Clin Gastroenterol Hepatol. 2014;12:1624–34.e1. doi: 10.1016/j.cgh.2014.01.042. [DOI] [PubMed] [Google Scholar]
33.Chen H, Qian J, Werner S, et al Development and validation of a panel of five proteins as blood biomarkers for early detection of colorectal cancer. Clin Epidemiol. 2017;9:517–26. doi: 10.2147/clep.s144171. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Breiman L, Friedman JH, Olshen R, et al. Classification and Regression Trees. New York: Chapman & Hall (Wadsworth, Inc.), 1984.
35.Esposito F, Malerba D, Semeraro G, et al A comparative analysis of methods for pruning decision trees. IEEE Trans Pattern Anal Mach Intell. 1997;19:476–91. [Google Scholar]
36.Timofeev R. Classification and regression trees (CART) Theory and Applications. Berlin: Humbt University, 2004.
37.De’ath G, Fabricius KE Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology. 2000;81:3178–92. [Google Scholar]
38.Chan JC, Chan DL, Diakos CI, et al The lymphocyte-to-monocyte ratio is a superior predictor of overall survival in comparison to established biomarkers of resectable colorectal cancer. Ann Surg. 2017;265:539–46. doi: 10.1097/sla.0000000000001743. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Jiang H, Li H, Li A, et al Preoperative combined hemoglobin, albumin, lymphocyte and platelet levels predict survival in patients with locally advanced colorectal cancer. Oncotarget. 2016;7:72076–83. doi: 10.18632/oncotarget.12271. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Ishizuka M, Nagata H, Takagi K, et al Combination of platelet count and neutrophil to lymphocyte ratio is a useful predictor of postoperative survival in patients with colorectal cancer. Br J Cancer. 2013;109:401–7. doi: 10.1038/bjc.2013.350. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.He H, Ma Y. IMBALANCED LEARNING: Foundations, Algorithms, and Applications. New York: Wiley-IEEE Press, 2013.
42.Brenner H, Kloor M, Pox CP Colorectal cancer. Lancet. 2014;383:1490–502. doi: 10.1016/s0140-6736(13)61649-9. [DOI] [PubMed] [Google Scholar]
43.Qaseem A, Denberg TD, Hopkins RH Jr., et al Screening for colorectal cancer: a guidance statement from the American College of Physicians. Ann Intern Med. 2012;156:378–86. doi: 10.7326/0003-4819-156-5-201203060-00010. [DOI] [PubMed] [Google Scholar]
44.van Rossum LG, van Rijn AF, Laheij RJ, et al Random comparison of guaiac and immunochemical fecal occult blood tests for colorectal cancer in a screening population. Gastroenterology. 2008;135:82–90. doi: 10.1053/j.gastro.2008.03.040. [DOI] [PubMed] [Google Scholar]
45.Hol L, van Leerdam ME, van Ballegooijen M, et al Screening for colorectal cancer: randomised trial comparing guaiac-based and immunochemical faecal occult blood testing and flexible sigmoidoscopy. Gut. 2010;59:62–8. doi: 10.1136/gut.2009.177089. [DOI] [PubMed] [Google Scholar]
46.Hol L, Wilschut JA, van Ballegooijen M, et al Screening for colorectal cancer: random comparison of guaiac and immunochemical faecal occult blood testing at different cut-off levels. Br J Cancer. 2009;100:1103–10. doi: 10.1038/sj.bjc.6604961. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Lee JK, Liles EG, Bent S, et al Accuracy of fecal immunochemical tests for colorectal cancer: systematic review and meta-analysis. Ann Intern Med. 2014;160:171. doi: 10.7326/m13-1484. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.de Wijkerslooth TR, Stoop EM, Bossuyt PM, et al Immunochemical fecal occult blood testing is equally sensitive for proximal and distal advanced neoplasia. Am J Gastroenterol. 2012;107:1570–8. doi: 10.1038/ajg.2012.249. [DOI] [PubMed] [Google Scholar]
49.Park DI, Ryu S, Kim YH, et al Comparison of guaiac-based and quantitative immunochemical fecal occult blood testing in a population at average risk undergoing colorectal cancer screening. Am J Gastroenterol. 2010;105:2017–25. doi: 10.1038/ajg.2010.179. [DOI] [PubMed] [Google Scholar]
50.Sohn DK, Jeong SY, Choi HS, et al Single immunochemical fecal occult blood test for detection of colorectal neoplasia. Cancer Res Treat. 2005;37:20–3. doi: 10.4143/crt.2005.37.1.20. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Chiu HM, Lee YC, Tu CH, et al Association between early stage colon neoplasms and false-negative results from the fecal immunochemical test. Clin Gastroenterol Hepatol. 2013;11:832–8.e1-2. doi: 10.1016/j.cgh.2013.01.013. [DOI] [PubMed] [Google Scholar]
52.Nakama H, Yamamoto M, Kamijo N, et al Colonoscopic evaluation of immunochemical fecal occult blood test for detection of colorectal neoplasia. Hepatogastroenterology. 1999;46:228–31. [PubMed] [Google Scholar]
53.Schneeweiss S Learning from big health care data. N Engl J Med. 2014;370:2161–3. doi: 10.1056/NEJMp1401111. [DOI] [PubMed] [Google Scholar]
54.Gabay C, Kushner I Acute-phase proteins and other systemic responses to inflammation. N Engl J Med. 1999;340:448–54. doi: 10.1056/nejm199902113400607. [DOI] [PubMed] [Google Scholar]
55.McMillan DC Systemic inflammation, nutritional status and survival in patients with cancer. Curr Opin Clin Nutr Metab Care. 2009;12:223–6. doi: 10.1097/MCO.0b013e32832a7902. [DOI] [PubMed] [Google Scholar]
56.Kowalski-Saunders PW, Winwood PJ, Arthur MJ, et al Reversible inhibition of albumin production by rat hepatocytes maintained on a laminin-rich gel (Engelbreth-Holm-Swarm) in response to secretory products of Kupffer cells and cytokines. Hepatology. 1992;16:733–41. doi: 10.1002/hep.1840160320. [DOI] [PubMed] [Google Scholar]
57.Zahorec R Ratio of neutrophil to lymphocyte counts — rapid and simple parameter of systemic inflammation and stress in critically ill. Bratisl Lek Listy. 2001;102:5–14. [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[b1] 1.Bray F, Ferlay J, Soerjomataram I, et al Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2018;68:394–424. doi: 10.3322/caac.21492. [DOI] [PubMed] [Google Scholar]

[b2] 2.Shanahan F, O’Toole PW Host-microbe interactions and spatial variation of cancer in the gut. Nat Rev Cancer. 2014;14:511–2. doi: 10.1038/nrc3765. [DOI] [PubMed] [Google Scholar]

[b3] 3.Chen W, Sun K, Zheng R, et al Cancer incidence and mortality in China, 2014. Chin J Cancer Res. 2018;30:1–12. doi: 10.21147/j.issn.1000-9604.2018.01.01. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b4] 4.Li Z, Yang L, Du C, et al Characteristics and comparison of colorectal cancer incidence in Beijing with other regions in the world. Oncotarget. 2017;8:24593–603. doi: 10.18632/oncotarget.15598. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b5] 5.Brenner H, Hoffmeister M, Stegmaier C, et al Risk of progression of advanced adenomas to colorectal cancer by age and sex: estimates based on 840,149 screening colonoscopies. Gut. 2007;56:1585–9. doi: 10.1136/gut.2007.122739. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b6] 6.Kuntz KM, Lansdorp-Vogelaar I, Rutter CM, et al A systematic comparison of microsimulation models of colorectal cancer: the role of assumptions about adenoma progression. Med Deci Making. 2011;31:530–9. doi: 10.1177/0272989x11408730. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b7] 7.Schreuders EH, Ruco A, Rabeneck L, et al Colorectal cancer screening: a global overview of existing programmes. Gut. 2015;64:1637–49. doi: 10.1136/gutjnl-2014-309086. [DOI] [PubMed] [Google Scholar]

[b8] 8.Levin B, Lieberman DA, McFarland B, et al Screening and surveillance for the early detection of colorectal cancer and adenomatous polyps, 2008: a joint guideline from the American Cancer Society, the US Multi-Society Task Force on Colorectal Cancer, and the American College of Radiology. CA Cancer J Clin. 2008;58:130–60. doi: 10.3322/ca.2007.0018. [DOI] [PubMed] [Google Scholar]

[b9] 9.Zauber AG, Winawer SJ, O’Brien MJ, et al Colonoscopic polypectomy and long-term prevention of colorectal-cancer deaths. N Engl J Med. 2012;366:687–96. doi: 10.1056/NEJMoa1100370. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b10] 10.Mandel JS, Church TR, Bond JH, et al The effect of fecal occult-blood screening on the incidence of colorectal cancer. New Engl J Med. 2000;343:1603–7. doi: 10.1056/nejm200011303432203. [DOI] [PubMed] [Google Scholar]

[b11] 11.Miller KD, Siegel RL, Lin CC, et al Cancer treatment and survivorship statistics, 2016. CA Cancer J Clin. 2016;66:271–89. doi: 10.3322/caac.21349. [DOI] [PubMed] [Google Scholar]

[b12] 12.Ait Ouakrim D, Pizot C, Boniol M, et al Trends in colorectal cancer mortality in Europe: retrospective analysis of the WHO mortality database. BMJ. 2015;351:h4970. doi: 10.1136/bmj.h4970. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b13] 13.Nguyen MT, Weinberg DS Biomarkers in colorectal cancer screening. J Nati Compr Canc Netw. 2016;14:1033–40. doi: 10.6004/jnccn.2016.0109. [DOI] [PubMed] [Google Scholar]

[b14] 14.Dalerba P, Sahoo D, Paik S, et al CDX2 as a prognostic biomarker in stage II and stage III colon cancer. New Engl J Med. 2016;374:211–22. doi: 10.1056/NEJMoa1506597. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b15] 15.Warren JD, Xiong W, Bunker AM, et al Septin 9 methylated DNA is a sensitive and specific blood test for colorectal cancer. BMC Med. 2011;9:133. doi: 10.1186/1741-7015-9-133. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b16] 16.Rodia MT, Ugolini G, Mattei G, et al Systematic large-scale meta-analysis identifies a panel of two mRNAs as blood biomarkers for colorectal cancer detection. Oncotarget. 2016;7:30295–306. doi: 10.18632/oncotarget.8108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b17] 17.Chen H, Zucknick M, Werner S, et al Head-to-Head comparison and evaluation of 92 plasma protein biomarkers for early detection of colorectal cancer in a true screening setting. Clin Cancer Res. 2015;21:3318–26. doi: 10.1158/1078-0432.ccr-14-3051. [DOI] [PubMed] [Google Scholar]

[b18] 18.Lieberman DA, Weiss DG, Veterans Affairs Cooperative Study Group 380 One-time screening for colorectal cancer with combined fecal occult-blood testing and examination of the distal colon. New Engl J Med. 2001;345:555–60. doi: 10.1056/NEJMoa010328. [DOI] [PubMed] [Google Scholar]

[b19] 19.Brenner H, Tao S Superior diagnostic performance of faecal immunochemical tests for haemoglobin in a head-to-head comparison with guaiac based faecal occult blood test among 2235 participants of screening colonoscopy. Eur J Cancer. 2013;49:3049–54. doi: 10.1016/j.ejca.2013.04.023. [DOI] [PubMed] [Google Scholar]

[b20] 20.Ginsberg GM, Lim SS, Lauer JA, et al Prevention, screening and treatment of colorectal cancer: a global and regional generalized cost effectiveness analysis. Cost Eff Resour Alloc. 2010;8:2. doi: 10.1186/1478-7547-8-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b21] 21.Ahmed F Barriers to colorectal cancer screening in the developing world: The view from Pakistan. World J Gastrointest Pharmacol Ther. 2013;4:83–5. doi: 10.4292/wjgpt.v4.i4.83. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b22] 22.Laiyemo AO, Brawley O, Irabor D, et al Toward colorectal cancer control in Africa. Int J Cancer. 2016;138:1033–4. doi: 10.1002/ijc.29843. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b23] 23.Ng SC, Wong SH Colorectal cancer screening in Asia. Br Med Bull. 2013;105:29–42. doi: 10.1093/bmb/lds040. [DOI] [PubMed] [Google Scholar]

[b24] 24.Zhao L, Zhang W, Ma D, et al Analysis of colorectal cancer screening practices in the general population of Tianjin. Zhongguo Zhong Liu Lin Chuang. 2015;42:760–4. doi: 10.3969/j.issn.1000-8179.20150644. [DOI] [Google Scholar]

[b25] 25.Zheng Y, Gong Y Research and practice of screening for colorectal cancer in population of Shanghai. Zhongguo Zhong Liu. 2013;22:86–9. [Google Scholar]

[b26] 26.Tian Z, Chen H, Zhai A, et al Investigation of colorectal cancer opportunistic screening being combined with the physical examination in Beijing Yungang. Shou Du Yi Ke Da Xue Xue Bao. 2016;37:34–7. doi: 10.3969/j.issn.1006-7795.2016.01.007. [DOI] [Google Scholar]

[b27] 27.Kinar Y, Kalkstein N, Akiva P, et al Development and validation of a predictive model for detection of colorectal cancer in primary care by analysis of complete blood counts: a binational retrospective study. J Am Med Inform Assoc. 2016;23:879–90. doi: 10.1093/jamia/ocv195. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b28] 28.Boursi B, Mamtani R, Hwang WT, et al A risk prediction model for sporadic CRC based on routine lab results. Dig Dis Sci. 2016;61:2076–86. doi: 10.1007/s10620-016-4081-x. [DOI] [PubMed] [Google Scholar]

[b29] 29.Spell DW, Jones DV Jr., Harper WF, et al. The value of a complete blood count in predicting cancer of the colon. Cancer Detect Prev. 2004;28:37. doi: 10.1016/j.cdp.2003.10.002. [DOI] [PubMed] [Google Scholar]

[b30] 30.Wild N, Andres H, Rollinger W, et al A combination of serum markers for the early detection of colorectal cancer. Clin Cancer Res. 2010;16:6111–21. doi: 10.1158/1078-0432.ccr-10-0119. [DOI] [PubMed] [Google Scholar]

[b31] 31.Werner S, Krause F, Rolny V, et al Evaluation of a 5-marker blood test for colorectal cancer early detection in a colorectal cancer screening setting. Clin Cancer Res. 2016;22:1725–33. doi: 10.1158/1078-0432.ccr-15-1268. [DOI] [PubMed] [Google Scholar]

[b32] 32.Ma GK, Ladabaum U Personalizing colorectal cancer screening: a systematic review of models to predict risk of colorectal neoplasia. Clin Gastroenterol Hepatol. 2014;12:1624–34.e1. doi: 10.1016/j.cgh.2014.01.042. [DOI] [PubMed] [Google Scholar]

[b33] 33.Chen H, Qian J, Werner S, et al Development and validation of a panel of five proteins as blood biomarkers for early detection of colorectal cancer. Clin Epidemiol. 2017;9:517–26. doi: 10.2147/clep.s144171. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b34] 34.Breiman L, Friedman JH, Olshen R, et al. Classification and Regression Trees. New York: Chapman & Hall (Wadsworth, Inc.), 1984.

[b35] 35.Esposito F, Malerba D, Semeraro G, et al A comparative analysis of methods for pruning decision trees. IEEE Trans Pattern Anal Mach Intell. 1997;19:476–91. [Google Scholar]

[b36] 36.Timofeev R. Classification and regression trees (CART) Theory and Applications. Berlin: Humbt University, 2004.

[b37] 37.De’ath G, Fabricius KE Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology. 2000;81:3178–92. [Google Scholar]

[b38] 38.Chan JC, Chan DL, Diakos CI, et al The lymphocyte-to-monocyte ratio is a superior predictor of overall survival in comparison to established biomarkers of resectable colorectal cancer. Ann Surg. 2017;265:539–46. doi: 10.1097/sla.0000000000001743. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b39] 39.Jiang H, Li H, Li A, et al Preoperative combined hemoglobin, albumin, lymphocyte and platelet levels predict survival in patients with locally advanced colorectal cancer. Oncotarget. 2016;7:72076–83. doi: 10.18632/oncotarget.12271. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b40] 40.Ishizuka M, Nagata H, Takagi K, et al Combination of platelet count and neutrophil to lymphocyte ratio is a useful predictor of postoperative survival in patients with colorectal cancer. Br J Cancer. 2013;109:401–7. doi: 10.1038/bjc.2013.350. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b41] 41.He H, Ma Y. IMBALANCED LEARNING: Foundations, Algorithms, and Applications. New York: Wiley-IEEE Press, 2013.

[b42] 42.Brenner H, Kloor M, Pox CP Colorectal cancer. Lancet. 2014;383:1490–502. doi: 10.1016/s0140-6736(13)61649-9. [DOI] [PubMed] [Google Scholar]

[b43] 43.Qaseem A, Denberg TD, Hopkins RH Jr., et al Screening for colorectal cancer: a guidance statement from the American College of Physicians. Ann Intern Med. 2012;156:378–86. doi: 10.7326/0003-4819-156-5-201203060-00010. [DOI] [PubMed] [Google Scholar]

[b44] 44.van Rossum LG, van Rijn AF, Laheij RJ, et al Random comparison of guaiac and immunochemical fecal occult blood tests for colorectal cancer in a screening population. Gastroenterology. 2008;135:82–90. doi: 10.1053/j.gastro.2008.03.040. [DOI] [PubMed] [Google Scholar]

[b45] 45.Hol L, van Leerdam ME, van Ballegooijen M, et al Screening for colorectal cancer: randomised trial comparing guaiac-based and immunochemical faecal occult blood testing and flexible sigmoidoscopy. Gut. 2010;59:62–8. doi: 10.1136/gut.2009.177089. [DOI] [PubMed] [Google Scholar]

[b46] 46.Hol L, Wilschut JA, van Ballegooijen M, et al Screening for colorectal cancer: random comparison of guaiac and immunochemical faecal occult blood testing at different cut-off levels. Br J Cancer. 2009;100:1103–10. doi: 10.1038/sj.bjc.6604961. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b47] 47.Lee JK, Liles EG, Bent S, et al Accuracy of fecal immunochemical tests for colorectal cancer: systematic review and meta-analysis. Ann Intern Med. 2014;160:171. doi: 10.7326/m13-1484. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b48] 48.de Wijkerslooth TR, Stoop EM, Bossuyt PM, et al Immunochemical fecal occult blood testing is equally sensitive for proximal and distal advanced neoplasia. Am J Gastroenterol. 2012;107:1570–8. doi: 10.1038/ajg.2012.249. [DOI] [PubMed] [Google Scholar]

[b49] 49.Park DI, Ryu S, Kim YH, et al Comparison of guaiac-based and quantitative immunochemical fecal occult blood testing in a population at average risk undergoing colorectal cancer screening. Am J Gastroenterol. 2010;105:2017–25. doi: 10.1038/ajg.2010.179. [DOI] [PubMed] [Google Scholar]

[b50] 50.Sohn DK, Jeong SY, Choi HS, et al Single immunochemical fecal occult blood test for detection of colorectal neoplasia. Cancer Res Treat. 2005;37:20–3. doi: 10.4143/crt.2005.37.1.20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b51] 51.Chiu HM, Lee YC, Tu CH, et al Association between early stage colon neoplasms and false-negative results from the fecal immunochemical test. Clin Gastroenterol Hepatol. 2013;11:832–8.e1-2. doi: 10.1016/j.cgh.2013.01.013. [DOI] [PubMed] [Google Scholar]

[b52] 52.Nakama H, Yamamoto M, Kamijo N, et al Colonoscopic evaluation of immunochemical fecal occult blood test for detection of colorectal neoplasia. Hepatogastroenterology. 1999;46:228–31. [PubMed] [Google Scholar]

[b53] 53.Schneeweiss S Learning from big health care data. N Engl J Med. 2014;370:2161–3. doi: 10.1056/NEJMp1401111. [DOI] [PubMed] [Google Scholar]

[b54] 54.Gabay C, Kushner I Acute-phase proteins and other systemic responses to inflammation. N Engl J Med. 1999;340:448–54. doi: 10.1056/nejm199902113400607. [DOI] [PubMed] [Google Scholar]

[b55] 55.McMillan DC Systemic inflammation, nutritional status and survival in patients with cancer. Curr Opin Clin Nutr Metab Care. 2009;12:223–6. doi: 10.1097/MCO.0b013e32832a7902. [DOI] [PubMed] [Google Scholar]

[b56] 56.Kowalski-Saunders PW, Winwood PJ, Arthur MJ, et al Reversible inhibition of albumin production by rat hepatocytes maintained on a laminin-rich gel (Engelbreth-Holm-Swarm) in response to secretory products of Kupffer cells and cytokines. Hepatology. 1992;16:733–41. doi: 10.1002/hep.1840160320. [DOI] [PubMed] [Google Scholar]

[b57] 57.Zahorec R Ratio of neutrophil to lymphocyte counts — rapid and simple parameter of systemic inflammation and stress in critically ill. Bratisl Lek Listy. 2001;102:5–14. [PubMed] [Google Scholar]

PERMALINK

An enrichment model using regular health examination data for early detection of colorectal cancer

Qiang Shi

Zhaoya Gao

Pengze Wu

Fanxiu Heng

Fuming Lei

Yanzhao Wang

Qingkun Gao

Qingmin Zeng

Pengfei Niu

Cheng Li

Jin Gu

Abstract

Objective

Methods

Results

Conclusions

Introduction

Materials and methods

Study population

S1.

Ethics approval and consent to participate

Availability of data and materials

Data quality control

CART

Data allocation

Variable selection for CART

Measuring performance of models

Results

Sample selection and routine lab test biomarkers

1.

1.

2.

A CART-based CRC classification model using routine lab test biomarkers

S1.

2.

S2.

Robustness and generalization of CART model

3.

S2.

S3.

S3.

Comparative effectiveness of CART-based screening relative to stool-based screening

S4.

3.

S4.

4.

Discussion

Conclusions

Acknowledgements

Footnote

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases