Circulating tumor cells (CTCs) enumeration and machine-learning based diagnostic biomarkers for breast cancer detection

Chun-Yu Liu; Yu-Hsiang Lin; Yi-Fang Tsai; Po-Yen Lu; Ji-Lin Chen; Yu-Hsuan Li; Chi-Cheng Huang; Yen-Shu Lin; Ta-Chung Chao; Chin-Jung Feng; Chih-Yi Hsu; Jen-Hwey Chiu; Chyong-Mei Chen; Ling-Ming Tseng

doi:10.1186/s12885-026-15741-9

. 2026 Mar 3;26:448. doi: 10.1186/s12885-026-15741-9

Circulating tumor cells (CTCs) enumeration and machine-learning based diagnostic biomarkers for breast cancer detection

Chun-Yu Liu ^1,^2,³, Yu-Hsiang Lin ¹, Yi-Fang Tsai ^1,^2,⁴, Po-Yen Lu ⁵, Ji-Lin Chen ^1,², Yu-Hsuan Li ⁵, Chi-Cheng Huang ^2,^4,⁶, Yen-Shu Lin ^1,^2,⁴, Ta-Chung Chao ^1,^2,³, Chin-Jung Feng ^1,^2,⁴, Chih-Yi Hsu ^1,⁷, Jen-Hwey Chiu ^2,^4,⁸, Chyong-Mei Chen ^5,^✉, Ling-Ming Tseng ^1,^2,^4,^✉

PMCID: PMC13063714 PMID: 41776447

Abstract

Background

Circulating tumor cells (CTCs) are detectable in early-stage cancer and may enable early cancer detection. We evaluated a CTC-based assay as a complementary biomarker for breast cancer detection in an Asian population with a high prevalence of dense breast tissue.

Methods

In this single-center, prospective, blinded study, peripheral blood from Taiwanese women with breast cancer and healthy controls was analyzed using a CTC-enumeration platform (CMx) based on biomarker expression (cytokeratin 18 [CK18], mammaglobin [MGB], CD45), cell morphometry, and nuclear features. A machine-learning model integrating CTC biomarkers with age, white blood cell (WBC) count, and platelet count was developed to assess classification performance, providing proof-of-concept for combining CTC-derived and routine blood parameters in breast cancer risk assessment.

Results

A total of 228 breast cancer patients and 170 healthy controls were included. Age and CK18- and MGB-positive CTC counts differed significantly between groups, whereas WBC and platelet counts did not. An ensemble linear support vector machines model incorporating age and CTC features achieved an area under the curve of 0.85 (95% CI, 0.73–0.96) in the independent test cohort, with high sensitivity (0.93), positive predictive value (0.74), and negative predictive value (0.86), but modest specificity (0.57). In the exploratory BI-RADS 3/4 subgroup, the model identified all cancer cases (sensitivity 1.00), with a specificity of 0.44 and overall accuracy of 0.79.

Conclusions

This study demonstrates the feasibility of combining CTC enumeration with machine learning for breast cancer detection and supports the need for future large-scale, multicenter, multiethnic prospective external validation.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12885-026-15741-9.

Keywords: Circulating tumor cells, Machine-learning algorithms, Breast cancer, Monte carlo cross-validation, Ensemble classifier

Background

Breast cancer is the most frequently diagnosed cancer globally and is the leading cause of cancer mortality among females in many countries [1]. Early detection of breast cancer significantly increases survival, improves patient outcomes, and reduces disease burden. Currently, breast cancer screening mostly depends on mammography, which has been proven to be associated with reduced breast cancer mortality [2, 3]. However, the risks of overdiagnosis of indolent disease and false-positive results related to mammography screening remain a challenge. Thus, there is an increasing need to develop complementary diagnostic tools for breast cancer that are less invasive but capable of providing additional clinical guidance. While serum markers such as CEA and CA 15 − 3 are utilized in routine physical exams as potential indicators, none are sufficiently sensitive and they are more commonly used to monitor responses to cancer treatment and disease recurrence. Hence, alternative methods for tumor liquid biopsy to support early cancer detection are gaining momentum.

Liquid biopsy has emerged as a transformative approach in cancer detection, offering a less invasive alternative to conventional tissue biopsy, which remains the diagnostic gold standard but is limited by invasiveness, challenges in repeated sampling, and vulnerability to tumor heterogeneity. This strategy encompasses diverse analytes, including circulating tumor cells (CTCs), circulating tumor DNA (ctDNA), exosomes, microRNAs, circulating RNAs, tumor-educated platelets, and autoantibodies, each with unique strengths and limitations [4]. CTCs, though historically recognized as valuable indicators of tumor progression, are extremely rare in peripheral blood and often missed by EpCAM-dependent detection methods [5]. ctDNA, in contrast, provides real-time insights into tumor genetics through mutation profiling, methylation analysis, and fragmentomics, but its low abundance in early-stage disease remains a major barrier. Exosomes and circulating RNAs offer stable carriers of nucleic acids and proteins reflective of the tumor microenvironment, although standardization of isolation methods is still lacking. Conventional serum proteins such as CEA and CA 15 − 3 suffer from low sensitivity and specificity, but efforts to combine multiple proteins or employ high-throughput proteomics are ongoing [6]. Despite these advances, most liquid biopsy approaches remain limited by insufficient sensitivity in early-stage cancers, difficulties in pinpointing the tissue of origin, and performance declines when applied to asymptomatic populations. In particular, CTCs offer a distinct advantage by preserving intact cellular architecture, which enables simultaneous morphological, genomic, and transcriptomic analyses from a single source. Unlike ctDNA or other acellular biomarker-based approaches, CTC assays allow the study of viable tumor cells, supporting downstream functional assays and phenotypic profiling that multi-omics strategies cannot capture [7, 8]. Although multi-marker liquid biopsy panels can improve detection sensitivity, their analytical complexity, cost, and standardization challenges may limit broad clinical adoption. A CTC-focused strategy therefore represents a simpler yet complementary approach for breast cancer risk assessment, particularly valuable in populations with dense breast tissue where conventional imaging demonstrates reduced sensitivity [9]. Furthermore, integrating CTC analysis with artificial intelligence (AI) and machine-learning (ML) may enhance diagnostic resolution, mitigate tumor heterogeneity, and improve both sensitivity and specificity for early cancer detection.

CTCs represent tumor cells that detach from the primary tumor site and enter the bloodstream via intravasation, thereby contributing to the metastatic dissemination process within the peripheral blood circulation system [10]. Previous investigations focusing on micrometastasis within the bone marrow of breast cancer patients elucidated early dissemination to distant sites, even among individuals with small, early-stage tumors [11]. Hence, CTCs present in the bloodstream could serve as initial indicators of the early phases of tumor metastasis, suggesting their potential as prognostic indicators or even as early diagnostic markers for breast cancer.

Clinically, outcome prediction may be more practical than estimating the effects of risk or diagnostic factors, as it can positively impact clinical decision-making. From this perspective, ML classifiers have been successfully used to predict diseases [12–18]. Furthermore, ML classifiers are known for their flexibility and robustness in avoiding the strong assumptions typically underlying standard statistical models, particularly concerning the assumed relationships between outcomes and features/covariates [12, 14, 16–18]. In this study, we aim to utilize CTCs in conjunction with ML to create a diagnostic tool for detecting breast cancer.

Methods

Patients and specimens

This study was approved by the Institutional Review Board of Taipei Veterans General Hospital (TPEVGH IRB No. 2017-10-016AC). Written informed consent was obtained from all participants prior to peripheral blood sampling and data collection, which were conducted in compliance with the Helsinki Declaration. For most participants, blood collection was performed on the same day as mammography or ultrasound. In cases where the procedures did not coincide, the interval between imaging and blood draw was recorded, with a median interval of 14 days. Eligible participants were adult women with pathology-confirmed breast cancer (including stage 0 ductal carcinoma in situ) and healthy women with either benign breast lesions (e.g., fibroadenoma or fibrocystic change) or normal breast findings on imaging [Breast Imaging Reporting and Data System (BI-RADS) category 1 by breast sonography and/or mammography]. Patients with lobular carcinoma in situ (LCIS) were excluded, as LCIS is generally regarded as a non-invasive, risk-associated lesion rather than a malignant endpoint, and most clinical trials and risk prediction models do not classify LCIS as breast cancer per se [19]. Furthermore, CTC characteristics in LCIS remain undefined in published literature, making biological interpretation challenging and potentially confounding if such cases were included. Between February 2018 and November 2021, a total of 398 patients were enrolled, comprising 228 individuals with breast cancer and 170 individuals categorized as benign or healthy. All participants were followed for a minimum of two years. The study was designed to evaluate the performance of the CMx platform for CTC detection and enumeration in breast cancer (Fig. 1).

Fig. 1 — Study design for evaluating the performance of the CMx test. Peripheral blood was collected prior to diagnostic confirmation and, when applicable, at the same visit as breast imaging (mammography or ultrasound). Blood samples were processed using the CMx platform to enumerate CTCs. Imaging findings and histopathology (biopsy) were used to establish the final clinical status. CTC outcomes derived from the CMx test were subsequently compared with the final clinical diagnosis to evaluate test performance

CTCs detection and enumeration (CMx platform)

The analytical validation of the CMx platform for rare CTC detection has been previously reported (Fig. 2A) [20–22]. Peripheral blood samples from donors were first collected in Vacutainer tubes containing EDTA anticoagulant (BD Biosciences, USA) and a cell preservative (Streck, USA). Subsequently, 2 mL of blood was introduced into a microfluidic EpCAM-coated CMx chip, operated at a controlled flow rate of 1.5 mL/h to capture CTCs. Following infusion, loosely bound cells were removed with phosphate-buffered saline, and the captured cells were released using air foams that gently detach the supported lipid bilayer from the chip surface, thereby liberating intact cells without harsh disruption of antigen–antibody bonds [20]. EpCAM, a homophilic type I transmembrane glycoprotein, has been widely adopted as a cell surface marker of epithelial carcinomas [23]. Mammaglobin (MGB), a 93-amino-acid glycoprotein, has been implicated as a diagnostic marker for breast carcinoma [24]. Cytokeratin 18 (CK18), an intermediate filament protein of the acidic type I cytokeratin family, is consistently expressed in many epithelial cancers, particularly adenocarcinomas [25], and has been applied in CTC detection [26, 27]. Prior studies have also explored the use of MGB as a breast cancer–specific CTC marker, in combination with EpCAM, CD45, and pan-CK for immunophenotyping [28]. In this study, MGB was used as an exploratory feature in combination with CK18 and patient age within an ensemble ML model to improve discrimination between breast cancer patients and healthy controls.

CTC staining and identification

In this study, released cells were stained with antibodies against CK18, MGB, pan-CK (guinea pig anti-Cytokeratin 8/18; OriGene, catalog #BP5007, USA), and CD45, together with the nuclear stain 4′,6-diamidino-2-phenylindole (DAPI). CTCs were defined and enumerated through a standardized three-step workflow. First, each patient sample was divided onto two slides, with cells mounted on a 10-mm membrane and imaged across four fluorescence channels using a Leica autofocus system. The first slide was stained with TRITC-conjugated anti-CK18, FITC-conjugated anti-CD45, and DAPI; the second slide was stained with Cy5-conjugated anti–pan-CK, TRITC-conjugated anti-MGB, FITC-conjugated anti-CD45, and DAPI. For each slide, 100 frames were automatically captured and stitched to generate composite image volumes for each channel. Second, the stitched images were analyzed using custom AI software that identified candidate cell-like regions. Each candidate was assigned a confidence index, trained iteratively against confirmed CTCs and white blood cells (WBCs), and flagged for morphology review. Finally, candidate events were examined in the CellReviewer platform according to strict morphological and immunophenotypic criteria: round-to-oval cell shape, diameter of 8–40 μm, positive cytoplasmic/membrane CK18 or MGB signal, CD45 negativity, and a DAPI-positive nucleus (Fig. 2B). WBCs were excluded based on their multilobed nuclear morphology. Enumeration was performed by a trained technician and verified by an expert pathologist when required.

Machine-learning models establishment and statistics

To predict outcomes, the analysis relied on a set of features: age, CK18, MGB, WBC, and platelet (referred to as Model 1). Our study aimed to develop a predictive model for classifying individuals into two categories: healthy/benign or diagnosed with cancer. Various ML algorithms were considered, including support vector machines (SVM) with different kernels, gradient boosting machine (GBM), random forest (RF), adaptive boosting (Adaboosting), and extreme gradient boosting (XGB) [16, 17]. Due to the limited sample size, a novel Monte Carlo cross-validation (MCCV) procedure was devised on the training set to construct an ensemble classifier [29]. To ensure robustness, 48 subjects were randomly selected for a test dataset using a stratified random sampling method to maintain proportions similar to the original data between healthy and cancer subjects (Fig. 3A). The MCCV procedure involved randomly splitting the remaining 350 subjects into training and validation datasets 1000 times, with 75% (n = 262) allocated to training and 25% (n = 88) to validation in each split. Within each split, ML models were trained on the training data using 10-fold cross-validation. Hyperparameters were tuned within prespecified ranges appropriate for each algorithm (Table S1). For example, SVM models varied the cost parameter (0.01–1000); GBM models tuned interaction depth (1–9), shrinkage (0.01, 0.1, 0.2), n.minobsinnode (5–20), bag.fraction (0.5–1.0), and n.trees (10,000); XGBoost (XGB) models tuned eta (0.01, 0.1, 0.2), max_depth (1–9), min_child_weight (1–9), subsample (0.7–1.0), and colsample_bytree (0.7–1.0). The optimal hyperparameters were selected by minimizing the classification error rate (CER) or maximizing accuracy in the validation folds. This process was repeated across 1000 random splits to generate an ensemble of “small” models for each algorithm, which were subsequently aggregated to produce the final ensemble classifier (Fig. 3B). This process was also repeated to integrate 1000 trained “small” ML models into an ensemble classifier for each of the eight ML algorithms considered. The performance of each trained “small” ML model was evaluated by CER for the corresponding validation set. The ensemble classifier with the lowest average estimated CER of validation sets, obtained from the 1000 optimized ML models across repeated splits, was selected as the proposed model (Fig. 3C). The proposed approach differs from traditional MCCV, as it integrates all “small” models from each splitting step into the ensemble classifier, making it particularly suitable for datasets with limited samples. Notably, within the total cohort, 33 individuals identified as healthy had missing data for WBC and platelet counts. Given the missing data for some patients’ WBC and platelet counts, imputation was performed using the missForest R package within each split before implementing the ML algorithms as our data preprocessing method (Fig. 3A). The same MCCV procedure was also applied to features excluding WBC and platelet (referred to as Model 2) for all eight ML algorithms, aiming to assess the impact of imputation and the contribution of these specific features (Fig. 3C). Ultimately, the best predictive model, determined by the smallest average CER derived from 1000 optimal ML models across repeated splitting, was employed for predictions on the test data through a majority voting system from the 1000 machines of the best predictive model. After obtaining the predicted probabilities, a patient was classified as having breast cancer if the probability exceeded a pre-defined threshold of 0.5. This threshold was selected a priori for this proof-of-concept evaluation, as it represents the theoretically grounded decision point for a Bayes classifier when class priors are approximately balanced and misclassification costs are assumed to be equal [30].

Fig. 3 — Research framework diagram. A Data preprocess for training, validation and test datasets. The imputation for missing data is implemented by the MissForest algorithm using the training data. B The proposed ensemble machine algorithm based on MCCV procedure, in which the random data-splitting and model-building steps are repeated up to 1,000 times. The notation M1, M2, …, M1000, … represents a continuing sequence of models that follow a regular pattern. The ellipsis (“…”) indicates that intermediate elements are omitted for brevity but are implied to exist in the same format (e.g., M3, M4, …, M999). After imputing missing data as described in (A), machine is developed by grid search method and 10-fold cv for the training data and then calculate CER for validation data. Model building procedure is repeated up to 1,000 times. (C) The best machine chosen from 8 algorithms are based on the complete training data and predictions on the test data through a majority voting system from the 1000 machines of the best predictive model

The primary analyses, including the construction and validation of Models 1 and 2, were pre-specified as part of the study design. Subsequent subgroup analyses by cancer stage, molecular subtype, breast density, and BI-RADS category were exploratory and not pre-specified. To further characterize the performance trade-offs across different operating points, a sensitivity analysis for the test data was conducted by evaluating model metrics across a range of probability thresholds (0.3–0.8). This analysis was intended to illustrate threshold-dependent performance variations rather than to define a post-hoc optimized cutoff. All analyses were conducted using R software (version 4.2.1), with statistical significance set at a two-sided p-value of < 0.05. The source code for this model is available at https://github.com/CYL-lab/SVM_linear_Model (v1.0.0, commit 9147380). To ensure long-term reproducibility, a copy of the source code including environment specifications (sessionInfo) is also provided as Supplementary Material.

Results

Baseline clinical characteristics

The baseline clinical and biological characteristics for the entire cohort are summarized (Table 1). To ensure the integrity of model development, these characteristics are further detailed separately for the training dataset (n = 350; Table S2) and the testing dataset (n = 48; Table S3). Comparison between the two cohorts revealed no statistically significant differences in age or any biomarker levels (all p > 0.05; Table S4). All features were continuous, and the median as well as the first and third quartiles were reported, due to the departure from normality observed in the Shapiro test results. The frequency distribution of cases was presented (Table 2). Furthermore, the distribution of CTC enumeration data, focusing on CK18 and MGB expression, was displayed (Figure S1).

Table 1.

Demographic characteristics of all patients

Characteristics	All patients (n = 398)	Cancer (n = 228)	Benign/Healthy (n = 170)	p-value
Age (years)	51 (43, 62)	56 (46.75, 64)	46.5 (34, 54)	< 0.001*
CK18	3 (1, 6)	3 (1, 7)	2 (1, 4)	< 0.001*
MGB	4 (2, 8)	5 (2, 10)	3 (1.25, 7)	< 0.001*
WBC (/uL)	6300 (5400, 7700)	6400 (5400, 7700)	6300 (5400, 7700)^§	0.688
Platelet (/uL)	255,000 (217000, 293000)	254,000 (216000, 291000)	255,000 (224000, 297000)^§	0.527

Open in a new tab

CK18 cytokeratin 18, MGB mammaglobin, WBC white blood cell

^§Summarized by deleting the missing values (n = 33 in Benign/Healthy group)

Median (Q1, Q3)

Mann–Whitney–Wilcoxon test. *p-values < 0.05 were considered statistically significant for comparisons between cancer and benign/healthy groups

Table 2.

Distribution of cases

Characteristic	Cancer (%) n = 228	Benign/Healthy (%) n = 170
Cancer stage
0	16 (7.0)	–
1	71 (31.1)	–
2	99 (43.4)	–
3	26 (11.4)	–
4	15 (6.6)	–
NA	1 (0.4)	–
Subtypes
ER+HER2−	144 (63.2)	–
ER+HER2+	26 (11.4)	–
ER−HER2+	22 (9.6)	–
ER−HER2−	36 (15.8)	–
Breast density
Category B	11 (4.8)	5 (2.9)
Category C	189 (82.9)	96 (56.5)
Category D	19 (8.3)	12 (7.1)
NA	9 (3.9)	57 (33.5)
BI-RADS category
0	2 (0.9)	0 (0.0)
1	0 (0.0)	15 (8.8)
2	2 (0.9)	63 (37.1)
3	2 (0.9)	16 (9.4)
4	122 (53.5)	75 (44.1)
5	30 (13.2)	0 (0.0)
6	67 (29.4)	0 (0.0)
NA	3 (1.3)	1 (0.6)

Open in a new tab

Category B: scattered areas of fibroglandular density, Category C: heterogeneously dense. Category D: extremely dense

NA not available, ER estrogen receptor, HER2 human epidermal growth factor receptor 2

Model performance

Performances of the ensemble machine, evaluated from its 1000 small ML models for the corresponding validation sets, were summarized using the average area under the curve (AUC) and the average CER. AUC and CER for all eight models are reported in Table 3. Among these models, the ensemble classifiers based on SVM with linear and radial kernels, along with GBM models, demonstrated comparable performances under the same set of features. Notably, these models outperformed the remaining five models in terms of either CER or AUC. Considering computational efficiency, linear and radial SVMs were preferred over GBM, which required substantially greater memory and computation time (≈ 11X slower than SVM-linear; SVM-radial ≈ 1.3X slower). Consequently, the GBM-based ensemble was excluded. Although SVM-linear and SVM-radial showed comparable performance in the primary analysis (Table 3), Supplementary experiments (additional comparisons using datasets split with different random seeds, data not shown) revealed superior stability of the linear SVM, whereas the radial SVM produced inconsistent or inferior results. Therefore, SVM-linear was selected as the predictive model. For Model 2, all eight ML algorithms were evaluated using the same MCCV procedure with only age, CK18, and MGB as features. Notably, SVM-linear Model 2 slightly outperformed Model 1 (Table 3). Based on performance and cost considerations, SVM-linear using age, CK18, and MGB was ultimately proposed for the predictive model.

Table 3.

Performances of eight ensemble classifiers

	Model 1 (5 features)		Model 2 (3 features)
	AUC (95% CI)	CER (SD)	AUC (95% CI)	CER (SD)
SVM (linear)	0.70 (0.59, 0.81)	0.35 (0.04)	0.71 (0.59, 0.82)	0.35 (0.04)
GBM	0.70 (0.59, 0.81)	0.34 (0.05)	0.71 (0.60, 0.82)	0.34 (0.05)
SVM (radial)	0.70 (0.58, 0.81)	0.34 (0.04)	0.70 (0.58, 0.81)	0.34 (0.04)
SVM (polynomial)	0.68 (0.56, 0.79)	0.38 (0.06)	0.69 (0.58, 0.80)	0.39 (0.06)
SVM (sigmoid)	0.66 (0.55, 0.79)	0.35 (0.04)	0.60 (0.48, 0.72)	0.43 (0.06)
RF	0.67 (0.55, 0.79)	0.35 (0.05)	0.68 (0.57, 0.80)	0.35 (0.05)
Adaboosting	0.64 (0.52, 0.77)	0.38 (0.04)	0.64 (0.52, 0.76)	0.40 (0.05)
XGB	0.61 (0.49, 0.73)	0.40 (0.05)	0.57 (0.44, 0.69)	0.41 (0.05)

Open in a new tab

Model 1: age, CK18, MGB, WBC, and platelet. Model 2: age, CK18, and MGB

AUC area under the ROC curve, CI confidence intervals, SD standard deviation, CER classification error rate

Visualizing predicted probabilities

Model performance of the aforementioned SVM-linear classifier was evaluated using an independent test set (N = 48, Table S3). Average predicted probabilities stratified by cancer status are summarized (Table 4). In Model 1, the ensemble classifier yielded higher predicted probabilities for cancer patients compared with healthy individuals (mean 0.68 vs. 0.48), achieving high sensitivity (0.96) and NPV (0.92), with moderate specificity (0.52). Similar results were observed in Model 2 (mean 0.66 vs. 0.47), with sensitivity of 0.93 and specificity of 0.57 (Table 4). Sensitivity analysis excluding cases with missing WBC or platelet data (complete-case cohort, n = 365) demonstrated preserved sensitivity (0.97) and overall predictive behavior, though specificity decreased (0.22), likely reflecting reduced sample size and estimate instability (Table 5). Performance at the 0.5 threshold was consistent across probability cutoffs from 0.3 to 0.8 (Figure S2), indicating minimal impact of imputation on model characteristics. Predicted probabilities were consistently higher in cancer patients, as shown in boxplots and violin plots (Fig. 4), with significant differences confirmed by Wilcoxon–Mann–Whitney and Kolmogorov–Smirnov tests. ROC analysis demonstrated comparable discrimination for Models 1 and 2, with AUCs of 0.86 and 0.85, respectively (Fig. 5). Exploratory subgroup analyses showed robust performance across cancer stages (accuracy > 90%), molecular subtypes (lowest in ER–/HER2– at 0.80), and mammographic density categories, with reduced accuracy in Category B attributable to a single misclassified case (Tables S5–S7). Overall, the model demonstrated consistent predictive performance across clinicopathological subgroups, supporting its potential utility for malignancy risk characterization.

Table 4.

Confusion matrices of the ensemble classifiers using SVM-linear based on models 1 and 2 (validation cohort, n = 48)

Validation data Model 1 (5 features)			Validation data Model 2 (3 features)
	True Benign/Healthy	True Cancer		True Benign/Healthy	True Cancer
Predicted Benign/Healthy	11	1	Predicted Benign/Healthy	12	2
Predicted Cancer	10	26	Predicted Cancer	9	25
Predicted probability, mean (±SD)	0.48 (±0.12)	0.68 (±0.12)	Predicted probability Mean (±SD)	0.47 (±0.12)	0.66 (±0.13)
Specificity Sensitivity Accuracy rate PPV NPV	0.52 0.96 0.77 0.72 0.92		Specificity Sensitivity Accuracy rate PPV NPV	0.57 0.93 0.77 0.74 0.86

Open in a new tab

Results are based on 1000 votes in the SVM-linear machine learning model. The high sensitivity and NPV suggest that the model is effective in ruling out malignancy, thereby potentially reducing unnecessary follow-up among low-risk individuals

SD standard deviation, WBC white blood cell count, PPV positive predictive value, NPV negative predictive value

Table 5.

Sensitivity analysis of the SVM-linear model using complete-case data

Validation data Model (all covariates) With only complete-case
	True Benign/Healthy	True Cancer
Predicted Benign/Healthy	4	1
Predicted Cancer	14	29
Predicted probability Mean (±SD)	0.58 (±0.12)	0.68 (±0.11)
Specificity Sensitivity Accuracy rate PPV NPV	0.22 0.97 0.69 0.67 0.80

Open in a new tab

PPV positive predictive value, NPV negative predictive value

Fig. 4 — Predicted probabilities generated by Model 1 and Model 2 in the test dataset. A Boxplots showing the prediction probabilities of Model 1 (five features: age, CK18, MGB, WBC, and platelet) and Model 2 (three features: age, CK18, and MGB). B Violin plots of prediction probabilities from Model 1 and Model 2 (x-axis: groups; y-axis: prediction probability). The significant separation between cancer and benign/healthy groups demonstrates the models’ ability to provide a quantifiable risk score. The width of each violin represents the density of samples at each prediction probability. For each model, the distributions of predicted probabilities for cancer vs. benign/healthy groups were compared using the Wilcoxon–Mann–Whitney (WMW) and Kolmogorov–Smirnov (KS) tests. Model 1: WMW p = 8.4 × 10^− 6; KS p = 3.6 × 10^− 5. Model 2: WMW p = 4.6 × 10^− 5; KS p = 1.3 × 10^− 4

Fig. 5 — Receiver operating characteristic (ROC) curves of Models 1 and 2 for the test data (n = 48). ROC analysis was performed to evaluate the classification performance of the ensemble machine-learning models. Model 1 (solid blue line) included age, CK18, MGB, WBC, and platelet as features, while Model 2 (dashed red line) included age, CK18, and MGB only. The ROC curves for both models demonstrate strong overall diagnostic performance (AUC > 0.85). From a clinical perspective, the steep initial rise of the curves indicates that the models can achieve high sensitivity even at relatively low false-positive rates. The shaded areas indicate 95% CIs generated from 2,000 stratified bootstrap replicates, while CIs for the AUCs were estimated using DeLong’s method

Complementary performance of ML-bases CTC algorithm with breast imaging BIRAD 3/4 category

As an exploratory analysis, we evaluated the model’s performance within the BI-RADS 3 and 4 subgroups, categories that often present diagnostic challenges [31, 32]. These categories comprise the major population of clinical interest and may benefit from additional complementary assessments. A total of 394 patients underwent either breast ultrasound or mammography for screening or diagnostic purposes. For each individual, the highest BI-RADS score from either modality was used for analysis (Table 2). All cases classified as BI-RADS 5 or 6 were malignant. When BI-RADS 3 and 4 patients were combined (n = 215), the overall cancer detection rate was 57.7%. Since the ML models were trained on a dataset of 350 patients, performance evaluation was conducted exclusively in the testing cohort (n = 48). Among these, 24 individuals were categorized as BI-RADS 3 or 4, with 15 pathologically confirmed as breast cancer (62.5%). In this subgroup, both Models 1 and 2 correctly identified all 15 cancer cases (sensitivity = 1.00). Model 2 showed a modest improvement in specificity (0.44 vs. 0.33) and accuracy (0.79 vs. 0.75) compared with Model 1 (Table 6). Given the limited size of this subgroup, these findings are exploratory and intended to demonstrate proof-of-concept for complementary risk assessment rather than clinical triage.

Table 6.

Confusion matrices and diagnostic performance of ML-based CTC models stratified by BI-RADS 3 and 4 categories in the testing cohort (n = 24)

Model	Characteristic	True Benign/Healthy	True Cancer	Sensitivity	Specificity	PPV	NPV	Accuracy
Model 1 (5 features)	Predicted Benign/Healthy	3	0	1.00	0.33	0.71	1.00	0.75
	Predicted Cancer	6	15
Model 2 (3 features)	Predicted Benign/Healthy	4	0	1.00	0.44	0.75	1.00	0.79
	Predicted Cancer	5	15

Open in a new tab

Model 1: age, CK18, MGB, WBC, and platelet. Model 2: age, CK18, and MGB

BI-RADS Breast Imaging Reporting and Data System, PPV positive predictive value, NPV negative predictive value

Discussion

In this study, we developed a ML-based model to optimize age, peripheral cell counts, and biomarker data from the detection of circulating tumor cells, aiming to enhance breast cancer risk assessment. This study serves as a proof-of-concept rather than a head-to-head diagnostic comparison. The conventional implementation of ML is to develop a classifier based on randomly sampled training and validation datasets. However, this sampling method may be sensitive to the sampled data, particularly when dealing with moderate or smaller sample sizes, resulting in high variation in model performance [15, 33]. To overcome this problem, we proposed a novel ML strategy using MCCV to integrate multiple classifiers within a random forest structure, thereby reducing heterogeneity due to sampling. Eight ML cores were evaluated accordingly, and the ensemble using SVM with a linear kernel was selected. While this MCCV-based ensemble strategy aims to reduce sampling heterogeneity, it does not eliminate the risk of overfitting, particularly given the modest size of the independent test cohort (n = 48). Consequently, the reported performance metrics should be viewed as preliminary estimates, and further external validation remains indispensable.

Recent studies suggest that platelets may enhance CTC survival in the bloodstream and promote cancer metastasis [34]. In the present study, peripheral platelet and leukocyte counts were examined as potential factors influencing CTC enumeration. However, the influence of these parameters was minimized in the final model. The exclusion of WBC is consistent with previous studies using the same CTC detection platform [35], in which automated image analysis excludes cells with leukocyte characteristics. For CTC-based biomarkers, MGB expression has been shown highly variable and substantially lower in TNBC than in luminal-type tumors [36], which may partly explain reduced model performance observed in the ER–/HER2– subgroup (Table S6). Moreover, our findings reflect the biological heterogeneity of TNBC, underscoring the need to evaluate complementary mesenchymal or immune-related markers [36–38]. To ensure more robust performance across molecular subtypes, future model calibration will prioritize feature expansion, including the incorporation of additional non-EpCAM markers such as vimentin [39, 40], to better detect tumors with low MGB expression. In addition, variability observed across BI-RADS breast density categories indicates that breast density may influence CTC detectability or marker expression (Table S7). Collectively, these subgroup analyses provide important guidance for biomarker refinement and suggest that future model calibration incorporating subgroup-aware feature expansion, such as molecular subtype and BI-RADS density as model inputs, may also improve robustness, stabilize feature contributions, and enhance generalizability across heterogeneous patient populations.

Reported PPVs for BI-RADS 3/4 lesions vary widely across studies due to differences in imaging technology, classification standards, and population characteristics [41, 42]. Previous literature reports malignancy risks for the suspicious subcategories at approximately 13% for 4 A and 36% for 4B [41], whereas studies from specialized centers report higher PPVs for BI-RADS 4B and 4 C (approximately 75% and 83%, respectively) [42]. This spectrum indicates the diagnostic ambiguity and heterogeneity of malignancy risk within intermediate BI-RADS categories. Within this imaging-defined context, our Model 2 achieved a PPV of 75% in the BI-RADS 3/4 test subgroup. Importantly, this exploratory finding is descriptive and should not be interpreted as a direct head-to-head diagnostic comparison with BI-RADS subclassifications or as evidence of equivalence to imaging-derived malignancy risk estimates. Rather than replacing imaging, the model may provide complementary information to help characterize risk within BI-RADS 3/4 cases.

The traditional serum tumor marker CA 15 − 3 is used to monitor therapy in advanced breast cancer. However, it has not been validated for the early detection of breast cancer due to a lack of sensitivity and specificity in identifying early-stage breast cancer [43]. Reported sensitivities for combinations of serum tumor markers remain below 60% in recurrent or metastatic disease [44]. In this study, model 2 achieved an NPV of 0.86 and a PPV of 0.74, suggesting its potential to support diagnostic decision-making and reduce unnecessary follow-up for low-risk individuals. While its sensitivity remained high (0.93) with a primary specificity of 0.57, our findings indicate that such performance estimates may also depend on data completeness (Tables 4 and 5). This vulnerability was illustrated by a sensitivity analysis using Model 1, where specificity declined to 0.22 upon restricting the cohort to complete cases (Table 5). This discrepancy highlights the inherent instability of estimates in small datasets and highlights how missing measurements can alter cohort composition. Consequently, these findings must be interpreted strictly as a proof-of-concept. The current false-positive rate remains a significant limitation; beyond the burden of follow-up imaging, it can induce patient anxiety and increase healthcare costs, potentially offsetting the economic benefits of the assay. To mitigate the high false-positive rate, future work will focus on threshold optimization and exploring combined imaging approaches to enhance specificity. Within this context, the use of decision-curve analysis will be essential to identify the optimal net benefit and minimize the clinical and emotional impact of false-positives. While our results provide a proof-of-concept, any impact on resource utilization or clinical workflow remains speculative. Translation into routine practice will require formal health-economic modeling and evaluation of practical factors, such as laboratory turnaround time, to assess clinical potential.

This study has several limitations. Although all benign and healthy participants were cancer-free at blood collection and remained so during at least two years of follow-up, longer surveillance is required to clarify the longitudinal dynamics and lead time of CTC biomarkers. The EpCAM-based enrichment strategy may underrepresent CTCs undergoing epithelial–mesenchymal transition or with low EpCAM expression [36–38]. Patients with LCIS were excluded because these lesions are frequently radiologically occult and typically detected incidentally at biopsy [19, 45], which may introduce selection bias and warrants future dedicated investigation. In addition, the modest size of the independent validation cohort and restriction to Taiwanese women limit generalizability. Accordingly, the performance metrics reported here should not be assumed to directly generalize to other ethnicities or screening populations, where baseline risk, breast density, and screening pathways differ, particularly as Asian women, including Taiwanese populations, tend to have denser breast tissue and an earlier onset of breast cancer compared with Western cohorts [46, 47]. Collectively, larger, multi-center, and multi-ethnic prospective external cohorts will be required to confirm robustness. Furthermore, future studies must focus on evaluating critical translational factors such as physician acceptance and clinical transparency. To address these requirements, developing model interpretability through feature importance or coefficient estimates to clarify the individual contributions of age, CK18, and MGB will be essential after larger external validation. Such analyses will help enhance clinical trust and ensure the model’s successful integration into clinical diagnostic workflows.

Conclusions

Our study demonstrates the feasibility of combining CTC enumeration with ML for breast cancer detection. Further studies are required to determine whether this approach can provide clinically meaningful complementarity to conventional imaging. While the CTC-based model shows promising discriminative performance, its modest specificity and the relatively small test cohort warrant cautious interpretation. Therefore, its potential role as a complementary risk assessment tool must be thoroughly validated in larger, independent cohorts before any consideration for incorporation into clinical diagnostic workflows.

Supplementary Information

Supplementary Material 1.^{(1.3MB, docx)}

Supplementary Material 2.^{(32.9KB, docx)}

Acknowledgements

The authors are grateful to the patients at Taipei Veterans General Hospital, who provided contributions to enable this research project. The authors would like to thank Dr. Feng-Ming Lin from PETcision Co., Ltd. for his valuable technical consultation and expertise in circulating tumor cell enumeration. The laboratory works were completed using facilities from Medical Science & Technology Building of Taipei Veterans General Hospital. Servier Medical Art was used to create some parts of the figures. It falls under Creative Commons Attribution 4.0 Unported License.

Authors’ contributions

Conceptualization and Funding acquisition: CYL and LMT; Formal analysis and Investigation: CYL, YFT, YHLin, PYL, JLC, YHLi, CCH, YSL, TCC, CJF, CYH, JHC, CMC, and LMT; Writing - Original Draft: CYL; Supervision and Writing - Review & Editing: CMC and LMT. All authors have read and approved the final manuscript.

Funding

This research is funded from the National Science and Technology Council, Taiwan (NSTC 112-2314-B-A49-083-MY3), the Taipei Veterans General Hospital (V114C-013; V115C-010), the Taipei Veterans General Hospital—National Taiwan University Hospital Joint Research Program (VN111-06), Dr. Morris Chang (ABMRD002), Melissa Lee Cancer Foundation, and Teh-Tzer Study Group for Human Medical Research Foundation (B1131020), the Szu-Yuan Research Foundation of Internal Medicine, and the Yong-Lin Healthcare Foundation (SINO-CANCER project). The funding sources were not involved in study design nor manuscript writing.

Data availability

All data generated or analyzed during this study are included in this published article and its supplementary information files.

Declarations

Ethics approval and consent to participate

The study protocol was reviewed and approved by the Institutional Review Board of Taipei Veterans General Hospital (2017-10-016AC). All procedures were conducted in compliance with the Helsinki Declaration.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Chyong-Mei Chen, Email: cmchen2@nycu.edu.tw.

Ling-Ming Tseng, Email: lmtseng@vghtpe.gov.tw.

References

1.Siegel RL, Miller KD, Fuchs HE, Jemal A. Cancer Statistics, 2021. CA Cancer J Clin. 2021;71(1):7–33. [DOI] [PubMed] [Google Scholar]
2.Jacklyn G, Glasziou P, Macaskill P, Barratt A. Meta-analysis of breast cancer mortality benefit and overdiagnosis adjusted for adherence: improving information on the effects of attending screening mammography. Br J Cancer. 2016;114(11):1269–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Duffy SW, Tabar L, Yen AM, Dean PB, Smith RA, Jonsson H, et al. Mammography screening reduces rates of advanced and fatal breast cancers: results in 549,091 women. Cancer. 2020;126(13):2971–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Connal S, Cameron JM, Sala A, Brennan PM, Palmer DS, Palmer JD, et al. Liquid biopsies: the future of cancer early detection. J Transl Med. 2023;21(1):118. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Ma L, Guo H, Zhao Y, Liu Z, Wang C, Bu J, et al. Liquid biopsy in cancer current: status, challenges and future prospects. Signal Transduct Target Ther. 2024;9(1):336. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Xu Y, Zhu S, Xia C, Yu H, Shi S, Chen K, et al. Liquid biopsy-based multi-cancer early detection: an exploration road from evidence to implementation. Sci Bull. 2025. 10.1016/j.scib.2025.06.030. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Guo L, Liu C, Qi M, Cheng L, Wang L, Li C, et al. Recent progress of nanostructure-based enrichment of circulating tumor cells and downstream analysis. Lab Chip. 2023;23(6):1493–523. [DOI] [PubMed] [Google Scholar]
8.Zhong HJ, Zhen Y, Chen S, Shi W, Liang X, Yang GJ. Advances in CTC and ctDNA detection techniques: opportunities for improving breast cancer care. Breast Cancer Res. 2025;27(1):97. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Shao X, Jin X, Chen Z, Zhang Z, Chen W, Jiang J, et al. A comprehensive comparison of Circulating tumor cells and breast imaging modalities as screening tools for breast cancer in Chinese women. Front Oncol. 2022;12:890248. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Deng Z, Wu S, Wang Y, Shi D. Circulating tumor cell isolation for cancer diagnosis and prognosis. EBioMedicine. 2022;83:104237. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Braun S, Vogl FD, Naume B, Janni W, Osborne MP, Coombes RC, et al. A pooled analysis of bone marrow micrometastasis in breast cancer. N Engl J Med. 2005;353(8):793–802. [DOI] [PubMed] [Google Scholar]
12.Xu H, Caramanis C, Mannor S. Robustness and regularization of support vector machines. J Mach Learn Res. 2009;10:1485–510. [Google Scholar]
13.Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2015;13:8–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Kovalev MS, Utkin LV. A robust algorithm for explaining unreliable machine learning survival models using the Kolmogorov–Smirnov bounds. Neural Netw. 2020;132:1–18. [DOI] [PubMed] [Google Scholar]
15.Li ZZ, Yoon J, Zhang R, Rajabipour F, Srubar WV, Dabo I et al. Machine learning in concrete science: applications, challenges, and best practices. NPJ Comput Mater. 2022;8(1).
16.Lebbe A, Saabith S, Sundararajan EA, Bakar AA, editors. Comparative study on different classification techniques for breast cancer dataset. 2014.
17.Mohammed SA, Darrab S, Noaman SA, Saake G. Analysis of breast cancer detection using different machine learning techniques. Data Min Big Data. 2020;1234:108–17. [Google Scholar]
18.Tavakoli H, Chen W, Sin DD, FitzGerald JM, Sadatsafavi M. Predicting severe chronic obstructive pulmonary disease exacerbations. Developing a population surveillance approach with administrative data. Ann Am Thorac Soc. 2020;17(9):1069–76. [DOI] [PubMed] [Google Scholar]
19.Wen HY, Brogi E. Lobular carcinoma in situ. Surg Pathol Clin. 2018;11(1):123–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Gupta P, Gulzar Z, Hsieh B, Lim A, Watson D, Mei R. Analytical validation of the CellMax platform for early detection of cancer by enumeration of rare circulating tumor cells. J Circ Biomark. 2019;8:1849454419899214. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Wu JC, Tseng PY, Tsai WS, Liao MY, Lu SH, Frank CW, et al. Antibody conjugated supported lipid bilayer for capturing and purification of viable tumor cells in blood for subsequent cell culture. Biomaterials. 2013;34(21):5191–9. [DOI] [PubMed] [Google Scholar]
22.Lai JM, Shao HJ, Wu JC, Lu SH, Chang YC. Efficient elusion of viable adhesive cells from a microfluidic system by air foam. Biomicrofluidics. 2014;8(5):052001. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Keller L, Werner S, Pantel K. Biology and clinical relevance of EpCAM. Cell Stress. 2019;3(6):165–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Wang Z, Spaulding B, Sienko A, Liang Y, Li H, Nielsen G, et al. Mammaglobin, a valuable diagnostic marker for metastatic breast carcinoma. Int J Clin Exp Pathol. 2009;2(4):384–9. [PMC free article] [PubMed] [Google Scholar]
25.Menz A, Weitbrecht T, Gorbokon N, Buscheck F, Luebke AM, Kluth M, et al. Diagnostic and prognostic impact of cytokeratin 18 expression in human tumors: a tissue microarray study on 11,952 tumors. Mol Med. 2021;27(1):16. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Hou JM, Krebs M, Ward T, Morris K, Sloane R, Blackhall F, et al. Circulating tumor cells, enumeration and beyond. Cancers (Basel). 2010;2(2):1236–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Lu SH, Tsai WS, Chang YH, Chou TY, Pang ST, Lin PH, et al. Identifying cancer origin using Circulating tumor cells. Cancer Biol Ther. 2016;17(4):430–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Meng S, Tripathy D, Frenkel EP, Shete S, Naftalis EZ, Huth JF, et al. Circulating tumor cells in patients with breast cancer dormancy. Clin Cancer Res. 2004;10(24):8152–62. [DOI] [PubMed] [Google Scholar]
29.Shan G. Monte Carlo cross-validation for a study with binary outcome and limited sample size. BMC Med Inform Decis Mak. 2022;22(1):270. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Bishop CM. Pattern Recognition and Machine Learning. 1 ed. New York, NY: Springer New York; 2006. p. 778. [Google Scholar]
31.Reghunath A, Mittal MK, Chintamani C, Prasad R. Novel approach in the evaluation of ultrasound BI-RADS 3 & 4 breast masses with a combination method of elastography & doppler. Indian J Med Res. 2021;154(2):355–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Burnside ES, Sickles EA, Bassett LW, Rubin DL, Lee CH, Ikeda DM, et al. The ACR BI-RADS experience: learning from history. J Am Coll Radiol. 2009;6(12):851–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Vabalas A, Gowen E, Poliakoff E, Casson AJ. Machine learning algorithm validation with a limited sample size. PLoS One. 2019;14(11):e0224365. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Anvari S, Osei E, Maftoon N. Interactions of platelets with circulating tumor cells contribute to cancer metastasis. Sci Rep. 2021;11(1):15477. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Tsai WS, You JF, Hung HY, Hsieh PS, Hsieh B, Lenz HJ, et al. Novel circulating tumor cell assay for detection of colorectal adenomas and cancer. Clin Transl Gastroenterol. 2019;10(10):e00088. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Sieuwerts AM, Kraan J, Bolt J, van der Spoel P, Elstrodt F, Schutte M, et al. Anti-epithelial cell adhesion molecule antibodies and the detection of circulating normal-like breast tumor cells. J Natl Cancer Inst. 2009;101(1):61–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Yu M, Bardia A, Wittner BS, Stott SL, Smas ME, Ting DT, et al. Circulating breast tumor cells exhibit dynamic changes in epithelial and mesenchymal composition. Science. 2013;339(6119):580–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Perelmuter VM, Grigoryeva ES, Alifanov VV, Kalinchuk AY, Andryuhova ES, Savelieva OE, et al. Characterization of EpCAM-positive and EpCAM-negative tumor cells in early-stage breast cancer. Int J Mol Sci. 2024. 10.3390/ijms252011109. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Satelli A, Mitra A, Brownlee Z, Xia X, Bellister S, Overman MJ, et al. Epithelial-mesenchymal transitioned circulating tumor cells capture for detecting tumor progression. Clin Cancer Res. 2015;21(4):899–906. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Satelli A, Brownlee Z, Mitra A, Meng QH, Li S. Circulating tumor cell enumeration with a combination of epithelial cell adhesion molecule- and cell-surface vimentin-based methods for monitoring breast cancer therapeutic response. Clin Chem. 2015;61(1):259–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Bent CK, Bassett LW, D’Orsi CJ, Sayre JW. The positive predictive value of BI-RADS microcalcification descriptors and final assessment categories. AJR Am J Roentgenol. 2010;194(5):1378–83. [DOI] [PubMed] [Google Scholar]
42.Ghaemian N, Haji Ghazi Tehrani N, Nabahati M. Accuracy of mammography and ultrasonography and their BI-RADS in detection of breast malignancy. Casp J Intern Med. 2021;12(4):573–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Duffy MJ. Serum tumor markers in breast cancer: are they of clinical value? Clin Chem. 2006;52(3):345–51. [DOI] [PubMed] [Google Scholar]
44.Pedersen AC, Sorensen PD, Jacobsen EH, Madsen JS, Brandslund I. Sensitivity of CA 15 – 3, CEA and serum HER2 in the early detection of recurrence of breast cancer. Clin Chem Lab Med. 2013;51(7):1511–9. [DOI] [PubMed] [Google Scholar]
45.Sokolova A, Lakhani SR. Lobular carcinoma in situ: diagnostic criteria and molecular correlates. Mod Pathol. 2021;34(Suppl 1):8–14. [DOI] [PubMed] [Google Scholar]
46.Hung CC, Moi SH, Huang HI, Hsiao TH, Huang CC. Polygenic risk score-based prediction of breast cancer risk in Taiwanese women with dense breast using a retrospective cohort study. Sci Rep. 2024;14(1):6324. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Chen YC, Su SY, Jhuang JR, Chiang CJ, Yang YW, Wu CC, et al. Forecast of a future leveling of the incidence trends of female breast cancer in Taiwan: an age-period-cohort analysis. Sci Rep. 2022;12(1):12481. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1.^{(1.3MB, docx)}

Supplementary Material 2.^{(32.9KB, docx)}

Data Availability Statement

All data generated or analyzed during this study are included in this published article and its supplementary information files.

[CR1] 1.Siegel RL, Miller KD, Fuchs HE, Jemal A. Cancer Statistics, 2021. CA Cancer J Clin. 2021;71(1):7–33. [DOI] [PubMed] [Google Scholar]

[CR2] 2.Jacklyn G, Glasziou P, Macaskill P, Barratt A. Meta-analysis of breast cancer mortality benefit and overdiagnosis adjusted for adherence: improving information on the effects of attending screening mammography. Br J Cancer. 2016;114(11):1269–76. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Duffy SW, Tabar L, Yen AM, Dean PB, Smith RA, Jonsson H, et al. Mammography screening reduces rates of advanced and fatal breast cancers: results in 549,091 women. Cancer. 2020;126(13):2971–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Connal S, Cameron JM, Sala A, Brennan PM, Palmer DS, Palmer JD, et al. Liquid biopsies: the future of cancer early detection. J Transl Med. 2023;21(1):118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Ma L, Guo H, Zhao Y, Liu Z, Wang C, Bu J, et al. Liquid biopsy in cancer current: status, challenges and future prospects. Signal Transduct Target Ther. 2024;9(1):336. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Xu Y, Zhu S, Xia C, Yu H, Shi S, Chen K, et al. Liquid biopsy-based multi-cancer early detection: an exploration road from evidence to implementation. Sci Bull. 2025. 10.1016/j.scib.2025.06.030. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Guo L, Liu C, Qi M, Cheng L, Wang L, Li C, et al. Recent progress of nanostructure-based enrichment of circulating tumor cells and downstream analysis. Lab Chip. 2023;23(6):1493–523. [DOI] [PubMed] [Google Scholar]

[CR8] 8.Zhong HJ, Zhen Y, Chen S, Shi W, Liang X, Yang GJ. Advances in CTC and ctDNA detection techniques: opportunities for improving breast cancer care. Breast Cancer Res. 2025;27(1):97. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Shao X, Jin X, Chen Z, Zhang Z, Chen W, Jiang J, et al. A comprehensive comparison of Circulating tumor cells and breast imaging modalities as screening tools for breast cancer in Chinese women. Front Oncol. 2022;12:890248. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Deng Z, Wu S, Wang Y, Shi D. Circulating tumor cell isolation for cancer diagnosis and prognosis. EBioMedicine. 2022;83:104237. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Braun S, Vogl FD, Naume B, Janni W, Osborne MP, Coombes RC, et al. A pooled analysis of bone marrow micrometastasis in breast cancer. N Engl J Med. 2005;353(8):793–802. [DOI] [PubMed] [Google Scholar]

[CR12] 12.Xu H, Caramanis C, Mannor S. Robustness and regularization of support vector machines. J Mach Learn Res. 2009;10:1485–510. [Google Scholar]

[CR13] 13.Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2015;13:8–17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Kovalev MS, Utkin LV. A robust algorithm for explaining unreliable machine learning survival models using the Kolmogorov–Smirnov bounds. Neural Netw. 2020;132:1–18. [DOI] [PubMed] [Google Scholar]

[CR15] 15.Li ZZ, Yoon J, Zhang R, Rajabipour F, Srubar WV, Dabo I et al. Machine learning in concrete science: applications, challenges, and best practices. NPJ Comput Mater. 2022;8(1).

[CR16] 16.Lebbe A, Saabith S, Sundararajan EA, Bakar AA, editors. Comparative study on different classification techniques for breast cancer dataset. 2014.

[CR17] 17.Mohammed SA, Darrab S, Noaman SA, Saake G. Analysis of breast cancer detection using different machine learning techniques. Data Min Big Data. 2020;1234:108–17. [Google Scholar]

[CR18] 18.Tavakoli H, Chen W, Sin DD, FitzGerald JM, Sadatsafavi M. Predicting severe chronic obstructive pulmonary disease exacerbations. Developing a population surveillance approach with administrative data. Ann Am Thorac Soc. 2020;17(9):1069–76. [DOI] [PubMed] [Google Scholar]

[CR19] 19.Wen HY, Brogi E. Lobular carcinoma in situ. Surg Pathol Clin. 2018;11(1):123–45. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Gupta P, Gulzar Z, Hsieh B, Lim A, Watson D, Mei R. Analytical validation of the CellMax platform for early detection of cancer by enumeration of rare circulating tumor cells. J Circ Biomark. 2019;8:1849454419899214. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Wu JC, Tseng PY, Tsai WS, Liao MY, Lu SH, Frank CW, et al. Antibody conjugated supported lipid bilayer for capturing and purification of viable tumor cells in blood for subsequent cell culture. Biomaterials. 2013;34(21):5191–9. [DOI] [PubMed] [Google Scholar]

[CR22] 22.Lai JM, Shao HJ, Wu JC, Lu SH, Chang YC. Efficient elusion of viable adhesive cells from a microfluidic system by air foam. Biomicrofluidics. 2014;8(5):052001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Keller L, Werner S, Pantel K. Biology and clinical relevance of EpCAM. Cell Stress. 2019;3(6):165–80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Wang Z, Spaulding B, Sienko A, Liang Y, Li H, Nielsen G, et al. Mammaglobin, a valuable diagnostic marker for metastatic breast carcinoma. Int J Clin Exp Pathol. 2009;2(4):384–9. [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Menz A, Weitbrecht T, Gorbokon N, Buscheck F, Luebke AM, Kluth M, et al. Diagnostic and prognostic impact of cytokeratin 18 expression in human tumors: a tissue microarray study on 11,952 tumors. Mol Med. 2021;27(1):16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Hou JM, Krebs M, Ward T, Morris K, Sloane R, Blackhall F, et al. Circulating tumor cells, enumeration and beyond. Cancers (Basel). 2010;2(2):1236–50. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Lu SH, Tsai WS, Chang YH, Chou TY, Pang ST, Lin PH, et al. Identifying cancer origin using Circulating tumor cells. Cancer Biol Ther. 2016;17(4):430–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Meng S, Tripathy D, Frenkel EP, Shete S, Naftalis EZ, Huth JF, et al. Circulating tumor cells in patients with breast cancer dormancy. Clin Cancer Res. 2004;10(24):8152–62. [DOI] [PubMed] [Google Scholar]

[CR29] 29.Shan G. Monte Carlo cross-validation for a study with binary outcome and limited sample size. BMC Med Inform Decis Mak. 2022;22(1):270. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Bishop CM. Pattern Recognition and Machine Learning. 1 ed. New York, NY: Springer New York; 2006. p. 778. [Google Scholar]

[CR31] 31.Reghunath A, Mittal MK, Chintamani C, Prasad R. Novel approach in the evaluation of ultrasound BI-RADS 3 & 4 breast masses with a combination method of elastography & doppler. Indian J Med Res. 2021;154(2):355–66. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Burnside ES, Sickles EA, Bassett LW, Rubin DL, Lee CH, Ikeda DM, et al. The ACR BI-RADS experience: learning from history. J Am Coll Radiol. 2009;6(12):851–60. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Vabalas A, Gowen E, Poliakoff E, Casson AJ. Machine learning algorithm validation with a limited sample size. PLoS One. 2019;14(11):e0224365. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Anvari S, Osei E, Maftoon N. Interactions of platelets with circulating tumor cells contribute to cancer metastasis. Sci Rep. 2021;11(1):15477. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Tsai WS, You JF, Hung HY, Hsieh PS, Hsieh B, Lenz HJ, et al. Novel circulating tumor cell assay for detection of colorectal adenomas and cancer. Clin Transl Gastroenterol. 2019;10(10):e00088. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Sieuwerts AM, Kraan J, Bolt J, van der Spoel P, Elstrodt F, Schutte M, et al. Anti-epithelial cell adhesion molecule antibodies and the detection of circulating normal-like breast tumor cells. J Natl Cancer Inst. 2009;101(1):61–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Yu M, Bardia A, Wittner BS, Stott SL, Smas ME, Ting DT, et al. Circulating breast tumor cells exhibit dynamic changes in epithelial and mesenchymal composition. Science. 2013;339(6119):580–4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Perelmuter VM, Grigoryeva ES, Alifanov VV, Kalinchuk AY, Andryuhova ES, Savelieva OE, et al. Characterization of EpCAM-positive and EpCAM-negative tumor cells in early-stage breast cancer. Int J Mol Sci. 2024. 10.3390/ijms252011109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Satelli A, Mitra A, Brownlee Z, Xia X, Bellister S, Overman MJ, et al. Epithelial-mesenchymal transitioned circulating tumor cells capture for detecting tumor progression. Clin Cancer Res. 2015;21(4):899–906. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Satelli A, Brownlee Z, Mitra A, Meng QH, Li S. Circulating tumor cell enumeration with a combination of epithelial cell adhesion molecule- and cell-surface vimentin-based methods for monitoring breast cancer therapeutic response. Clin Chem. 2015;61(1):259–66. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Bent CK, Bassett LW, D’Orsi CJ, Sayre JW. The positive predictive value of BI-RADS microcalcification descriptors and final assessment categories. AJR Am J Roentgenol. 2010;194(5):1378–83. [DOI] [PubMed] [Google Scholar]

[CR42] 42.Ghaemian N, Haji Ghazi Tehrani N, Nabahati M. Accuracy of mammography and ultrasonography and their BI-RADS in detection of breast malignancy. Casp J Intern Med. 2021;12(4):573–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Duffy MJ. Serum tumor markers in breast cancer: are they of clinical value? Clin Chem. 2006;52(3):345–51. [DOI] [PubMed] [Google Scholar]

[CR44] 44.Pedersen AC, Sorensen PD, Jacobsen EH, Madsen JS, Brandslund I. Sensitivity of CA 15 – 3, CEA and serum HER2 in the early detection of recurrence of breast cancer. Clin Chem Lab Med. 2013;51(7):1511–9. [DOI] [PubMed] [Google Scholar]

[CR45] 45.Sokolova A, Lakhani SR. Lobular carcinoma in situ: diagnostic criteria and molecular correlates. Mod Pathol. 2021;34(Suppl 1):8–14. [DOI] [PubMed] [Google Scholar]

[CR46] 46.Hung CC, Moi SH, Huang HI, Hsiao TH, Huang CC. Polygenic risk score-based prediction of breast cancer risk in Taiwanese women with dense breast using a retrospective cohort study. Sci Rep. 2024;14(1):6324. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR47] 47.Chen YC, Su SY, Jhuang JR, Chiang CJ, Yang YW, Wu CC, et al. Forecast of a future leveling of the incidence trends of female breast cancer in Taiwan: an age-period-cohort analysis. Sci Rep. 2022;12(1):12481. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Circulating tumor cells (CTCs) enumeration and machine-learning based diagnostic biomarkers for breast cancer detection

Chun-Yu Liu

Yu-Hsiang Lin

Yi-Fang Tsai

Po-Yen Lu

Ji-Lin Chen

Yu-Hsuan Li

Chi-Cheng Huang

Yen-Shu Lin

Ta-Chung Chao

Chin-Jung Feng

Chih-Yi Hsu

Jen-Hwey Chiu

Chyong-Mei Chen

Ling-Ming Tseng

Abstract

Background

Methods

Results

Conclusions

Supplementary Information

Background

Methods

Patients and specimens

Fig. 1.

CTCs detection and enumeration (CMx platform)

Fig. 2.

CTC staining and identification

Machine-learning models establishment and statistics

Fig. 3.

Results

Baseline clinical characteristics

Table 1.

Table 2.

Model performance

Table 3.

Visualizing predicted probabilities

Table 4.

Table 5.

Fig. 4.

Fig. 5.

Complementary performance of ML-bases CTC algorithm with breast imaging BIRAD 3/4 category

Table 6.

Discussion

Conclusions

Supplementary Information

Acknowledgements

Authors’ contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases