Early detection of urological tumors based on genomic characteristics of cell-free DNA fragments: a multi-center study

Huiyong Zhang; Caihong Huang; Chunmeng Wei; Rongbin Zhou; Chengbang Wang; Wenhao Lu; Zuheng Wang; Xiao Li; Shaohua Chen; Dianyu Wang; Jin Ji; Yuexiang Li; Rirong Yang; Junyi Chen; Yanling Hu; Fubo Wang

doi:10.1038/s41698-025-01130-1

. 2025 Nov 17;9:352. doi: 10.1038/s41698-025-01130-1

Early detection of urological tumors based on genomic characteristics of cell-free DNA fragments: a multi-center study

Huiyong Zhang ^1,^2,^#, Caihong Huang ^3,^#, Chunmeng Wei ^1,^4,^#, Rongbin Zhou ¹, Chengbang Wang ^1,⁴, Wenhao Lu ¹, Zuheng Wang ⁴, Xiao Li ⁵, Shaohua Chen ^6,⁷, Dianyu Wang ⁸, Jin Ji ^9,¹⁰, Yuexiang Li ¹, Rirong Yang ¹, Junyi Chen ^8,^✉, Yanling Hu ^1,^5,^✉, Fubo Wang ^1,^4,^5,^7,^✉

PMCID: PMC12624079 PMID: 41249499

Abstract

Cell-free DNA (cfDNA) has shown potential in distinguishing cancer patients from healthy individuals. This study investigates cfDNA fragmentomics—fragmentation patterns, end motifs (EDMs), and breakpoint motifs (BPMs)—to develop an early detection method for bladder urothelial carcinoma (BLCA), prostate adenocarcinoma (PRAD), and clear cell renal cell carcinoma (ccRCC). Using low-coverage whole genome sequencing (lcWGS) on plasma samples from 758 participants (including BLCA, PRAD, ccRCC, benign prostatic hyperplasia patients, and healthy controls), we analyzed cfDNA features. Machine learning models (logistic regression, support vector machine, random forest, XGBoost, Stacking) distinguished urological tumors from non-tumor cases with AUCs of 96% (BLCA), 99% (ccRCC), 92% (PRAD), and 89% (pan-cancer). Key discriminators included 6-bp EDMs and BPMs. A proposed two-tier screening strategy combining pan-cancer and cancer-specific features offers a cost-effective, non-invasive approach for early detection with strong clinical potential.

Subject terms: Cancer, Computational biology and bioinformatics, Biomarkers, Medical research, Oncology

Introduction

The three major malignant tumors of the urinary system, including prostate adenocarcinoma (PRAD), bladder urothelial carcinoma (BLCA), and Kidney cancer, ranked 4rd, 9th, and 14th, respectively, among the most common new cases cancers in 2022¹. Due to the lack of typical clinical symptoms in the early stages, these cancers are often undetected until they reach advanced stages, missing the optimal time for treatment and resulting in poor therapeutic outcomes. In particular, the five-year survival rate for localized renal cancer can reach 93%², while metastatic renal cancer has a survival rate of only 12%³. Therefore, early diagnosis is crucial for improving patient outcomes, necessitating a highly compliant, minimally invasive method for early tumor detection.

However, traditional tumor screening methods, such as serological marker detection, imaging, and tissue biopsy, have inherent limitations. For instance, prostate-specific antigen is a commonly used biomarker for screening PRAD. However, its specificity is relatively low, and it may also be affected by benign prostatic hyperplasia (BPH) and prostatitis, thereby leading to overdiagnosis and over-treatment of low-risk PRAD cases^4,5. Although tissue biopsy remains the gold standard for diagnosis, its invasive nature carries risks of complications, including infection and hematuria⁶. For BLCA, urine cytology has low sensitivity (13–75%)⁷, while cystoscopy combined with pathological biopsy, although considered the gold standard, is hampered by its complexity and invasiveness. Currently, there are no widely recommended screening tests for BLCA in the general population. Additionally, there are currently no recognized biomarkers for renal cancer, and similarly, there are no established screening tests for kidney cancer, which primarily relies on imaging and pathological biopsy for diagnosis. Therefore, it is essential to develop a highly compliant, minimally invasive method for early tumor detection.

With the advancement and reduced cost of gene sequencing technology, liquid biopsy based on cell-free DNA (cfDNA) has emerged as a research hotspot. CfDNA, released into the blood during cell apoptosis or necrosis, exhibits tissue specificity influenced by various factors⁸. Fragmentation characteristics, including fragment size, copy number variations, nucleosome footprints, and whole-genome cfDNA fragmentation patterns and end motifs (EDMs), have demonstrated potential diagnostic value in various cancers^9–11. Specifically, cfDNA has demonstrated high diagnostic potential in liver cancer^12–14, lung cancer^15–17, and colorectal cancer^18,19. Cristiano et al.’s DELFI technology achieved high accuracy in distinguishing seven cancer types using cfDNA fragment length ratios with an area under the curve (AUC) of 0.94²⁰ Subsequent studies have validated its application in liver and lung cancers^13,15. These findings indicate that whole-genome cfDNA fragmentation patterns can serve as biomarkers for liver or lung cancer detection. Additionally, cfDNA EDMs represent novel biomarkers for cancer detection. Researchers have identified tumor-associated cfDNA preferred end coordinates by analyzing cfDNA fragments from hepatocellular carcinoma patients and non-cancer individuals²¹, which may be attributed to chromatin accessibility, tumor-specific nucleosome positioning, and nuclease activity^22,23. Jiang et al. constructed diagnostic models using support vector machine (SVM) and logistic regression (LR) algorithms based on the preferential 4 bp EDMs in cancer patients, achieving AUCs of 0.89¹¹. Chung et al. validated cfDNA detection in a cohort of 7861 colorectal cancer patients, reporting a sensitivity of 83.1% (95% confidence interval [CI] 72.2–90.3)²⁴. Notably, a recent study evaluating a multicancer early detection test analyzing plasma cell-free DNA using genetic and fragmentomics features from whole-genome sequencing demonstrated an overall sensitivity of 87.4%, specificity of 97.8%, and tissue-of-origin prediction accuracy of 82.4% in an independent validation cohort, further highlighting the significant potential of cfDNA-based methods for early cancer detection²⁵. These results, along with growing evidence, reinforce the notion that cfDNA fragments have the potential to serve as cancer biomarkers.

In this study, we recruited a total of 758 participants, including patients with BLCA, PRAD, clear cell renal cell carcinoma (ccRCC), BPH, and healthy controls (HCs). We performed low-coverage whole-genome sequencing (lcWGS) to analyze cfDNA fragmentation patterns, EDMs, and breakpoint motifs (BPMs). Feature analysis and selection were conducted to identify significantly different cfDNA fragmentomic features. Diagnostic models were constructed using LR, SVM, random forest (RF), XGBoost, and stacking algorithms to distinguish cancer patients from non-cancer controls, providing a feasible method for screening and diagnosing common urological tumors.

Results

Participant characteristics and disposition

A total of 407 newly diagnosed urological tumor patients, 94 BPH patients and 257 healthy individuals were included in this study (Fig. 1. Step 1). The tumor patients consisted of 91 BLCA patients, 153 PRAD patients, and 163 ccRCC patients. All healthy individuals were classified into the HC group. Four datasets were constructed based on different research objectives: BLCA dataset (BLCA vs. HC, 348 subjects), ccRCC dataset (ccRCC vs. HC, 420 subjects), PRAD dataset (PRAD vs. BPH + HC, 413 male subjects), and pan-cancer dataset (all tumors vs. non-tumors, 758 subjects). The demographic information of the participants is shown in Table 1 and Table 2. Notably, the distribution of these samples across the four participating medical centers, detailed below Table 2, demonstrates that both cancer and control samples were collected from multiple sites, which was a deliberate design choice to mitigate potential site-specific batch effects. All subjects underwent lcWGS of cfDNA, and four types of fragmentomic features were extracted: FSR, FSD, EDM, and BPM, totaling 10,113 features. To reduce model complexity and improve computational efficiency, a two-step feature selection strategy was employed (Fig. 1. Step 2). First, T-tests were used to identify features with statistically significant differences between case and control groups (P < 0.01), resulting in 6057, 6826, 5821, and 7176 features for the BLCA, ccRCC, PRAD, and pan-cancer datasets, respectively (Supplementary Data S1). Subsequently, the SHAP(SHapley Additive exPlanations)²⁶ method was applied for feature reduction, retaining 25, 34, 11, and 36 features for the respective datasets (Supplementary Data S2).

Table 1.

Demographic characteristics of Train and Test cohorts across multiple cancer types

Cancer	BLCA		PRAD		ccRCC		pan_cancer
Dataset	Train	Test	Train	Test	Train	Test	Train	Test
Age (years)
Mean ± SD	63.9 ± 10.3	61.0 ± 12.1	65.8 ± 10.4	67.1 ± 9.0	60.5 ± 11.3	60.4 ± 10.7	64.0 ± 10.8	65.4 ± 11.2
Median (IQR)	65.0(14.0)	60.0(18.0)	67.0(12.75)	69.0(13.5)	61.0(16.25)	62.0(12.0)	66.0(14.75)	66.5(15.0)
Age Group (no., %)
50–59	66(23.7)	19(27.1)	48(14.5)	16(19.3)	92(27.4)	22(26.2)	123(20.3)	25(16.4)
60–69	99(35.6)	19(27.1)	130(39.4)	27(32.5)	109(32.4)	35(41.7)	227(37.5)	56(36.8)
70–79	76(27.3)	14(20.0)	100(30.3)	33(39.8)	68(20.2)	12(14.3)	168(27.7)	42(27.6)
<50	24(8.6)	14(20.0)	29(8.8)	3(3.6)	59(17.6)	13(15.5)	62(10.2)	15(9.9)
>=80	13(4.7)	4(5.7)	23(7.0)	4(4.8)	8(2.4)	2(2.4)	26(4.3)	14(9.2)
Sex (no., %)
Female	63(22.7)	15(21.4)			82(24.4)	20(23.8)	98(16.2)	23(15.1)
Male	215(77.3)	55(78.6)	330(100.0)	83(100.0)	254(75.6)	64(76.2)	508(83.8)	129(84.9)
Data Collection (no.)
GHETC, PLA	32	9	20	4	56	11	63	14
SZH-AHMU	17	9	13	3	17	3	26	6
FAH-GXMU	106	27	154	34	77	21	219	63
CHH	123	25	143	42	186	49	298	69

Open in a new tab

IQR interquartile range; SD standard deviation. Values in parentheses indicate standard deviation or percentage. “–“ indicates not applicable.

Table 2.

Demographic characteristics of non-cancer and cancer cohorts across multiple cancer types

Cancer	BLCA		PRAD		ccRCC		pan_cancer
Group	Non-cancer	Cancer	Non-cancer	Cancer	Non-cancer	Cancer	Non-cancer	Cancer
Age (years)
Mean ± SD	62.0 ± 10.8	67.1 ± 9.8	64.9 ± 11.2	68.7 ± 6.6	62.0 ± 10.8	58.1 ± 11.5	64.5 ± 11.0	64.2 ± 10.8
Median (IQR)	63.0(16.0)	68.0(13.0)	67.0(15.0)	69.0(9.0)	63.0(16.0)	60.0(16.5)	66.0(15.0)	66.0(14.0)
Age Group (no., %)
50–59	71(27.6)	14(15.4)	55(19.3)	9(7.0)	71(27.6)	43(26.4)	80(22.8)	68(16.7)
60–69	84(32.7)	34(37.4)	94(33.0)	63(49.2)	84(32.7)	60(36.8)	116(33.0)	167(41.0)
70–79	59(23.0)	31(34.1)	83(29.1)	50(39.1)	59(23.0)	21(12.9)	96(27.4)	114(28.0)
<50	34(13.2)	4(4.4)	31(10.9)	1(0.8)	34(13.2)	38(23.3)	34(9.7)	43(10.6)
>=80	9(3.5)	8(8.8)	22(7.7)	5(3.9)	9(3.5)	1(0.6)	25(7.1)	15(3.7)
Sex (no., %)
Female	59(23.0)	19(20.9)			59(23.0)	43(26.4)	59(16.8)	62(15.2)
Male	198(77.0)	72(79.1)	285(100.0)	128(100.0)	198(77.0)	120(73.6)	292(83.2)	345(84.8)
Data Collection (no.)
GHETC, PLA	33	8	22	2	33	34	33	44
SZH-AHMU	16	10	14	2	16	4	16	16
FAH-GXMU	79	54	123	65	79	19	144	138
CHH	129	19	126	59	129	106	158	209

Open in a new tab

Values in parentheses indicate standard deviation or percentage. “-“ indicates not applicable.

IQR interquartile range, SD standard deviation.

The diagnostic efficacy of the Machine Learning (ML) model

Predictive models were built using LR, SVM, RF, and XGBoost, with the Stacking method employed to combine their predictions for enhanced performance. Each model was parameter-optimized on its respective training set and evaluated for diagnostic efficacy on the test set. The results demonstrated excellent performance across all models in distinguishing urological tumor patients from non-tumor individuals (Table 3). Specifically, for the BLCA dataset, the AUCs of the LR and XGBoost models were 0.93 (95% CI: 0.87–0.98) and 0.96 (95% CI: 0.91–0.99), respectively, effectively distinguishing BLCA patients from HC (Fig. 2A). For the ccRCC dataset, the AUCs of the LR and RF models both reached 0.99 (95% CI: 0.97–1.00 and 0.98–1.00, respectively), indicating extremely high accuracy in distinguishing ccRCC patients from HC (Fig. 2B). For the PRAD dataset, the AUCs of the LR and XGBoost models were 0.91 (95% CI: 0.84−0.96) and 0.92 (95% CI: 0.85−0.97), respectively, effectively distinguishing PRAD patients from BPH patients and HC (Fig. 2C). For the pan-cancer dataset, the AUCs of the LR and XGBoost models were 0.88 (95% CI: 0.82–0.93) and 0.89 (95% CI: 0.83–0.94), respectively, showing a slight decline in performance compared to single-cancer models but still demonstrating useful pan-cancer diagnostic capability (Fig. 2D). Additionally, Decision Curve Analysis (DCA)²⁷ demonstrated that the net benefit of using these models for screening was higher than both the “screen all” and “no screening” strategies across most threshold probability ranges, indicating significant clinical utility for early screening of urological tumors using cfDNA fragmentomic feature-based ML models (Fig. 4).

Table 3.

Evaluation of the prediction performance of five algorithms on the four datasets

Cancer	algorithms	sample_n	auc[CI]	tpr	spc	ppv	npv	f1	acc
BLCA	Stacking	70	0.93[0.87,0.98]	0.94	0.85	0.68	0.98	0.79	0.87
	RF	70	0.87[0.77,0.94]	0.94	0.71	0.53	0.97	0.68	0.77
	LR	70	0.79[0.63,0.92]	0.94	0.31	0.32	0.94	0.48	0.47
	SVC	70	0.83[0.70,0.93]	0.94	0.4	0.35	0.95	0.52	0.54
	XGBoost	70	0.96[0.91,0.99]	0.94	0.87	0.71	0.98	0.81	0.89
ccRCC	Stacking	84	0.99[0.97,1.00]	0.91	0.96	0.94	0.94	0.92	0.94
	RF	84	0.99[0.98,1.00]	0.91	0.98	0.97	0.94	0.94	0.95
	LR	84	0.97[0.93,1.00]	0.91	0.92	0.88	0.94	0.9	0.92
	SVC	84	0.95[0.89,0.99]	0.91	0.88	0.83	0.94	0.87	0.89
	XGBoost	84	0.98[0.96,1.00]	0.91	0.96	0.94	0.94	0.92	0.94
PRAD	Stacking	83	0.91[0.84,0.96]	0.92	0.77	0.65	0.96	0.76	0.82
	RF	83	0.89[0.81,0.95]	0.92	0.81	0.69	0.96	0.79	0.84
	LR	83	0.86[0.76,0.93]	0.92	0.77	0.65	0.96	0.76	0.82
	SVC	83	0.87[0.78,0.94]	0.92	0.77	0.65	0.96	0.76	0.82
	XGBoost	83	0.92[0.85,0.97]	0.92	0.81	0.69	0.96	0.79	0.84
Pan_Cancer	Stacking	152	0.88[0.82,0.93]	0.9	0.67	0.76	0.85	0.83	0.8
	RF	152	0.86[0.80,0.91]	0.9	0.56	0.7	0.83	0.79	0.74
	LR	152	0.82[0.75,0.88]	0.9	0.54	0.7	0.83	0.79	0.74
	SVC	152	0.83[0.76,0.89]	0.9	0.57	0.71	0.83	0.8	0.75
	XGBoost	152	0.89[0.83,0.94]	0.9	0.71	0.79	0.86	0.84	0.82

Open in a new tab

The highest value in each row is highlighted in bold and blue.

AUC area under the curve, TPR True Positive Rate, SPC Specificity, PPV Positive Predictive Value, NPV Negative Predictive Value, ACC Accuracy, F1 F1 Score.

Fig. 2 — The results of the five algorithms for diagnostic model construction were compared in the four datasets. LR logistic regression, RF random forest, SVC Support Vector Classification (SVM for Classification).

Fig. 4 — The results showed the net benefit of using the models for screening. The best model is XGBoost for BLCA, RF for ccRCC; Others are LR. DCA. The horizontal gray‒green lines parallel to the x-axis represent no patient for screening (Treat None). The red line indicates that all the patients undergoing screening (Treat All). Decision curve analysis.

Predictive performance of individual features

To explore the contribution of different cfDNA fragmentomic features to predictive performance, SHAP values provide a model-agnostic method for interpreting the contribution of each feature to model predictions. Figure 3 shows the SHAP values of the most influential features in each dataset’s models. For the BLCA dataset (Fig. 3A), 6bpB_GATGAA, 6bpM_GCGCAG, 6bpM_GCGCCG, and 6bpB_СССААА were the most influential. Higher frequencies of 6bpB_GATGAA was associated with higher BLCA risk (Red is on the right side), while higher frequencies of 6bpM_GCGCAG, 6bpM_GCGCCG, 6bpB_СССААА, were associated with lower BLCA risk (Red is on the left side). Similar patterns were observed in the other datasets. For the ccRCC dataset (Fig. 3B), 6bpB_CCTTGA,6bpB_CCTTGT,6bpM_TGACAG,4bpM_TGTC, had the greatest influence on the model’s predictions, with their higher frequencies correlating with lower ccRCC risk. In the PRAD dataset (Fig. 3C), 6 bpM_TCCTAA, 6bpB_AGATCA, 6bpM_CGTGAA, 4 bpM_CGCA were the most influential features, with higher frequencies of 6 bpM_TCCTAA associated with higher PRAD risk, while higher frequencies of 6bpM_CGTGAA and 4 bpM_CGCA were associated with lower PRAD risk. For the pan-cancer dataset (Fig. 3D), 6 bpM_CCTATC, 6bpM_TCTGAG, 6bpM_TCCTAA, 6bpM_ATGGGT were the most influential features. Higher frequencies of 6bpM_CCTATC and 6bpM_TCCTAA were associated with higher cancer risk, while higher frequencies of 6bpM_TCTGAG were associated with lower cancer risk. It is important to note that the ranking of feature importance does not indicate causality; SHAP analysis helps suggest possible mechanisms involved in urological tumor development.

Fig. 3 — The impact of different features illustrated by SHAP value on the prediction. The SHAP value plotting the nine most relevant features and other feature to predict the risk of cancer. In the figure, the Y-axis represents different cfDNA fragmentomic features, and the X-axis represents SHAP values, with higher values indicating greater influence on the model’s predictions. Color changes represent feature values, with blue indicating lower values and red indicating higher values. SHAP Shapley Additive exPlanations.

The clinical and economic benefits of the model

All models independently selected variables to build optimal models for specific cancers. To further optimize screening strategies, a tiered screening approach was proposed, starting with the pan-cancer model for initial screening, followed by differential diagnosis using the most important features for specific cancers. This strategy can reduce costs and improve screening efficiency. The clinical utility of the models was evaluated using DCA, comparing the models against a no-model approach across a range of probabilities. In this evaluation, the optimal model in each dataset demonstrated the highest net benefit, showing substantial clinical utility compared to the absence of a model (Fig. 4). Adjusting the decision thresholds of the models, such as setting sensitivity to 95%, still yielded acceptable diagnostic performance. Figure 5 presents a waterfall plot illustrating the tiered screening strategy. Setting the decision threshold of the pan-cancer model at 0.93 screened out 95% of cancer patients. For suspected patients, differential diagnosis using the BLCA, ccRCC, and PRAD models with decision thresholds of 0.57, 0.84, and 0.76, respectively, further screened out 95% of patients with corresponding cancers. This tiered strategy ensures high diagnostic accuracy while significantly reducing screening costs, demonstrating substantial clinical and economic benefits.

Fig. 5 — The y-axis represents the predicted probability values. The x-axis represents the individual samples, ordered from the lowest to the highest predicted probability. Yellow bars indicate the treatment group cases; green bars indicate the control group cases. The horizontal lines represent the cutoff points corresponding to a sensitivity of 90% and 95%.

Discussion

In this study, we pioneered a cfDNA-based detection model specifically for urological tumors, marking the first time such a model has been proposed. Our investigation yielded two pivotal findings. Firstly, we demonstrated that machine learning models utilizing plasma cfDNA fragmentomic features can efficiently differentiate early-stage urological tumor patients from non-cancer individuals. As evidenced by our findings and supported by previous studies, whole-genome sequencing (WGS) enables the examination of tumor-driven cfDNA fragment distribution and frequency, allowing for the detection of tumors at their nascent stages. Notably, this method proves effective even with low-depth (1X) sequencing, showcasing its broad application potential in urological tumors.

Secondly, we validated that specific features of 6bp EDM and BPM play a central role in distinguishing urological tumors. Building on previous research, we incorporated a range of potential features—including fragment size ratio (FSR), fragment size distribution (FSD), EDM (both 4 bp and 6 bp), and BPM (4 bp and 6 bp)—resulting in a total of 10,113 features. While our initial feature selection involved conducting individual T-tests, we acknowledge that we did not apply a formal multiple testing correction at this stage, which could potentially increase the risk of false positives. However, we employed a comprehensive subsequent process involving Recursive Feature Elimination combined with cross-validation (RFECV) and SHAP value analysis to further refine the feature set and identify those most critical for distinguishing urological tumors. These findings are in alignment with similar feature sets identified in early lung cancer detection models¹⁶. The biological relevance of end motifs is supported by Jiang et al.¹¹, who demonstrated their association with cancer through nuclease regulation and their tissue-specific nature, and further built SVM and LR models based on 4bp EDMs to predict hepatocellular carcinoma with an AUC of 0.89. Furthermore, Guo et al.¹⁶ successfully utilized 6 bp BPMs to build LR model for early-stage lung adenocarcinoma detection, highlighting the potential of these features in early cancer diagnosis. As reviewed by Lo et al.²³, the fragmentation patterns of cfDNA, including these motifs, are known to be influenced by fundamental biological processes such as nucleosomal organization and nuclease activity, providing a plausible biological basis for the alterations we observed in urological cancer patients. Interestingly, our feature selection process did not highlight FSR and FSD features as a significant contributor to the final models, despite their known importance in other cancers. This observation warrants further discussion in the context of early urological tumor detection and the potential superiority of EDM and BPM features in our study.

Moreover, we proposed a novel two-tiered screening strategy. Initially, the pan-cancer model served as the preliminary screening tool for urological tumors. Subsequently, based on identified key targets, we conducted differential localization for specific cancers such as bladder, prostate, and kidney. By fine-tuning threshold settings, this approach enhanced screening efficiency and cost-effectiveness, making it a practical option for clinical deployment.

When compared to previously reported cfDNA-based screening models, our study further validates the immense potential of cfDNA fragmentomics in the early detection of urological cancers. For instance, one study²⁴ reported a sensitivity of 83.1% for colorectal cancer detection in a large cohort, while another²⁸ used ML methods to model primary liver cancer, colorectal adenocarcinoma, and lung adenocarcinoma based on cfDNA, achieving an AUC of 0.983 for distinguishing cancer patients from healthy individuals. Notably, their study utilized 1X WGS data derived from downsampling 5X WGS, whereas our study employed direct 1X sequencing. Furthermore, we elucidated the model’s strong predictive performance through SHAP plots, highlighting the significant contributions of 6bp EDM and BPM. Cristiano et al.²⁰ developed the DELFI model, which constructed machine learning models for seven cancers—including breast, bile duct, colorectal, gastric, lung, ovarian, and pancreatic cancers—with AUCs ranging from 0.86 to 0.94, and also highlighted the contribution of fragment size features in distinguishing cancer types. Similarly, Zhang et al.¹² proposed the use of FSD features for modeling primary liver cancer and non-cancer individuals, achieving an AUC of 0.995 (95% CI: 0.984–0.998), though their sample sequencing depth was down-sampled to 4X and the specific contributions of individual features were not thoroughly explained. In contrast, our study not only directly employed 1X sequencing data but also validated the crucial role of 6bp EDM and BPM in the early detection of urological tumors through comprehensive feature selection and model interpretation.

At present, more advanced tumor screening methods are centered around cfDNA methylation detection, which typically relies on immunoprecipitation or targeted enrichment techniques that analyze a limited number of loci. For instance, the ctDNA-based SEPT9 test has been approved by the U.S. Food and Drug Administration (FDA) as an effective early and non-invasive screening tool for colorectal adenocarcinoma and has been integrated into clinical practice²⁹. However, detecting ctDNA in the early stages of tumors remains challenging due to its low concentration. Consequently, cfDNA methylation-based methods have been widely used in both single-cancer and pan-cancer early screening studies. One study³⁰ reported a multi-cancer detection method based on cfDNA methylation with a specificity of 99.3%, although the sensitivity for detecting 12 cancers (including BLCA) increased with tumor stage—39% for Stage I, 69% for Stage II, and 83% for Stage III. Clearly, there is room for improvement in detecting early-stage (Stage I) cancers. In contrast, our study leveraged a cfDNA fragmentomic feature-based detection approach, which demonstrated considerable potential for early screening of urological tumors by assessing changes in the distribution and frequency of cfDNA fragments.

While more advanced ctDNA detection methods often focus on methylation patterns or targeted sequencing of known mutations, our study employed a tumor-agnostic, genome-wide fragmentomics approach. We acknowledge the reviewer’s concern regarding potentially low ctDNA levels, particularly in early-stage PRAD and ccRCC, which could limit the direct detection of tumor-derived DNA. In a preliminary analysis of 10 randomly selected tumor samples using ichorCNA, we observed a Tumor Fraction of 0 in all cases, supporting this concern. However, our machine learning models still achieved high AUCs in distinguishing cancer patients, suggesting that the detected fragmentomic features—including fragment size distribution and end/breakpoint motifs—may reflect broader cancer-associated changes in cfDNA fragmentation patterns, even in the absence of a substantial tumor-derived fraction detectable by ichorCNA. This is consistent with findings from other studies utilizing similar fragmentomics approaches in early cancer detection, where subtle but consistent alterations in cfDNA fragmentation have shown diagnostic potential.

A primary limitation of the current study is the lack of an independent external validation cohort. We are strongly committed to validating our findings through external validation in future prospective studies to further establish the generalizability of our model. While our work outlines a framework with potential clinical applications, our primary focus remains on validating the diagnostic performance of cfDNA fragmentomics for urological tumors. This study warrants further mechanistic investigations and causal analysis to validate the conclusions. The underlying biological mechanisms—such as the influence of chromatin accessibility and nuclease activity on cfDNA fragmentation and EDM preferences^20,21—are not fully understood, and future studies are needed to elucidate these pathways. Additionally, variations in tumor pathogenesis, such as the genetic predisposition and environmental exposures influencing BLCA compared to the mutation-driven nature of PRAD and ccRCC^31,32, require further investigation to fully interpret the observed cfDNA signatures. While we aimed to have a control group generally comparable in age, we acknowledge that a formal co-morbidity matching analysis was not explicitly performed in this study, which is another potential limitation to consider. Furthermore, we acknowledge that the balanced or near-balanced case-control ratio in our test sets might lead to an overestimation of the model’s performance in real-world scenarios where the prevalence of urological cancers is likely lower.

In summary, our study establishes that cfDNA fragmentomic analysis, based on low-coverage WGS and advanced machine learning techniques, offers a highly accurate and minimally invasive approach for the early detection of urological tumors. By capturing changes in cfDNA fragment distribution and frequency, even in samples with undetectable tumor fraction, our method highlights the diagnostic value of genome-wide fragmentation patterns in early-stage cancers. These findings not only advance current understanding of cfDNA biology in urological tumors but also provide a new methodological framework for future multi-cancer screening models based on cfDNA fragmentomics. With continued optimization, rigorous external validation, and deeper mechanistic exploration, we believe this approach holds significant promise for clinical translation, potentially contributing to more effective and earlier cancer diagnosis across diverse populations.

Methods

Study design and participants

This large-scale, multicenter case-control study aims to identify and validate the diagnostic efficacy of cfDNA fragmentomic features in predicting early urological tumors. A total of 758 participants were recruited from four medical centers: the First Affiliated Hospital of Guangxi Medical University, Changhai Hospital, General Hospital of Eastern Theater Command, PLA, and Suzhou Hospital of Anhui Medical University (Suzhou Municipal Hospital). Inclusion criteria for HCs were as follows: (1) age over 18; (2) no major health conditions based on routine medical examinations, including clinical tumor markers within the normal range and no abnormal masses detected on imaging; (3) no history of cancer; and (4) informed consent. Exclusion criteria included (1) history of blood transfusion within 3 months and (2) inability to understand the study. Inclusion criteria for patients were: (1) age over 18; (2) newly diagnosed patients who had not undergone any tumor treatments, such as immunotherapy, radiotherapy, chemotherapy, or anti-tumor medications; (3) primary tumors pathologically confirmed as BLCA, PRAD, or ccRCC, staged T1 or T2, without lymph node or distant metastasis, and resectable; (4) no significant liver or kidney dysfunction; and (5) informed consent. Exclusion criteria for patients included (1) concurrent or prior malignancies, (2) active infectious diseases, (3) history of blood transfusion within 3 months, and (4) inability to understand the study. Clinical and imaging diagnoses were confirmed using WHO classification, and TNM staging was based on the AJCC 2017 system. Early-stage cancer was defined according to the eighth edition of the American Joint Committee on Cancer (AJCC) TNM staging criteria, covering stages 0 (carcinoma in situ), IA, IB, IIA, and IIB. All participants provided written informed consent for the use of their blood samples and clinical data prior to sample collection. This study was conducted in accordance with national guidelines and approved by the Ethics Committee of Guangxi Medical University (Approval No. 2022-0154), in compliance with the Declaration of Helsinki.

The study workflow is illustrated in Fig. 1 and comprises three main steps: 1) Sample Processing and Sequencing: Involves sample collection, cfDNA extraction, library construction, and low-coverage whole-genome sequencing to obtain raw data. 2) Feature Identification and Selection: Focuses on processing sequencing data, extracting, and selecting significant cfDNA fragmentomic features. 3) Model Training and Selection: Diagnostic models were developed using LR, SVM, RF, XGBoost, and stacking algorithms based on selected features, with the optimal model selected based on performance metrics.

Sample processing and sequencing

Sample collection and plasma cfDNA extraction

Blood samples were collected in ethylenediaminetetraacetic acid anticoagulant vacuum tubes. A 5 mL fasting peripheral blood sample was drawn from each study participant. Immediately after collection, the tubes were gently inverted 8–10 times to ensure thorough mixing. The samples were then transported under cold conditions to the laboratory, where plasma separation and storage were completed within 2 h post-collection. Blood samples were centrifuged at 1500 g for 10 min at 4 °C. After centrifugation, the supernatant (plasma) was aliquoted into cryovials in 500 μL portions using a pipette and stored in a −80 °C freezer until further use. Although larger plasma volumes (800–1500 µL) produced significantly higher cfDNA concentrations (P < 0.05), sequencing libraries constructed from 500 µL consistently passed quality control metrics and generated stable, high-quality lcWGS data (Supplementary Materials Section 1), confirming the adequacy of this volume. Cell-free DNA (cfDNA) was extracted from the plasma samples using the MagMAX Cell-free DNA Isolation Kit (Thermo Fisher Scientific, USA), strictly following the manufacturer’s instructions. The concentration of cfDNA in the plasma samples was quantified using the Qubit 4.0 Fluorometer (Thermo Fisher Scientific, USA). To assess the distribution of DNA fragment lengths, the Bioptic QSEP-100 Automated Nucleic Acid Protein Analysis System (Bioptic, Taiwan) was employed.

Plasma cfDNA library construction and low-coverage WGS

Following quality control approval of the cfDNA, each cfDNA sample was processed using the VAHTS Universal DNA Library Prep Kit for Illumina V3 (Vazyme Biotech Co., Ltd., China), adhering strictly to the manufacturer’s protocol. This involved end repair, phosphorylation, “A” tailing, and ligation of sequencing-specific adapters. Subsequently, the cfDNA was enriched via PCR to construct whole-genome sequencing (WGS) libraries. The qualified libraries were sequenced on the BGI MGISEQ-2000RS platform, achieving an average coverage depth of 1× using paired-end sequencing mode, with each end sequenced at 100 base pairs (bp).

Feature identification and selection

Bioinformatic processing of sequencing data

Sequencing data underwent initial quality control to remove reads with over 10% indeterminate bases and adapter sequences. Reads with more than 40% low-quality bases (Q ≤ 15) were also excluded. The resultant clean reads were assessed using FastQC (v.0.11.9) and aligned to the human reference genome hg19 via BWA software (v.0.7.16). Post-alignment, Samtools was employed to calculate the mapping rate, duplicate rate, and genome coverage. Sambamba (v.1.0.0) was used to remove duplicates and reads with a mapping rate under 30%.

Identification of fragmentomics features

We divided the hg19 autosomes into 572 adjacent, non-overlapping regions of 5 M each, excluding low-mapping regions³³ and Duke blacklisted regions²⁰. This resulted in 473 usable regions. The size of the insert fragments, representing the cfDNA fragment length, was calculated based on the paired-end reads aligned to the reference genome. We classified the cfDNA fragments from the peripheral blood of the subjects into 24 distinct lengths ranging from 100 to 220 bp in 5 bp increments. We then calculated the proportion of different lengths of cfDNA on each arm of the 39 autosomal chromosomes to obtain Fragment Size Distribution (FSD) features, resulting in 936 (39 * 24) FSD features per sample. To obtain Fragment Size Ratio (FSR) features, fragments of 100–150 bp were defined as short fragments, and those of 151–220 bp as long fragments. The FSR for each of the 473 regions was calculated as the ratio of short to long fragments, yielding 473 FSR features per sample.

Besides fragment size-related features, cfDNA end motifs (EDMs) and breakpoint motifs (BPMs) were also extracted. Following the approaches described by Jiang et al. and Cristiano et al.^11,16, we extracted EDMs and BPMs of varying lengths (e.g., 4-bp and 6-bp) to characterize cfDNA fragmentation profiles. Specifically, EDMs were defined as the first 4 nucleotides extending from the 5′ end towards the 3′ end of cfDNA fragments for 4 bp end motifs, while the first 6 nucleotides in the same direction were classified as 6 bp end motifs. For BPMs, the sequences of 2 nucleotides immediately upstream and downstream of the cleavage site (marked by the first nucleotide at the 5′ end) were defined as 4 bp breakpoint motifs, and those of 3 nucleotides as 6 bp breakpoint motifs. This yielded 256 (4^4) possible features for 4-bp EDMs and 4096 (4^6) unique BPM features per sample. The frequencies of each end motif and breakpoint motif (4 bp EDM, 6 bp EDM, 4 bp BPM, and 6 bp BPM) were calculated. To account for differences in sequencing depth and fragment counts between samples, frequency normalization was performed by dividing the count of each specific EDM or BPM by the total count of all identified end or breakpoint fragments within that sample (scaling values to [0,1]). This approach allows for the detailed analysis of nucleotide patterns associated with cfDNA fragmentation and processing biases. The overall procedure was adapted from previously validated fragmentomic pipelines^11,16, and optimized for low-coverage whole-genome sequencing (lcWGS) data to ensure feature robustness and computational efficiency. To ensure all features were on a similar scale across batches prior to training machine learning models, all extracted features—including normalized frequencies of EDMs and BPMs, as well as FSR and FSD features—were subsequently standardized using z-scores. The z-score for each sample’s feature value (x) was calculated as: z = (x − u)/s. where ‘u’ is the mean of the training samples for that specific feature, and ‘s’ is the standard deviation of the training samples for that same feature. Finally, all features were combined, resulting in 10,113 features per sample. The following datasets were constructed: PRAD vs. non-PRAD (HC + BPH, male only); BLCA vs. HC; ccRCC vs. HC; pan_cancer vs. non-cancer—combining positive cases of the three cancers as the observation group and others as the control group for the pan_cancer dataset.

FSR, FSD, EDM and BPM as candidate predictors obtained from lcWGS data, T tests were first used to identify features significantly different between case and control groups (P < 0.01). Then, Recursive feature elimination combined with cross-validation (RFECV) guided by SHAP was used for secondary feature selection, ultimately yielding the optimal predictive features for each dataset.

Model training and selection

Each dataset was randomly split into training (80%) and independent test (20%) sets using stratified random sampling. Five machine learning algorithms were used to construct diagnostic models: LR, SVM, RF, XGBoost³⁴, and Stacking³⁵. XGBoost, an optimized gradient boosting decision tree, uses Newton’s method to find the extremum of the loss function, includes regularization terms in the loss function, and supports parallel processing, significantly improving scalability and training speed. Stacking is a layered model ensemble framework that trains a secondary classifier by combining the predictions of multiple primary classifiers to improve overall prediction accuracy and robustness. XGBoost and Stacking have frequently appeared in various Kaggle competitions and have become indispensable algorithms in data mining projects.

We employed five-fold cross-validation to perform model selection and hyperparameter optimization, using the AUC as the primary performance metric. A random search strategy was used to explore the hyperparameter space, and the optimal configurations were selected based on the mean AUC across folds. The detailed hyperparameter search space, including parameter ranges and candidate values, is provided in Supplementary Table S5. All procedures were repeated with a fixed random seed to ensure reproducibility. Finally, the predictive performance of the models was evaluated on the test set, and the operability of the models was demonstrated in conjunction with clinical practice.

Statistical analysis

Various statistical methods were employed to analyze the data in this study. Continuous variables were described using medians and interquartile ranges. The predictive ability of the models for different urological tumors was evaluated using receiver operating characteristic (ROC) curves, and the AUC was calculated as the evaluation metric. Decision curve analysis DCA²⁷ was used to assess the clinical utility of the models by quantifying the net benefit across different threshold probability ranges. Additionally, waterfall plots were created to visually display the predicted probability distribution for individual patients, further assessing the clinical applicability of the models. To determine the optimal diagnostic threshold for the models, sensitivity was set at approximately 90% and 95% to minimize the risk of missing tumor diagnoses. The optimal cutoff value was then determined, and the corresponding sensitivity, specificity, and their 95% CI were calculated. Furthermore, the SHAP method was used to analyze and evaluate the contribution of key predictive variables to the model’s predictions, enhancing the interpretability of the models²⁶. Model parameter calculations were performed using Python 3.9, and DCA curves and waterfall plots were visualized using R 4.1.0.

Supplementary information

Supplementary Information^{(314.5KB, pdf)}

Supplementary Data S1^{(266.3KB, xlsx)}

Supplementary Data S2^{(12.1KB, xlsx)}

Supplementary Data S3^{(1.2MB, xlsx)}

Acknowledgements

This work was supported by grants from the National Natural Science Foundation of China (NSFC) (82372828, F.W.), the Science Foundation for Distinguished Young Scholars of Guangxi (2023GXNSFFA026003, F.W.), the Science and Technology Major Project of Guangxi (AA22096030 and AA22096032), the Science Foundation for Distinguished Young Scholars of Guangxi Medical University (F.W.), and the Yongjiang Program of Nanning (2021015, F.W.).

Author contributions

Fubo Wang led the project. Fubo Wang, Yanling Hu, and Junyi Chen designed the study. Huiyong Zhang, Caihong Huang, Chunmeng Wei, Wenhao Lu, Zuheng Wang, Xiao Li, Dianyu Wang, Jin Ji, Rirong Yang performed research, Huiyong Zhang, Caihong Huang, Rongbin Zhou, Chengbang Wang, Yuexiang Li analyzed data, Huiyong Zhang, Caihong Huang, Chunmeng Wei, and Shaohua Chen wrote and edited the manuscript, Fubo Wang, Yanling Hu, and Junyi Chen critically reviewed the manuscript, and all authors approve the final submission and publication.

Data availability

The data supporting the findings of this study are not publicly available due to patient privacy considerations and institutional regulations. Access to the data may be granted by the corresponding author upon reasonable request. The code used in this study is available at: https://gitee.com/guangxi-medical-university/cfdnasubmint.git.

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Huiyong Zhang, Caihong Huang, Chunmeng Wei.

Contributor Information

Junyi Chen, Email: chenjunyidoctor@163.com.

Yanling Hu, Email: huyanling@gxmu.edu.cn.

Fubo Wang, Email: wangfubo@gxmu.edu.cn.

Supplementary information

The online version contains supplementary material available at 10.1038/s41698-025-01130-1.

References

1.Bray, F. et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA A Cancer J. Clin.74, 229–263 (2024). [DOI] [PubMed] [Google Scholar]
2.American Cancer Society. Survival rates for kidney cancer. https://www.cancer.org/cancer/types/kidney-cancer/detection-diagnosis-staging/survival-rates.html (2024).
3.Padala, S. A. et al. Epidemiology of Renal Cell Carcinoma. World J. Oncol.11, 79–87 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Kuru, T. H. et al. Histology core-specific evaluation of the European Society of Urogenital Radiology (ESUR) standardised scoring system of multiparametric magnetic resonance imaging (mpMRI) of the prostate. BJU Int.112, 1080–1087 (2013). [DOI] [PubMed] [Google Scholar]
5.Cary, K. C. & Cooperberg, M. R. Biomarkers in prostate cancer surveillance and screening: past, present, and future. Ther. Adv. Urol.5, 318–329 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Das, C. J., Razik, A., Sharma, S. & Verma, S. Prostate biopsy: when and how to perform. Clin. Radio.74, 853–864 (2019). [DOI] [PubMed] [Google Scholar]
7.van Rhijn, B. W. G., van der Poel, H. G. & van der Kwast, T. H. Urine markers for bladder cancer surveillance: a systematic review. Eur. Urol.47, 736–748 (2005). [DOI] [PubMed] [Google Scholar]
8.Heitzer, E., Auinger, L. & Speicher, M. R. Cell-free DNA and Apoptosis: How Dead Cells Inform About The Living. Trends Mol. Med.26, 519–528 (2020). [DOI] [PubMed] [Google Scholar]
9.Mouliere, F. et al. Enhanced detection of circulating tumor DNA by fragment size analysis. Sci. Transl. Med.10, eaat4921 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Snyder, M. W., Kircher, M., Hill, A. J., Daza, R. M. & Shendure, J. Cell-free DNA comprises an in vivo nucleosome footprint that informs its tissues-of-origin. Cell164, 57 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Jiang, P. et al. Plasma DNA end-motif profiling as a fragmentomic marker in cancer, pregnancy, and transplantation. Cancer Discov.10, 664–673 (2020). [DOI] [PubMed] [Google Scholar]
12.Zhang, X. et al. Ultrasensitive and affordable assay for early detection of primary liver cancer using plasma cell-free DNA fragmentomics. Hepatology76, 317–329 (2022). [DOI] [PubMed] [Google Scholar]
13.Foda, Z. H. et al. Detecting liver cancer using cell-free DNA fragmentomes. Cancer Discov.13, 616–631 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Tao, K. et al. Machine learning-based genome-wide interrogation of somatic copy number aberrations in circulating tumor DNA for early detection of hepatocellular carcinoma. eBioMedicine56, 102811 (2020). [DOI] [PMC free article] [PubMed]
15.Mathios, D. et al. Detection and characterization of lung cancer using cell-free DNA fragmentomes. Nat. Commun.12, 5060 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Guo, W. et al. Sensitive detection of stage I lung adenocarcinoma using plasma cell-free DNA breakpoint motif profiling. eBioMedicine81, 104131 (2022). [DOI] [PMC free article] [PubMed]
17.Wang, S. et al. Multidimensional cell-free DNA fragmentomic assay for detection of early-stage lung cancer. Am. J. Respir. Crit. Care Med.207, 1203–1213 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Ma, X. et al. Multi-dimensional fragmentomic assay for ultrasensitive early detection of colorectal advanced adenoma and adenocarcinoma. J. Hematol. Oncol.14, 175 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Nguyen, V. T. C. et al. Multimodal analysis of methylomics and fragmentomics in plasma cell-free DNA for multi-cancer early detection and localization. Elife12, RP89083 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Cristiano, S. et al. Genome-wide cell-free DNA fragmentation in patients with cancer. Nature570, 385–389 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Jiang, P. et al. Preferred end coordinates and somatic variants as signatures of circulating tumor DNA associated with hepatocellular carcinoma. Proc. Natl Acad. Sci.115, E10925–E10933 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Han, D. S. C. et al. The biology of cell-free DNA fragmentation and the roles of DNASE1, DNASE1L3, and DFFB. Am. J. Hum. Genet.106, 202–214 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Lo, Y. M. D., Han, D. S. C., Jiang, P. & Chiu, R. W. K. Epigenetics, fragmentomics, and topology of cell-free DNA in liquid biopsies. Science372, eaaw3616 (2021). [DOI] [PubMed] [Google Scholar]
24.Chung, D. C. et al. A cell-free DNA blood-based test for colorectal cancer screening. N. Engl. J. Med390, 973–983 (2024). [DOI] [PubMed] [Google Scholar]
25.Bao, H. et al. Early detection of multiple cancer types using multidimensional cell-free DNA fragmentomics. Nat Med31, 2737–2745 (2025). [DOI] [PubMed] [Google Scholar]
26.Lundberg, S. M. & Lee, S.-I. A Unified approach to interpreting model predictions. in Advances in Neural Information Processing Systems. Vol. 30 (Curran Associates, Inc., 2017).
27.Vickers, A. J. & Elkin, E. B. Decision curve analysis: a novel method for evaluating prediction models. Med Decis. Mak.26, 565–574 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Bao, H. et al. Letter to the Editor: an ultra-sensitive assay using cell-free DNA fragmentomics for multi-cancer early detection. Mol. Cancer21, 129 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Li, Y., Fan, Z., Meng, Y., Liu, S. & Zhan, H. Blood-based DNA methylation signatures in cancer: A systematic review. Biochim. Biophys. Acta Mol. Basis Dis.1869, 166583 (2023). [DOI] [PubMed] [Google Scholar]
30.Liu, M. C. et al. Sensitive and specific multi-cancer detection and localization using methylation signatures in cell-free DNA. Ann. Oncol.31, 745–759 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Zhang, Q. et al. Investigating cellular similarities and differences between upper tract urothelial carcinoma and bladder urothelial carcinoma using single-cell sequencing. Front. Immunol.15, 1298087 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.de la Chapelle, A. Genetic predisposition to colorectal cancer. Nat. Rev. Cancer4, 769–780 (2004). [DOI] [PubMed] [Google Scholar]
33.Fortin, J.-P. & Hansen, K. D. Reconstructing A/B compartments as revealed by Hi-C using long-range correlations in epigenetic data. Genome Biol.16, 180 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Chen, T. & Guestrin, C. Xgboost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, 2016). 10.1145/2939672.2939785.
35.Pavlyshenko, B. Using stacking approaches for machine learning models. In Proc. IEEE Second International Conference on Data Stream Mining & Processing (DSMP) 255–258. 10.1109/DSMP.2018.8478522 (2018).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information^{(314.5KB, pdf)}

Supplementary Data S1^{(266.3KB, xlsx)}

Supplementary Data S2^{(12.1KB, xlsx)}

Supplementary Data S3^{(1.2MB, xlsx)}

Data Availability Statement

[CR1] 1.Bray, F. et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA A Cancer J. Clin.74, 229–263 (2024). [DOI] [PubMed] [Google Scholar]

[CR2] 2.American Cancer Society. Survival rates for kidney cancer. https://www.cancer.org/cancer/types/kidney-cancer/detection-diagnosis-staging/survival-rates.html (2024).

[CR3] 3.Padala, S. A. et al. Epidemiology of Renal Cell Carcinoma. World J. Oncol.11, 79–87 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Kuru, T. H. et al. Histology core-specific evaluation of the European Society of Urogenital Radiology (ESUR) standardised scoring system of multiparametric magnetic resonance imaging (mpMRI) of the prostate. BJU Int.112, 1080–1087 (2013). [DOI] [PubMed] [Google Scholar]

[CR5] 5.Cary, K. C. & Cooperberg, M. R. Biomarkers in prostate cancer surveillance and screening: past, present, and future. Ther. Adv. Urol.5, 318–329 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Das, C. J., Razik, A., Sharma, S. & Verma, S. Prostate biopsy: when and how to perform. Clin. Radio.74, 853–864 (2019). [DOI] [PubMed] [Google Scholar]

[CR7] 7.van Rhijn, B. W. G., van der Poel, H. G. & van der Kwast, T. H. Urine markers for bladder cancer surveillance: a systematic review. Eur. Urol.47, 736–748 (2005). [DOI] [PubMed] [Google Scholar]

[CR8] 8.Heitzer, E., Auinger, L. & Speicher, M. R. Cell-free DNA and Apoptosis: How Dead Cells Inform About The Living. Trends Mol. Med.26, 519–528 (2020). [DOI] [PubMed] [Google Scholar]

[CR9] 9.Mouliere, F. et al. Enhanced detection of circulating tumor DNA by fragment size analysis. Sci. Transl. Med.10, eaat4921 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Snyder, M. W., Kircher, M., Hill, A. J., Daza, R. M. & Shendure, J. Cell-free DNA comprises an in vivo nucleosome footprint that informs its tissues-of-origin. Cell164, 57 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Jiang, P. et al. Plasma DNA end-motif profiling as a fragmentomic marker in cancer, pregnancy, and transplantation. Cancer Discov.10, 664–673 (2020). [DOI] [PubMed] [Google Scholar]

[CR12] 12.Zhang, X. et al. Ultrasensitive and affordable assay for early detection of primary liver cancer using plasma cell-free DNA fragmentomics. Hepatology76, 317–329 (2022). [DOI] [PubMed] [Google Scholar]

[CR13] 13.Foda, Z. H. et al. Detecting liver cancer using cell-free DNA fragmentomes. Cancer Discov.13, 616–631 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Tao, K. et al. Machine learning-based genome-wide interrogation of somatic copy number aberrations in circulating tumor DNA for early detection of hepatocellular carcinoma. eBioMedicine56, 102811 (2020). [DOI] [PMC free article] [PubMed]

[CR15] 15.Mathios, D. et al. Detection and characterization of lung cancer using cell-free DNA fragmentomes. Nat. Commun.12, 5060 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Guo, W. et al. Sensitive detection of stage I lung adenocarcinoma using plasma cell-free DNA breakpoint motif profiling. eBioMedicine81, 104131 (2022). [DOI] [PMC free article] [PubMed]

[CR17] 17.Wang, S. et al. Multidimensional cell-free DNA fragmentomic assay for detection of early-stage lung cancer. Am. J. Respir. Crit. Care Med.207, 1203–1213 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Ma, X. et al. Multi-dimensional fragmentomic assay for ultrasensitive early detection of colorectal advanced adenoma and adenocarcinoma. J. Hematol. Oncol.14, 175 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Nguyen, V. T. C. et al. Multimodal analysis of methylomics and fragmentomics in plasma cell-free DNA for multi-cancer early detection and localization. Elife12, RP89083 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Cristiano, S. et al. Genome-wide cell-free DNA fragmentation in patients with cancer. Nature570, 385–389 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Jiang, P. et al. Preferred end coordinates and somatic variants as signatures of circulating tumor DNA associated with hepatocellular carcinoma. Proc. Natl Acad. Sci.115, E10925–E10933 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Han, D. S. C. et al. The biology of cell-free DNA fragmentation and the roles of DNASE1, DNASE1L3, and DFFB. Am. J. Hum. Genet.106, 202–214 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Lo, Y. M. D., Han, D. S. C., Jiang, P. & Chiu, R. W. K. Epigenetics, fragmentomics, and topology of cell-free DNA in liquid biopsies. Science372, eaaw3616 (2021). [DOI] [PubMed] [Google Scholar]

[CR24] 24.Chung, D. C. et al. A cell-free DNA blood-based test for colorectal cancer screening. N. Engl. J. Med390, 973–983 (2024). [DOI] [PubMed] [Google Scholar]

[CR25] 25.Bao, H. et al. Early detection of multiple cancer types using multidimensional cell-free DNA fragmentomics. Nat Med31, 2737–2745 (2025). [DOI] [PubMed] [Google Scholar]

[CR26] 26.Lundberg, S. M. & Lee, S.-I. A Unified approach to interpreting model predictions. in Advances in Neural Information Processing Systems. Vol. 30 (Curran Associates, Inc., 2017).

[CR27] 27.Vickers, A. J. & Elkin, E. B. Decision curve analysis: a novel method for evaluating prediction models. Med Decis. Mak.26, 565–574 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Bao, H. et al. Letter to the Editor: an ultra-sensitive assay using cell-free DNA fragmentomics for multi-cancer early detection. Mol. Cancer21, 129 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Li, Y., Fan, Z., Meng, Y., Liu, S. & Zhan, H. Blood-based DNA methylation signatures in cancer: A systematic review. Biochim. Biophys. Acta Mol. Basis Dis.1869, 166583 (2023). [DOI] [PubMed] [Google Scholar]

[CR30] 30.Liu, M. C. et al. Sensitive and specific multi-cancer detection and localization using methylation signatures in cell-free DNA. Ann. Oncol.31, 745–759 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Zhang, Q. et al. Investigating cellular similarities and differences between upper tract urothelial carcinoma and bladder urothelial carcinoma using single-cell sequencing. Front. Immunol.15, 1298087 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.de la Chapelle, A. Genetic predisposition to colorectal cancer. Nat. Rev. Cancer4, 769–780 (2004). [DOI] [PubMed] [Google Scholar]

[CR33] 33.Fortin, J.-P. & Hansen, K. D. Reconstructing A/B compartments as revealed by Hi-C using long-range correlations in epigenetic data. Genome Biol.16, 180 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Chen, T. & Guestrin, C. Xgboost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, 2016). 10.1145/2939672.2939785.

[CR35] 35.Pavlyshenko, B. Using stacking approaches for machine learning models. In Proc. IEEE Second International Conference on Data Stream Mining & Processing (DSMP) 255–258. 10.1109/DSMP.2018.8478522 (2018).

PERMALINK

Early detection of urological tumors based on genomic characteristics of cell-free DNA fragments: a multi-center study

Huiyong Zhang

Caihong Huang

Chunmeng Wei

Rongbin Zhou

Chengbang Wang

Wenhao Lu

Zuheng Wang

Xiao Li

Shaohua Chen

Dianyu Wang

Jin Ji

Yuexiang Li

Rirong Yang

Junyi Chen

Yanling Hu

Fubo Wang

Abstract

Introduction

Results

Participant characteristics and disposition

Fig. 1. Detailed workflow for cfDNA fragmentomic feature analysis and machine learning model construction.

Table 1.

Table 2.

The diagnostic efficacy of the Machine Learning (ML) model

Table 3.

Fig. 2. Receiver operating characteristic (ROC) curves.

Fig. 4. DCA on the four test sets.

Predictive performance of individual features

Fig. 3. The SHAP value plot.

The clinical and economic benefits of the model

Fig. 5. Waterfall plot of the models’ predictions in relation to biopsy results.

Discussion

Methods

Study design and participants

Sample processing and sequencing

Sample collection and plasma cfDNA extraction

Plasma cfDNA library construction and low-coverage WGS

Feature identification and selection

Bioinformatic processing of sequencing data

Identification of fragmentomics features

Model training and selection

Statistical analysis

Supplementary information

Acknowledgements

Author contributions

Data availability

Competing interests

Footnotes

Contributor Information

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases