Skip to main content
Lippincott Open Access logoLink to Lippincott Open Access
. 2025 Jun 28;111(9):6123–6134. doi: 10.1097/JS9.0000000000002805

Prospective cohort study integrating plasma proteomics and machine learning for early risk prediction of prostate cancer

Yongming Chen a, Tianxin Long b, Miao Wang c, Shengjie Liu c, Zhengtong Lv c, Yuxiao Jiang a, Huimin Hou c,*, Ming Liu a,c,*
PMCID: PMC12430897  PMID: 40557500

Abstract

Background:

Early detection of prostate cancer (PCa) remains a clinical challenge. Plasma proteomics provides a non-invasive tool for identifying individuals at elevated risk prior to symptom onset or PSA elevation.

Methods:

We quantified 1463 plasma proteins in 23 825 PCa-free men from the UK Biobank (UKB). Participants were split into training and validation sets. Cox regression and Light Gradient Boosting Machine (LightGBM) with forward feature selection were used to identify and rank predictive proteins. Model performance was assessed by area under the receiver operating characteristic curve (AUC) in the validation set, and SHAP values were used to interpret feature contributions.

Results:

TSPAN1 and GP2 consistently ranked as top predictors across all analyses. In the training set, both proteins remained significantly associated with PCa risk after Bonferroni correction in multivariable Cox models. LightGBM with forward selection further prioritized TSPAN1 and GP2 as key contributors, and SHAP analysis confirmed their dominant importance. In the validation set, a model combining TSPAN1, GP2, and demographic variables achieved an AUC of 0.728 for overall PCa prediction and 0.760 for 5-year risk. Based on Youden Index–derived thresholds, high-expression groups of TSPAN1 and GP2 were associated with hazard ratios of 1.75 and 1.60, respectively. Longitudinal profiling showed that TSPAN1 levels began rising approximately 9 years before diagnosis, while GP2 increased from 6 years prior.

Conclusions:

TSPAN1 and GP2 are promising long-term predictive biomarkers for PCa. A streamlined proteomics-based model may enable individualized risk stratification and inform earlier, less invasive screening strategies.

Keywords: GP2, machine learning, plasma proteomics, prostate cancer, risk prediction, TSPAN1

Introduction

Prostate cancer (PCa) is the second most common cancer and the fifth leading cause of cancer-related death among men worldwide. In 2022, an estimated 1.46 million new cases and over 396 000 deaths were reported globally[1]. Although localized PCa has an excellent prognosis with a 5-year survival rate of nearly 100%, this rate drops dramatically to 29% in cases of distant metastasis. However, PCa often has an insidious onset with no noticeable symptoms in the early stages. This poses a significant challenge for early detection, particularly in countries without routine PCa screening programs, where many patients are diagnosed at advanced stages.

HIGHLIGHTS

  • Identification of predictive plasma biomarkers: Analysis of 1463 plasma proteins in 23 825 men from the UK Biobank identified TSPAN1 and GP2 as top predictors of PCa, with hazard ratios of 1.73 and 1.65, respectively.

  • Efficient machine learning–based model: A two-protein panel (TSPAN1 + GP2), when combined with demographic variables, achieved an AUC of 0.728 in PCa prediction, comparable to larger multi-protein models but with greater clinical simplicity.

  • Temporal dynamics of biomarkers: TSPAN1 levels began rising approximately 9 years before diagnosis, while GP2 showed a delayed yet sustained elevation starting from 6 years prior, suggesting a sequential biomarker pattern.

  • Minimally invasive clinical utility: The validated protein panel enables blood-based early risk stratification and could help reduce reliance on invasive procedures in prostate cancer screening.

Currently, the most commonly used method for PCa screening is the detection of prostate-specific antigen (PSA) in blood. While PSA screening has improved early diagnosis, its accuracy in predicting PCa risk remains suboptimal, often leading to overdiagnosis and overtreatment, especially in individuals with PSA levels >3.0 ng/mL[2,3]. Although magnetic resonance imaging (MRI) can reduce overdiagnosis to some extent, its detection rate is comparable to PSA[4].

Recent advancements in blood-based biomarkers, particularly plasma proteomics[5],present promising opportunities for more precise risk prediction across a variety of diseases. Plasma proteins, encompassing both classical circulating proteins and tissue “leakage” proteins, serve as valuable biomarkers for a variety of diseases[6]. They have been particularly instrumental in cardio-metabolic diseases (CMDs), where they provide critical insights into disease mechanisms and support risk stratification[7-9]. Remarkably, recent study has suggested that plasma protein alterations might also contribute to the risk of certain cancer[10]. Interestingly, alterations in plasma proteins, as key indicators of systemic metabolic health, have been increasingly linked to PCa and may serve as critical biomarkers for predicting its diagnosis and risk stratification[11].

The current gold standard for PCa diagnosis – prostate biopsy – is an invasive procedure that imposes significant physical and psychological burdens on patients. In contrast, plasma protein profiling offers a non-invasive and readily accessible approach, making it a promising tool for both risk prediction and diagnosis. However, comprehensive, large-scale cohort studies on plasma proteins, particularly those aimed at predicting PCa incidence, are notably lacking.

In this study, we utilized plasma proteomics data from over 20 000 participants in the UK Biobank (UKB) to identify plasma proteins strongly associated with PCa risk. Our objective was to develop a robust PCa prediction system, offering novel insights into the early detection and prevention of PCa.

Methods

Study design

This was a prospective cohort analysis leveraging the UK Biobank (UKB), a large-scale population-based cohort of over 500 000 individuals recruited between 2006 and 2010 from 22 assessment centers across the United Kingdom. For the present study, we included 23 825 adult men who had baseline plasma proteomic profiles and no prior history of PCa at enrollment (Fig. 1). To minimize reverse causality and ensure true prospective prediction, participants who developed PCa within the first year of follow-up were excluded. Participants were followed from the date of enrollment to the first diagnosis of PCa, death, loss to follow-up, or administrative censoring on 15 January 2024. The study aimed to identify and evaluate baseline plasma protein biomarkers predictive of incident PCa using a combination of classical Cox regression and machine learning approaches.

Figure 1.

Figure 1.

Flowchart of cohort selection.

The study aimed to evaluate the predictive value of baseline plasma proteins for incident prostate cancer using both conventional Cox regression and machine learning–based approaches. Model performance was assessed using internal cross-validation and stratified testing sets. All analyses were conducted in accordance with the UKB research ethics framework. This work has been reported in line with the STROCSS criteria[12].

Definition of outcome variables

The primary outcome was the diagnosis of PCa, determined by clinical examination and pathological evaluation conducted by clinicians and pathologists within the cohort. Alternatively, diagnoses were identified through linkage to hospital records, cancer registries, and death registries in the UKB using the International Classification of Diseases, 10th Revision (ICD-10: C62). The enrollment date in the UKB was considered the start of the follow-up period. Follow-up continued until the first occurrence of PCa diagnosis, death, loss to follow-up, or the end of available registry data (15 January 2024), whichever came first.

Plasma proteomics

The plasma proteomics workflow in the UKB, including sampling, storage, and sequencing, has been previously described[5]. In brief, baseline blood samples were collected from 2007 to 2010 at 22 UK assessment centers. Most were obtained during participants’ baseline visits, with additional samples sourced from the UKB Pharma Plasma Proteome Consortium and the COVID-19 repeat imaging study. Blood was drawn into EDTA tubes, centrifuged at 2500 g for 10 minutes at 4°C to isolate plasma, and the supernatant was aliquoted and stored at −80°C for further analysis.

Samples were shipped on dry ice to the Olink analysis service center in Sweden, where protein quantification was performed using the antibody-based Olink Explore Proximity Extension Assay[13]. Plasma proteomic analysis covered 52 704 UKB participants between April 2021 and February 2022. Laboratory personnel were blinded to all sample characteristics and clinical data. Detailed protocols for sample handling and storage have been reported previously[14]. After stringent quality control, 1463 unique proteins were quantified, spanning cardiovascular, metabolic, inflammatory, neurological, and oncological pathways. Intra- and inter-plate coefficients of variation for all Olink panels were below 10% and 20%, respectively. Protein levels were expressed as normalized protein expression (NPX) values, derived by dividing sample-specific counts by standardized counts (see Olink normalization documentation: https://biobank.ndph.ox.ac.uk/showcase/ukb/docs/Olink_1536_B0_to_B7_Normalization.pdf).

Statistical methods and analytical framework

No formal statistical power calculation was conducted prior to analysis. However, the study included 23 825 men, among whom 1234 developed incident prostate cancer over a median follow-up of 14.7 years. This large sample size enabled stable estimation of effect sizes and model performance metrics, particularly in internal validation analyses. Descriptive statistics were used to compare differences between the PCa event group and controls, employing chi-square tests for categorical variables and Student’s t-tests for continuous variables.

Cox proportional hazards regression Models were applied to estimate the association between each plasma protein’s scaled NPX value and PCa events. Hazard ratios (HRs), 95% confidence intervals (CIs), and P values were reported. Model 1 was adjusted for age, ethnicity, and Townsend Deprivation Index (TDI), while Model 2 further adjusted for body mass index (BMI), family history of cancer, history of diabetes, pre-existing cardiovascular disease (including atrial fibrillation, heart failure, coronary artery disease, stroke, or peripheral artery disease), smoking status, alcohol consumption frequency and high-density lipoprotein (HDL) cholesterol. Bonferroni correction was applied to account for multiple comparisons, with significance set at P <0.05.

Cohort partitioning

Participants were chronologically sorted by enrollment date, with the earliest 80% assigned to the training set and the most recent 20% held out as an independent validation set. All differential expression analyses, feature selection, and SHAP interpretation were conducted using the training set, whereas all ROC curve evaluations and AUC comparisons were conducted exclusively in the validation set to avoid information leakage and overfitting.

Cox regression for biomarker discovery

We applied Cox proportional hazards models to identify proteins associated with incident PCa in the training set. Model 1 adjusted for age, ethnicity, and TDI. Model 2 further adjusted for BMI, systolic blood pressure, diabetes status, BPH diagnosis, cardiovascular disease history, smoking status, alcohol intake frequency, HDL levels, and family history of cancer. NPX values were z-transformed. Hazard ratios (HRs), 95% confidence intervals, and P values were calculated for each protein. Bonferroni correction was used to adjust for multiple testing across 1463 proteins.

Functional enrichment analysis

Proteins nominally associated with PCa risk (P < 0.05) in Model 2 were subjected to pathway enrichment analysis using Enrichr[15], referencing Reactome, KEGG, and Gene Ontology (GO) terms for biological process (BP), molecular function (MF), and cellular component (CC). Enrichment results were visualized based on −log10(FDR).

Stepwise LightGBM modeling and SHAP interpretation

Proteins with P < 0.05 in Model 2 were used to train a LightGBM classifier within the training set. To prioritize features, we ranked these proteins by their importance scores based on information gain. Proteins were then sequentially added to the model using a forward selection strategy, where one additional protein was incorporated at each step based on the ranking.

At each step, 10-fold cross-validation was conducted within the training set to compute the mean AUC and its standard deviation. The forward selection process was terminated when the incremental improvement in mean AUC was less than 0.005 for two consecutive steps. This conservative stopping criterion was adopted to minimize overfitting and to ensure that only proteins providing stable and meaningful contributions to predictive performance were retained.

To optimize model performance, key LightGBM hyperparameters (e.g., learning rate, number of leaves, maximum depth, and feature fraction) were tuned using grid search within the training set. The configuration yielding the highest mean AUC in cross-validation was selected. To address class imbalance (approximate PCa incidence of 5.2%), the scale_pos_weight parameter in LightGBM was set according to the inverse class ratio in the training set.

To further interpret model behavior, SHAP (SHapley Additive exPlanations) values were calculated for the final model. These values quantified the marginal contribution of each protein to the predicted PCa risk for individual participants and indicated whether higher protein expression was associated with increased or decreased risk according to the model.

Evaluation of predictive accuracy

Model performance was evaluated in the independent validation set using predicted probabilities generated by LightGBM models trained on the training set. Receiver operating characteristic (ROC) curves were plotted, and area under the curve (AUC) values with 95% confidence intervals were calculated using DeLong’s method.

We compared predictive performance across multiple models: (i) individual top-ranked proteins; (ii) small protein panels derived from LightGBM-based stepwise selection; and (iii) extended models integrating protein panels with demographic covariates including age, BMI, ethnicity, TDI, family history of cancer, diabetes, cardiovascular disease, smoking status, alcohol consumption, and HDL cholesterol. Logistic regression was used for visualization only and did not influence model training.

To assess temporal robustness, time-stratified models were constructed for PCa events occurring within 5 years, within 10 years, and beyond 10 years from baseline. For each timeframe, the outcome was redefined accordingly, and controls were limited to participants free of PCa at the corresponding follow-up time. To minimize reverse causality, individuals diagnosed with PCa within 1 year of enrollment were excluded from all prediction analyses.

Survival and temporal trajectory analyses

Kaplan–Meier curves were generated to examine PCa incidence by dichotomized protein levels using optimal Youden index cutoffs. Cox regression estimated hazard ratios for high vs. low expression groups. For longitudinal profiling, we analyzed protein trajectories across 14 years preceding PCa diagnosis in cases. For controls (i.e., participants free of PCa until censoring), mean protein levels across the entire cohort were used as a stable reference. Yearly mean values were calculated and visualized. Group differences were tested annually using t-tests, and overall trends were examined using Mann–Kendall tests.

Software and statistical significance

All data analyses were conducted using R (v4.3.3) and Jupyter Notebook (v7.2.2). Machine learning models were implemented using the LightGBM R package and SHAP analyses were visualized via Python. A two-sided P value <0.05 was considered statistically significant unless otherwise stated.

Results

Baseline characteristics of participants

A total of 23 825 men without PCa at baseline were included in the analysis. Details of participant inclusion and exclusion are presented in Fig. 1. Participants were chronologically ordered by enrollment date, with the earliest 80% assigned to the training set and the most recent 20% reserved as an independent validation set.

The mean baseline age of the entire cohort was 56.9 ± 8.3 years, comparable to that of the non-PCa group (56.7 ± 8.4 years), whereas the PCa group was significantly older (61.6 ± 5.7 years; P < 0.001). The median follow-up duration was 14.7 years. Most participants (93.1%) were of White ethnicity (Table 1). During follow-up, 1234 incident PCa cases were identified: 372 within 5 years, 974 within 10 years, and 260 beyond 10 years. The prevalence of benign prostatic hyperplasia (BPH) at baseline was significantly higher in the PCa group (P < 0.001), while no significant differences were observed in the prevalence of diabetes or cardiovascular disease (CVD). A family history of cancer was more frequently reported among PCa cases (42.5%) compared to non-cases (36.4%; P < 0.001). Differences in smoking and alcohol consumption status were also statistically significant.

Table 1.

Characteristic of the participants

Characteristics Overall PCa Non-PCa P value
N = 23 825 N = 1234 N = 22 591
Age (mean (SD)) 56.9(8.3) 61.6(5.7) 56.7(8.4) <0.001***
TDI (mean (SD)) −1.12(3.26) −1.48(3.14) −1.10(3.26) <0.001***
HDL (mean (SD)) 49.29(12.20) 50.27(12.87) 49.24(12.17) 0.004**
Systolic BP (mean (SD)) 140.64(17.44) 143.35(17.04) 140.49(17.45) <0.001***
BPH (%) 0.001**
 No 23 297(97.8) 1189(96.4) 22 108(97.9)
 Yes 528(2.2) 45(3.6) 483(2.1)
Diabetes (%) 0.068
 No 21 988(92.3) 1156(93.7) 20 832(92.2)
 Yes 1837(7.7) 78(6.3) 1759(7.8)
Clinical CVD (%) 0.191
 No 21 485(90.2) 1099(89.1) 20 386(90.2)
 Yes 2340(9.8) 135(10.9) 2205(9.8)
Ethnic (%) 0.082
 White 22183(93.1) 1155(93.6) 21 028(93.1)
 Asian 590(2.5) 21(1.7) 569(2.5)
 Black 535(2.2) 36(2.9) 499(2.2)
 Others 517(2.2) 22(1.8) 495(2.2)
Abnormal BMI (%) 0.479
 No 5829(24.5) 291(23.6) 5538(24.5)
 Yes 17 996(75.5) 943(76.4) 17 053(75.5)
Family history of cancer (%) <0.001***
 No 15 077(63.3) 709(57.5) 14 368(63.6)
 Yes 8748(36.7) 525(42.5) 8223(36.4)
Current smoking status (%) 0.001**
 No 20 833(87.4) 1118(90.6) 19 715(87.3)
 Yes 2992(12.6) 116(9.4) 2876(12.7)
Frequency of alcohol intake 0.003**
 Never 1669(7.0) 64(5.2) 1605(7.1)
 <3 times per week 10 050(42.2) 494(40.0) 9556(42.3)
 ≥3 times per week 12 106(50.8) 676(54.8) 11 430(50.6)

BMI = body mass index; BPH = benign prostatic hyperplasia; CVD = cardiovascular disease; PCa = prostate cancer; Systolic BP = systolic blood pressure; TDI = Townsend Deprivation Index.

*

P < 0.05, **P < 0.01, and ***P < 0.001.

Identification of plasma proteins associated with PCa risk

A total of 1463 plasma proteins were quantified (distribution shown in Supplementary Figure 1, available at: http://links.lww.com/JS9/E484). Two Cox proportional hazards models were applied in the training set: Model 1 adjusted for age, ethnicity, and TDI, while Model 2 further included BMI, systolic blood pressure, diabetes, BPH, and CVD (Supplementary Tables 1 and 2, available at: http://links.lww.com/JS9/E483).

At a nominal significance threshold (P < 0.05), Model 1 identified multiple risk-associated (HR>1) and protective (HR<1) proteins (Fig. 2A), with TSPAN1 and GP2 showing the strongest associations. After Bonferroni correction, six proteins remained significant, including two risk proteins (TSPAN1, GP2) and four protective proteins (CCL20, IL17A, FLT3LG, and CCL18) (Fig. 2B). In Model 2, which incorporated a broader set of clinical covariates, only TSPAN1 and GP2 remained Bonferroni-significant (Fig. 2D), reinforcing their robustness.

Figure 2.

Figure 2.

Key protein and pathway associations with prostate cancer.

(A–D) Volcano plots for Model 1 and Model 2, depicting the association between proteins and prostate cancer along with their statistical significance, both before and after Bonferroni correction. Proteins with significant associations are labeled. Model 1 was adjusted for age, ethnicity, and TDI, while Model 2 included additional adjustments for BMI, systolic blood pressure, diabetes status, BPH diagnosis, cardiovascular disease history (atrial fibrillation, heart failure, coronary artery disease, stroke, or peripheral artery disease), smoking status, alcohol intake frequency, HDL levels, and family history of cancer. Bonferroni correction was applied for multiple comparisons, with significance set at P <0.05. (C) Pathway enrichment analysis of significant proteins (not corrected by Bonferroni) in Model 2, grouped by database sources such as Reactome, KEGG, and GO, using the Enrichr platform (https://maayanlab.cloud/Enrichr/). The number of proteins observed in each pathway is displayed above each bar. Statistical significance was defined as FDR-corrected P <0.05 (dotted horizontal line), with P values calculated using two-sided tests. MF, molecular function; CC, cellular component; BP, biological process.

Pathway enrichment analysis based on proteins with P <0.05 in Model 2 (Supplementary Table 3, available at: http://links.lww.com/JS9/E483) revealed enrichment in immune-related pathways including “Cytokine signaling in immune system” and “JAK-STAT signaling pathway” (Fig. 2E). Enrichment in subcellular components such as “lysosomal lumen” and “collagen-containing extracellular matrix” was also observed, highlighting potential microenvironmental influences on PCa risk.

Protein importance ranking and prediction model development

Building on the identified candidate proteins from Model 2, we next prioritized their predictive utility using LightGBM-based feature importance ranking was performed using proteins significant in Model 2. Proteins were sequentially added to the classifier, and AUC was calculated cumulatively (Fig. 3A; Supplementary Table 4, available at: http://links.lww.com/JS9/E483). Four proteins – TSPAN1, GP2, EDA2R, and QPCT – accounted for the majority of predictive performance.

Figure 3.

Figure 3.

Predictive performance of protein panels for prostate cancer risk.

(A) Stepwise feature selection and protein importance based on LightGBM modeling. Each bar represents the importance of a protein in predicting PCa risk (left y-axis), ordered by descending importance. The black line (right y-axis) shows the cumulative AUC as proteins are progressively added into the model. (B) SHAP (Shapley Additive Explanations) summary plot showing the contribution of top-ranked proteins to the model prediction. The horizontal axis indicates SHAP values (feature impact), while color represents feature values (red = high, blue = low). (C–F) ROC curves evaluating the performance of different predictive models for PCa diagnosis: (C) overall prediction at any time; (D) prediction within 5 years; (E) prediction within 10 years; and (F) prediction beyond 10 years. Models incorporate selected protein combinations, with or without demographic variables. Protein panel 1 includes four selected proteins (TSPAN1, GP2, EDA2R, and QPCT) highlighted in red in Fig. 3A. Protein panels 2–4 each consist of four proteins highlighted in red in Supplementary Figures 2A, 3A, and 4A (available at: http://links.lww.com/JS9/E484), respectively. These protein panels were derived using a stepwise forward selection procedure based on the LightGBM model, and correspond to prostate cancer diagnoses occurring during the entire follow-up period, within 5 years, within 10 years, and beyond 10 years, respectively.

SHAP analysis (Fig. 3B) confirmed TSPAN1 as the most influential predictor, followed by EDA2R and GP2. QPCT showed a contrasting pattern, where high expression was associated with lower predicted risk, suggesting a potential protective role.

In validation analyses, the TSPAN1 + GP2 model – comprising top-ranked and Bonferroni-significant proteins – achieved an AUC of 0.671 (95% CI: 0.636–0.705). Adding demographic covariates improved the AUC to 0.728 (95% CI: 0.700–0.756). Inclusion of CCL18 resulted in a reduced AUC of 0.655 (95% CI: 0.618–0.693), while the demographic-adjusted version achieved 0.723 (95% CI: 0.693–0.752). The four-protein model (TSPAN1, GP2, EDA2R, and QPCT) reached an AUC of 0.668 (95% CI: 0.633–0.703), with only marginal improvement after demographic adjustment (AUC = 0.729; 95% CI: 0.701–0.758), suggesting that most of the predictive signal was already captured by the simpler TSPAN1 + GP2 + demographic model.

Time-specific prediction of PCa risk

To evaluate the temporal stability of predictive performance, we next assessed model accuracy across different intervals from baseline: within 5 years, 10 years, and beyond 10 years. LightGBM and SHAP analyses were conducted in the training set, with validation AUCs computed in the test set.

For 5-year prediction, model performance plateaued after including NTRK3, with additional proteins diminishing discrimination (Supplementary Figure 2A, available at: http://links.lww.com/JS9/E484 and Supplementary Table 5, available at: http://links.lww.com/JS9/E483). Ten proteins – TSPAN1, TNC, SLC16A1, CDH2, CPM, IL4, GP2, EDA2R, DUOX2, and NTRK3 – were selected as panel 2. SHAP analysis highlighted TSPAN1, GP2, EDA2R, and NTRK3 as key contributors (Supplementary Figure 2B, available at: http://links.lww.com/JS9/E484).

The TSPAN1 + GP2 model achieved an AUC of 0.732 (95% CI: 0.676–0.789), improving to 0.760 (95% CI: 0.711–0.809) with demographic adjustment. Adding CCL18 yielded the highest observed AUC of 0.767 (95% CI: 0.717–0.817). The full 10-protein panel performed similarly (AUC = 0.732; 95% CI: 0.678–0.786), with little gain after adding demographics (AUC = 0.759; 95% CI: 0.709–0.809), indicating that most predictive signal resided in the TSPAN1 + GP2 + CCL18 + demographic model.

For 10-year prediction (Fig. 3E; Supplementary Figure 3A and B, available at: http://links.lww.com/JS9/E484; Supplementary Table 6, available at: http://links.lww.com/JS9/E483) and >10-year prediction (Fig. 3F; Supplementary Figure 4A and B, available at: http://links.lww.com/JS9/E484; Supplementary Table 7, available at: http://links.lww.com/JS9/E483), the TSPAN1 + GP2 + demographic model again outperformed others, achieving AUCs of 0.730 (95% CI: 0.702–0.758) and 0.691 (95% CI: 0.658–0.723), respectively. In contrast, the broader 10-protein (panel 3) and 44-protein (panel 4) panels showed diminished performance, particularly for >10-year prediction (AUC = 0.671; 95% CI: 0.632–0.709), suggesting plasma proteomic signatures are most effective for short- to medium-term risk stratification.

Figure 4.

Figure 4.

Kaplan–Meier survival curves for prostate cancer incidence based on protein expression levels.

Kaplan–Meier curves showing the association between protein expression levels of TSPAN1 (A) and GP2 (B) and the cumulative incidence of PCa. Protein concentrations were dichotomized into high and low groups according to the optimal cut point determined by the maximum Youden index. Hazard ratios (HRs) with corresponding 95% confidence intervals (CIs) and P values are shown. The number at risk for each group over the follow-up period is displayed below the curves. A P value <0.05 was considered statistically significant.

Association of differential proteins with clinical progression to PCa

To evaluate the relevance of key proteins in disease onset, baseline levels of TSPAN1 and GP2 were classified into high and low expression groups using optimal cutoffs derived from the Youden Index, and time-to-diagnosis analyses were conducted.

Kaplan–Meier curves revealed significant separation in event-free survival between high- and low-expression groups for both markers. Individuals with high baseline TSPAN1 expression (cut point = 0.4426) exhibited a markedly higher risk of developing PCa, with a hazard ratio (HR) of 1.75 (95% CI: 1.55–1.98; P < 0.001) (Fig. 4A). Similarly, elevated GP2 levels (cut point = 0.1584) were associated with a 1.60-fold increased risk of PCa (95% CI: 1.42–1.81; P < 0.001) (Fig. 4B). The curves showed early and persistent divergence between groups, supporting the potential of TSPAN1 and GP2 not only as predictive biomarkers but also as indicators of PCa progression risk over long-term follow-up.

Temporal trajectories of differential proteins

To further investigate their temporal dynamics preceding PCa onset, we analyzed the longitudinal expression trajectories of TSPAN1 and GP2 over the 14 years preceding PCa diagnosis. For cases, time was indexed to the year of diagnosis; for controls, mean expression levels were treated as a time-invariant reference (Supplementary Table 8, available at: http://links.lww.com/JS9/E483).

TSPAN1 levels in PCa cases began to diverge from controls approximately 9 years before diagnosis, with the first statistically significant difference observed at year −9 (P < 0.05), followed by a steady and progressive elevation reaching peak separation after year −6 (Fig. 5A). GP2 levels exhibited a transient increase at year −9, returned to baseline during years −8 and −7, and then showed a sustained and significant rise from year −6 onward (Fig. 5B), suggesting delayed but persistent dysregulation. These trajectories highlight their potential as early and temporally distinct biomarkers for PCa risk stratification.

Figure 5.

Figure 5.

Temporal trajectories of protein concentrations prior to prostate cancer diagnosis.

Longitudinal trends in plasma protein levels of TSPAN1 (A) and GP2 (B) during the 14 years preceding prostate cancer (PCa) diagnosis. For PCa cases, time is aligned to years before diagnosis; for the Normal group, we calculated only the mean protein concentrations in non-PCa individuals, without accounting for temporal variation. Lines represent the mean concentration at each time point, with 95% confidence intervals. Group differences at each year were assessed using two-sided independent t-tests comparing PCa and Normal participants. Asterisks indicate statistical significance: P < 0.05*, P < 0.01**, and P < 0.001***.

Discussion

In this study, we systematically identified plasma proteins associated with incident PCa using a large-scale, high-throughput proteomics platform in a prospective population-based cohort. Among the 1463 proteins examined, TSPAN1 and GP2 consistently emerged as the most robust and informative biomarkers across all analyses – including Cox regression, LightGBM importance ranking, SHAP value attribution, and validation AUCs – for predicting PCa risk within 5 years, 10 years, beyond 10 years, and across all events.

To our knowledge, this is the first study utilizing UK Biobank’s proteomics resource to establish a plasma protein-based predictive model for PCa. Although multiple proteins showed nominal associations with PCa risk in multivariable Cox models, only TSPAN1 and GP2 retained statistical significance after stringent Bonferroni correction. These two proteins alone yielded a validation AUC of 0.671, which improved to 0.728 with the addition of demographic covariates. The best-performing model, incorporating TSPAN1, GP2, CCL18, and demographics, reached an AUC of 0.767 for 5-year prediction, supporting their potential clinical utility.

Kaplan–Meier analysis confirmed the prognostic relevance of TSPAN1 and GP2, with high-expression groups exhibiting significantly increased PCa risk (HRs of 1.75 and 1.60, respectively). Furthermore, temporal trajectory analyses revealed that TSPAN1 levels began diverging from controls up to 9 years before diagnosis, while GP2 exhibited a delayed yet persistent increase from 6 years prior. These trends suggest that both proteins may reflect early pathophysiologic alterations and hold promise as preclinical biomarkers.

TSPAN1 is a member of the tetraspanin family involved in cell-cell adhesion, signal transduction, and membrane protein trafficking[16]. While its role in PCa is still under investigation, emerging evidence suggests it is regulated by androgen signaling and contributes to disease progression. A prior study demonstrated that TSPAN1 expression is upregulated in PCa cells under androgen stimulation and promotes the expression of Slug, a transcription factor implicated in epithelial–mesenchymal transition (EMT), a key process in cancer invasion and metastasis[17]. Furthermore, TSPAN1 has been identified as a marker of cancer-associated fibroblasts (CAFs) and was associated with reduced biochemical recurrence-free survival in PCa patients, suggesting a potential role in stromal–epithelial crosstalk and tumor progression[18,19]. However, the precise mechanisms by which TSPAN1 contributes to PCa pathogenesis remain unclear, particularly in the context of circulating protein dynamics. Our study expands current understanding by demonstrating that plasma levels of TSPAN1 begin to rise approximately 9 years prior to clinical diagnosis, potentially reflecting subclinical tumor activity and shedding into the circulation. This temporal pattern supports its utility as an early, non-invasive biomarker, with potential relevance for both screening and risk stratification.

GP2, a glycoprotein primarily known for its expression in pancreatic acinar cells and involvement in mucosal immunity via its zona pellucida-like domain[20], has only recently been explored in the context of PCa. Although studies remain limited, immunohistochemical data suggest that elevated GP2 expression correlates with adverse histological features such as higher Gleason score and lymph node involvement, and may be associated with tumor heterogeneity[21,22]. In our analysis, plasma GP2 levels were significantly elevated in future PCa cases and showed a sustained increase starting approximately 6 years before diagnosis. This pattern suggests a potential role for GP2 in early carcinogenesis, possibly through modulation of immune surveillance or epithelial barrier integrity. The consistent upregulation of GP2 in both tissue and plasma supports its candidacy as a circulating biomarker, though mechanistic studies are needed to elucidate its biological function in the prostate tumor microenvironment.

Importantly, our data demonstrate that the majority of predictive signal resides in a narrow subset of proteins – namely TSPAN1 and GP2 – rather than in large multiplex panels. Although the combination of TSPAN1, GP2, and CCL18 with demographics yielded the highest AUC (0.767) for 5-year prediction, this gain over the TSPAN1 + GP2 + demographic model (0.760) was marginal. For long-term risk prediction across the entire follow-up, TSPAN1 and GP2 consistently outperformed larger panels. Adding more proteins beyond these core markers contributed little to predictive performance and occasionally introduced overfitting, highlighting the value of a parsimonious and robust biomarker signature.

Analyzing protein concentration trajectories over the 14 years preceding diagnosis revealed that TSPAN1 levels began to rise significantly 9 years before PCa diagnosis. GP2, in contrast, showed significant separation from the normal group’s concentration curve starting 6 years before diagnosis, making it another key predictive molecule for PCa. Integrating demographic features and other relevant factors is essential to develop a non-invasive, cost-effective, and accessible prediction algorithm for accurately assessing PCa risk. In addition to the protein panel, we incorporated demographic factors such as age, BMI, and TDI, which significantly improved the model’s predictive performance. The results indicate that the combination of plasma proteins and demographic features performs best in predicting PCa risk within 5 years, with robust accuracy for predictions within 10 years. However, the model’s performance for PCa risk beyond 10 years was less effective.

While PSA remains the current screening standard[11,23,24], it was unavailable in the UK Biobank, limiting direct comparison. Nevertheless, our findings suggest that the protein-based model could aid early risk stratification, particularly among older men with low-risk proteomic profiles who might benefit from reduced screening frequency. Future studies should evaluate whether such protein-based stratification can complement or refine PSA-based protocols in clinical practice. Moreover, decision curve analysis and net reclassification improvement could not be conducted due to the absence of PSA and biopsy data, but future validations in independent cohorts with comprehensive clinical outcomes will be critical to defining actionable thresholds and assessing real-world decision utility.

Beyond diagnostic accuracy, our findings contribute to the broader understanding of PCa pathogenesis. The distinct temporal dynamics of TSPAN1 and GP2 point to heterogeneous molecular processes underpinning disease initiation. TSPAN1 may reflect early tumor microenvironment remodeling, while GP2 could be linked to later-stage immune escape or systemic dissemination. Further mechanistic studies – integrating transcriptomics, immune profiling, and spatial omics – are needed to contextualize their roles within gene–environment interactions and tissue-specific pathways.

Translationally, the non-invasive nature and high-throughput compatibility of plasma proteomics position TSPAN1 and GP2 as practical candidates for integration into primary care or risk-adapted PCa screening strategies. While PSA testing remains the current standard, its limited specificity often results in unnecessary imaging and biopsy procedures. A parsimonious plasma protein panel could complement existing pathways by improving risk stratification – either as a pre-PSA screening tool or to refine decision-making among individuals with borderline PSA values. This approach may help optimize downstream use of multiparametric MRI and biopsy, ultimately enhancing diagnostic precision and reducing overtreatment. Moreover, the Olink platform used in this study is compatible with large-scale clinical deployment, and the addition of two plasma biomarkers would entail minimal cost and logistical burden. Future efforts should focus on external validation, harmonization across proteomic platforms, and integration into prospective clinical trials to evaluate feasibility, cost-effectiveness, and clinical utility in real-world settings. Combining TSPAN1 and GP2 with PSA or imaging may ultimately enable multi-modal early detection strategies with improved accuracy and tailored patient management.

This study has several notable strengths, including the use of large-scale high-throughput proteomics sequencing and a large cohort with long-term follow-up. These features greatly enhance the reliability of our findings and facilitate the identification of previously unreported plasma protein biomarkers.

However, several limitations should also be acknowledged. First, PSA levels were not available in the UK Biobank, limiting direct comparison with current clinical standards. Second, detailed clinical information such as tumor stage, Gleason grade, and molecular subtypes was not accessible, preventing us from evaluating whether TSPAN1 or GP2 levels correlate with disease aggressiveness or progression. Third, plasma protein levels were measured only once at baseline, so temporal trajectories were inferred from cross-sectional rather than longitudinal data. As a result, our findings reflect population-level rather than within-individual dynamics. Fourth, external validation was not feasible due to the lack of large, independent proteomics cohorts. Although temporally split training-validation and cross-validation strategies were applied to enhance model robustness, platform-specific variability may limit the generalizability of absolute protein thresholds. This underscores the need for future multicenter replication. Fifth, the timing of PCa diagnosis may not perfectly coincide with the actual onset of disease, introducing potential misclassification in trajectory analyses. Sixth, the UK Biobank cohort is predominantly composed of White, UK-based individuals and lacks diversity in ancestry and clinical annotations, which may constrain broader applicability and mechanistic interpretation of the findings. Moreover, the absence of detailed histological or molecular subtyping data prevented us from examining subtype-specific associations, highlighting directions for future mechanistic and stratified research. Finally, although our proteomics-based approach identified key candidate biomarkers, their roles within gene–environment interaction frameworks remain poorly understood.

Despite these limitations, our findings underscore the promise of plasma proteomics in advancing early detection and risk stratification for PCa. TSPAN1 and GP2 represent robust, biologically plausible, and temporally informative biomarkers that merit further investigation for integration into future screening algorithms.

Conclusions

Leveraging large-scale plasma proteomics and long-term follow-up from a population-based cohort, this study identified TSPAN1 and GP2 as robust predictive biomarkers for future PCa risk. These proteins showed consistent associations with disease onset across multiple analytical approaches and time windows. When combined with basic demographic variables, the resulting model achieved strong predictive performance, including up to 10 years before diagnosis. Incorporating such biomarkers into early risk stratification frameworks may enable more accurate screening and timely intervention for individuals at increased risk of PCa.

Footnotes

Supplemental Digital Content is available for this article. Direct URL citations are provided in the HTML and PDF versions of this article on the journal’s website, www.lww.com/international-journal-of-surgery.

Published online 28 June 2025

Contributor Information

Yongming Chen, Email: chenyongming@bjhmoh.cn.

Tianxin Long, Email: longtianxin@fuwai.com.

Miao Wang, Email: wmwfdsgjzx@163.com.

Shengjie Liu, Email: liushengjie0412@163.com.

Yuxiao Jiang, Email: pandatof@163.com.

Huimin Hou, Email: houhuimin0305@163.com.

Ming Liu, Email: liumingbjh@126.com.

Ethical approval

The UKB study was approved by the North West Multi-Centre Research Ethics Committee (11/NW/0382), with all participants providing written informed consent. Conducted in accordance with the Declaration of Helsinki, the research utilized UK Biobank data under application number 665208.

Consent

Not applicable.

Sources of funding

This study was supported by a grant from Beijing Natural Science Foundation (7232138) and National Key Research and Development Program of China (2022YFC3602900) to M.L.; National High Level Hospital Clinical Research Funding (BJ-2022-143) to H.H.

Author contributions

Y.C.: study design, data collection, data analysis, manuscript drafting and manuscript editing; T.L.: data collection, and visualization; M.W.: manuscript review and editing; S.L.: visualization; Z.L.: data analysis; Y.J.: data collection; H.H.: conceptualization and funding acquisition; M.L.: conceptualization, supervision, manuscript review and funding acquisition.

Conflicts of interest disclosure

The authors declare no competing interests.

Research registration unique identifying number (UIN)

Not applicable.

Guarantor

Ming Liu.

Provenance and peer review

Not commissioned, externally peer-reviewed.

Data availability statement

The data used in this study are available from the UK Biobank (Application Number: 665208). Researchers may apply for access through the UK Biobank website (www.ukbiobank.ac.uk), subject to institutional approval and compliance with the UK Biobank’s terms and conditions.

References

  • [1].Bray F, Laversanne M, Sung H, et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2024;74:229–63. [DOI] [PubMed] [Google Scholar]
  • [2].Sidaway P. MRI-based stratification reduces the risk of overdiagnosis of prostate cancer. Nat Rev Clin Oncol 2024;21:838. [DOI] [PubMed] [Google Scholar]
  • [3].Pinsky PF. Prostate biopsy in men with an elevated PSA level – reducing overdiagnosis. N Engl J Med 2024;391:1153–54. [DOI] [PubMed] [Google Scholar]
  • [4].Wilt TJ, Dahm P. Prostate cancer screening with MRI does not differ from PSA only for detection but reduces biopsies and overdiagnosis. Ann Intern Med 2024;177:Jc94. [DOI] [PubMed] [Google Scholar]
  • [5].Sun BB, Chiou J, Traylor M, et al. Plasma proteomic associations with genetics and health in the UK Biobank. Nature 2023;622:329–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Anderson NL, Anderson NG. The human plasma proteome: history, character, and diagnostic prospects. Mol Cell Proteomics 2002;1:845–67. [DOI] [PubMed] [Google Scholar]
  • [7].Ramonfaur D, Buckley LF, Arthur V, et al. High throughput plasma proteomics and risk of heart failure and frailty in late life. JAMA Cardiol 2024;9:649–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].van Vugt M, Finan C, Chopade S, et al. Integrating metabolomics and proteomics to identify novel drug targets for heart failure and atrial fibrillation. Genome Med 2024;16:120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Peng X, Li Y, Liu N, et al. Plasma proteomic insights for identification of novel predictors and potential drug targets in atrial fibrillation: a prospective cohort study and Mendelian randomization analysis. Circ Arrhythm Electrophysiol 2024;17:e013037. [DOI] [PubMed] [Google Scholar]
  • [10].Sun J, Liu Y, Zhao J, et al. Plasma proteomic and polygenic profiling improve risk stratification and personalized screening for colorectal cancer. Nat Commun 2024;15:8873. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Pang B, Wang Q, Chen H, et al. Proteomic identification of small extracellular vesicle proteins LAMB1 and histone H4 for prostate cancer diagnosis and risk stratification. Adv Sci (Weinh) 2024;11:e2402509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Agha RA, Mathew G, Rashid R, et al. Revised strengthening the reporting of cohort, cross-sectional and case-control studies in surgery (STROCSS) guideline: an update for the age of artificial intelligence. Premier J Sci 2025;10:100081. [Google Scholar]
  • [13].Wik L, Nordberg N, Broberg J, et al. Proximity extension assay in combination with next-generation sequencing for high-throughput proteome-wide analysis. Mol Cell Proteomics 2021;20:100168. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Elliott P, Peakman TC. The UK Biobank sample handling and storage protocol for the collection, processing and archiving of human blood and urine. Int J Epidemiol 2008;37:234–44. [DOI] [PubMed] [Google Scholar]
  • [15].Chen EY, Tan CM, Kou Y, et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinf 2013;14:128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Garcia-Mayea Y, Mir C, Carballo L, Sánchez-García A, Bataller M, LLeonart ME. TSPAN1, a novel tetraspanin member highly involved in carcinogenesis and chemoresistance. Biochim Biophys Acta Rev Cancer 2022;1877:188674. [DOI] [PubMed] [Google Scholar]
  • [17].Munkley J, McClurg UL, Livermore KE, et al. The cancer-associated cell migration protein TSPAN1 is under control of androgens and its upregulation increases prostate cancer cell migration. Sci Rep 2017;7:5249. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Qian Y, Feng D, Wang J, et al. Establishment of cancer-associated fibroblasts-related subtypes and prognostic index for prostate cancer through single-cell and bulk RNA transcriptome. Sci Rep 2023;13:9016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Stinnesbeck M, Kristiansen A, Ellinger J, et al. Prognostic role of TSPAN1, KIAA1324 and ESRP1 in prostate cancer. Apmis 2021;129:204–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Scheele GA, Fukuoka S, Freedman SD. Role of the GP2/THP family of GPI-anchored proteins in membrane trafficking during regulated exocrine secretion. Pancreas 1994;9:139–49. [DOI] [PubMed] [Google Scholar]
  • [21].Uhlig R, Günther K, Bröker N, et al. Diagnostic and prognostic role of pancreatic secretory granule membrane major glycoprotein 2 (GP2) immunohistochemistry: a TMA study on 27,681 tumors. Pathol Res Pract 2022;238:154123. [DOI] [PubMed] [Google Scholar]
  • [22].Yun JW, Lee S, Ryu D, et al. Biomarkers associated with tumor heterogeneity in prostate cancer. Transl Oncol 2019;12:43–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Ola IO, Talala K, Tammela T, et al. Long-term risk of prostate cancer mortality among men with baseline prostate-specific antigen below 3 ng/ml: evidence from the Finnish Randomized Study of Screening for Prostate Cancer. Eur Urol Oncol 2024;8:452–9. [DOI] [PubMed] [Google Scholar]
  • [24].Kuanar S, Cai J, Nakai H, et al. Transition-zone PSA-density calculated from MRI deep learning prostate zonal segmentation model for prediction of clinically significant prostate cancer. Abdom Radiol (NY) 2024;49:3722–34. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data used in this study are available from the UK Biobank (Application Number: 665208). Researchers may apply for access through the UK Biobank website (www.ukbiobank.ac.uk), subject to institutional approval and compliance with the UK Biobank’s terms and conditions.


Articles from International Journal of Surgery (London, England) are provided here courtesy of Wolters Kluwer Health

RESOURCES