Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Feb 3;16:7106. doi: 10.1038/s41598-026-35903-w

Machine learning framework for mRNA alternative splicing analysis identifies a signature of progression in colorectal adenocarcinoma

Uran Maimekov 1, Mehdi Nosrati 2, Ahmed Mahmoud 2, Mainak Mustafi 2, Michael W Craige 2, Frederick Coffman 2, J Scott Parrott 3, Carol Lutz 4, Antonina Mitrofanova 2,5,
PMCID: PMC12920883  PMID: 41634106

Abstract

Despite recent advances in genome-wide profiling and the discovery of novel therapeutic options for colorectal adenocarcinoma (COAD), effective patient classification for the risk of cancer progression remains underdeveloped. Recent research has highlighted the crucial role of mRNA alternative splicing (AS) in the development and progression of COAD, yet a genome-wide comprehensive evaluation of the role of AS in COAD progression has not been implemented. In this study, we present a robust machine-learning framework designed to uncover clinically relevant AS events associated with progression-free survival (PFS) in COAD patients. For this, we analyzed RNA sequencing data from the TCGA-COAD cohort (n = 266). We employed a machine learning approach integrating Cox Proportional Hazards (PH) analysis and Robust Likelihood-Based Survival (RBSURV) modeling that identified a five-event AS-PFS signature (spanning AS events in OR52K1, SPIN3, NDUFV1, BMPR1A, and ARPC4 genes). By leveraging this signature, we defined a risk score for each patient, categorizing them into low and high-risk groups. This signature and its risk score were further validated through Kaplan-Meier survival analysis and time-dependent receiver operating characteristic (ROC) analysis in the TCGA-COAD test set and independent patient cohort AC-ICAM (n = 348). Comparison to other markers and methods further confirmed the independent predictive value of the AS-PFS risk signature. We propose that this signature could be utilized in clinical settings to enhance patient stratification at diagnosis and further inform personalized treatment strategies.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-026-35903-w.

Subject terms: Biomarkers, Cancer, Computational biology and bioinformatics, Gastroenterology, Oncology

Introduction

Colorectal adenocarcinoma (COAD), which originates from the epithelial cells of the large intestine and rectum, is the predominant form of colorectal cancer, accounting for over 90% of cases1. COAD is characterized by a notably high mortality rate, with an estimated 152,000 new cases and 53,000 deaths anticipated in the US in 2024 (https://seer.cancer.gov). Despite recent advances in defining a comprehensive genomic landscape and advanced therapeutic options for COAD24, the substantial tumor heterogeneity poses a formidable obstacle to the efficacy and effectiveness of targeted therapeutic interventions, thereby often undermining treatment outcomes and highlighting the imperative to delve deeper into COAD’s molecular mechanisms, including mRNA alternative splicing.

mRNA alternative splicing (AS) represents a fundamental mechanism of post-transcriptional gene regulation that allows a single gene to produce multiple mRNA isoforms, contributing significantly to transcriptome and proteome diversity5. AS produces different isoforms by selective exon (and sometimes intron) inclusion or exclusion and alternative selection. While this process is pervasive, with an estimated 95% of human multi-exon genes undergoing AS6, dysregulation of AS processes could lead to disease manifestation, including tumor initiation, growth, metastasis, immune escape, and therapeutic failure79. Many studies have highlighted the crucial role of mRNA alternative splicing (AS) in COAD progression, with emerging evidence suggesting that abnormalities in splicing events and cancer-specific spliced variants could serve as valuable prognostic biomarkers and therapeutic targets for this disease1014. For instance, the alternative splicing of TIA-1 has been associated with angiogenesis and tumor growth15 and the alternative splicing of spleen tyrosine kinase (STK) has been shown to differentially regulate the proliferation and metastasis16 in COAD. While these studies nominate AS as a promising avenue for novel biomarker discovery, a comprehensive, robust, genome-wide investigation of the role of AS in COAD progression remains limited.

In recent years several groups have developed computational models to elucidate the connection between AS and survival14,1719. While these methods underscore the importance of investigating AS events in cancer progression, they often do not take full advantage of the robust statistical machine learning study design and reproducibility it offers, frequently utilize clinically limited endpoints, and often overlook multicollinearity and clinically relevant selection of AS events. To address these limitations, we employed a robust machine learning design where the AS classifier was learned/trained on the training patient cohort and tested on a non-overlapping testing set and an independent patient cohort (to avoid overfitting) using progression-free survival (PFS) as a clinical end-point. Our framework effectively combined Cox Proportional Hazards (PH) modeling20 with Robust Likelihood-based Survival (RBSURV) modeling21, enabling non-multicollinear yet clinically relevant variable selection of the AS events, allowing a more accurate learning and reproducible assessment of the model’s performance on unseen data, which is essential for clinically relevant predictive modeling.

Here we utilized RNA-seq data from the TCGA-COAD patient cohort22 (n = 266 patients, distributed across stages I (n = 42), II (n = 106), III (n = 79), and IV (n = 39)) and RNA-seq data from AC-ICAM patient cohort23 (n = 348 patients, distributed across stages I (n = 55), II (n = 122), III (n = 110), and IV (n = 61)) to elucidate and validate the prognostic impact of COAD-specific AS events on PFS. The TCGA-COAD cohort was stratified into training (n = 133) and testing (n = 133) sets and our methodology involved the training process on TCGA training set, allowing elucidation of the AS-PFS signature comprising five independent AS events (spanning OR52K1, SPIN3, NDUFV1, BMPR1A, and ARPC4 genes) and the testing process on TCGA testing set and on an independent AC-ICAM cohort23, demonstrating its robustness and potential clinical utility.

To enhance the potential clinical applicability of our findings, we mathematically defined a risk score (i.e., risk to develop PFS event) for each patient using the AS-PFS signature, weighted by its effect on the PFS, and identified high and low-risk patient groups. Comparison of patients within these risk-score groups demonstrated significant differences in time to COAD progression using Kaplan-Meier survival analysis in the TCGA testing cohort (HR = 2.1, log-rank p-value = 0.025) and in the AC-ICAM dataset (HR = 1.6, log-rank p-value = 0.013). To assess the independent predictive ability of the risk score when compared to commonly used clinical variables (such as stage, gender, and age), we used time-dependent ROC-AUC analysis24, which demonstrated that incorporating the AS-PFS signature risk scores improved the ROC-AUC when compared to clinical variables alone at all time points. Furthermore, comparison to results from other computational techniques and known transcriptomic and genomic markers of COAD progression demonstrated the advantage of the comprehensive, robust, genome-wide, machine-learning design and further highlighted the independent non-random ability of the AS-PFS signature to predict patients’ risk of COAD progression.

We propose that the identified AS-PFS signature and its risk score could be utilized in clinical settings to help identify patients at higher (and lower) risk of COAD progression. This information would be of significant value in subsequent personalized disease management and in selecting precision therapeutic strategies. We anticipate that our methodology establishes a framework that could be utilized in addressing diverse clinically informed questions for patients with COAD and other neoplastic conditions.

Results

Comprehensive framework for AS investigation

To investigate and validate the association of AS events to PFS in COAD patients, we utilized the TCGA-COAD patient cohort22 (n = 266) and AC-ICAM cohort23 (n = 348) as they provided a comprehensive collection of patients with a wide range of COAD staging. For our analysis, we utilized progression-free survival (PFS) as the clinical endpoint, defined as the time to tumor recurrence, local or distant metastasis, or new primary tumor event, or time to the most recent follow-up.

We utilized RNA-seq profiles to estimate inclusion levels for each mRNA splicing event defined as Percent Spliced In (PSI), which is the ratio of Transcript Per Million (TPM) for transcripts that included this event (numerator) over the TPM of transcripts that either included or excluded this event (denominator) (Fig. 1a, see Materials and Methods). In general, PSI ranges from zero to one, with values closer to one reflecting higher inclusion of the event in a sample and values close to 0 reflecting more exclusion of the event in a sample. We estimated AS of multiple types, including skipped exon (SE), retained intron (RI), mutually exclusive exons (MX), alternative first/last exons (AF/AL), and alternative 5’/3’ splice sites (A5/A3).

Fig. 1.

Fig. 1

Alternative Splicing and its landscape in COAD. (a) Schematic representation of alternative splicing (AS) process, with PSI value calculations (see Material and Methods) (bottom); (b) Skipped exon (SE), retained intron (RI), mutual exclusive exon (MX), alternative last and first exons (AL/AF), and alternative 5’ and 3’ splice sites (A5/A3) splicing types are shown. Horizontal gray bars represent the number of events of each type identified in the TCGA-COAD cohort (with total 15,769 AS events after data preprocessing, see Materials and Methods).

To ensure a robust statistical machine learning framework with subsequent validation on unseen testing data, the TCGA-COAD cohort was split into non-overlapping training (n = 133) and testing (n = 133) sets, based on powered sample-size split estimates (Fig. 2, see Materials and Methods). To ensure class balance, the split was done through stratified sampling of the TCGA-COAD dataset, based on clinical endpoint (PFS), cancer stage, gender, and age (Table 1, see Materials and Methods). In addition to validation using the TCGA testing set, we further assessed the generalizability of our findings using an independent cohort AC-ICAM (Table 1, see Materials and Methods), which serves as a separate testing set for model validation.

Fig. 2.

Fig. 2

Schematic representation of a machine learning framework to identify and validate the AS-PFS signature. (left) Model Training: identification of AS-PFS signature and (right) Model Testing: validation of identified AS-PFS signature using Kaplan-Meier survival and time-dependent ROC-AUC analyses.

Table 1.

Clinical characteristics of the TCGA-COAD (training and testing sets) and AC-ICAM patient cohorts utilized in this study.

Characteristics TCGA-COAD22 TCGA-COAD
Training set (50%)
TCGA-COAD
Testing set (50%)
AC-ICAM23
Testing cohort
Number of patients 266 133 133 348
Number of PFS tumor events 75/266 (28.2%) 38/133 37/133 109/348 (31.2%)
Age at diagnosis 65.0 ± 13.2 64.90 ± 14.13 65.08 ± 12.22 68.18 ± 11.45
Gender
female 120 57 63 166
male 146 76 70 182
AJCC pathologic tumor stage
stage I 42 20 22 55
stage II 106 54 52 122
stage III 79 37 42 110
stage IV 39 22 17 61

In the training phase, PSI of AS events were used as inputs, and PFS was used as a clinical end-point. To ensure the robustness and independence of the discovery, we designed a rigorous “sequential” strategy, designed to prioritize most relevant event selection (adjusted for clinical covariates), which included univariable Cox PH modeling20, Robust Likelihood-Based Survival (RBSURV) modeling21, and multivariable Cox PH20, producing a final set of five independent (non-multicollinear, with independent predictive value) AS events significantly associated with PFS (AS-PFS signature). To ensure the clinical utility of our findings, we utilized the AS-PFS signature to assign a risk score for each patient, reflecting the likelihood that patients’ cancer will rapidly progress (Fig. 2).

In the testing phase, the AS-PFS signature and risk score were evaluated and validated using Kaplan-Meier survival analysis25 and time-dependent ROC analysis24 in comparison to commonly used clinical variables (i.e., stage, gender, age). The robustness and significance of our results and their independent predictive value were evaluated through comparison to other markers and methods.

Model training: AS signature associated with COAD PFS

The training cohort comprised 133 TCGA-COAD patients and included 38 patients who experienced a PFS event. This group consisted of 57 females and 76 males, with 74 patients classified in stages I and II, and 59 in stages III and IV. The average age was 64.90 years with a standard deviation of 14.13 years. In the training phase, our objective was to identify AS events that serve as a prognostic signature for PFS in COAD patients. To ensure the robustness and independent predictive value of predictions, we developed a “sequential” framework, which utilized predictions from univariable Cox Proportional Hazards (Cox PH) modeling20 as input into Robust Likelihood-Based Survival (RBSURV) modeling21, and subsequently into multivariable Cox PH20. The first step in our analysis was to select individual AS events that are significantly associated with COAD PFS. Given the large number of input variables (15,769 AS events after filtering, see Materials and Methods) in order to avoid overfitting and numerical instability of the model, as suggested in26,27 we first utilized 15,769 AS events for univariable Cox PH modeling using PSI values for each AS event as predictor variable and PFS as response variable, adjusted for tumor stage, gender and age, which led to the reduction of variable space and identification of 75 AS events (adjusted hazard p-value < 0.01, Efron’s local FDR < 0.2) (Fig. 3a, see Material and Methods, Supplementary Data 1).

Fig. 3.

Fig. 3

Model Training identified the five-AS event signature predictive of PFS in COAD patients (a) 75 AS events with adjusted hazard p < 0.01 were identified in univariable Cox PH analysis (dashed line is the threshold of p = 0.01). (b) RBSURV modeling found the 20 most PFS-relevant AS events. The asterisk (*) indicates the smallest AIC value. (c) Forest plot of the 20 AS events of hazard ratios (HRs) from multivariable Cox PH analysis. Stars indicate significant AS events with adjusted hazard p < 0.01. (d) Schematic representation of the splicing patterns for five identified AS events in the AS-PFS signature. (e) Distribution of risk scores and classification into low and high-risk groups, along with a heatmap displaying PSI values for the AS-PFS signature across these risk groups. (f) Kaplan-Meier survival analysis of identified risk groups confirmed a significant difference in PFS. Log-rank p-value indicated.

While univariable Cox proportional hazards modeling produces events that are individually associated with PFS, it does not consider their potential multicollinearity and combinatorial effect on COAD PFS. To overcome these limitations, we utilized 75 AS events identified from the univariable Cox PH analysis as inputs in Robust Likelihood-Based Survival (RBSURV) modeling21. This method, based on the partial likelihood of the adjusted for clinical covariates Cox PH model, employs forward selection, generating a series of AS event combinations and selecting an optimal model with minimum Akaike Information Criterion (AIC28 value (see Materials and Methods), which results in selecting a model with an optimal set of AS events most relevant for PFS prediction. This process yielded 20 non-multicollinear most PFS-relevant AS events (Fig. 3b). As expected, Variance Inflation Factor (VIF29 analysis confirmed that all 20 events had VIF values below the critical threshold of 5 (see Materials and Methods), indicating the absence of multicollinearity.

Finally, to identify events that remained significant when adjusted for the effects of the whole set of 20 AS events and clinical covariates, we utilized these 20 events as inputs into a multivariable Cox PH model, which refined the signature down to five significant AS events (i.e., AS-PFS signature) with an adjusted hazard p-value < 0.01 (Fig. 3c). Identified five AS events (Fig. 3d) spanned OR52K1 (A3 splice site in exon 2), SPIN3 (A3 splice site in exon 5), NDUFV1 (A5 and retained intron splice site in exon 8), BMPR1A (A3 splice site in exon 14), and ARPC4 (A5 splice site in exon 1). Among them, AS events in OR52K1, SPIN3, and NDUFV1 showed positive coefficients of being associated with PFS, indicating that higher PSI values for these events are linked to an increased hazard (risk) and shorter time to progression. On the other

hand, AS events in BMPR1A and ARPC4 had negative coefficients of being associated with PFS, implying the opposite association.

To enhance the clinical utility and applicability of our findings, we utilized the AS-PFS signature to mathematically define a risk score (reflecting risk of developing PFS cancer progression) for each patient. This score was defined as a sum of the scaled PSI values of the five AS events from AS-PFS signature, weighted by the strength of their effect on the PFS (i.e., reflected by their regression coefficients from the multivariable Cox PH analysis, see Materials and Methods) and reflected a probability that a patient will develop COAD progression rapidly (higher the score, higher the probability). Once risk scores for all patients in the training cohort were estimated, we categorized patients into two risk groups: low and high risk (Fig. 3e, see Materials and Methods). Patients in the low-risk group were predicted to have the slowest cancer progression, while those in the high-risk group faced the highest likelihood of rapid progression. As a confirmation of our categorization, these groups were subjected to Kaplan-Meier survival analysis, which demonstrated a significant difference in PFS between the two groups (Fig. 3f, HR = 4.3, log-rank p-value < 0.0001), suggesting that the risk score can be used as a unified PFS predictor variable for COAD patients.

Model testing: validation of the identified AS-PFS signature

The next essential step in our analysis was to evaluate the ability of the AS-PFS signature and the corresponding patient risk scores to predict PFS survival on previously unseen data. For this, we utilized the non-overlapping TCGA testing set and an independent AC-ICAM cohort. The TCGA-COAD testing set comprised of 133 patients (not overlapping with the TCGA-COAD training cohort) and included 37 patients who experienced a PFS event. This group consisted of 63 females and 70 males, with 74 patients classified in stages I and II, and 59 in stages III and IV. The average age was 65.08 years with a standard deviation of 12.22 years, with all variables comparable to the training cohort (Table 1, see Materials and Methods). The AC-ICAM cohort comprised of 348 patients and included 109 patients who experienced a PFS event. This group consisted of 166 females and 182 males with 177 patients classified in stages I and II, and 171 patients in stages III and IV. The average age was 68.18 years with a standard deviation of 11.45 years (Table 1, see Materials and Methods).

First, to assess the clinical utility of the five-event AS-PFS signature in both the testing TCGA and AC-ICAM cohorts, we computed a risk score for each patient, following the same method used in the training set (see Materials and Methods) and categorized patients into low and high-risk groups (Fig. 4a, d). We then utilized Kaplan-Meier survival analysis to evaluate the difference in PFS between the defined groups, which demonstrated significant predictive ability of the AS-PFS risk score (TCGA test: HR = 2.1, log-rank p-value = 0.023; (AC-ICAM): HR = 1.6, log-rank p-value = 0.013; Fig. 4b, e). This analysis highlights the potential of the five-event PFS signature and the corresponding risk score to be utilized to identify patients at risk of developing COAD progression. In addition, we further subjected patient groups, based on upper and lower quartiles of their risk score distribution, to Kaplan-Meier survival analysis in both TCGA test and AC-ICAM cohorts, indicating that higher quartile was a major contributing/driving factor in predicting patients at higher risk of COAD progression (see Materials and Methods, Supplementary Fig. 2).

Fig. 4.

Fig. 4

The AS-PFS risk score signature predicts PFS progression in the testing TCGA and AC-ICAM cohorts. (a, d) Distribution of risk scores and classification into low and high-risk groups, along with a heatmap displaying PSI values for the AS-PFS signature across two risk groups. (b, e) Kaplan-Meier survival analysis of identified risk groups confirmed a significant difference in PFS. Log-rank p-value indicated. (c, f) Time-dependent ROC analysis.

Furthermore, to evaluate the independent predictive value of the risk score when compared to the commonly used clinical variables (i.e., advantage of their use in combination in the clinic), we evaluated the ability of the AS-PFS derived risk score to predict the probability of PFS in a time-dependent manner through time-dependent ROC analysis in both validation cohorts (Fig. 4c, f). ROC curves were evaluated using the area under the curve (AUC-ROC), where AUC = 0.5 indicates a random predictor and AUC = 1.0 indicates a perfect predictor. This analysis demonstrated that incorporating the AS-PFS signature’s risk score improved the AUC-ROC at all time points when combined with clinical variables alone (i.e., tumor stage, gender and age) and that the predictive ability of the AS-PFS signature does not change over time, indicating its added independent predictive value and highlighting the benefit of using AS-PFS risk score in addition to clinical variables in the clinical setting.

Significance and robustness analysis: comparative analysis to known markers of COAD progression and commonly utilized methods

To evaluate the significance and robustness of our findings, we compared the predictive ability of the AS-PFS derived risk score to (i) known transcriptomic and genomic markers of COAD progression; and (ii) commonly used methods in both validation patient cohorts. For comparison to known markers of COAD progression, we utilized 19 transcriptomic3047 and 18 genomic4863 previously identified markers of COAD progression and used them to assess the independent predictive value of the AS-PFS derived risk score using the Cox PH model on the testing cohorts (Fig. 5a-b, Supplementary Fig. 1, Supplementary Data 2–5). The AS-PFS derived risk score has shown significant improvement when added to the predictive values of known transcriptomic and genomic markers and Microsatellite Instability (MSI64,65 in both validation cohorts, emphasizing the independent predictive ability of the AS-PFS derived risk score to predict PFS over known transcriptomic and genomic markers of COAD progression.

Fig. 5.

Fig. 5

The predictive ability of the five AS-PFS signature’s risk score outperforms/enhances the predictive accuracy of known markers of cancer progression and other methods. (a, b) Comparison to known transcriptomic and genomic markers of cancer progression. Wald test p-values from the Cox PH model are indicated. (c, d) Comparison to other methods, including top mRNA expression genes, LASSO regression analysis, and top AS PSI.

To evaluate advantages of utilizing the proposed robust machine learning sequential framework, we compared the performance of our method to the performance of other methods (Fig. 5c, d) commonly used in such analysis, in particular (1) univariable Cox PH analysis on gene expression (top k = 5); (2) univariable Cox PH analysis on AS PSI levels (top k = 5); and (3) univariable Cox PH analysis on AS PSI followed by Least Absolute Shrinkage and Selection Operator (LASSO66 regularization and multivariable Cox PH (we replaced RBSURV with LASSO). To ensure that all methods are comparable to our machine learning approach, we trained the mentioned methods on the TCGA-COAD training cohort, with each producing a list of predictions, either gene or splicing events lists, respectively (i.e., 5 most significant predictors for mRNA expression; 5 most significant predictors for AS PSI; and 8 predictors for LASSO on AS PSI), which were then validated in both testing/validation cohorts using Wald test p-value. All comparative analyses were done with and without adjustment for clinical variables (cancer stage, age, and gender) and have demonstrated that our method outperformed other approaches (Fig. 5c, d, Supplementary Data 6).

Taken together, these analyses indicated that the five-gene AS-PFS signature not only holds its own predictive power but also demonstrates independent predictive value, enhancing the predictive accuracy when combined with other established markers of COAD progression and significantly increasing its potential clinical utilization.

Discussion

In this study, we have developed a robust machine learning framework to identify an AS signature comprising of AS events in five genes - OR52K1, SPIN3, NDUFV1, BMPR1A, and ARPC4 - that are highly associated with PFS in COAD patients. Our framework allows for the identification of the non-multicollinear AS events that are most relevant for COAD PFS prediction and demonstrates their independent predictive value on unseen data in the testing cohorts of COAD and AC-ICAM as a stand-alone risk score and as compared to other markers and methods.

Some of the AS-PFS signature harboring genes have been previously implicated in oncogenic processes and therapeutic responses across a variety of cancer types, yet their role and their cross-talk in colorectal adenocarcinoma progression remain to be investigated. In particular, OR52K1, a member of the olfactory receptor gene family67,68 encoding G-protein coupled receptors (GPCR), has been found to be overexpressed in human prostate cancer tissue69 and correlated with tumor progression in invasive breast carcinoma70.

SPIN3, a member of the spindlin family71, is known for its role in various cancers, including esophageal squamous cell carcinoma (ESCC) and seminoma72,73. Moreover, SPIN3 has been shown to be a promoter of apoptosis resistance in various cancer cell lines73 and plays a role in downregulation of Cyclin D1 (CYCD1), a key regulator of cell division and cell cycle and a downstream target of the PI3K/AKT pathway.

NDUFV1, ubiquinone oxidoreductase gene, encodes the 51-kD subunit of complex I (NADH dehydrogenase) of the mitochondrial respiratory chain, its mutations are linked to mitochondrial complex I deficiency74 and can cause aberrant splicing, leading to mitochondrial respiratory chain (MRC) disorder75. It has been shown that the NAD+/NADH balance controlled by complex I is crucial for breast cancer progression, and inhibition of complex I via NDUFV1 knockdown increases metastatic activity in aggressive breast cancer cells76. In addition, the interaction of NDUFV1 with NDUFS1 is crucial for the PHB2-mediated enhancement of mitochondrial respiration in colorectal cancer progression77, suggesting potential new avenues for therapeutic vulnerabilities.

BMPR1A, bone morphogenetic protein receptor type 1A, has been shown to be involved in juvenile polyposis (JP), an autosomal dominant syndrome predisposing to gastric and colorectal cancers7881, with intronic mutations leading to aberrant splicing in pediatric patients82. Furthermore, BMPR1A has been shown to play a critical role in regulating angiogenesis and tumor progression through the BMP/Smad signaling pathway in colon cancer cells83.

Lastly, ARPC4 encodes a subunit of the actin-related protein complex Arp2/3, crucial for actin polymerization and cellular dynamics. Elevated ARPC4 expression has been observed in bladder and colorectal cancers, significantly correlating with lymphatic metastasis and serving as an independent prognostic biomarker84,85. Knockdown studies in glioblastoma multiforme have shown that reducing ARPC4 expression enhances CAR T cell efficacy, suggesting its potential as a therapeutic target86.

Furthermore, a cross-talk between the AS-harboring genes promises to open new horizons for the mechanistic understanding of COAD progression and new therapeutic opportunities. Some of these mechanisms for further investigation could include the regulation of cell cycle progression, the epigenetic control of differentiation and gene transcriptional availability, the regulation of intracellular proliferative signaling pathways, ATP synthesis, and cell motility and invasiveness70,71,7376,83,84, providing valuable hypotheses for further experimental and clinical validation.

Though our study focused on mRNA alternative splicing alone, we appreciate that these mechanisms do not work in isolation, and their interplay with epigenomic and non-coding elements plays a crucial role in uncovering the complexity of cancer progression and therapeutic response. For instance, miRNAs, which received substantial attention for their role in the initiation and progression of colorectal cancer87, have been shown to modulate alternative RNA splicing in addition to their well-known function in gene expression regulation88. By targeting splicing factors or regulatory elements involved in RNA processing, miRNAs can influence which splice variants are produced, leading to either oncogenic or tumor-suppressive isoforms. Beyond miRNAs, long non-coding RNAs (lncRNAs) are emerging as key regulators of alternative splicing. lncRNAs act as scaffolds, guides, or decoys for splicing regulators89,90, thereby repositioning these factors at specific transcripts and modulating spliceosome assembly. Thus, investigating a cross-talk between miRNAs, lncRNAs and AS could provide deeper insights into the molecular mechanisms driving COAD progression and potentially reveal novel candidates for therapeutic investigations.

Even though in this study we focused on bulk RNA-seq data, which was available for the TCGA and AC-ICAM patient cohorts, single-cell RNA (scRNA) sequencing would provide more precise resolution and delve deeper into tumor heterogeneity, microenvironment, and their effect on cancer progression and therapeutic response91. Yet, widely used single-cell platforms, such as 10x genomics92 are often not suited for alternative-splicing analysis, as their technology could be biased toward the 3’ end of mRNA, missing the majority of alternative splicing events. SmartSeq293,94 and Nanopore95 sequencing promise to overcome these limitations, but they remain rather scarce in the data domain. As the cost of sequencing becomes more affordable, we foresee more widespread utilization of these technologies and datasets for further alternative splicing investigation.

Finally, the detection of AS events has technical challenges due to the diversity of splicing patterns and the limitations of current experimental techniques like RNA-seq, microarrays, and RT-PCR96,97. Short-read RNA-seq may not capture long-range splicing events or distinguish between similar isoforms due to limited read lengths and uneven sequencing coverage96. Microarrays and RT-PCR face difficulties in designing probes and primers that can specifically detect all possible splice variants97. Additionally, AS is regulated in a context-dependent manner, influenced by factors such as cell type, development stage, and environmental conditions, complicating the standardization of assays across different biological contexts. To overcome these limitations, long-read sequencing technologies provided by PacBio98 and Nanopore95 offer effective alternatives by enabling the sequencing of full-length transcripts for a more comprehensive analysis of AS events, yet their utilization is currently limited due to cost-effectiveness. Yet, as this limitation is overcome, we expect a more accurate detection of AS events and corresponding isoforms, with subsequent utilization for prognostic and predictive modeling.

In summary, our robust machine-learning framework has identified the five gene AS-PFS signature with an independent predictive value for COAD PFS. Some of the AS-harboring genes have been implicated in disease progression and therapeutic response across various cancer types, and their cross-talk and interplay with other cellular mechanisms promise to open new previously unexplored avenues for investigating mechanisms of cancer progression and therapeutic vulnerabilities. We propose that the identified signature and AS-PFS-derived risk score could be utilized in the clinic to refine risk stratification for patients who may benefit from or may want to avoid more aggressive therapeutic strategies and offer deeper insights into optimized therapeutic opportunities for COAD.

Methods

Patient cohorts utilized for study analysis

For the development and validation of our model, we utilized RNA-seq data from two independent colorectal adenocarcinoma (COAD) patient cohorts: the TCGA-COAD dataset22 and the AC-ICAM dataset23. Both cohorts consisted of fresh-frozen tumor samples and were selected for their comprehensive clinical annotations and availability of progression-free survival (PFS) data, enabling robust training and testing/validation of the model and findings.

RNA-seq samples from The Cancer Genome Atlas Colorectal Adenocarcinoma (TCGA-COAD) patient dataset22 were profiled on Illumina HiSeq 2000 and downloaded from GDC through dbGaP phs000178 access (paired-end reads at 75 bp per read with 60–100 million reads per sample). Raw FASTQ files were aligned to the hg38 reference genome using pseudoalignment Kallisto99 (version 0.43.1) with bootstrap generation of 100 re-samplings (running multiple iterations of the same sample with resampled counts) that are then used in downstream differential analyses tools. After resampling expectation-maximization (EM) algorithm was run to refine abundance estimates. The output transcript abundances were subsequently used for mRNA alternative splicing analyses using SUPPA100 (version 2.2.1), which is the most preferable method for accurate alternative splicing estimation for 75bp-read profiles through incorporating the precise sequence of each transcript for analysis and excluding intronic gaps. The clinical data files for the TCGA-COAD cohort were obtained from the cBio Cancer Genomics Portal (cBioPortal, https://www.cbioportal.org). After excluding the samples with incomplete clinical (e.g., gender, age, tumor stage) and survival (progression-free survival) information, a total of 266 tumor samples were selected for the study cohort. In this cohort, the average age at initial diagnosis was 65.0 years with a standard deviation of 13.2, and the number of females (n = 120) and males (n = 146). With respect to the American Joint Committee on Cancer (AJCC) code staging system, the disease stage and tumor aggressiveness of this group include stage I (n = 42), stage II (n = 106), stage III (n = 79), and stage IV (n = 39). All stages mean that the cancer is present. The higher the stage, the larger the cancer tumor and the more it has spread into other tissues and organs. The clinical end-point utilized in this study was measured by the progression-free survival (PFS, defined as time to any new tumor event such as disease progression, local recurrence, distant metastasis, or new primary tumors at any site; or time to latest follow-up – for censored patients). This cohort consisted of patients with new tumor events (n = 75) and censored patients (n = 191). The patients in this dataset had an average follow-up survival time of 868 days. For patients who experienced tumor events, the average follow-up duration (until the new tumor event) was 581 days. Conversely, for those who were censored, the average follow-up time was 982 days.

RNA-seq samples from the AC-ICAM23 patient dataset were profiled on Illumina HiSeq 4000 and downloaded from GDC through dbGaP phs002978 access (paired-end reads at 75 bp per read with ~ 20 million reads per sample). The same computational data analysis pipeline was applied as for the TCGA cohort, namely pseudoalignment with Kallisto99 (version 0.43.1) including 100 bootstraps, followed by EM-based refinement and PSI calculation using SUPPA100 (version 2.2.1). The clinical data files were retrieved from the cBio Cancer Genomics Portal (cBioPortal, https://www.cbioportal.org). After excluding non-tumor samples, 348 tumor samples remained for analysis. The average age at diagnosis was 68.2 years (SD = 11.45), with 166 females and 182 males. AJCC staging was as follows: stage I (n = 55), stage II (n = 122), stage III (n = 110), and stage IV (n = 61). The same PFS definition was used for consistency. This cohort included 109 patients with tumor progression events and 239 censored patients. The average follow-up time was 1,762 days overall, with 482 days for event patients and 2,346 days for censored individuals.

Alternative splicing analysis with SUPPA

To estimate mRNA alternative splicing (AS) we utilized SUPPA100 (version 2.2.1). Seven types of AS events were considered by SUPPA and utilized in this study: skipped exon (SE), mutually exclusive exons (MX), alternative 5’/3’ splice sites (A5/A3), alternative first/last exons (AF/AL) and retained intron (RI) (Fig. 1). To estimate levels of AS events SUPPA utilizes Percent Spliced In (PSI) value, reflecting the inclusion rate of a specific splicing event in the gene of interest, and defined as the ratio of the Transcript Per Million (TPM) for the gene transcripts that include one form of the event, S1 (numerator) to the total abundance (TPM) of all transcripts for that gene that contain either form of the event, S1 U S2 (denominator). Mathematically, for each splicing event in a specific gene, this is expressed as:

graphic file with name d33e1143.gif

where S1 is a set of transcripts that include that event for a specific gene, and S2 – is a set of transcripts that skip that event for a specific gene. PSI values range from zero to one, with a value close to one (1) indicating that nearly all identified transcripts of the gene include that splicing event, and a value close to zero (0) indicating that almost none of the identified transcripts include that splicing event.

Data quality control and standardization

Using SUPPA tool, we initially detected ~ 110 K splicing events across TCGA-COAD cohort. To ensure data quality and statistical suitability, we removed AS events with any missing values across samples and retained only those occurring in protein-coding genes, which retained 60,586 AS events. We further filtered out low-variability events with a standard deviation of PSI < 0.1 (as suggested in101) across the cohort, retaining 15,769 events. Since our focus in this study was on protein-coding genes, we only included AS events from protein-coding genes using the biomaRt package102. Subsequently, to ensure PSI compatibility across different events and genes, the PSI values of the AS events were z-score standardized (scaled), where the z-score was determined by subtracting the event’s PSI mean from each PSI value and dividing the result by the event’s PSI standard deviation.

Split of TCGA-COAD into training and testing cohorts

To ensure a robust machine learning framework, we divided the TCGA-COAD study cohort (n = 266) into non-overlapping training and testing sets in a 50/50 ratio (n = 133 and n = 133 respectively); a 50/50 split was required based on our sample size estimates necessary to achieve power in this dataset, given to a total number of patients with new tumor events (38 and 37 in training and testing sets, respectively). Assuming α = 0.05 (two-sided) and 80% power and using Shoenfeld approximation103 we can detect effects of HR ≳ 2 with the sample size corresponding to the 50/50 split of the TCGA cohort. Shoenfeld calculations were done using getSampleSizeSurvival function from rpact R package To ensure that clinical variables could not affect the results of training vs. testing analysis, the training/testing set split was stratified/balanced by the clinical endpoint (PFS), tumor stage, gender, and age to ensure class balance (Table 1) using createDataPartition function from caret package104.

Model training

Cox PH modeling to associate AS events with COAD PFS

To identify AS events that were associated with the PFS progression in the training TCGA-COAD patient cohort (n = 133) we employed Cox Proportional Hazards (Cox PH) modeling20. The Cox PH is a semiparametric statistical model that assesses the effect of predictor variables (in our case, AS events) on the time to an event (PFS was used as a clinical endpoint in our case) using the hazard function. The Cox PH model mathematically is represented as follows:

graphic file with name d33e1219.gif

where Inline graphic represents one of the j = 1…k predictor variables, coefficients Inline graphic are the regression coefficients for each predictor Inline graphic and Inline graphic is the baseline hazard at time t having the predictors set to zero.

We performed Cox PH analysis separately for each AS event to estimate its association with the PFS using the coxph function from the R survival package105, adjusted for clinical covariates: tumor stage, gender, and age. The association of each AS event with PFS was estimated through the Cox PH model’s hazard ratio (HR), hazard p-value, and 95% confidence intervals (CI), providing a robust assessment of the association between AS events and disease progression. Hazard p-value for AS event of interest from adjusted Cox PH model analysis at p < 0.01 was used as a threshold to identify AS events significantly associated with PFS, independent of the clinical variables (Supplementary Data 1). In addition, to announce for multiple hypothesis testing in large-scale data, we utilized Efron‘s method to calculate the local FDR p-values106 using locfdr R package, designed specifically for large-scale simultaneous hypothesis testing. We selected a threshold of 0.2 which is suggested in the Efron paper106, (Supplementary Data 1). The veracity of PH assumption was done using cox.zph function from the R survival package105.

Optimal variable selection: RBSURV modeling

To build a robust informative prognostic AS predictor on univariable Cox PH output we utilized the Robust Likelihood-Based Survival Modeling approach (RBSURV21. RBSURV modeling is based on the partial likelihood of the Cox PH model. This approach provides the ability to discover sets of events by a stepwise forward selection process, ensuring that the most informative AS events are included in the model, enhancing the predictive power. Additionally, it allows adjustment for clinical variables, such as tumor stage, gender, and age. To implement this analysis, we utilized rbsurv21 package in R. The detailed procedure was as follows:

  1. The rbsurv model divided the samples randomly into a training set with N(1-p) samples and a validation set with Np samples. Here, we chose p = 1/3. Each AS event was fitted to the training set of samples to estimate the parameters, which were then evaluated using log-likelihood on the validation set of samples. This process was repeated for each AS event.

  2. The above procedure was performed 10 times, thus obtaining 10 log-likelihoods for each event. The AS event with the highest mean log-likelihood was selected.

  3. Next, we searched for the subsequent best event by evaluating every possible two-event model, selecting the one with the highest mean log-likelihood.

  4. We continued this stepwise forward AS selection procedure, generating a series of models. Akaike Information Criterions (AIC28 values were computed for all the candidate models, and the optimal model was selected based on the minimum AIC value.

graphic file with name d33e1328.gif

where loglik is the partial likelihood of the Cox PH model, k is the number of parameters in the model, and a is the pre-specified constant (a = 2).

To construct the composite prognostic AS predictor, univariable Cox PH output, containing AS events significantly associated with PFS, was subjected to the RBSURV modeling. To enhance the reliability and validity of the results, we subjected the final set of AS events to a multivariable Cox PH model on the training set (adjusted as above), which defined the final AS-PFS signature, and then utilized it in the testing phase.

VIF analysis

To assess multicollinearity among the AS events in the final signature, we performed a Variance Inflation Factor (VIF29 analysis for each AS event using the usdm package in R. A VIF threshold of 5 was used to confirm the absence of multicollinearity.

Risk score assignment and division into risk groups

In addition to utilizing the AS-PFS signature per se, and to enhance the clinical utility and applicability of our findings, we further estimated the risk of developing a PFS event for each patient. For this, we defined a risk score for each patient as a sum of the scaled PSI values from each AS event in the AS-PFS signature, weighted by the corresponding regression coefficients from the multivariable Cox PH model done on the final AS-PFS signature. Risk score assumes the additive effect of AS events. The mathematical formulation of the risk score is as follows:

graphic file with name d33e1363.gif

where k is the total number of AS events (predictors) in the AS-PFS signature, Inline graphic is a scaled PSI for AS events (predictors) i = 1…k in the final signature, and Inline graphic represents the regression coefficient for an AS event i.

These risk scores (one per patient) were then used to split patients into risk groups. As the cutoff we selected mean value of the risk scores (as well is quartiles) and patients were split into low and high-risk groups. Kaplan-Meier survival analysis was then performed to compare patients within different risk groups for their differences in PFS (log-rank p-value < 0.05 considered statistically significant). Kaplan-Meier survival analysis was done using survfit function from survival105 package in R.

Model testing/validation

We evaluated the ability of the identified AS-PFS signature to predict PFS in the non-overlapping TCGA-COAD test set and independent AC-ICAM patient cohort. We computed the risk score for each patient (as described above) and estimated its predictive accuracy in the corresponding testing sets/cohorts through Kaplan-Meier survival analysis and time-dependent Receiver Operating Characteristic (ROC) analysis107. The effectiveness of the ROC analysis was measured by the area under the curve (AUC108, with an AUC of 0.5 indicating no predictive ability and an AUC of 1 indicating perfect prediction. ROC-AUC analysis was done utilizing timeROC109 package in R.

Significance and robustness analysis

Comparison to common markers of COAD progression

To compare the ability of the identified AS-PFS derived risk score to the predictive ability of known transcriptomic and genomic markers (plus Microsatellite Instability, MSI) of COAD progression, we utilized both testing/validation patient cohorts. In particular, comparisons were done in direct independent association with disease progression (PFS) using the Cox Proportional Hazards model20. For transcriptomic markers, we utilized their gene expression values. For genomic markers, we utilized genomic alterations (downloaded from cBioPortal), including deep and shallow deletions, diploid, gain, and amplification. Cox Proportional Hazards model analysis was implemented using the coxph function from the R survival105 package, p-values were estimated using Wald test.

Comparison to other methods

To evaluate if the identified final AS-PFS signature outperforms the predictive ability of commonly used methods, we compared it to (1) univariable Cox PH analysis on gene expression (top k = 5); (2) univariable Cox PH analysis on AS PSI levels (top k = 5); and (3) univariable Cox PH analysis on AS PSI followed by Least Absolute Shrinkage and Selection Operator (LASSO66 regularization. In each case, we utilized a training TCGA cohort for model training and testing cohorts of TCGA and AC-ICAM for model validation (as above). For Cox PH on expression data, we selected the top 5 genes that were significantly associated with disease progression (PFS) in the TCGA training set in univariable Cox PH analysis20 and validated them on the testing cohorts of TCGA and AC-ICAM. For Cox PH on AS PSI levels, we selected the top 5 AS events that were significantly associated with disease progression (PFS) in the TCGA training set in univariable Cox PH analysis and validated them on the testing cohorts of TCGA and AC-ICAM. For LASSO regression analysis, we utilized significantly associated AS events which were found in univariable Cox PH analysis in the TCGA training set (k = 75 at p < 0.01, as in our original analysis), and ran LASSO analysis66, then multivariable Cox PH analysis to identify the final AS signature. The LASSO regression was adjusted for clinical covariates (tumor stage, gender, and age) and performed using the cv.glmnet function from the glmnet110 R package. In this function, the penalty parameter Inline graphic was determined using k-fold (k = 5) cross-validation approach. Since folds in cross-validation are selected at random, we repeated this procedure 100 times to ensure the robustness of our findings. The cross-validation mean deviance was recorded for each value of Inline graphic across the folds, serving as an indicator of the model’s prediction error. The Inline graphic value corresponding to the minimum average deviance over 100 times was selected as the optimal value. The ability of the above approaches to predict COAD PFS was evaluated on the testing cohorts using Wald test p-value of the Cox PH model through survival105 package in R.

Statistical analysis

All statistical analyses were performed using RStudio version 4.3.1 for statistical computing. To ensure comparability of training and testing TCGA-COAD sets and minimize bias, we applied a chi-square test for categorical variables and t-test for continuous variables to compare distributions of covariates between the two cohorts (p > 0.05, Supplementary Data 7). The Cox PH model was utilized to associate AS PSI values with the progression-free survival (PFS) of patients. All Cox PH survival and RBSURV analyses were subjected to adjustment for common covariates (tumor stage, gender, and age). Patients’ cohorts were obtained from public repositories, and all the code was assembled using freely available R packages, as described above, with no restrictions.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1 (18.1KB, docx)
Supplementary Material 2 (59.7KB, xlsx)
Supplementary Material 3 (747.7KB, docx)

Author contributions

U.M. and A.M. conceived and designed the study. U.M. and A.M. performed the computational and statistical analysis, prepared the figures, and wrote the paper. M.N. contributed to study design and conceptualization. Ah. M. performed initial TCGA-COAD RNA-seq mapping and AS PSI estimation. M.M. contributed to figure conceptualization and design. M.W.C. analyzed data for molecular pathway membership. J.S.P. advised on statistical methods utilized in the manuscript. F.C. and C.L. advised on the clinical utilization of the findings. All authors edited and approved the final manuscript.

Funding

A.M. is funded by NIH NLM R01LM013236, ACS RSG-21-023-01-TBG, and DOD Data Science HT94252410346. M.W.C. is supported by 2021CIF-Rutgers-15, by the National Science Foundation under Grant # 2127309 to the Computing Research Association for the CIFellows Post-Doctoral Fellowship Award. The funders played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.

Data availability

Data supporting the findings of this study were obtained from the Database of Genotype and Phenotypes (dbGaP): (1) The Cancer Genome Atlas (TCGA)22, RNA-sequencing data, phs000178.v11.p8; (2) AC-ICAM23, RNA-sequencing data, phs002978.v1.p1. Clinical data for both datasets were downloaded from cBioPortal. The summarized PSI level data will be available in the GitHub repository: https://github.com/mitrofanova-lab/SpliceML.

Code availability

All the codes and summarized datasets for this study will be available in the GitHub repository: https://github.com/mitrofanova-lab/SpliceML.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Alzahrani, S. M. & Doghaither, A. Al–Ghafari, A. B. General insight into cancer: an overview of colorectal cancer. Mol. Clin. Oncol.15, 1–8 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Mauri, G., Arena, S., Siena, S. & Bardelli, A. Sartore-Bianchi, A. The DNA damage response pathway as a land of therapeutic opportunities for colorectal cancer. Ann. Oncol.31, 1135–1147 (2020). [DOI] [PubMed] [Google Scholar]
  • 3.Bando, H., Ohtsu, A. & Yoshino, T. Therapeutic landscape and future direction of metastatic colorectal cancer. Nat. Reviews Gastroenterol. Hepatol.20, 306–322 (2023). [DOI] [PubMed] [Google Scholar]
  • 4.Yaeger, R. et al. Clinical sequencing defines the genomic landscape of metastatic colorectal cancer. Cancer cell.33, 125–136 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Pan, Q., Shai, O., Lee, L. J., Frey, B. J. & Blencowe, B. J. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet.40, 1413–1415. 10.1038/ng.259 (2008). [DOI] [PubMed] [Google Scholar]
  • 6.Marasco, L. E. & Kornblihtt, A. R. The physiology of alternative splicing. Nat. Rev. Mol. Cell Biol.24, 242–254 (2023). [DOI] [PubMed] [Google Scholar]
  • 7.Salton, M. et al. Inhibition of vemurafenib-resistant melanoma by interference with pre-mRNA splicing. Nat. Commun.6, 7103 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.David, C. J. & Manley, J. L. Alternative pre-mRNA splicing regulation in cancer: pathways and programs unhinged. Genes Dev.24, 2343–2364 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Oltean, S. & Bates, D. O. Hallmarks of alternative splicing in cancer. Oncogene33, 5311–5318 (2014). [DOI] [PubMed] [Google Scholar]
  • 10.Miura, K., Fujibuchi, W. & Unno, M. Splice isoforms as therapeutic targets for colorectal cancer. Carcinogenesis33, 2311–2319 (2012). [DOI] [PubMed] [Google Scholar]
  • 11.Le, K., Prabhakar, B. S., Hong, W. & Li L.-c. Alternative splicing as a biomarker and potential target for drug discovery. Acta Pharmacol. Sin.36, 1212–1218 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Yi, Q. & Tang, L. Alternative spliced variants as biomarkers of colorectal cancer. Curr. Drug Metab.12, 966–974 (2011). [DOI] [PubMed] [Google Scholar]
  • 13.Bisognin, A. et al. An integrative framework identifies alternative splicing events in colorectal cancer development. Mol. Oncol.8, 129–141 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Zong, Z. et al. Genome-wide profiling of prognostic alternative splicing signature in colorectal cancer. Front. Oncol.8, 537 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Zadeh, M. A. H. et al. Alternative splicing of TIA-1 in human colon cancer regulates VEGF isoform expression, angiogenesis, tumour growth and bevacizumab resistance. Mol. Oncol.9, 167–178 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Ni, B. et al. Alternative splicing of spleen tyrosine kinase differentially regulates colorectal cancer progression. Oncol. Lett.12, 1737–1744 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Zhao, J. et al. Systematic profiling of alternative splicing signature reveals prognostic predictor for prostate cancer. Cancer Sci.111, 3020–3031 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Zhu, J., Chen, Z. & Yong, L. Systematic profiling of alternative splicing signature reveals prognostic predictor for ovarian cancer. Gynecol. Oncol.148, 368–374 (2018). [DOI] [PubMed] [Google Scholar]
  • 19.Jin, P., Tan, Y., Zhang, W., Li, J. & Wang, K. Prognostic alternative mRNA splicing signatures and associated splicing factors in acute myeloid leukemia. Neoplasia22, 447–457 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Cox, D. R. Regression models and life-tables. J. Roy. Stat. Soc.: Ser. B (Methodol.). 34, 187–202 (1972). [Google Scholar]
  • 21.Cho, H., Yu, A., Kim, S., Kang, J. & Hong, S. M. Robust likelihood-based survival modeling with microarray data. J. Stat. Softw.29, 1–16 (2009). [Google Scholar]
  • 22.Network, C. G. A. Comprehensive molecular characterization of human colon and rectal cancer. Nature487, 330 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Roelands, J. et al. An integrated tumor, immune and Microbiome atlas of colon cancer. Nat. Med.29, 1273–1286. 10.1038/s41591-023-02324-5 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Heagerty, P. J., Lumley, T. & Pepe, M. S. Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics56, 337–344 (2000). [DOI] [PubMed] [Google Scholar]
  • 25.Goel, M. K., Khanna, P. & Kishore, J. Understanding survival analysis: Kaplan-Meier estimate. Int. J. Ayurveda Res.1, 274 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Zhao, S. D. & Li, Y. Principled sure independence screening for Cox models with ultra-high-dimensional covariates. J. Multivar. Anal.105, 397–411. 10.1016/j.jmva.2011.08.002 (2012). https://doi.org:. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Fan, J. & Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. Royal Stat. Soc. Ser. B: Stat. Methodol.70, 849–911. 10.1111/j.1467-9868.2008.00674.x (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Bozdogan, H. Model selection and akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika52, 345–370 (1987). [Google Scholar]
  • 29.Daoud, J. I. in Journal of Physics: Conference Series. 012009 (IOP Publishing).
  • 30.Tian, W. et al. RUNX1 regulates MCM2/CDC20 to promote COAD progression modified by deubiquitination of USP31. Sci. Rep.14, 13906. 10.1038/s41598-024-64726-w (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Mook, O. R., Frederiks, W. M. & Van Noorden, C. J. The role of gelatinases in colorectal cancer progression and metastasis. Biochim. Biophys. Acta. 1705, 69–89. 10.1016/j.bbcan.2004.09.006 (2004). [DOI] [PubMed] [Google Scholar]
  • 32.Tomasello, G. et al. Association of CDX2 expression with survival in early colorectal cancer: A systematic review and Meta-analysis. Clin. Colorectal Cancer. 17, 97–103. 10.1016/j.clcc.2018.02.001 (2018). [DOI] [PubMed] [Google Scholar]
  • 33.Spano, J. P. et al. Impact of EGFR expression on colorectal cancer patient prognosis and survival. Ann. Oncol.16, 102–108. 10.1093/annonc/mdi006 (2005). [DOI] [PubMed] [Google Scholar]
  • 34.Lü, B. et al. Analysis of SOX9 expression in colorectal cancer. Am. J. Clin. Pathol.130, 897–904. 10.1309/ajcpw1w8gjbqgcni (2008). [DOI] [PubMed] [Google Scholar]
  • 35.Kavanagh, D. O. et al. Is overexpression of HER-2 a predictor of prognosis in colorectal cancer? BMC Cancer. 9, 1. 10.1186/1471-2407-9-1 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Morikawa, T. et al. Tumor TP53 expression status, body mass index and prognosis in colorectal cancer. Int. J. Cancer. 131, 1169–1178. 10.1002/ijc.26495 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Toon, C. W. et al. Immunohistochemistry for Myc predicts survival in colorectal cancer. PLoS One. 9, e87456. 10.1371/journal.pone.0087456 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Kim, S. A. et al. Loss of CDH1 (E-cadherin) expression is associated with infiltrative tumour growth and lymph node metastasis. Br. J. Cancer. 114, 199–206. 10.1038/bjc.2015.347 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Des Guetz, G. et al. Microvessel density and VEGF expression are prognostic factors in colorectal cancer. Meta-analysis of the literature. Br. J. Cancer. 94, 1823–1832. 10.1038/sj.bjc.6603176 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Pereira, H., Silva, S., Julião, R., Garcia, P. & Perpétua, F. Prognostic markers for colorectal cancer: expression of P53 and BCL2. World J. Surg.21, 210–213. 10.1007/s002689900218 (1997). [DOI] [PubMed] [Google Scholar]
  • 41.Akishima-Fukasawa, Y. et al. Prognostic significance of CXCL12 expression in patients with colorectal carcinoma. Am. J. Clin. Pathol.132, 202–210. 10.1309/ajcpk35vzjewcutl (2009). quiz 307. [DOI] [PubMed] [Google Scholar]
  • 42.Stanisavljević, L. et al. CXCR4, CXCL12 and the relative CXCL12-CXCR4 expression as prognostic factors in colon cancer. Tumour Biol.37, 7441–7452. 10.1007/s13277-015-4591-8 (2016). [DOI] [PubMed] [Google Scholar]
  • 43.Wang, Z. et al. The prognostic and clinical value of CD44 in colorectal cancer: A Meta-Analysis. Front. Oncol.9, 309. 10.3389/fonc.2019.00309 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Fang, Y. J. et al. Elevated expressions of MMP7, TROP2, and survivin are associated with survival, disease recurrence, and liver metastasis of colon cancer. Int. J. Colorectal Dis.24, 875–884. 10.1007/s00384-009-0725-z (2009). [DOI] [PubMed] [Google Scholar]
  • 45.Yan, P. et al. Reduced expression of SMAD4 is associated with poor survival in colon cancer. Clin. Cancer Res.22, 3037–3047. 10.1158/1078-0432.Ccr-15-0939 (2016). [DOI] [PubMed] [Google Scholar]
  • 46.Dunne, P. D. et al. EphA2 expression is a key driver of migration and invasion and a poor prognostic marker in colorectal cancer. Clin. Cancer Res.22, 230–242. 10.1158/1078-0432.Ccr-15-0603 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Chu, X. Y. et al. FOXM1 expression correlates with tumor invasion and a poor prognosis of colorectal cancer. Acta Histochem.114, 755–762. 10.1016/j.acthis.2012.01.002 (2012). [DOI] [PubMed] [Google Scholar]
  • 48.Conlin, A., Smith, G., Carey, F. A., Wolf, C. R. & Steele, R. J. The prognostic significance of K-ras, p53, and APC mutations in colorectal carcinoma. Gut54, 1283–1286. 10.1136/gut.2005.066514 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Arrington, A. K. et al. Prognostic and predictive roles of KRAS mutation in colorectal cancer. Int. J. Mol. Sci.13, 12153–12168. 10.3390/ijms131012153 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Xu, Y. & Pasche, B. TGF-beta signaling alterations and susceptibility to colorectal cancer. Hum. Mol. Genet.16 (Spec 1), R14–20. 10.1093/hmg/ddl486 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Hamelin, R. et al. Association of p53 mutations with short survival in colorectal cancer. Gastroenterology106, 42–48. 10.1016/s0016-5085(94)94217-x (1994). [DOI] [PubMed] [Google Scholar]
  • 52.Kato, S. et al. PIK3CA mutation is predictive of poor survival in patients with colorectal cancer. Int. J. Cancer. 121, 1771–1778. 10.1002/ijc.22890 (2007). [DOI] [PubMed] [Google Scholar]
  • 53.De Roock, W. et al. PIK3CA, and PTEN mutations: implications for targeted therapies in metastatic colorectal cancer. Lancet Oncol.12, 594–603. 10.1016/s1470-2045(10)70209-6 (2011). [DOI] [PubMed] [Google Scholar]
  • 54.Fleming, N. I. et al. SMAD2, SMAD3 and SMAD4 mutations in colorectal cancer. Cancer Res.73, 725–735. 10.1158/0008-5472.Can-12-2706 (2013). [DOI] [PubMed] [Google Scholar]
  • 55.Schirripa, M. et al. Role of NRAS mutations as prognostic and predictive markers in metastatic colorectal cancer. Int. J. Cancer. 136, 83–90. 10.1002/ijc.28955 (2015). [DOI] [PubMed] [Google Scholar]
  • 56.Ross, J. S. et al. Targeting HER2 in colorectal cancer: the landscape of amplification and short variant mutations in ERBB2 and ERBB3. Cancer124, 1358–1373. 10.1002/cncr.31125 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Chang, C. C. et al. FBXW7 mutation analysis and its correlation with clinicopathological features and prognosis in colorectal cancer patients. Int. J. Biol. Markers. 30, e88–95. 10.5301/jbm.5000125 (2015). [DOI] [PubMed] [Google Scholar]
  • 58.Danielsen, S. A. et al. Portrait of the PI3K/AKT pathway in colorectal cancer. Biochim. Biophys. Acta. 1855, 104–121. 10.1016/j.bbcan.2014.09.008 (2015). [DOI] [PubMed] [Google Scholar]
  • 59.Javier, B. M. et al. Recurrent, truncating SOX9 mutations are associated with SOX9 overexpression, KRAS mutation, and TP53 wild type status in colorectal carcinoma. Oncotarget7, 50875–50882. 10.18632/oncotarget.9682 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Huang, J. et al. IDH1 and IDH2 mutations in colorectal cancers. Am. J. Clin. Pathol.156, 777–786. 10.1093/ajcp/aqab023 (2021). [DOI] [PubMed] [Google Scholar]
  • 61.Morikawa, T. et al. Association of CTNNB1 (beta-catenin) alterations, body mass index, and physical activity with survival in patients with colorectal cancer. Jama305, 1685–1694. 10.1001/jama.2011.513 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Sopik, V., Phelan, C., Cybulski, C. & Narod, S. A. BRCA1 and BRCA2 mutations and the risk for colorectal cancer. Clin. Genet.87, 411–418. 10.1111/cge.12497 (2015). [DOI] [PubMed] [Google Scholar]
  • 63.Lui, G. Y. L., Grandori, C. & Kemp, C. J. CDK12: an emerging therapeutic target for cancer. J. Clin. Pathol.71, 957–962. 10.1136/jclinpath-2018-205356 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Boland, C. R. & Goel, A. Microsatellite instability in colorectal cancer. Gastroenterology138, 2073–2087 (2010). e2073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Vilar, E. & Gruber, S. B. Microsatellite instability in colorectal cancer—the stable evidence. Nat. Reviews Clin. Oncol.7, 153–162 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Ranstam, J. & Cook, J. A. LASSO regression. J. Br. Surg.105, 1348–1348 (2018). [Google Scholar]
  • 67.Olender, T. et al. The human olfactory transcriptome. BMC Genom.17, 1–18 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Volz, A. et al. Complex transcription and splicing of odorant receptor genes. J. Biol. Chem.278, 19691–19701 (2003). [DOI] [PubMed] [Google Scholar]
  • 69.Neuhaus, E. M. et al. Activation of an olfactory receptor inhibits proliferation of prostate cancer cells. J. Biol. Chem.284, 16218–16225 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Masjedi, S., Zwiebel, L. J. & Giorgio, T. D. Olfactory receptor gene abundance in invasive breast carcinoma. Sci. Rep.9, 13736 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Li, D., Guo, J. & Jia, R. Epigenetic control of cancer cell proliferation and cell cycle progression by HNRNPK via promoting exon 4 inclusion of histone code reader SPIN1. J. Mol. Biol.435, 167993 (2023). [DOI] [PubMed] [Google Scholar]
  • 72.Lu, T. et al. Identification of DNA methylation-driven genes in esophageal squamous cell carcinoma: a study based on the cancer genome atlas. Cancer Cell Int.19, 1–13 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Janecki, D. M. et al. SPIN1 is a proto-oncogene and SPIN3 is a tumor suppressor in human seminoma. Oncotarget9, 32466 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Srivastava, A. et al. Genetic diversity of NDUFV1-dependent mitochondrial complex I deficiency. Eur. J. Hum. Genet.26, 1582–1587 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Kiss, S. et al. A cryptic pathogenic NDUFV1 variant identified by RNA-seq in a patient with normal complex I activity in muscle and transient magnetic resonance imaging changes. Am. J. Med. Genet. Part. A. 191, 1599–1606 (2023). [DOI] [PubMed] [Google Scholar]
  • 76.Santidrian, A. F. et al. Mitochondrial complex I activity and NAD+/NADH balance regulate breast cancer progression. J. Clin. Investig.123, 1068–1081 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Ren, L. et al. PHB2 promotes colorectal cancer cell proliferation and tumorigenesis through NDUFS1-mediated oxidative phosphorylation. Cell. Death Dis.14, 44. 10.1038/s41419-023-05575-9 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Howe, J. et al. The prevalence of MADH4 and BMPR1A mutations in juvenile polyposis and absence of BMPR2, BMPR1B, and ACVR1 mutations. J. Med. Genet.41, 484–491 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Friedl, W. et al. Juvenile polyposis: massive gastric polyposis is more common in MADH4 mutation carriers than in BMPR1A mutation carriers. Hum. Genet.111, 108–111 (2002). [DOI] [PubMed] [Google Scholar]
  • 80.Zhou, X. P. et al. Germline mutations in BMPR1A/ALK3 cause a subset of cases of juvenile polyposis syndrome and of Cowden and Bannayan-Riley-Ruvalcaba syndromes. Am. J. Hum. Genet.69, 704–711 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Slattery, M. L. et al. Genetic variation in bone morphogenetic protein and colon and rectal cancer. Int. J. Cancer. 130, 653–664 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Imagawa, K., Morita, A., Fukushima, H., Tagawa, M. & Takada, H. A novel BMPR1A mutation affects mRNA splicing in juvenile polyposis syndrome. Pediatr. Int.64, e15041 (2022). [DOI] [PubMed] [Google Scholar]
  • 83.Xiao, F. et al. MicroRNA-885-3p inhibits the growth of HT-29 colon cancer cell xenografts by disrupting angiogenesis via targeting BMPR1A and blocking BMP/Smad/Id1 signaling. Oncogene34, 1968–1978 (2015). [DOI] [PubMed] [Google Scholar]
  • 84.Xu, N. et al. ARPC4 promotes bladder cancer cell invasion and is associated with lymph node metastasis. J. Cell. Biochem.121, 231–243 (2020). [DOI] [PubMed] [Google Scholar]
  • 85.Su, X., Wang, S., Huo, Y. & Yang, C. Short interfering RNA-mediated Silencing of actin-related protein 2/3 complex subunit 4 inhibits the migration of SW620 human colorectal cancer cells. Oncol. Lett.15, 2847–2854 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Li, X. et al. Identification of genetic modifiers enhancing B7-H3-targeting CAR T cell therapy against glioblastoma through large-scale CRISPRi screening. J. Exp. Clin. Cancer Res.43, 95. 10.1186/s13046-024-03027-6 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Schetter, A. J., Okayama, H. & Harris, C. C. The role of MicroRNAs in colorectal cancer. Cancer J.18, 244–252 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.Marima, R. et al. MicroRNA and alternative mRNA splicing events in cancer drug Response/Resistance: potent therapeutic targets. Biomedicines910.3390/biomedicines9121818 (2021). [DOI] [PMC free article] [PubMed]
  • 89.Liu, Y. et al. Noncoding RNAs regulate alternative splicing in cancer. J. Exp. Clin. Cancer Res.40, 11. 10.1186/s13046-020-01798-2 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Ouyang, J. et al. Long non-coding RNAs are involved in alternative splicing and promote cancer progression. Br. J. Cancer. 126, 1113–1124. 10.1038/s41416-021-01600-w (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91.Wu, J. et al. Integration of single-cell sequencing and bulk RNA-seq to identify and develop a prognostic signature related to colorectal cancer stem cells. Sci. Rep.14, 12270 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Wang, X., He, Y., Zhang, Q., Ren, X. & Zhang, Z. Direct comparative analyses of 10X genomics chromium and smart-seq2. Genomics Proteom. Bioinf.19, 253–266 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93.Picelli, S. et al. Full-length RNA-seq from single cells using Smart-seq2. Nat. Protoc.9, 171–181 (2014). [DOI] [PubMed] [Google Scholar]
  • 94.Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods. 10, 1096–1098 (2013). [DOI] [PubMed] [Google Scholar]
  • 95.Branton, D. et al. The potential and challenges of nanopore sequencing. Nat. Biotechnol.26, 1146–1153 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96.Lee, C. & Roy, M. Analysis of alternative splicing with microarrays: successes and challenges. Genome Biol.5, 1–4 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97.Camacho Londoño, J. & Philipp, S. E. A reliable method for quantification of splice variants using RT-qPCR. BMC Mol. Biol.17, 1–12 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 98.Rhoads, A. & Au, K. F. PacBio sequencing and its applications. Genomics Proteom. Bioinf.13, 278–289 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol.34, 525–527 (2016). [DOI] [PubMed] [Google Scholar]
  • 100.Alamancos, G. P., Pagès, A., Trincado, J. L., Bellora, N. & Eyras, E. Leveraging transcript quantification for fast computation of alternative splicing profiles. Rna21, 1521–1531 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101.Liu, Y., Jia, W., Li, J., Zhu, H. & Yu, J. Identification of survival-associated alternative splicing signatures in lung squamous cell carcinoma. Front. Oncol.10, 587343 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.Durinck, S., Spellman, P. T., Birney, E. & Huber, W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomart. Nat. Protoc.4, 1184–1191 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103.Schoenfeld, D. A. Sample-size formula for the proportional-hazards regression model. Biometrics, 39(2), 499–503 (1983). [PubMed]
  • 104.Kuhn, M. et al. Package ‘caret’. R J.223, 48 (2020). [Google Scholar]
  • 105.Therneau, T. M. & Lumley, T. Package ‘survival’. R Top. Doc.128, 28–33 (2015). [Google Scholar]
  • 106.Efron, B. Division of Biostatistics, Stanford University, (2005).
  • 107.Hajian-Tilaki, K. Receiver operating characteristic (ROC) curve analysis for medical diagnostic test evaluation. Caspian J. Intern. Med.4, 627 (2013). [PMC free article] [PubMed] [Google Scholar]
  • 108.Hanley, J. A. & McNeil, B. J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology143, 29–36 (1982). [DOI] [PubMed] [Google Scholar]
  • 109.Blanche, P., Dartigues, J. F. & Jacqmin-Gadda, H. Estimating and comparing time‐dependent areas under receiver operating characteristic curves for censored event times with competing risks. Stat. Med.32, 5381–5397 (2013). [DOI] [PubMed] [Google Scholar]
  • 110.Hastie, T., Qian, J. & Tay, K. An introduction to Glmnet. CRAN R Repositary. 5, 1–35 (2021). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1 (18.1KB, docx)
Supplementary Material 2 (59.7KB, xlsx)
Supplementary Material 3 (747.7KB, docx)

Data Availability Statement

Data supporting the findings of this study were obtained from the Database of Genotype and Phenotypes (dbGaP): (1) The Cancer Genome Atlas (TCGA)22, RNA-sequencing data, phs000178.v11.p8; (2) AC-ICAM23, RNA-sequencing data, phs002978.v1.p1. Clinical data for both datasets were downloaded from cBioPortal. The summarized PSI level data will be available in the GitHub repository: https://github.com/mitrofanova-lab/SpliceML.

All the codes and summarized datasets for this study will be available in the GitHub repository: https://github.com/mitrofanova-lab/SpliceML.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES