RCANE: a deep learning algorithm for whole-genome pan-cancer somatic copy number aberration prediction using RNA-seq data

Changhao Ge; Xiaowen Hu; Lin Zhang; Hongzhe Li

doi:10.1038/s42003-025-08712-6

. 2025 Sep 24;8:1354. doi: 10.1038/s42003-025-08712-6

RCANE: a deep learning algorithm for whole-genome pan-cancer somatic copy number aberration prediction using RNA-seq data

Changhao Ge ^1,², Xiaowen Hu ³, Lin Zhang ³, Hongzhe Li ^2,^✉

PMCID: PMC12460768 PMID: 40993228

Abstract

Transcriptome sequencing (RNA-seq) of cancers is widely employed in cancer research to investigate gene expression patterns and their role in disease progression. Somatic copy-number aberrations (SCNAs)—critical genomic drivers of tumorigenesis—can also be inferred directly from RNA-seq, yielding a “two-for-one” return of quantitative expression measures plus structural-variation calls at a fraction of the cost of separate DNA assays. Here, we present RCANE, a deep-learning framework that predicts genome-wide SCNAs across diverse cancer types using only RNA-seq data. Trained on The Cancer Genome Atlas (TCGA) and DepMap cell-line cohorts, RCANE consistently outperforms existing approaches, delivering a scalable, robust solution for improving somatic copy-number aberration profiling in cancer diagnostics and therapeutic decision-making.

Subject terms: Computational models, Cancer genomics

A deep-learning framework that predicts genome-wide somatic copy number aberrations across diverse cancer types using only RNA-seq data.

Introduction

Somatic copy number aberrations (SCNAs) are a hallmark of cancer, involving large-scale genomic alterations that drive tumorigenesis and cancer progression by affecting gene dosage and altering the expression of oncogenes and tumor suppressor genesc¹. Detecting SCNAs is essential for understanding cancer biology and developing personalized therapies^1–4. However, SCNA detection traditionally depends on high-cost, high-depth sequencing techniques, such as SNP microarrays, whole-genome sequencing (WGS) or whole-exome sequencing (WES). SNP microarrays, generally speaking, have an advantage over WES for CNA detection specifically. For WGS or WES read data, after the read data has been processed, WGS/WES methods essentially mimic a traditional SNP microarray method. The methods create pseudo probes from the sequencing reads. These reads are averaged in a certain bin or sliding window and divided by the number of reads in a reference sample (or group of reference samples) to establish a $\log_{2}$ ratio value, which can then be used to estimate actual copy number. Compared to whole genome sequencing, WES introduces more biases and noise that make CNA detection very challenging. Comparative studies have shown that existing tools show moderate sensitivity (50%–80%), fair specificity (70%–94%) and poor FDRs (27%–60%)⁵.

Bulk mRNA sequencing (RNA-seq) offers a more cost-effective and widely used alternative in cancer genomics studies, reflecting cellular activity directly and making it a key component of multi-omic studies. Leveraging RNA-seq for copy-number calling can be more cost-effective than sequencing DNA de novo. In many cancer studies, RNA-seq is already performed to profile gene expression, so no additional library preparation or sequencing run is required to infer CNAs, saving both reagent and instrument time. Even when performed solely for CNA detection, transcriptome sequencing typically demands lower coverage than whole-genome sequencing and avoids the high per-base cost of deep WGS or the capture reagents of WES. Thus, RNA-seq CNA inference often delivers a “two-for-one” return—quantitative expression data plus structural variation calls—at a fraction of the incremental expense of separate DNA assays.

Developing an algorithm that can accurately predict SCNAs from RNA-seq data has attracted recent interest. However, the nonlinear and complex relationship between gene expression and genomic alterations, along with factors such as DNA-methylation^6–9 and transcriptional adaptation¹⁰ that influence the RNA transcription, presents a challenge for accurate SCNA inference from RNA-seq data. Current SCNA detection tools using bulk RNA-seq data generally fall into two categories: modified segmentation-based methods that modify the existing CNA detection methods developed for array comparative genomic hybridization (CGH)¹¹ or SNP array data, and machine learning approaches that uses RNA-seq data as predictors. While CNVkit¹² is capable of estimating SCNAs from RNA-seq data, it was originally designed for DNA sequencing and doesn’t generalize well to transcriptomic data. Machine learning methods such as CNAPE¹³ improve SCNA prediction from RNA-seq but typically focus on either gene- or chromosome-level alterations, and miss finer patterns. Moreover, these approaches often require large training datasets, which limits their applicability in biomedical settings where only a small number of samples are available. Several other methods, including RNAseqCNV¹⁴, CaSpER¹⁵, and SuperFreq¹⁶, require B-allele frequency information, which is typically unavailable in RNA-seq studies.

Although algorithms based on single cell RNA-seq data (scRNA-seq) such as CopyKAT¹⁷ and SCEVAN¹⁸ have demonstrated strong performance, they are specifically designed for scRNA-seq data. These methods generally assume that each sample is composed of a mixture of malignant or normal cells, and typically rely on identifying a cluster of normal cells with high confidence to serve as a reference. This assumption breaks down in bulk RNA-seq since it measures the combined transcriptomes of all cells in the sample—both malignant tumor cells and non-tumor cells (e.g., stromal, immune, or normal epithelial cells). Therefore, the scRNA-seq based methods for CNA detection are expected to perform poorly on bulk RNA-seq data.

These limitations necessitate novel approaches that effectively utilize bulk RNA-seq data for SCNA prediction. The methods proposed in this paper fills this gap by introducing RNA-seq to copy number aberration neural network (RCANE), a deep learning algorithm designed to predict whole-genome SCNAs from cancer RNA-seq data. Deep learning with an appropriate architecture to capture various dependency among the data is particularly suited to this problem, as it excels at modeling complex, high-dimensional data¹⁹. In addition, it can fine-tune across diverse datasets, enhancing its generalizability to new studies.

Results

An overview of RCANE

RCANE is a deep learning-based method designed to predict whole-genome copy number aberrations from RNA-seq data. It is trained using datasets from The Cancer Genome Atlas (TCGA) Program²⁰. A comparison between RCANE and existing approaches is shown in Table 1. Before neural network modeling, we preprocess raw mRNA-seq and SCNA intensity data (e.g., the $\log_{2}$ ratio of target to reference signal from SNP array platforms), as described in Fig. 1a. mRNA-seq data is normalized using transcripts per million (TPM), and lowly expressed genes are removed. The remaining genes are reordered based on their genomic positions. Because SCNAs typically span broad genomic regions and affect many genes, adjacent genes are grouped into segments, with the assumption that all genes within a segment share the same copy number value. The data is then reshaped into a 3D tensor, with segments forming the last dimension. Within each segment, SCNA log-intensity values are summarized using the median. To capture cross-chromosomal correlations, we construct positive and negative segment graphs based on correlation matrices of segment intensities: gene pairs with correlations above 0.1 or below −0.1 are defined as positive or negative edges, respectively. These graphs exclude intra-chromosomal edges and are specific to each cancer type. An overview of the RCANE workflow—including model training, prediction, and data analysis—is provided in Supplementary Fig. 1. In addition, our software package includes a visualization tool for whole-genome SCNAs (see Supplementary Fig. 2).

Table 1.

Comparison of RCANE with existing methods for predicting CNA using RNA data

Method	Category	Output	Scale	BAF free^a	Reference free^b
RCANE	Deep Learning	Intensity	Whole-genome	✓	✓
CNAPE	Machine Learning	State only	Gene/Chromosome	✓	✓
CNVkit	Segmentation	Intensity	Whole-genome	✓	✗
CopyKAT	Segmentation	Intensity	Whole-genome	✓	✗
SCEVAN	Variational	Intensity	Whole-genome	✓	✗
RNAseqCNV	Machine Learning	State only	Chromosome	✗	✓
CaSpER	Segmentation	Intensity	Whole-genome	✗	✗
SuperFreq	Segmentation	Intensity	Whole-genome	✗	✓

Open in a new tab

^aModels that only take RNA expression data as input and don’t require B-allele frequencies.

^bModels that can process the data of a single sample without reference from other samples.

The core architecture of RCANE combines sequence models with graph neural networks (Fig. 1b). To help the model learn both the effects of individual gene expression and the relative importance of different genes, a subset of gene expression values is randomly masked at the beginning of each training epoch. Cancer types are encoded via an embedding layer, and the gene expression values are adjusted using these cancer-type embeddings, then passed through a multi-layer perceptron (MLP). This design enables RCANE to capture cancer type-specific patterns, reflecting the biological and molecular differences across cancer types, while still leveraging shared information across the full dataset. Within each genomic segment, gene outputs—together with cancer type embeddings—are normalized using layer normalization²¹ to ensure zero mean and unit variance. These normalized values are aggregated via a weighted average and further processed by another MLP. The resulting segment-level features are then input into a chromosome-specific Long Short-Term Memory (LSTM) network²² and two Graph Attention (GAttn) layers²³, which incorporate predefined positive and negative correlation graphs. The LSTM captures both short- and long-range dependencies in gene expression within chromosomes, while the GAttn layers capture cross-chromosomal SCNA patterns, such as the 1p/19q co-deletion observed in gliomas²⁴. Finally, the outputs from the LSTM and GAttn components are integrated and passed through a set of univariate layers for fine-tuning. The model is trained by minimizing the mean squared error between the predicted and observed SCNA intensity values.

We trained and evaluated RCANE using data from the TCGA project²⁵, which comprises 33 cancer types with varying sample sizes (Supplementary Fig. 3a, b). The model was further fine-tuned using human cancer cell lines representing 17 cancer types from the DepMap project²⁶, and its performance was compared against the vanilla (pre-fine-tuning) version. For external validation, we employed an independent dataset from the Clinical Proteomic Tumor Analysis Consortium (CPTAC)²⁷, which includes 10 cancer types. As cancer cell line data exhibit higher tumor purity, we applied a correction to the mRNA expression data to account for these distributional shifts prior to model inference (Supplementary Fig. 3c).

To evaluate the contributions of different components of the model, we conducted an ablation study focusing on the LSTM, GAttn, and univariate layers (Fig. 1c). Removing either the LSTM or GAttn resulted in a 7%–9% reduction in MCC on TCGA data and an 11%–14% reduction on cell line data, with accuracy dropping by 2%–3% and 6%–7%, respectively. While removing the univariate layers had little effect on TCGA performance, it significantly impaired the model’s ability to fine-tune and generalize on cell line data. Therefore, each component plays a critical role in the overall model architecture.

Evaluation of RCANE for CNV detection in TCGA testing samples

In TCGA test samples, RCANE outperformed CNAPE, CNVkit, and CopyKAT in whole-genome copy number prediction, achieving the highest F₁ scores for predicting deletion, neutral copy number and amplification for all cancer types (Fig. 2a). While CNAPE performed comparably to RCANE in Kidney Chromophobe (KICH), its performance was unstable across other cancer types. CopyKAT performed the worst in this task, likely due to its design for single-cell data, which is not well-suited for bulk RNA-seq.

All methods exhibited diminished performance in Acute Myeloid Leukemia (LAML), a hematological malignancy (Fig. 2a). This is likely attributable to the typically low and unstable RNA content in blood cells, which makes it a less suitable proxy for SCNA detection.

RCANE also excelled in segment-wise SCNA detection, with an average sensitivity of 0.80, specificity of 0.97, and MCC of 0.79 (Fig. 2b). In comparison, CNAPE tended to under-select, and CNVkit over-selected CNAs, resulting in lower MCCs of 0.37 and 0.35, respectively. Visualizations of two representative samples further illustrated that RCANE accurately recovered both arm-level and focal SCNAs (Fig. 2c). Both RCANE and CNVkit were able to detect broad copy number gains and losses. In contrast, CNAPE captured only a subset of the SCNA patterns, while CopyKAT failed in most cases. Compared with RCANE, CNVkit produced noisier estimates and exhibited a notably higher rate of false positives.

Evaluation of RCANE for CNV detection in DepMap cell lines

We applied RCANE to the DepMap cell line dataset to evaluate its fine-tuning capability. Across all cancer types except Adrenocortical Carcinoma (ACC), the fine-tuned RCANE consistently achieved higher F₁ scores than the vanilla model (Fig. 3a). Fine-tuning also improved performance in terms of Jaccard score (Fig. 3b), sensitivity, specificity, and MCC (Fig. 3d), as well as whole-genome SCNA estimation accuracy (Fig. 3e). This provides an efficient mechanism for RCANE to adapt to new datasets, which is especially valuable in scenarios with limited computational resources or training data.

Fig. 3 — a F₁ score for detecting copy number deletion, neutral and amplication for differnt cancer types. b Comparison of Jaccard score of RCANE, CNAPE, CNVkit and CopyKAT. One-sided Mann-Whitney-Wilcoxon test: top group is greater. c Comparison of ROC curves for detecting loss/deletion and gain/amplication of different methods. d Comparison of sensitivity, specificity and MCC of RCANE, CNAPE, CNVkit and CopyKAT. One-sided Mann-Whitney-Wilcoxon test: left group is greater. e True and estimated CNVs for two samples.

While the vanilla RCANE did not match the performance of its fine-tuned counterpart on the cell line data, it still outperformed other methods considered in this study (Fig. 3a–c). Although CNVkit achieved comparable sensitivity, its higher rate of false positives resulted in lower specificity and MCC (Fig. 3d). Visualizations further indicated that although the vanilla RCANE underperformed slightly relative to the fine-tuned model, it was still able to capture most of the key SCNA features (Fig. 3e).

Prediction of intensities and Identification of SCNA-related genes

Originally trained on SNP array log-intensity ratio data ( $\log_{2} R$ ), RCANE extends its applicability to predicting continuous log-intensity values, which provide higher resolution than categorical SCNA calls and enable more nuanced downstream analyses. Accurate prediction of $\log_{2} R$ values also offers an additional perspective for evaluating the performance of SCNA inference methods.

Among the compared methods, CNVkit also generates $\log_{2} R$ estimates. RCANE outperforms CNVkit with more accurate predictions of log-intensity ratios. Across all cancer types, the Pearson correlations between the observed and RCANE-predicted $\log_{2} R$ values are consistently close to 1 and significantly higher than those of CNVkit (P < 0.001; Fig. 4a). Furthermore, RCANE captures complex relationships between mRNA expression and copy number alterations across different chromosomal regions (Fig. 4b), resulting in a more accurate genome-wide intensity profile (Fig. 4c). A similar trend is observed in the cancer cell line data, where RCANE continues to outperform CNVkit in intensity estimation (Supplementary Fig. 4). Supplementary Fig. 5 presents examples of the observed and model-predicted $\log_{2} R$ intensities of 6 different genomic regions across different cancer types, showing almost perfect predictions.

Fig. 4 — a Boxplot of Pearson correlation between intensity prediction and ground truth for all cancer types. KICH and LAML results from CNVkit are excluded due to model errors. One-sided Mann-Whitney-Wilcoxon test: RCANE is greater than CNVkit. b t-SNE plots of ground truth and predicted intensity with dot locations based on mRNA expression data. Colors indicate copy number intensity in $\log_{2}$ ratio. c Whole-genome intensity prediction for two samples. Colors indicate copy number intensity in $\log_{2}$ ratio.

RCANE’s use of a masking and weighted-average mechanism allows for the identification of SCNA-related genes based on model-assigned weights, with layer normalization ensuring that gene importance is captured exclusively by these weights (Supplementary Fig. 6). This method effectively handles missing RNA-seq data of some genes through masking rather than imputation, which improves model robustness (Supplementary Fig. 7). Genes assigned higher weights are more strongly influenced by SCNAs, while those with lower weights tend to be independent of SCNA effects (Fig. 5a). In our model, we identified SCNA-associated genes such as POLR2H and ABCF3, as well as genes with lower relevance, such as LINC02069 and LINC02054 (Fig. 5b, c).

Fig. 5 — a Scatter plot showing the relationship between model weights and Pearson correlations of copy number intensity and mRNA expression for each gene. Color indicates the CNA ratio. b Model weights of cancer type and 20 genes within one segment. c Scatter plot illustrating mRNA expression and CNA intensity for the genes with the highest and lowest model weights, with Pearson’s correlation coefficient (r).

Independent validation results of CPTAC data

In the independent validation analysis using CPTAC data, RCANE achieves the highest whole-genome SCNA prediction accuracy, as measured by the Jaccard score (Fig. 6). Although trained exclusively on TCGA RNA-seq data, RCANE outperforms existing methods—including CNAPE, CNVkit, and CopyKAT—across most tumor types in the CPTAC cohort. While performance differences are not statistically significant in a few cancer types, RCANE remains among the top-performing methods. We emphasize that the reference SCNAs in CPTAC are derived from the WES data, whereas those in TCGA are obtained from high-resolution SNP6 arrays. These differences in data modality and resolution introduce a substantial domain shift between the training and evaluation settings. Despite this, RCANE maintains strong predictive performance, demonstrating its reliability and robust generalization across cohorts, platforms, and varying levels of noise in SCNA ground truth.

Fig. 6 — One-sided Mann-Whitney-Wilcoxon test: left group is greater.

As a comparison, we also perform the CNA detection analysis using the method developed for scRNA-seq data. As expected, these methods does not perform well for bulk RNA-seq data. Details can be founded in Supplementary Fig. 8.

Discussion

RCANE offers a cost-effective and accurate solution for predicting SCNAs from RNA-seq data, providing a viable alternative to traditional sequencing-based or array-CGH approaches. This deep learning framework effectively models the complex relationship between gene expression and SCNAs, and consistently outperforms existing tools such as CNAPE, CNVkit, and CopyKAT. RCANE is capable of fine-tuning on small datasets, making it adaptable to new studies with limited samples. Its strong generalization across external datasets further supports its utility in diverse cancer contexts.

Trained on continuous copy number intensity values rather than discrete classes, RCANE enables more nuanced analysis compared to classification-based methods. Moreover, it identifies SCNA-associated genes, offering valuable insights into the regulatory impact of genomic alterations. Future work will focus on incorporating additional features such as clinical data and tumor purity to enhance biomedical interpretability. We anticipate that RCANE will serve as a widely applicable tool for SCNA analysis in cancer research, with potential extensions—including multi-omics integration^28,29—further expanding its impact in cancer genomics.

Extending RCANE to scRNA-seq from tumor specimens is a natural next step that leverages its attention-based architecture to resolve intra-tumor heterogeneity. By adapting our input pipeline to accept cell-level expression profiles, RCANE can learn to highlight CNA signals from subpopulations of malignant cells, enabling the detection of both clonal and subclonal events. As larger, high-quality scRNA-DNA cohorts become available, we plan to extend RCANE and benchmark against leading variational-inference methods (e.g., CopyKAT, SCEVAN) and refine our model to integrate multi-omic data, thereby enhancing resolution and translational impact in precision oncology studies.

Methods

Algorithm architecture

RCANE is implemented in Python (version 3.8.19) using NumPy (version 1.26.4), PyTorch (version 2.4.1+cu121), and PyG (version 2.5.3). The input to the neural network consists of three components: (i) a tensor of dimensions B × N_s × N_t representing mRNA expression levels, (ii) a vector of size B encoding cancer type identifiers, and (iii) a masking tensor of the same shape as the expression tensor, B × N_s × N_t, where B denotes the batch size, N_s the number of genomic segments, and N_t the number of transcripts per segment. Given that focal SCNAs typically span a median length of 1.8 megabases (Mb)², while human genes have a median length of approximately 24 kilobases (Kb)³⁰, a typical focal SCNA is expected to affect around 70 genes. To capture such events with sufficient resolution while maintaining sequence lengths tractable for LSTM-based models, we set N_t = 20, resulting in N_s = 1514 segments for genome-wide coverage. These segments are divided into 23 sequences corresponding to the 23 chromosomes. Chromosome 1 contains the longest sequence with 150 segments, while chromosome 21 contains the shortest with 18 segments. This sequence length has been empirically shown to work well for LSTM architectures.

The masking tensor serves three key purposes. First, it handles end-of-chromosome padding: for segments located at the ends of chromosomes that contain fewer than N_t transcripts, the excess positions are masked to prevent invalid inputs from affecting the network. Second, random masking is applied during training to enhance the model’s ability to infer transcript-specific effects of SCNAs. This approach also facilitates the interpretation of gene-level importance, where higher learned weights indicate stronger associations with SCNAs and lower weights suggest reduced relevance. Third, it encourages RCANE to utilize information from all N_t genes within a segment, rather than relying on only a few dominant signals. As a result, in applications where some gene expression values are missing, the model can simply mask out the missing genes and operate using the available expression data while still achieve better performance than data imputation (Supplementary Fig. 7). This mechanism also contributes to reducing overfitting. Further details on the implementation are provided later in this section.

Cancer type information is processed through an embedding layer. Since tumor samples from different cancer types are often collected from distinct tissues, they exhibit unique RNA expression patterns. These inherent tissue-specific differences, however, are not relevant for predicting SCNAs. To account for this, each transcript’s expression value is adjusted based on the cancer type to normalize these variations. The adjusted values are then passed through a dedicated multi-layer perceptron (MLP), with an input size of 1 and an output size of N_out, resulting in a total of N_s × N_t MLPs. Given that different cancer types have distinct SCNA baselines, their embeddings are also processed through a MLP, which produces an output of the same size as required by the next layer. For each segment, the N_s × N_t output vectors from the transcripts, along with the cancer type output vector, are combined into a tensor of size N_s × (N_t + 1) × N_out.

Due to the large variation in expression magnitudes across transcripts, we apply layer normalization along the last dimension during the training to normalize their scales. To aggregate information from each transcript within a segment, a weighted average is computed over the corresponding row vectors, using a learnable weight vector of size N_t + 1, transformed via the softmax function:

x_{i j}^{wa} = \frac{\sum_{k = 1}^{N_{t} + 1} \exp (M_{i j k} + w_{j k}) x_{i j k}}{\sum_{k = 1}^{N_{t} + 1} \exp (M_{i j k} + w_{j k})},

where $x_{i j k} \in R^{N_{out}}$ denotes the output vector for the ith sample, jth segment, and kth transcript (or the cancer type when k = N_t + 1); M_ijk ∈ {0, − ∞} is the masking indicator, with M_ijk = − ∞ indicating a masked entry; and $x_{i j}^{wa} \in R^{N_{out}}$ is the resulting weighted average representation of the segment.

During training, a random subset of entries is masked, and the remaining non-masked values contribute to the segment-level representation through the weighted average. The exponential of learnable weight $\exp (w_{j k}) \in R_{+}$ reflects the relative importance of each transcript in predicting the SCNA state of its segment. At the evaluation time, all masking is disabled except for the positions corresponding to missing expression values, allowing the model to utilize all available data. The resulting segment-level representations are then passed through segment-specific MLPs, followed by Graph Attention and LSTM layers.

The Graph Attention layer comprises two blocks: one captures positively correlated segments of copy number intensities, and the other captures negatively correlated segments. Both blocks take the output of the preceding layer as node features but operate on distinct sets of graph edges. The first block attends to edges defined by positive correlations, while the second focuses on edges defined by negative correlations. These two types of correlations are modeled separately, as their opposing characteristics require distinct parameterizations within the model.

In parallel with the Graph Attention layers, the model incorporates a LSTM layer with 23 units, each corresponding to a chromosome. Each cell of a LSTM sequence represents a chromosome segment. Each cell receives input from the segment-specific MLP and outputs neighborhood-corrected information for the corresponding segment. This design allows the LSTM layer to capture sequential dependencies across segments within each chromosome. To integrate both sequential and relational information, the outputs from the LSTM layer and the two Graph Attention blocks are concatenated. This combined representation is then passed through a universal MLP, shared across all N_s segments, which generates a scalar output for each segment.

Since our main model is trained on bulk RNA-seq data of cancer tissues, when it is applied to new datasets such as cancer cell lines, variations in data distributions can affect the model performance. To address this, we incorporate a univariate block as the final component of RCANE, specifically designed for fine-tuning. The input to this block has a dimension of N_s, each one representing a genomic segment. It is passed through a set of N_s univariate neural networks, where each layer—including hidden layers—has both input and output dimensions of 1. Each segment is processed independently, without using information from other segments. This design allows an efficient, segment-specific adjustment of predictions, making fine-tuning faster while preserving essential predictive information. The output of this block represents the final predicted $\log_{2} R$ intensity for each segment.

Model parameters and computational resource required

All activation functions in RCANE are set to Leaky ReLU with a negative slope of 0.1. The MLPs used for aggregating RNA-seq data and cancer type embeddings have a depth of 2 and a hidden dimension of 32. The Graph Attention module employs 2 attention heads, each with a depth of 2 and hidden dimension of 32, and applies a dropout rate of 0.2 during training. The LSTM component consists of 4 layers with a hidden dimension of 32 and the same dropout rate. The MLP that processes the concatenated outputs from the Graph Attention and LSTM layers also has a depth of 2, with a hidden dimension of 64. The MLP used for fine-tuning is a single-layer network, while the univariate block has a depth of 4. The total number of trainable parameters in the model is approximately 46.8 million.

RCANE is trained for 130 epochs on an NVIDIA A40 GPU with 48 GB of memory, using a batch size of 32. Optimization is performed using the Adam algorithm with a learning rate of 0.001 and the mean squared error loss. In addition, alternative regression losses, such as the Huber loss, are implemented to support future analyses.

Missing RNA-seq data

In many cases, RNA-seq data of interest may contain missing gene expression values compared to the data used to train the RCANE model, often due to differences in data sources or sequencing platforms³¹. Missing data poses a significant challenge in machine learning, particularly for deep learning models, as it can substantially affect prediction accuracy and robustness.

Nevertheless, RNA expression data is typically highly correlated and often resides on a low-dimensional manifold^32,33. For example, genes involved in the same biological pathway frequently exhibit co-expression patterns^34,35, implying that the absence of some gene values may not lead to significant information loss. RCANE leverages this property by employing a masking and weighted-average strategy to handle missing data. During prediction, missing values are masked, and the available expression values from other genes within each genomic segment are used to infer SCNAs.

To evaluate RCANE’s performance in the presence of missing data, we simulate incomplete RNA-seq data by randomly removing entries from both the TCGA and cell line datasets. For the cell line data, we assess both the standard RCANE model and its fine-tuned variant. Within the RNA expression matrix, each element is assigned the same probability of being missing, with missingness probabilities p ∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6}, where p = 0 denotes complete data. We compare RCANE’s performance against two imputation methods: k-nearest neighbor (KNN) imputation with k = 5 and median imputation (Supplementary Fig. 7). For both methods, imputation is performed using TCGA data from model training. Our results show that RCANE’s masking strategy maintains robust performance even with a substantial proportion of missing RNA expression data, while traditional imputation methods suffer significant performance degradation under similar conditions.

TCGA dataset

The TCGA project analyzed over 11,000 tumor samples across 33 cancer types, encompassing both common forms of cancer, such as glioblastoma, breast cancer, lung cancer, as well as rarer types like adrenocortical carcinoma and mesothelioma. Sample sizes vary by cancer type, with hundreds of samples typically analyzed for each. The largest groups include Breast Invasive Carcinoma (BRCA) with over 1000 samples, Lung Adenocarcinoma (LUAD) and Glioblastoma Multiforme (GBM) with more than 500 samples each, and Colon Adenocarcinoma (COAD) with over 400 samples.

For our study, TCGA data is downloaded using the TCGAbiolinks³⁶ (version 2.25.3) through Bioconductor³⁷ (version 3.18), which provides both raw and TPM-normalized mRNA expression matrices, along with SCNA data. The mRNA expression data consists of 60,660 transcripts. Raw mRNA expression data is used as input for CNVkit as recommended¹², while TPM-normalized data is used for all other methods. For copy number variation analysis, we retrieve Affymetrix Genome-Wide Human SNP Array 6.0 data, which has been transformed to $\log_{2}$ ratio ( $\log_{2} R$ ) and processed using CBS³⁸. Copy number intensities greater than 0.25 are classified as gains or amplifications, while those below −0.2 are classified as losses or deletions. A total of 8900 tumor samples are used for training and 2226 for testing, with the data randomly split across cancer types. For methods other than RCANE, separate models aretrained for each cancer type to account for the heterogeneity among cancers.

To prepare the input for RCANE, transcripts located on the Y chromosome or expressed in fewer than 30% of the samples are excluded, resulting in a set of 30,096 transcripts. These transcripts are reordered by genomic location across 23 chromosomes based on the genome reference GRCh38. Within each chromosome, 20 adjacent transcripts aregrouped into segments, producing a tensor of size N × 1514 × 20. The mRNA expression data are then transformed using $\log_{2} (1 + TPM)$ . For copy number data, we generate the RCANE output by calling the copy number for each selected transcript, with each segment $\log_{2} R$ intensity represented by the median value of the 20 transcripts within that segment.

To build the correlation graphs used by RCANE, we compute a segment correlation matrix for each cancer type based on copy number intensity data from samples of that cancer type. Positive correlation graphs include edges between segment pairs with correlation values greater than 0.1, while negative correlation graphs include pairs with correlation values below −0.1. Only segment pairs from different chromosomes are considered when constructing the correlation graphs.

DepMap cell line dataset

We select cells in the DepMap dataset with cancer types overlapping those in the TCGA dataset and randomly split them by cancer type, resulting in 114 training samples and 266 testing samples across 17 cancer types. For RCANE, the training samples along with cancer type information are used for fine-tuning. For other methods, the training samples from different cancer types are combined and used for training from scratch due to the small sample size.

The mRNA expression data initially contain 53,961 transcripts, from which we select 29,088 transcripts that overlap with the TCGA data. Other missing transcripts are masked during fine-tuning and evaluation. As the cell line data does not provide raw counts, we used $\log_{2} (1 + TPM)$ as input for all four methods examined in this work. Due to the significant distribution shift between the TCGA and cell line mRNA expression data, we apply the ComBat³⁹ function from sva⁴⁰ (version 3.50.0) with default parameters to correct for the batch effects. The TCGA data is used as the reference, and cancer type is included as a covariate.

The copy number intensity data for the cell line is processed by the Broad Institute. Unlike the TCGA data, these values are not $\log_{2}$ transformed, so we apply the transformation manually. We then summarize the intensities into segments and define losses/deletions and gains/amplifications in the same way as the TCGA data.

CPTAC dataset

The CPTAC data contains 10 cancer types, where somatic copy number ratios are derived from WES bam files preprocessed using the CopywriteR⁴¹ package. The CBS algorithm implemented in CopywriteR is used for copy number segmentation. For independent validation, we download the TPM RNA-seq and $\log_{2}$ copy number variation data, resulting in 1022 matched samples. Three cancer types contain 60,669 transcripts, while the remaining types have 59,361 transcripts. RNA-seq data are matched to TCGA data based on gene symbols, and missing values are masked during evaluation. The copy number intensity ratios are categorized into three SCNA classes using the same criteria as applied to the TCGA data.

For CNVkit, correlation files derived from TCGA data are used as the input. Similarly, CNAPE models trained on TCGA RNA-seq data are applied to CPTAC samples for copy number estimation. As CopyKAT does not require external references, it is applied directly to the CPTAC data.

Statistics and reproducibility

The performance of RCANE and other methods is evaluated using scikit-learn (version 1.3.2). Visualizations are based on pandas (version 2.0.2), matplotlib (version 3.9.1) and seaborn (version 0.13.2). In the box plots, the center line indicates the median, the box limits represent the first and third quartiles, and the whiskers extend to 1.5 times the interquartile range. The Accuracy score is defined by:

Accuracy score = \frac{\sum_{c \in {L, N, G}} {TP}_{c}}{\sum_{c \in {L, N, G}} {TP}_{c} + {FP}_{c}},

and the Jaccard score is calculated as follows:

Jaccard score = \frac{\sum_{c \in {L, N, G}} {TP}_{c}}{\sum_{c \in {L, N, G}} {TP}_{c} + {FP}_{c} + {FN}_{c}},

where L, N, and G represent copy number loss/deletion, neutral, and gain/amplification categories, respectively. TP_c, FP_c and FN_c denote the true positives, false positives, and false negatives for category c. The F₁ score for each category is defined as:

F_{1 c} = \frac{2 {TP}_{c}}{2 {TP}_{c} + {FP}_{c} + {FN}_{c}} .

For SCNA detection, we define:

TP = {TP}_{L} + {TP}_{G}, TN = {TP}_{N}, FP = {FP}_{L} + {FP}_{G}, FN = {FP}_{N},

and

Sensitivity = \frac{TP}{TP + FN}, Specificity = \frac{TN}{TN + FP}, MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP) (TP+FN) (TN+FP) (TN+FN)}} .

The Pearson’s correlation coefficients (r) are computed in SciPy (version 1.13.1). The P-values of one-sided Mann-Whitney-Wilcoxon test are computed in statannotations (version 0.7.2). We use ^* for P < 0.05, ^** for P < 0.01, ^*** for P < 0.001, and ns for P ≥ 0.05.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Supplementary information

Supplemental Materials^{(4.5MB, pdf)}

Reporting Summary^{(6MB, pdf)}

Acknowledgements

This research was supported by NIH grant GM129781.

Author contributions

Changhao Ge, Xiaowen Hu, and Lin Zhang curated and prepared the data. Changhao Ge and Hongzhe Li designed the model architecture. Changhao Ge implemented the code for model training and data analysis. Changhao Ge, Xiaowen Hu, and Hongzhe Li wrote the manuscript. All authors reviewed and approved the final version of the manuscript.

Peer review

Peer review information

Communications Biology thanks Yingnian Wu, Yikai Luo and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Aylin Bircan, Laura Rodriguez Perez.

Data availability

The data used for model training and evaluation are publicly available at https://gdac.broadinstitute.org/, https://depmap.org/portal/, https://depmap.sanger.ac.uk/, and https://kb.linkedomics.org/. The preprocessed data, along with model checkpoint files trained on TCGA data and fine-tuned with DepMap cell line data, are available in Zenodo, at 10.5281/zenodo.13975633⁴².

Code availability

The scripts for model training, performance evaluation, data analysis, and figure generation are available in Zenodo at 10.5281/zenodo.16712285⁴³, under the open-source MIT license.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s42003-025-08712-6.

References

1.Steele, C. D. et al. Signatures of copy number alterations in human cancer. Nature606, 984–991 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Beroukhim, R. et al. The landscape of somatic copy-number alteration across human cancers. Nature463, 899–905 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Zack, T. I. et al. Pan-cancer patterns of somatic copy number alteration. Nat. Genet.45, 1134–1140 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Taylor, A. M. et al. Genomic and functional approaches to understanding cancer aneuploidy. Cancer Cell33, 676–689 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Zare, F., Dow, M., Monteleone, N., Hosny, A. & Nabavi, S. An evaluation of copy number variation detection tools for cancer using whole exome sequencing data. BMC Bioinforma.18, 286 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Bird, A. P. Cpg-rich islands and the function of dna methylation. Nature321, 209–213 (1986). [DOI] [PubMed] [Google Scholar]
7.Jones, P. A. & Baylin, S. B. The fundamental role of epigenetic events in cancer. Nat. Rev. Genet.3, 415–428 (2002). [DOI] [PubMed] [Google Scholar]
8.Bird, A. Dna methylation patterns and epigenetic memory. Genes Dev.16, 6–21 (2002). [DOI] [PubMed] [Google Scholar]
9.Mattei, A. L., Bailly, N. & Meissner, A. Dna methylation: a historical perspective. Trends Genet.38, 676–707 (2022). [DOI] [PubMed] [Google Scholar]
10.Bhattacharya, A. et al. Transcriptional effects of copy number alterations in a large set of human cancers. Nat. Commun.11, 715 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Pinkel, D. & Albertson, D. G. Array comparative genomic hybridization and its applications in cancer. Nat. Genet.37, S11–S17 (2005). [DOI] [PubMed] [Google Scholar]
12.Talevich, E., Shain, A. H., Botton, T. & Bastian, B. C. Cnvkit: Genome-wide copy number detection and visualization from targeted dna sequencing. PLoS Computational Biol.12, e1004873 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Mu, Q. & Wang, J. Cnape: A machine learning method for copy number alteration prediction from gene expression. IEEE/ACM Trans. Computational Biol. Bioinforma.18, 306–311 (2021). [DOI] [PubMed] [Google Scholar]
14.& Bar^inka, J. Rnaseqcnv: analysis of large-scale copy number variations from rna-seq data. Leukemia36, 1492–1498 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Serin Harmanci, A., Harmanci, A. O. & Zhou, X. Casper identifies and visualizes cnv events by integrative analysis of single-cell or bulk rna-sequencing data. Nat. Commun.11, 89 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Flensburg, C., Sargeant, T., Oshlack, A. & Majewski, I. J. Superfreq: Integrated mutation detection and clonal tracking in cancer. PLoS Computational Biol.16, e1007603 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Gao, R. et al. Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes. Nat. Biotechnol.39, 599–608 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
18.De Falco, A., Caruso, F., Su, X.-D., Iavarone, A. & Ceccarelli, M. A variational algorithm to detect the clonal copy number substructure of tumors from scrna-seq data. Nat. Commun.14, 1074 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Novakovsky, G., Dexter, N., Libbrecht, M. W., Wasserman, W. W. & Mostafavi, S. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat. Rev. Genet.24, 125–137 (2023). [DOI] [PubMed] [Google Scholar]
20.The Cancer Genome Atlas Research Network. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet.45, 1113–1120 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Ba, J. L. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
22.Hochreiter, S. Long short-term memory. Neural Computation MIT-Press (1997). [DOI] [PubMed]
23.Yun, S., Jeong, M., Kim, R., Kang, J. & Kim, H. J. Graph transformer networks. Adv. Neural Info. Proc. Syst.32, (2019).
24.Eckel-Passow, J. E. et al. Glioma groups based on 1p/19q, idh, and tert promoter mutations in tumors. N. Engl. J. Med.372, 2499–2508 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Weinstein, J. N. et al. The cancer genome atlas pan-cancer analysis project. Nat. Genet.45, 1113–1120 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Tsherniak, A. et al. Defining a cancer dependency map. Cell170, 564–576 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Ellis, M. J. et al. Connecting genomic alterations to cancer biology with proteomics: the nci clinical proteomic tumor analysis consortium. Cancer Discov.3, 1108–1112 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Subramanian, I., Verma, S., Kumar, S., Jere, A. & Anamika, K. Multi-omics data integration, interpretation, and its application. Bioinforma. Biol. Insights14, 1177932219899051 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Picard, M., Scott-Boyer, M.-P., Bodein, A., Périn, O. & Droit, A. Integration strategies of multi-omics data for machine learning analysis. Computational Struct. Biotechnol. J.19, 3735–3746 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Fuchs, G. et al. 4sudrb-seq: Measuring genomewide transcriptional elongation rates and initiation frequencies within cells. Genome Biol.15, R69 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Chu, Y. & Corey, D. R. Rna sequencing: platform selection, experimental design, and data interpretation. Nucleic Acid Therapeutics22, 271–274 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Chapman, A. R. et al. Correlated gene modules uncovered by high-precision single-cell transcriptomics. Proc. Natl Acad. Sci.119, e2206938119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Verma, A. & Engelhardt, B. E. A robust nonlinear low-dimensional manifold for single cell rna-seq data. BMC Bioinforma.21, 1–15 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Stuart, J. M., Segal, E., Koller, D. & Kim, S. K. A gene-coexpression network for global discovery of conserved genetic modules. Science302, 249–255 (2003). [DOI] [PubMed] [Google Scholar]
35.Abu-Jamous, B. & Kelly, S. Clust: automatic extraction of optimal co-expressed gene clusters from gene expression data. Genome Biol.19, 172 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Colaprico, A. et al. Tcgabiolinks: an r/bioconductor package for integrative analysis of tcga data. Nucleic Acids Res.44, e71–e71 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol.5, 1–16 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Olshen, A. B., Venkatraman, E. S., Lucito, R. & Wigler, M. Circular binary segmentation for the analysis of array-based dna copy number data. Biostatistics5, 557–572 (2004). [DOI] [PubMed] [Google Scholar]
39.Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics8, 118–127 (2006). [DOI] [PubMed] [Google Scholar]
40.Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E. & Storey, J. D. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics28, 882–883 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Kuilman, T. et al. Copywriter: Dna copy number detection from off-target sequence data. Genome Biol.16, 49 (2015). [DOI] [PMC free article] [PubMed]
42.Ge, C., Hu, X., Zhang, L. & Li, H. Supporting data for rcane: A deep learning algorithm for whole-genome pan-cancer somatic copy number aberration prediction using rna-seq data. Zenodo10.5281/zenodo.13975633 (2024). [DOI] [PMC free article] [PubMed]
43.Ge, C. Code for rcane: A deep learning algorithm for whole-genome pan-cancer somatic copy number aberration prediction using rna-seq data 10.5281/zenodo.16712285 (2025). [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Materials^{(4.5MB, pdf)}

Reporting Summary^{(6MB, pdf)}

Data Availability Statement

The scripts for model training, performance evaluation, data analysis, and figure generation are available in Zenodo at 10.5281/zenodo.16712285⁴³, under the open-source MIT license.

[CR1] 1.Steele, C. D. et al. Signatures of copy number alterations in human cancer. Nature606, 984–991 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Beroukhim, R. et al. The landscape of somatic copy-number alteration across human cancers. Nature463, 899–905 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Zack, T. I. et al. Pan-cancer patterns of somatic copy number alteration. Nat. Genet.45, 1134–1140 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Taylor, A. M. et al. Genomic and functional approaches to understanding cancer aneuploidy. Cancer Cell33, 676–689 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Zare, F., Dow, M., Monteleone, N., Hosny, A. & Nabavi, S. An evaluation of copy number variation detection tools for cancer using whole exome sequencing data. BMC Bioinforma.18, 286 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Bird, A. P. Cpg-rich islands and the function of dna methylation. Nature321, 209–213 (1986). [DOI] [PubMed] [Google Scholar]

[CR7] 7.Jones, P. A. & Baylin, S. B. The fundamental role of epigenetic events in cancer. Nat. Rev. Genet.3, 415–428 (2002). [DOI] [PubMed] [Google Scholar]

[CR8] 8.Bird, A. Dna methylation patterns and epigenetic memory. Genes Dev.16, 6–21 (2002). [DOI] [PubMed] [Google Scholar]

[CR9] 9.Mattei, A. L., Bailly, N. & Meissner, A. Dna methylation: a historical perspective. Trends Genet.38, 676–707 (2022). [DOI] [PubMed] [Google Scholar]

[CR10] 10.Bhattacharya, A. et al. Transcriptional effects of copy number alterations in a large set of human cancers. Nat. Commun.11, 715 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Pinkel, D. & Albertson, D. G. Array comparative genomic hybridization and its applications in cancer. Nat. Genet.37, S11–S17 (2005). [DOI] [PubMed] [Google Scholar]

[CR12] 12.Talevich, E., Shain, A. H., Botton, T. & Bastian, B. C. Cnvkit: Genome-wide copy number detection and visualization from targeted dna sequencing. PLoS Computational Biol.12, e1004873 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Mu, Q. & Wang, J. Cnape: A machine learning method for copy number alteration prediction from gene expression. IEEE/ACM Trans. Computational Biol. Bioinforma.18, 306–311 (2021). [DOI] [PubMed] [Google Scholar]

[CR14] 14.& Bar^inka, J. Rnaseqcnv: analysis of large-scale copy number variations from rna-seq data. Leukemia36, 1492–1498 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Serin Harmanci, A., Harmanci, A. O. & Zhou, X. Casper identifies and visualizes cnv events by integrative analysis of single-cell or bulk rna-sequencing data. Nat. Commun.11, 89 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Flensburg, C., Sargeant, T., Oshlack, A. & Majewski, I. J. Superfreq: Integrated mutation detection and clonal tracking in cancer. PLoS Computational Biol.16, e1007603 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Gao, R. et al. Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes. Nat. Biotechnol.39, 599–608 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.De Falco, A., Caruso, F., Su, X.-D., Iavarone, A. & Ceccarelli, M. A variational algorithm to detect the clonal copy number substructure of tumors from scrna-seq data. Nat. Commun.14, 1074 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Novakovsky, G., Dexter, N., Libbrecht, M. W., Wasserman, W. W. & Mostafavi, S. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat. Rev. Genet.24, 125–137 (2023). [DOI] [PubMed] [Google Scholar]

[CR20] 20.The Cancer Genome Atlas Research Network. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet.45, 1113–1120 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Ba, J. L. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).

[CR22] 22.Hochreiter, S. Long short-term memory. Neural Computation MIT-Press (1997). [DOI] [PubMed]

[CR23] 23.Yun, S., Jeong, M., Kim, R., Kang, J. & Kim, H. J. Graph transformer networks. Adv. Neural Info. Proc. Syst.32, (2019).

[CR24] 24.Eckel-Passow, J. E. et al. Glioma groups based on 1p/19q, idh, and tert promoter mutations in tumors. N. Engl. J. Med.372, 2499–2508 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Weinstein, J. N. et al. The cancer genome atlas pan-cancer analysis project. Nat. Genet.45, 1113–1120 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Tsherniak, A. et al. Defining a cancer dependency map. Cell170, 564–576 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Ellis, M. J. et al. Connecting genomic alterations to cancer biology with proteomics: the nci clinical proteomic tumor analysis consortium. Cancer Discov.3, 1108–1112 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Subramanian, I., Verma, S., Kumar, S., Jere, A. & Anamika, K. Multi-omics data integration, interpretation, and its application. Bioinforma. Biol. Insights14, 1177932219899051 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Picard, M., Scott-Boyer, M.-P., Bodein, A., Périn, O. & Droit, A. Integration strategies of multi-omics data for machine learning analysis. Computational Struct. Biotechnol. J.19, 3735–3746 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Fuchs, G. et al. 4sudrb-seq: Measuring genomewide transcriptional elongation rates and initiation frequencies within cells. Genome Biol.15, R69 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Chu, Y. & Corey, D. R. Rna sequencing: platform selection, experimental design, and data interpretation. Nucleic Acid Therapeutics22, 271–274 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Chapman, A. R. et al. Correlated gene modules uncovered by high-precision single-cell transcriptomics. Proc. Natl Acad. Sci.119, e2206938119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Verma, A. & Engelhardt, B. E. A robust nonlinear low-dimensional manifold for single cell rna-seq data. BMC Bioinforma.21, 1–15 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Stuart, J. M., Segal, E., Koller, D. & Kim, S. K. A gene-coexpression network for global discovery of conserved genetic modules. Science302, 249–255 (2003). [DOI] [PubMed] [Google Scholar]

[CR35] 35.Abu-Jamous, B. & Kelly, S. Clust: automatic extraction of optimal co-expressed gene clusters from gene expression data. Genome Biol.19, 172 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Colaprico, A. et al. Tcgabiolinks: an r/bioconductor package for integrative analysis of tcga data. Nucleic Acids Res.44, e71–e71 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol.5, 1–16 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Olshen, A. B., Venkatraman, E. S., Lucito, R. & Wigler, M. Circular binary segmentation for the analysis of array-based dna copy number data. Biostatistics5, 557–572 (2004). [DOI] [PubMed] [Google Scholar]

[CR39] 39.Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics8, 118–127 (2006). [DOI] [PubMed] [Google Scholar]

[CR40] 40.Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E. & Storey, J. D. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics28, 882–883 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Kuilman, T. et al. Copywriter: Dna copy number detection from off-target sequence data. Genome Biol.16, 49 (2015). [DOI] [PMC free article] [PubMed]

[CR42] 42.Ge, C., Hu, X., Zhang, L. & Li, H. Supporting data for rcane: A deep learning algorithm for whole-genome pan-cancer somatic copy number aberration prediction using rna-seq data. Zenodo10.5281/zenodo.13975633 (2024). [DOI] [PMC free article] [PubMed]

[CR43] 43.Ge, C. Code for rcane: A deep learning algorithm for whole-genome pan-cancer somatic copy number aberration prediction using rna-seq data 10.5281/zenodo.16712285 (2025). [DOI] [PMC free article] [PubMed]

PERMALINK

RCANE: a deep learning algorithm for whole-genome pan-cancer somatic copy number aberration prediction using RNA-seq data

Changhao Ge

Xiaowen Hu

Lin Zhang

Hongzhe Li

Abstract

Introduction

Results

An overview of RCANE

Table 1.

Fig. 1. Overview of the RCANE Model.

Evaluation of RCANE for CNV detection in TCGA testing samples

Fig. 2. Results of CNV detection for the TCGA testing samples.

Evaluation of RCANE for CNV detection in DepMap cell lines

Fig. 3. Results of CNV detection for the cancer cell line testing samples.

Prediction of intensities and Identification of SCNA-related genes

Fig. 4. Intensity estimation of RCANE and CNVkit on TCGA testing data.

Fig. 5. Model weights in RCANE capture CNA-RNA correlations.

Independent validation results of CPTAC data

Fig. 6. Jaccard score of whole-genome CNV detection for CPTAC data.

Discussion

Methods

Algorithm architecture

Model parameters and computational resource required

Missing RNA-seq data

TCGA dataset

DepMap cell line dataset

CPTAC dataset

Statistics and reproducibility

Reporting summary

Supplementary information

Acknowledgements

Author contributions

Peer review

Peer review information

Data availability

Code availability

Competing interests

Footnotes

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases