Skip to main content
Communications Biology logoLink to Communications Biology
. 2025 Sep 24;8:1354. doi: 10.1038/s42003-025-08712-6

RCANE: a deep learning algorithm for whole-genome pan-cancer somatic copy number aberration prediction using RNA-seq data

Changhao Ge 1,2, Xiaowen Hu 3, Lin Zhang 3, Hongzhe Li 2,
PMCID: PMC12460768  PMID: 40993228

Abstract

Transcriptome sequencing (RNA-seq) of cancers is widely employed in cancer research to investigate gene expression patterns and their role in disease progression. Somatic copy-number aberrations (SCNAs)—critical genomic drivers of tumorigenesis—can also be inferred directly from RNA-seq, yielding a “two-for-one” return of quantitative expression measures plus structural-variation calls at a fraction of the cost of separate DNA assays. Here, we present RCANE, a deep-learning framework that predicts genome-wide SCNAs across diverse cancer types using only RNA-seq data. Trained on The Cancer Genome Atlas (TCGA) and DepMap cell-line cohorts, RCANE consistently outperforms existing approaches, delivering a scalable, robust solution for improving somatic copy-number aberration profiling in cancer diagnostics and therapeutic decision-making.

Subject terms: Computational models, Cancer genomics


A deep-learning framework that predicts genome-wide somatic copy number aberrations across diverse cancer types using only RNA-seq data.

Introduction

Somatic copy number aberrations (SCNAs) are a hallmark of cancer, involving large-scale genomic alterations that drive tumorigenesis and cancer progression by affecting gene dosage and altering the expression of oncogenes and tumor suppressor genesc1. Detecting SCNAs is essential for understanding cancer biology and developing personalized therapies14. However, SCNA detection traditionally depends on high-cost, high-depth sequencing techniques, such as SNP microarrays, whole-genome sequencing (WGS) or whole-exome sequencing (WES). SNP microarrays, generally speaking, have an advantage over WES for CNA detection specifically. For WGS or WES read data, after the read data has been processed, WGS/WES methods essentially mimic a traditional SNP microarray method. The methods create pseudo probes from the sequencing reads. These reads are averaged in a certain bin or sliding window and divided by the number of reads in a reference sample (or group of reference samples) to establish a log2 ratio value, which can then be used to estimate actual copy number. Compared to whole genome sequencing, WES introduces more biases and noise that make CNA detection very challenging. Comparative studies have shown that existing tools show moderate sensitivity (50%–80%), fair specificity (70%–94%) and poor FDRs (27%–60%)5.

Bulk mRNA sequencing (RNA-seq) offers a more cost-effective and widely used alternative in cancer genomics studies, reflecting cellular activity directly and making it a key component of multi-omic studies. Leveraging RNA-seq for copy-number calling can be more cost-effective than sequencing DNA de novo. In many cancer studies, RNA-seq is already performed to profile gene expression, so no additional library preparation or sequencing run is required to infer CNAs, saving both reagent and instrument time. Even when performed solely for CNA detection, transcriptome sequencing typically demands lower coverage than whole-genome sequencing and avoids the high per-base cost of deep WGS or the capture reagents of WES. Thus, RNA-seq CNA inference often delivers a “two-for-one” return—quantitative expression data plus structural variation calls—at a fraction of the incremental expense of separate DNA assays.

Developing an algorithm that can accurately predict SCNAs from RNA-seq data has attracted recent interest. However, the nonlinear and complex relationship between gene expression and genomic alterations, along with factors such as DNA-methylation69 and transcriptional adaptation10 that influence the RNA transcription, presents a challenge for accurate SCNA inference from RNA-seq data. Current SCNA detection tools using bulk RNA-seq data generally fall into two categories: modified segmentation-based methods that modify the existing CNA detection methods developed for array comparative genomic hybridization (CGH)11 or SNP array data, and machine learning approaches that uses RNA-seq data as predictors. While CNVkit12 is capable of estimating SCNAs from RNA-seq data, it was originally designed for DNA sequencing and doesn’t generalize well to transcriptomic data. Machine learning methods such as CNAPE13 improve SCNA prediction from RNA-seq but typically focus on either gene- or chromosome-level alterations, and miss finer patterns. Moreover, these approaches often require large training datasets, which limits their applicability in biomedical settings where only a small number of samples are available. Several other methods, including RNAseqCNV14, CaSpER15, and SuperFreq16, require B-allele frequency information, which is typically unavailable in RNA-seq studies.

Although algorithms based on single cell RNA-seq data (scRNA-seq) such as CopyKAT17 and SCEVAN18 have demonstrated strong performance, they are specifically designed for scRNA-seq data. These methods generally assume that each sample is composed of a mixture of malignant or normal cells, and typically rely on identifying a cluster of normal cells with high confidence to serve as a reference. This assumption breaks down in bulk RNA-seq since it measures the combined transcriptomes of all cells in the sample—both malignant tumor cells and non-tumor cells (e.g., stromal, immune, or normal epithelial cells). Therefore, the scRNA-seq based methods for CNA detection are expected to perform poorly on bulk RNA-seq data.

These limitations necessitate novel approaches that effectively utilize bulk RNA-seq data for SCNA prediction. The methods proposed in this paper fills this gap by introducing RNA-seq to copy number aberration neural network (RCANE), a deep learning algorithm designed to predict whole-genome SCNAs from cancer RNA-seq data. Deep learning with an appropriate architecture to capture various dependency among the data is particularly suited to this problem, as it excels at modeling complex, high-dimensional data19. In addition, it can fine-tune across diverse datasets, enhancing its generalizability to new studies.

Results

An overview of RCANE

RCANE is a deep learning-based method designed to predict whole-genome copy number aberrations from RNA-seq data. It is trained using datasets from The Cancer Genome Atlas (TCGA) Program20. A comparison between RCANE and existing approaches is shown in Table 1. Before neural network modeling, we preprocess raw mRNA-seq and SCNA intensity data (e.g., the log2 ratio of target to reference signal from SNP array platforms), as described in Fig. 1a. mRNA-seq data is normalized using transcripts per million (TPM), and lowly expressed genes are removed. The remaining genes are reordered based on their genomic positions. Because SCNAs typically span broad genomic regions and affect many genes, adjacent genes are grouped into segments, with the assumption that all genes within a segment share the same copy number value. The data is then reshaped into a 3D tensor, with segments forming the last dimension. Within each segment, SCNA log-intensity values are summarized using the median. To capture cross-chromosomal correlations, we construct positive and negative segment graphs based on correlation matrices of segment intensities: gene pairs with correlations above 0.1 or below −0.1 are defined as positive or negative edges, respectively. These graphs exclude intra-chromosomal edges and are specific to each cancer type. An overview of the RCANE workflow—including model training, prediction, and data analysis—is provided in Supplementary Fig. 1. In addition, our software package includes a visualization tool for whole-genome SCNAs (see Supplementary Fig. 2).

Table 1.

Comparison of RCANE with existing methods for predicting CNA using RNA data

Method Category Output Scale BAF freea Reference freeb
RCANE Deep Learning Intensity Whole-genome
CNAPE Machine Learning State only Gene/Chromosome
CNVkit Segmentation Intensity Whole-genome
CopyKAT Segmentation Intensity Whole-genome
SCEVAN Variational Intensity Whole-genome
RNAseqCNV Machine Learning State only Chromosome
CaSpER Segmentation Intensity Whole-genome
SuperFreq Segmentation Intensity Whole-genome

aModels that only take RNA expression data as input and don’t require B-allele frequencies.

bModels that can process the data of a single sample without reference from other samples.

Fig. 1. Overview of the RCANE Model.

Fig. 1

a Data preprocessing. mRNA expression data are filtered and reordered by genomic positions, and gene copy numbers are grouped into segments. These segments are then used to compute correlation matrices and define segment-based graphs. b The RCANE architecture. Cancer types are represented through embeddings. For each genomic segment, mRNA data is adjusted based on the cancer type and aggregated into a weighted average. This information is then fed into an LSTM layer and two Graph Attention mechanisms. The outputs are combined using a multi-layer perceptron (MLP) and univariate layer for each segment. The model is trained using a regression loss function to align predictions with the ground truth, with fine-tuning applied to the final two layers to optimize performance. c Results of ablation analysis for TCGA and cell line testing samples. MCC and accuracy are presented for various RCANE architectures that exclude certain component. One-sided Mann-Whitney-Wilcoxon test: left group is greater.

The core architecture of RCANE combines sequence models with graph neural networks (Fig. 1b). To help the model learn both the effects of individual gene expression and the relative importance of different genes, a subset of gene expression values is randomly masked at the beginning of each training epoch. Cancer types are encoded via an embedding layer, and the gene expression values are adjusted using these cancer-type embeddings, then passed through a multi-layer perceptron (MLP). This design enables RCANE to capture cancer type-specific patterns, reflecting the biological and molecular differences across cancer types, while still leveraging shared information across the full dataset. Within each genomic segment, gene outputs—together with cancer type embeddings—are normalized using layer normalization21 to ensure zero mean and unit variance. These normalized values are aggregated via a weighted average and further processed by another MLP. The resulting segment-level features are then input into a chromosome-specific Long Short-Term Memory (LSTM) network22 and two Graph Attention (GAttn) layers23, which incorporate predefined positive and negative correlation graphs. The LSTM captures both short- and long-range dependencies in gene expression within chromosomes, while the GAttn layers capture cross-chromosomal SCNA patterns, such as the 1p/19q co-deletion observed in gliomas24. Finally, the outputs from the LSTM and GAttn components are integrated and passed through a set of univariate layers for fine-tuning. The model is trained by minimizing the mean squared error between the predicted and observed SCNA intensity values.

We trained and evaluated RCANE using data from the TCGA project25, which comprises 33 cancer types with varying sample sizes (Supplementary Fig. 3a, b). The model was further fine-tuned using human cancer cell lines representing 17 cancer types from the DepMap project26, and its performance was compared against the vanilla (pre-fine-tuning) version. For external validation, we employed an independent dataset from the Clinical Proteomic Tumor Analysis Consortium (CPTAC)27, which includes 10 cancer types. As cancer cell line data exhibit higher tumor purity, we applied a correction to the mRNA expression data to account for these distributional shifts prior to model inference (Supplementary Fig. 3c).

To evaluate the contributions of different components of the model, we conducted an ablation study focusing on the LSTM, GAttn, and univariate layers (Fig. 1c). Removing either the LSTM or GAttn resulted in a 7%–9% reduction in MCC on TCGA data and an 11%–14% reduction on cell line data, with accuracy dropping by 2%–3% and 6%–7%, respectively. While removing the univariate layers had little effect on TCGA performance, it significantly impaired the model’s ability to fine-tune and generalize on cell line data. Therefore, each component plays a critical role in the overall model architecture.

Evaluation of RCANE for CNV detection in TCGA testing samples

In TCGA test samples, RCANE outperformed CNAPE, CNVkit, and CopyKAT in whole-genome copy number prediction, achieving the highest F1 scores for predicting deletion, neutral copy number and amplification for all cancer types (Fig. 2a). While CNAPE performed comparably to RCANE in Kidney Chromophobe (KICH), its performance was unstable across other cancer types. CopyKAT performed the worst in this task, likely due to its design for single-cell data, which is not well-suited for bulk RNA-seq.

Fig. 2. Results of CNV detection for the TCGA testing samples.

Fig. 2

a F1 score for detecting copy number deletion, neutral and amplication for differnt cancer types. b Comparison of sensitivity, specificity and MCC of RCANE, CNAPE, CNVkit and CopyKAT. One-sided Mann-Whitney-Wilcoxon test: left group is greater. c True and estimated CNVs for two samples.

All methods exhibited diminished performance in Acute Myeloid Leukemia (LAML), a hematological malignancy (Fig. 2a). This is likely attributable to the typically low and unstable RNA content in blood cells, which makes it a less suitable proxy for SCNA detection.

RCANE also excelled in segment-wise SCNA detection, with an average sensitivity of 0.80, specificity of 0.97, and MCC of 0.79 (Fig. 2b). In comparison, CNAPE tended to under-select, and CNVkit over-selected CNAs, resulting in lower MCCs of 0.37 and 0.35, respectively. Visualizations of two representative samples further illustrated that RCANE accurately recovered both arm-level and focal SCNAs (Fig. 2c). Both RCANE and CNVkit were able to detect broad copy number gains and losses. In contrast, CNAPE captured only a subset of the SCNA patterns, while CopyKAT failed in most cases. Compared with RCANE, CNVkit produced noisier estimates and exhibited a notably higher rate of false positives.

Evaluation of RCANE for CNV detection in DepMap cell lines

We applied RCANE to the DepMap cell line dataset to evaluate its fine-tuning capability. Across all cancer types except Adrenocortical Carcinoma (ACC), the fine-tuned RCANE consistently achieved higher F1 scores than the vanilla model (Fig. 3a). Fine-tuning also improved performance in terms of Jaccard score (Fig. 3b), sensitivity, specificity, and MCC (Fig. 3d), as well as whole-genome SCNA estimation accuracy (Fig. 3e). This provides an efficient mechanism for RCANE to adapt to new datasets, which is especially valuable in scenarios with limited computational resources or training data.

Fig. 3. Results of CNV detection for the cancer cell line testing samples.

Fig. 3

a F1 score for detecting copy number deletion, neutral and amplication for differnt cancer types. b Comparison of Jaccard score of RCANE, CNAPE, CNVkit and CopyKAT. One-sided Mann-Whitney-Wilcoxon test: top group is greater. c Comparison of ROC curves for detecting loss/deletion and gain/amplication of different methods. d Comparison of sensitivity, specificity and MCC of RCANE, CNAPE, CNVkit and CopyKAT. One-sided Mann-Whitney-Wilcoxon test: left group is greater. e True and estimated CNVs for two samples.

While the vanilla RCANE did not match the performance of its fine-tuned counterpart on the cell line data, it still outperformed other methods considered in this study (Fig. 3a–c). Although CNVkit achieved comparable sensitivity, its higher rate of false positives resulted in lower specificity and MCC (Fig. 3d). Visualizations further indicated that although the vanilla RCANE underperformed slightly relative to the fine-tuned model, it was still able to capture most of the key SCNA features (Fig. 3e).

Prediction of intensities and Identification of SCNA-related genes

Originally trained on SNP array log-intensity ratio data (log2R), RCANE extends its applicability to predicting continuous log-intensity values, which provide higher resolution than categorical SCNA calls and enable more nuanced downstream analyses. Accurate prediction of log2R values also offers an additional perspective for evaluating the performance of SCNA inference methods.

Among the compared methods, CNVkit also generates log2R estimates. RCANE outperforms CNVkit with more accurate predictions of log-intensity ratios. Across all cancer types, the Pearson correlations between the observed and RCANE-predicted log2R values are consistently close to 1 and significantly higher than those of CNVkit (P < 0.001; Fig. 4a). Furthermore, RCANE captures complex relationships between mRNA expression and copy number alterations across different chromosomal regions (Fig. 4b), resulting in a more accurate genome-wide intensity profile (Fig. 4c). A similar trend is observed in the cancer cell line data, where RCANE continues to outperform CNVkit in intensity estimation (Supplementary Fig. 4). Supplementary Fig. 5 presents examples of the observed and model-predicted log2R intensities of 6 different genomic regions across different cancer types, showing almost perfect predictions.

Fig. 4. Intensity estimation of RCANE and CNVkit on TCGA testing data.

Fig. 4

a Boxplot of Pearson correlation between intensity prediction and ground truth for all cancer types. KICH and LAML results from CNVkit are excluded due to model errors. One-sided Mann-Whitney-Wilcoxon test: RCANE is greater than CNVkit. b t-SNE plots of ground truth and predicted intensity with dot locations based on mRNA expression data. Colors indicate copy number intensity in log2 ratio. c Whole-genome intensity prediction for two samples. Colors indicate copy number intensity in log2 ratio.

RCANE’s use of a masking and weighted-average mechanism allows for the identification of SCNA-related genes based on model-assigned weights, with layer normalization ensuring that gene importance is captured exclusively by these weights (Supplementary Fig. 6). This method effectively handles missing RNA-seq data of some genes through masking rather than imputation, which improves model robustness (Supplementary Fig. 7). Genes assigned higher weights are more strongly influenced by SCNAs, while those with lower weights tend to be independent of SCNA effects (Fig. 5a). In our model, we identified SCNA-associated genes such as POLR2H and ABCF3, as well as genes with lower relevance, such as LINC02069 and LINC02054 (Fig. 5b, c).

Fig. 5. Model weights in RCANE capture CNA-RNA correlations.

Fig. 5

a Scatter plot showing the relationship between model weights and Pearson correlations of copy number intensity and mRNA expression for each gene. Color indicates the CNA ratio. b Model weights of cancer type and 20 genes within one segment. c Scatter plot illustrating mRNA expression and CNA intensity for the genes with the highest and lowest model weights, with Pearson’s correlation coefficient (r).

Independent validation results of CPTAC data

In the independent validation analysis using CPTAC data, RCANE achieves the highest whole-genome SCNA prediction accuracy, as measured by the Jaccard score (Fig. 6). Although trained exclusively on TCGA RNA-seq data, RCANE outperforms existing methods—including CNAPE, CNVkit, and CopyKAT—across most tumor types in the CPTAC cohort. While performance differences are not statistically significant in a few cancer types, RCANE remains among the top-performing methods. We emphasize that the reference SCNAs in CPTAC are derived from the WES data, whereas those in TCGA are obtained from high-resolution SNP6 arrays. These differences in data modality and resolution introduce a substantial domain shift between the training and evaluation settings. Despite this, RCANE maintains strong predictive performance, demonstrating its reliability and robust generalization across cohorts, platforms, and varying levels of noise in SCNA ground truth.

Fig. 6. Jaccard score of whole-genome CNV detection for CPTAC data.

Fig. 6

One-sided Mann-Whitney-Wilcoxon test: left group is greater.

As a comparison, we also perform the CNA detection analysis using the method developed for scRNA-seq data. As expected, these methods does not perform well for bulk RNA-seq data. Details can be founded in Supplementary Fig. 8.

Discussion

RCANE offers a cost-effective and accurate solution for predicting SCNAs from RNA-seq data, providing a viable alternative to traditional sequencing-based or array-CGH approaches. This deep learning framework effectively models the complex relationship between gene expression and SCNAs, and consistently outperforms existing tools such as CNAPE, CNVkit, and CopyKAT. RCANE is capable of fine-tuning on small datasets, making it adaptable to new studies with limited samples. Its strong generalization across external datasets further supports its utility in diverse cancer contexts.

Trained on continuous copy number intensity values rather than discrete classes, RCANE enables more nuanced analysis compared to classification-based methods. Moreover, it identifies SCNA-associated genes, offering valuable insights into the regulatory impact of genomic alterations. Future work will focus on incorporating additional features such as clinical data and tumor purity to enhance biomedical interpretability. We anticipate that RCANE will serve as a widely applicable tool for SCNA analysis in cancer research, with potential extensions—including multi-omics integration28,29—further expanding its impact in cancer genomics.

Extending RCANE to scRNA-seq from tumor specimens is a natural next step that leverages its attention-based architecture to resolve intra-tumor heterogeneity. By adapting our input pipeline to accept cell-level expression profiles, RCANE can learn to highlight CNA signals from subpopulations of malignant cells, enabling the detection of both clonal and subclonal events. As larger, high-quality scRNA-DNA cohorts become available, we plan to extend RCANE and benchmark against leading variational-inference methods (e.g., CopyKAT, SCEVAN) and refine our model to integrate multi-omic data, thereby enhancing resolution and translational impact in precision oncology studies.

Methods

Algorithm architecture

RCANE is implemented in Python (version 3.8.19) using NumPy (version 1.26.4), PyTorch (version 2.4.1+cu121), and PyG (version 2.5.3). The input to the neural network consists of three components: (i) a tensor of dimensions B × Ns × Nt representing mRNA expression levels, (ii) a vector of size B encoding cancer type identifiers, and (iii) a masking tensor of the same shape as the expression tensor, B × Ns × Nt, where B denotes the batch size, Ns the number of genomic segments, and Nt the number of transcripts per segment. Given that focal SCNAs typically span a median length of 1.8 megabases (Mb)2, while human genes have a median length of approximately 24 kilobases (Kb)30, a typical focal SCNA is expected to affect around 70 genes. To capture such events with sufficient resolution while maintaining sequence lengths tractable for LSTM-based models, we set Nt = 20, resulting in Ns = 1514 segments for genome-wide coverage. These segments are divided into 23 sequences corresponding to the 23 chromosomes. Chromosome 1 contains the longest sequence with 150 segments, while chromosome 21 contains the shortest with 18 segments. This sequence length has been empirically shown to work well for LSTM architectures.

The masking tensor serves three key purposes. First, it handles end-of-chromosome padding: for segments located at the ends of chromosomes that contain fewer than Nt transcripts, the excess positions are masked to prevent invalid inputs from affecting the network. Second, random masking is applied during training to enhance the model’s ability to infer transcript-specific effects of SCNAs. This approach also facilitates the interpretation of gene-level importance, where higher learned weights indicate stronger associations with SCNAs and lower weights suggest reduced relevance. Third, it encourages RCANE to utilize information from all Nt genes within a segment, rather than relying on only a few dominant signals. As a result, in applications where some gene expression values are missing, the model can simply mask out the missing genes and operate using the available expression data while still achieve better performance than data imputation (Supplementary Fig. 7). This mechanism also contributes to reducing overfitting. Further details on the implementation are provided later in this section.

Cancer type information is processed through an embedding layer. Since tumor samples from different cancer types are often collected from distinct tissues, they exhibit unique RNA expression patterns. These inherent tissue-specific differences, however, are not relevant for predicting SCNAs. To account for this, each transcript’s expression value is adjusted based on the cancer type to normalize these variations. The adjusted values are then passed through a dedicated multi-layer perceptron (MLP), with an input size of 1 and an output size of Nout, resulting in a total of Ns × Nt MLPs. Given that different cancer types have distinct SCNA baselines, their embeddings are also processed through a MLP, which produces an output of the same size as required by the next layer. For each segment, the Ns × Nt output vectors from the transcripts, along with the cancer type output vector, are combined into a tensor of size Ns × (Nt + 1) × Nout.

Due to the large variation in expression magnitudes across transcripts, we apply layer normalization along the last dimension during the training to normalize their scales. To aggregate information from each transcript within a segment, a weighted average is computed over the corresponding row vectors, using a learnable weight vector of size Nt + 1, transformed via the softmax function:

xijwa=k=1Nt+1exp(Mijk+wjk)xijkk=1Nt+1exp(Mijk+wjk),

where xijkRNout denotes the output vector for the ith sample, jth segment, and kth transcript (or the cancer type when k = Nt + 1); Mijk ∈ {0, − } is the masking indicator, with Mijk = −  indicating a masked entry; and xijwaRNout is the resulting weighted average representation of the segment.

During training, a random subset of entries is masked, and the remaining non-masked values contribute to the segment-level representation through the weighted average. The exponential of learnable weight exp(wjk)R+ reflects the relative importance of each transcript in predicting the SCNA state of its segment. At the evaluation time, all masking is disabled except for the positions corresponding to missing expression values, allowing the model to utilize all available data. The resulting segment-level representations are then passed through segment-specific MLPs, followed by Graph Attention and LSTM layers.

The Graph Attention layer comprises two blocks: one captures positively correlated segments of copy number intensities, and the other captures negatively correlated segments. Both blocks take the output of the preceding layer as node features but operate on distinct sets of graph edges. The first block attends to edges defined by positive correlations, while the second focuses on edges defined by negative correlations. These two types of correlations are modeled separately, as their opposing characteristics require distinct parameterizations within the model.

In parallel with the Graph Attention layers, the model incorporates a LSTM layer with 23 units, each corresponding to a chromosome. Each cell of a LSTM sequence represents a chromosome segment. Each cell receives input from the segment-specific MLP and outputs neighborhood-corrected information for the corresponding segment. This design allows the LSTM layer to capture sequential dependencies across segments within each chromosome. To integrate both sequential and relational information, the outputs from the LSTM layer and the two Graph Attention blocks are concatenated. This combined representation is then passed through a universal MLP, shared across all Ns segments, which generates a scalar output for each segment.

Since our main model is trained on bulk RNA-seq data of cancer tissues, when it is applied to new datasets such as cancer cell lines, variations in data distributions can affect the model performance. To address this, we incorporate a univariate block as the final component of RCANE, specifically designed for fine-tuning. The input to this block has a dimension of Ns, each one representing a genomic segment. It is passed through a set of Ns univariate neural networks, where each layer—including hidden layers—has both input and output dimensions of 1. Each segment is processed independently, without using information from other segments. This design allows an efficient, segment-specific adjustment of predictions, making fine-tuning faster while preserving essential predictive information. The output of this block represents the final predicted log2R intensity for each segment.

Model parameters and computational resource required

All activation functions in RCANE are set to Leaky ReLU with a negative slope of 0.1. The MLPs used for aggregating RNA-seq data and cancer type embeddings have a depth of 2 and a hidden dimension of 32. The Graph Attention module employs 2 attention heads, each with a depth of 2 and hidden dimension of 32, and applies a dropout rate of 0.2 during training. The LSTM component consists of 4 layers with a hidden dimension of 32 and the same dropout rate. The MLP that processes the concatenated outputs from the Graph Attention and LSTM layers also has a depth of 2, with a hidden dimension of 64. The MLP used for fine-tuning is a single-layer network, while the univariate block has a depth of 4. The total number of trainable parameters in the model is approximately 46.8 million.

RCANE is trained for 130 epochs on an NVIDIA A40 GPU with 48 GB of memory, using a batch size of 32. Optimization is performed using the Adam algorithm with a learning rate of 0.001 and the mean squared error loss. In addition, alternative regression losses, such as the Huber loss, are implemented to support future analyses.

Missing RNA-seq data

In many cases, RNA-seq data of interest may contain missing gene expression values compared to the data used to train the RCANE model, often due to differences in data sources or sequencing platforms31. Missing data poses a significant challenge in machine learning, particularly for deep learning models, as it can substantially affect prediction accuracy and robustness.

Nevertheless, RNA expression data is typically highly correlated and often resides on a low-dimensional manifold32,33. For example, genes involved in the same biological pathway frequently exhibit co-expression patterns34,35, implying that the absence of some gene values may not lead to significant information loss. RCANE leverages this property by employing a masking and weighted-average strategy to handle missing data. During prediction, missing values are masked, and the available expression values from other genes within each genomic segment are used to infer SCNAs.

To evaluate RCANE’s performance in the presence of missing data, we simulate incomplete RNA-seq data by randomly removing entries from both the TCGA and cell line datasets. For the cell line data, we assess both the standard RCANE model and its fine-tuned variant. Within the RNA expression matrix, each element is assigned the same probability of being missing, with missingness probabilities p ∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6}, where p = 0 denotes complete data. We compare RCANE’s performance against two imputation methods: k-nearest neighbor (KNN) imputation with k = 5 and median imputation (Supplementary Fig. 7). For both methods, imputation is performed using TCGA data from model training. Our results show that RCANE’s masking strategy maintains robust performance even with a substantial proportion of missing RNA expression data, while traditional imputation methods suffer significant performance degradation under similar conditions.

TCGA dataset

The TCGA project analyzed over 11,000 tumor samples across 33 cancer types, encompassing both common forms of cancer, such as glioblastoma, breast cancer, lung cancer, as well as rarer types like adrenocortical carcinoma and mesothelioma. Sample sizes vary by cancer type, with hundreds of samples typically analyzed for each. The largest groups include Breast Invasive Carcinoma (BRCA) with over 1000 samples, Lung Adenocarcinoma (LUAD) and Glioblastoma Multiforme (GBM) with more than 500 samples each, and Colon Adenocarcinoma (COAD) with over 400 samples.

For our study, TCGA data is downloaded using the TCGAbiolinks36 (version 2.25.3) through Bioconductor37 (version 3.18), which provides both raw and TPM-normalized mRNA expression matrices, along with SCNA data. The mRNA expression data consists of 60,660 transcripts. Raw mRNA expression data is used as input for CNVkit as recommended12, while TPM-normalized data is used for all other methods. For copy number variation analysis, we retrieve Affymetrix Genome-Wide Human SNP Array 6.0 data, which has been transformed to log2 ratio (log2R) and processed using CBS38. Copy number intensities greater than 0.25 are classified as gains or amplifications, while those below −0.2 are classified as losses or deletions. A total of 8900 tumor samples are used for training and 2226 for testing, with the data randomly split across cancer types. For methods other than RCANE, separate models aretrained for each cancer type to account for the heterogeneity among cancers.

To prepare the input for RCANE, transcripts located on the Y chromosome or expressed in fewer than 30% of the samples are excluded, resulting in a set of 30,096 transcripts. These transcripts are reordered by genomic location across 23 chromosomes based on the genome reference GRCh38. Within each chromosome, 20 adjacent transcripts aregrouped into segments, producing a tensor of size N × 1514 × 20. The mRNA expression data are then transformed using log2(1+TPM). For copy number data, we generate the RCANE output by calling the copy number for each selected transcript, with each segment log2R intensity represented by the median value of the 20 transcripts within that segment.

To build the correlation graphs used by RCANE, we compute a segment correlation matrix for each cancer type based on copy number intensity data from samples of that cancer type. Positive correlation graphs include edges between segment pairs with correlation values greater than 0.1, while negative correlation graphs include pairs with correlation values below −0.1. Only segment pairs from different chromosomes are considered when constructing the correlation graphs.

DepMap cell line dataset

We select cells in the DepMap dataset with cancer types overlapping those in the TCGA dataset and randomly split them by cancer type, resulting in 114 training samples and 266 testing samples across 17 cancer types. For RCANE, the training samples along with cancer type information are used for fine-tuning. For other methods, the training samples from different cancer types are combined and used for training from scratch due to the small sample size.

The mRNA expression data initially contain 53,961 transcripts, from which we select 29,088 transcripts that overlap with the TCGA data. Other missing transcripts are masked during fine-tuning and evaluation. As the cell line data does not provide raw counts, we used log2(1+TPM) as input for all four methods examined in this work. Due to the significant distribution shift between the TCGA and cell line mRNA expression data, we apply the ComBat39 function from sva40 (version 3.50.0) with default parameters to correct for the batch effects. The TCGA data is used as the reference, and cancer type is included as a covariate.

The copy number intensity data for the cell line is processed by the Broad Institute. Unlike the TCGA data, these values are not log2 transformed, so we apply the transformation manually. We then summarize the intensities into segments and define losses/deletions and gains/amplifications in the same way as the TCGA data.

CPTAC dataset

The CPTAC data contains 10 cancer types, where somatic copy number ratios are derived from WES bam files preprocessed using the CopywriteR41 package. The CBS algorithm implemented in CopywriteR is used for copy number segmentation. For independent validation, we download the TPM RNA-seq and log2 copy number variation data, resulting in 1022 matched samples. Three cancer types contain 60,669 transcripts, while the remaining types have 59,361 transcripts. RNA-seq data are matched to TCGA data based on gene symbols, and missing values are masked during evaluation. The copy number intensity ratios are categorized into three SCNA classes using the same criteria as applied to the TCGA data.

For CNVkit, correlation files derived from TCGA data are used as the input. Similarly, CNAPE models trained on TCGA RNA-seq data are applied to CPTAC samples for copy number estimation. As CopyKAT does not require external references, it is applied directly to the CPTAC data.

Statistics and reproducibility

The performance of RCANE and other methods is evaluated using scikit-learn (version 1.3.2). Visualizations are based on pandas (version 2.0.2), matplotlib (version 3.9.1) and seaborn (version 0.13.2). In the box plots, the center line indicates the median, the box limits represent the first and third quartiles, and the whiskers extend to 1.5 times the interquartile range. The Accuracy score is defined by:

Accuracyscore=c{L,N,G}TPcc{L,N,G}TPc+FPc,

and the Jaccard score is calculated as follows:

Jaccardscore=c{L,N,G}TPcc{L,N,G}TPc+FPc+FNc,

where L, N, and G represent copy number loss/deletion, neutral, and gain/amplification categories, respectively. TPc, FPc and FNc denote the true positives, false positives, and false negatives for category c. The F1 score for each category is defined as:

F1c=2TPc2TPc+FPc+FNc.

For SCNA detection, we define:

TP=TPL+TPG,TN=TPN,FP=FPL+FPG,FN=FPN,

and

Sensitivity=TPTP+FN,Specificity=TNTN+FP,MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN).

The Pearson’s correlation coefficients (r) are computed in SciPy (version 1.13.1). The P-values of one-sided Mann-Whitney-Wilcoxon test are computed in statannotations (version 0.7.2). We use * for P < 0.05, ** for P < 0.01, *** for P < 0.001, and ns for P ≥ 0.05.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Supplementary information

Supplemental Materials (4.5MB, pdf)
Reporting Summary (6MB, pdf)

Acknowledgements

This research was supported by NIH grant GM129781.

Author contributions

Changhao Ge, Xiaowen Hu, and Lin Zhang curated and prepared the data. Changhao Ge and Hongzhe Li designed the model architecture. Changhao Ge implemented the code for model training and data analysis. Changhao Ge, Xiaowen Hu, and Hongzhe Li wrote the manuscript. All authors reviewed and approved the final version of the manuscript.

Peer review

Peer review information

Communications Biology thanks Yingnian Wu, Yikai Luo and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Aylin Bircan, Laura Rodriguez Perez.

Data availability

The data used for model training and evaluation are publicly available at https://gdac.broadinstitute.org/, https://depmap.org/portal/, https://depmap.sanger.ac.uk/, and https://kb.linkedomics.org/. The preprocessed data, along with model checkpoint files trained on TCGA data and fine-tuned with DepMap cell line data, are available in Zenodo, at 10.5281/zenodo.1397563342.

Code availability

The scripts for model training, performance evaluation, data analysis, and figure generation are available in Zenodo at 10.5281/zenodo.1671228543, under the open-source MIT license.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s42003-025-08712-6.

References

  • 1.Steele, C. D. et al. Signatures of copy number alterations in human cancer. Nature606, 984–991 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Beroukhim, R. et al. The landscape of somatic copy-number alteration across human cancers. Nature463, 899–905 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Zack, T. I. et al. Pan-cancer patterns of somatic copy number alteration. Nat. Genet.45, 1134–1140 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Taylor, A. M. et al. Genomic and functional approaches to understanding cancer aneuploidy. Cancer Cell33, 676–689 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Zare, F., Dow, M., Monteleone, N., Hosny, A. & Nabavi, S. An evaluation of copy number variation detection tools for cancer using whole exome sequencing data. BMC Bioinforma.18, 286 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Bird, A. P. Cpg-rich islands and the function of dna methylation. Nature321, 209–213 (1986). [DOI] [PubMed] [Google Scholar]
  • 7.Jones, P. A. & Baylin, S. B. The fundamental role of epigenetic events in cancer. Nat. Rev. Genet.3, 415–428 (2002). [DOI] [PubMed] [Google Scholar]
  • 8.Bird, A. Dna methylation patterns and epigenetic memory. Genes Dev.16, 6–21 (2002). [DOI] [PubMed] [Google Scholar]
  • 9.Mattei, A. L., Bailly, N. & Meissner, A. Dna methylation: a historical perspective. Trends Genet.38, 676–707 (2022). [DOI] [PubMed] [Google Scholar]
  • 10.Bhattacharya, A. et al. Transcriptional effects of copy number alterations in a large set of human cancers. Nat. Commun.11, 715 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Pinkel, D. & Albertson, D. G. Array comparative genomic hybridization and its applications in cancer. Nat. Genet.37, S11–S17 (2005). [DOI] [PubMed] [Google Scholar]
  • 12.Talevich, E., Shain, A. H., Botton, T. & Bastian, B. C. Cnvkit: Genome-wide copy number detection and visualization from targeted dna sequencing. PLoS Computational Biol.12, e1004873 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Mu, Q. & Wang, J. Cnape: A machine learning method for copy number alteration prediction from gene expression. IEEE/ACM Trans. Computational Biol. Bioinforma.18, 306–311 (2021). [DOI] [PubMed] [Google Scholar]
  • 14.& Bar^inka, J. Rnaseqcnv: analysis of large-scale copy number variations from rna-seq data. Leukemia36, 1492–1498 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Serin Harmanci, A., Harmanci, A. O. & Zhou, X. Casper identifies and visualizes cnv events by integrative analysis of single-cell or bulk rna-sequencing data. Nat. Commun.11, 89 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Flensburg, C., Sargeant, T., Oshlack, A. & Majewski, I. J. Superfreq: Integrated mutation detection and clonal tracking in cancer. PLoS Computational Biol.16, e1007603 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Gao, R. et al. Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes. Nat. Biotechnol.39, 599–608 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.De Falco, A., Caruso, F., Su, X.-D., Iavarone, A. & Ceccarelli, M. A variational algorithm to detect the clonal copy number substructure of tumors from scrna-seq data. Nat. Commun.14, 1074 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Novakovsky, G., Dexter, N., Libbrecht, M. W., Wasserman, W. W. & Mostafavi, S. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat. Rev. Genet.24, 125–137 (2023). [DOI] [PubMed] [Google Scholar]
  • 20.The Cancer Genome Atlas Research Network. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet.45, 1113–1120 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Ba, J. L. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
  • 22.Hochreiter, S. Long short-term memory. Neural Computation MIT-Press (1997). [DOI] [PubMed]
  • 23.Yun, S., Jeong, M., Kim, R., Kang, J. & Kim, H. J. Graph transformer networks. Adv. Neural Info. Proc. Syst.32, (2019).
  • 24.Eckel-Passow, J. E. et al. Glioma groups based on 1p/19q, idh, and tert promoter mutations in tumors. N. Engl. J. Med.372, 2499–2508 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Weinstein, J. N. et al. The cancer genome atlas pan-cancer analysis project. Nat. Genet.45, 1113–1120 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Tsherniak, A. et al. Defining a cancer dependency map. Cell170, 564–576 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Ellis, M. J. et al. Connecting genomic alterations to cancer biology with proteomics: the nci clinical proteomic tumor analysis consortium. Cancer Discov.3, 1108–1112 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Subramanian, I., Verma, S., Kumar, S., Jere, A. & Anamika, K. Multi-omics data integration, interpretation, and its application. Bioinforma. Biol. Insights14, 1177932219899051 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Picard, M., Scott-Boyer, M.-P., Bodein, A., Périn, O. & Droit, A. Integration strategies of multi-omics data for machine learning analysis. Computational Struct. Biotechnol. J.19, 3735–3746 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Fuchs, G. et al. 4sudrb-seq: Measuring genomewide transcriptional elongation rates and initiation frequencies within cells. Genome Biol.15, R69 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Chu, Y. & Corey, D. R. Rna sequencing: platform selection, experimental design, and data interpretation. Nucleic Acid Therapeutics22, 271–274 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Chapman, A. R. et al. Correlated gene modules uncovered by high-precision single-cell transcriptomics. Proc. Natl Acad. Sci.119, e2206938119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Verma, A. & Engelhardt, B. E. A robust nonlinear low-dimensional manifold for single cell rna-seq data. BMC Bioinforma.21, 1–15 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Stuart, J. M., Segal, E., Koller, D. & Kim, S. K. A gene-coexpression network for global discovery of conserved genetic modules. Science302, 249–255 (2003). [DOI] [PubMed] [Google Scholar]
  • 35.Abu-Jamous, B. & Kelly, S. Clust: automatic extraction of optimal co-expressed gene clusters from gene expression data. Genome Biol.19, 172 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Colaprico, A. et al. Tcgabiolinks: an r/bioconductor package for integrative analysis of tcga data. Nucleic Acids Res.44, e71–e71 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol.5, 1–16 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Olshen, A. B., Venkatraman, E. S., Lucito, R. & Wigler, M. Circular binary segmentation for the analysis of array-based dna copy number data. Biostatistics5, 557–572 (2004). [DOI] [PubMed] [Google Scholar]
  • 39.Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics8, 118–127 (2006). [DOI] [PubMed] [Google Scholar]
  • 40.Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E. & Storey, J. D. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics28, 882–883 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Kuilman, T. et al. Copywriter: Dna copy number detection from off-target sequence data. Genome Biol.16, 49 (2015). [DOI] [PMC free article] [PubMed]
  • 42.Ge, C., Hu, X., Zhang, L. & Li, H. Supporting data for rcane: A deep learning algorithm for whole-genome pan-cancer somatic copy number aberration prediction using rna-seq data. Zenodo10.5281/zenodo.13975633 (2024). [DOI] [PMC free article] [PubMed]
  • 43.Ge, C. Code for rcane: A deep learning algorithm for whole-genome pan-cancer somatic copy number aberration prediction using rna-seq data 10.5281/zenodo.16712285 (2025). [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Materials (4.5MB, pdf)
Reporting Summary (6MB, pdf)

Data Availability Statement

The data used for model training and evaluation are publicly available at https://gdac.broadinstitute.org/, https://depmap.org/portal/, https://depmap.sanger.ac.uk/, and https://kb.linkedomics.org/. The preprocessed data, along with model checkpoint files trained on TCGA data and fine-tuned with DepMap cell line data, are available in Zenodo, at 10.5281/zenodo.1397563342.

The scripts for model training, performance evaluation, data analysis, and figure generation are available in Zenodo at 10.5281/zenodo.1671228543, under the open-source MIT license.


Articles from Communications Biology are provided here courtesy of Nature Publishing Group

RESOURCES