From genome to phenome: Predicting multiple cancer phenotypes based on somatic genomic alterations via the genomic impact transformer

Yifeng Tao; Chunhui Cai; William W Cohen; Xinghua Lu

. Author manuscript; available in PMC: 2021 Jan 1.

Published in final edited form as: Pac Symp Biocomput. 2020;25:79–90.

From genome to phenome: Predicting multiple cancer phenotypes based on somatic genomic alterations via the genomic impact transformer

Yifeng Tao ¹, Chunhui Cai ², William W Cohen ^1,^†, Xinghua Lu ^2,^3,^†

PMCID: PMC6932864 NIHMSID: NIHMS1061137 PMID: 31797588

Abstract

Cancers are mainly caused by somatic genomic alterations (SGAs) that perturb cellular signaling systems and eventually activate oncogenic processes. Therefore, understanding the functional impact of SGAs is a fundamental task in cancer biology and precision oncology. Here, we present a deep neural network model with encoder-decoder architecture, referred to as genomic impact transformer (GIT), to infer the functional impact of SGAs on cellular signaling systems through modeling the statistical relationships between SGA events and differentially expressed genes (DEGs) in tumors. The model utilizes a multi-head self-attention mechanism to identify SGAs that likely cause DEGs, or in other words, differentiating potential driver SGAs from passenger ones in a tumor. GIT model learns a vector (gene embedding) as an abstract representation of functional impact for each SGA-affected gene. Given SGAs of a tumor, the model can instantiate the states of the hidden layer, providing an abstract representation (tumor embedding) reflecting characteristics of perturbed molecular/cellular processes in the tumor, which in turn can be used to predict multiple phenotypes. We apply the GIT model to 4,468 tumors profiled by The Cancer Genome Atlas (TCGA) project. The attention mechanism enables the model to better capture the statistical relationship between SGAs and DEGs than conventional methods, and distinguishes cancer drivers from passengers. The learned gene embeddings capture the functional similarity of SGAs perturbing common pathways. The tumor embeddings are shown to be useful for tumor status representation, and phenotype prediction including patient survival time and drug response of cancer cell lines.^*

Keywords: Neural networks, Knowledge representation, Gene regulatory networks, Cancer

1. Introduction

Cancer is mainly caused by the activation of oncogenes or deactivation of tumor suppressor genes (collectively called “driver genes”) as results of somatic genomic alterations (SGAs),¹ including somatic mutations (SMs),^2,3 somatic copy number alterations (SCNAs),^4,5 DNA structure variations (SVs),⁶ and epigenetic changes.⁷ Precision oncology relies on the capability of identifying and targeting tumor-specific aberrations resulting from driver SGAs and their effects on molecular and cellular phenotypes. However, our knowledge of driver SGAs and cancer pathways remains incomplete. Particularly, it remains a challenge to determine which SGAs (among often hundreds) in a specific tumor are drivers, which cellular signals or biological processes a driver SGA perturbs, and which molecular/cellular phenotypes a driver SGA affects.

Current methods for identifying driver genes mainly concentrate on identifying genes that are mutated at a frequency above expectation, based on the assumption that mutations in these genes may provide oncogenic advantages and thus are positively selected.^8,9 Some works further focus on the mutations perturbing conserved (potentially functional) domains of proteins as indications they may be driver events.^10,11 However, these methods do not provide any information regarding the functional impact of enriched mutations on molecular/cellular phenotypes of cells. Without the knowledge of functional impact, it is difficult to further determine whether an SGA will lead to specific molecular, cellular and clinical phenotypes, such as response to therapies. What’s more, while both SMs and SCNAs may activate/deactivate a driver gene, there is no well-established frequency-based method that combines different types of SGAs to determine their functional impact.

Conventionally, an SGA event perturbing a gene in a tumor is represented as a “one-hot” vector spanning gene space, in which the element corresponding to the perturbed gene is set to “1”. This representation simply indicates which gene is perturbed, but it does not reflect the functional impact of the SGA, nor can it represent the similarity of distinct SGAs that perturb a common signaling pathway. We conjecture that it is possible to represent an SGA as a low-dimensional vector, in the same manner as the “word embedding”^12–14 in the natural language processing (NLP) field, such that the representation reflects the functional impact of a gene on biological systems, and genes sharing similar functions should be closely located in such embedding space. Here the “similar function” is broadly defined, e.g., genes from the same pathway or of the same biological process.¹⁵ Motivated by this, we propose a scheme for learning “gene embeddings” for SGA-affected genes, i.e., a mapping from individual genes to low-dimensional vectors of real numbers that are useful in multiple prediction tasks.

Based on the assumption that SGAs perturbing cellular signaling systems often eventually lead to changes in gene expression,¹⁶ we introduce an encoder-decoder architecture neural network model called “genomic impact transformer” (GIT) to predict DEGs and detect potential cancer drivers with the supervision of DEGs. While deep learning models are being increasingly used to model different bioinformatics problems,^17,18 to our knowledge there are few studies using the neural network to model the relationships between SGAs and molecular/cellular phenotypes in cancers. The proposed GIT model has the following innovative characteristics: (1) The encoder part of the transformer¹⁹ first uses SGAs observed in a tumor as inputs, maps each SGA into a gene embedding representation, and combines gene embeddings of SGAs to derive a personalized “tumor embedding”. Then the decoder part decodes and translates the tumor embedding to DEGs. (2) A multi-head self-attention mechanism^20,21 is utilized in the encoder, which is a technique widely used in NLP to choose the input features that significantly influence the output. It differentiates SGAs by assigning different weights to them so that it can potentially distinguish SGAs that have an impact on DEG from those do not, i.e., detecting drivers from passengers. (3) Pooling inferred weighted impact of SGAs in a tumor produces a personalized tumor embedding, which can be used as an effective feature to predict DEGs and other phenotypes. (4) Gene embeddings are pre-trained by a “Gene2Vec” algorithm and further refined by the GIT, which captures the functional impact of SGAs on the cellular signaling system. Our results and analysis indicate that above innovative approaches enable us to derive powerful gene embedding and tumor embedding representations that are highly informative of molecular, cellular and clinical phenotypes.

2. Materials and methods

2.1. SGAs and DEGs pre-processing

We obtained SGA data, including SMs and SCNAs, and DEGs of 4,468 tumors consisting of 16 cancer types directly from TCGA portal.²² Details available in SI (Sec. S1).

2.2. The GIT neural network

2.2.1. GIT network structure: encoder-decoder architecture

Figure 1a shows the general structure of the GIT model with an overall encoder-decoder architecture. GIT mimics hierarchically organized cellular signaling system,^23,24 in which a neuron may potentially encode the signal of one or more signaling proteins. When a cellular signaling system is perturbed by SGAs, it often can lead to changes in measured molecular phenotypes, such as gene expression changes. Thus, for a tumor t, the set of its $SGAS {g}_{g = 1}^{m}$ is connected to the GIT neural network as observed input (Fig. 1a bottom part squares). The impact of SGAs is represented as embedding vectors ${e_{g}}_{g = 1}^{m},$ which are further linearly combined to produce a tumor embedding vector e_t through an attention mechanism in the encoder (Fig. 1a middle part). We explicitly represent cancer type s and its influence on encoding system e_s of the tumor because tissue type influences which genes are expressed in cells of specific tissue as well. Finally, the decoder module, which consists of a feed-forward multi-layer perceptron (MLP),²⁵ transforms the functional impact of SGAs and cancer type into DEGs of the tumor (Fig. 1a top part).

Fig. 1. — **(a)** Overall architecture of GIT. An example case and its detected drivers are shown. **(b)** A two-dimensional demo that shows how attention mechanism combines multiple gene embeddings of SGAs ${e_{g}}_{g = 1}^{m}$ and cancer type embedding e_s into a tumor embedding vector e_t using attention weights ${α_{g}}_{g = 1}^{m} .$ **(c)** Calculation of attention weights ${α_{g}}_{g = 1}^{m}$ using gene embeddings ${e_{g}}_{g = 1}^{m} .$

2.2.2. Pre-training gene embeddings using Gene2Vec algorithm

In this study, we projected the discrete binary representation of SGAs perturbing a gene into a continuous embedding space, which we call “gene embeddings” of corresponding SGAs, using a “Gene2Vec” algorithm, based on the assumption of co-occurrence pattern of SGAs in each tumor, including mutually exclusive patterns of mutations affecting a common pathway.²⁶ These gene embeddings were further updated and fine-tuned by the GIT model with the supervision of affected DEGs. Algorithm details available in SI (Sec. S2).

2.2.3. Encoder: multi-head self-attention mechanism

To detect the difference of functional impact of SGAs in a tumor, we designed a multi-head self-attention mechanism (Fig. 1a middle part). For all SGA-affected genes ${g}_{g = 1}^{m}$ and the cancer type s of a tumor t, we first mapped them to corresponding gene embeddings ${e_{g}}_{g = 1}^{m}$ and a cancer type embedding e_s from a look-up table $E = {e_{g}}_{g \in G} \cap {e_{s}}_{s \in S},$ where e_g and e_s are real-valued vectors. From the implementation perspective, we treated cancer types in the same way as SGAs, except the attention weight of it is fixed to be “1”. The overall idea of producing the tumor embedding e_t is to use the weighted sum of cancer type embedding e_s and gene embeddings ${e_{g}}_{g = 1}^{m}$ (Fig. 1b) :

e_{t} = 1 \cdot e_{s} + \sum_{g} α_{g} \cdot e_{g} = 1 \cdot e_{s} + α_{1} \cdot e_{1} + \dots + α_{m} \cdot e_{m} .

(1)

The attention weights ${α_{g}}_{g = 1}^{m}$ were calculated by employing multi-head self-attention mechanism, using gene embeddings of SGAs ${e_{g}}_{g = 1}^{m}$ in the tumor: ${α_{g}}_{g = 1}^{m} = {Function}_{Attention} ({e_{g}}_{g = 1}^{m}; W_{0}, Θ)$ (Fig. 1c). See SI (Sec. S3) for mathematical details. Overall we have three parameters {W₀, Θ, ε} to train in the multi-head attention module using back-propagation.²⁷ The look-up table ${e_{g}}_{g = G}$ was initialized with Gene2Vec pre-trained gene embeddings and refined by GIT here.

2.2.4. Decoder: multi-layer perceptron (MLP)

For a specific tumor t, we fed tumor embedding et into an MLP with one hidden layer as the decoder, using non-linear activation functions and fully connected layers, to produce the final predictions $\hat{y}$ for DEGs y; (Fig. 1a top part):

\hat{y} = σ (W_{2} \cdot ReLU (W_{1} \cdot ReLU (e_{t}) + b_{1}) + b_{2}) .

(2)

where ReLU(x) = max(0, x) is rectified linear unit, and σ(x) = (1+exp(−x))⁻¹ is sigmoid activation function. The output of the decoder and actual values of DEGs were used to calculate the $l$ ₂-regularized cross entropy, which was minimized during training: $\min_{W, E, Θ, b} CrossEnt (y, \hat{y}) + l_{2} (W, E, Θ; λ_{2}), where W = {W_{l}}_{l = 0}^{2},$ cross entropy loss defined as $CrossEnt (y, \hat{y}) = - \sum_{i} [(1 - y_{i}) \log (1 - {\hat{y}}_{i}) + y_{i} \log {\hat{y}}_{i}], and l_{p}$ regularizer defined as $l_{p} (W; λ) = λ \cdot {\sum_{l} ‖ W_{l} ‖}_{p}, p \in {1, 2} .$

2.3. Training and evaluation

We utilized PyTorch (https://pytorch.org/) to train, validate and test the Gene2Vec, GIT (variants) and other conventional models (Lasso and MLPs; Section 3.1). The training, validation and test sets were split in the ratio of 0.33:0.33:0.33 and fixed across different models. The hyperparameters were tuned over the training and validation sets to get best F1 scores, trained on training and validation sets, and finally applied to the test set for evaluation if not further mentioned below. The models were trained by updating parameters using backpropagation,²⁷ specifically, using mini-batch Adam²⁸ with default momentum parameters. Gene2Vec used mini-batch stochastic gradient descent (SGD) instead of Adam. Dropout²⁹ and weight decay (l_p-regularization) were used to prevent overfitting. We trained all the models 30 to 42 epochs until they fully converged. The output DEGs were represented as a sparse binary vector. We utilized various performance metrics including accuracy, precision, recall, and F1 score, where F1 is the harmonic mean of precision and recall. The training and test were repeated for five runs get the mean and variance of evaluation metrics. We designed two metrics in the present work for evaluating the functional similarity among genes sharing similar gene embedding: “nearest neighborhood (NN) accuracy” and “GO enrichment”. See SI (Sec. S4) for the definition and meaning of them.

3. Results

3.1. GIT statistically detects real biological signals

The task of GIT is to predict DEGs (dependent variables) using SGAs as input (independent variables). Our results of GIT performance on both real and shuffled data demonstrates that GIT is able to capture real statistical relationships between SGAs and DEGs from the noisy biological data (SI: Sec. S5).

As a comparison, we also trained and tested the Lasso (multivariate regression with l₁-regularization)³⁰ and MLPs²⁵ as baseline prediction models to predict DEGs based on SGAs. The Lasso model is appealing in our setting because, when predicting a DEG, it can filter out most of the irrelevant input variables (SGAs) and keep only the most informative ones, and it is a natural choice in our case where there are 19.8k possible SGAs. However, in comparison to MLP, it lacks the capability of portraying complex relationships between SGAs and DEGs. On the other hand, while conventional MLPs have sufficient power to capture complex relationships–particularly, the neurons in hidden layers may mimic signaling proteins24–they can not utilize any biological knowledge extracted from cancer genomics, nor do they explain the signaling process and distinguish driver SGAs. We employed the precision, recall, F1 score, as well as accuracy to compare GIT and traditional methods (Table 1: 1st to 4th, and last rows). One can conclude that GIT outperforms all these other conventional baseline methods for predicting DEGs in all metrics, indicating the specifically designed structure of GIT is able to soar the performance in the task of predicting DEGs from SGAs.

Table 1.

Performances of GIT (variants) and baseline methods.

Methods	Precision	Recall	F1 score	Accuracy

Lasso	59.6±0.05	52.8±0.03	56.0±0.01	74.0±0.02
1 layer MLP	61.9±0.09	50.4±0.17	55.6±0.07	74.7±0.02
2 layer MLP	64.2±0.39	52.0±0.66	56.5±0.19	75.9±0.09
3 layer MLP	64.2±0.37	50.5±0.30	52.1±0.29	75.7±0.13

GIT - can	60.5±0.34	45.8±0.38	52.1±0.29	73.6±0.14
GIT - attn	67.6±0.32	55.3±0.77	60.8±0.35	77.7±0.05
GIT - init	69.8±0.28	54.1±0.37	60.9±0.16	78.3±0.06

GIT	69.5±0.09	57.1±0.18	62.7±0.08	78.7±0.01

Open in a new tab

In order to evaluate the utility of each module (procedure) in GIT, we conducted ablation study by removing one module at a time: the cancer type input (“can”), the multi-head self-attention module (“attn”), and the initialization with pre-trained gene embeddings (“init”). The impact of each module can be detected by comparing to the full GIT model. All the modules in GIT help to improve the prediction of DEGs from SGAs in terms of overall performance: F1 score and accuracy (Table 1: 5th to last rows).

3.2. Gene embeddings compactly represent the functional impact of SGAs

We examined whether the gene embeddings capture the functional similarity of SGAs, using mainly two metrics: NN accuracy and GO enrichment (Defined in SI Sec. S4). NN accuracy: By capturing the co-occurrence pattern of somatic alterations, the Gene2Vec pre-trained gene embeddings improve 36% in NN accuracy over the random chance of any pair of the genes sharing Gene Ontology (GO) annotation¹⁵ (Table 2). The fine-tuned embeddings by GIT further show a one-fold increase in NN accuracy. These results indicate that the learned gene embeddings are consistent with the gene functions, and they map the discrete binary SGA representation into a meaningful and compact space. GO enrichment: We performed clustering analysis of SGAs in embedding space using k-means clustering, and calculated GO enrichment, and we varied the number of clusters (k ) to derive clusters with different degrees of granularity (Fig. 2a). As one can see, when the genes are randomly distributed in the embedding space, they get GO enrichment of 1. However, in the gene embedding space, the GO enrichment increases fast until the number of clusters reaches 40, indicating a strong correlation between the clusters in embedding space and the functions of the genes.

Table 2.

NN accuracy with respect to GO in different gene embedding spaces.

Gene embeddings	NN accuracy	Improvement

Random pairs	5.3±0.36	–
Gene2Vec	7.2	36%
Gene2Vec + GIT	10.7	100%

Open in a new tab

To visualize the manifold of gene embeddings, we grouped the genes into 40 clusters, and conducted the t-SNE³¹ of genes (Fig. 2b left panel). Using PANTHER GO enrichment analysis,³² 12 out of 40 clusters are shown to be enriched in at least one biological process (SI Sec. S6). Most of the gene clusters are well-defined and tight located in the projected t-SNE space. As a case study, we took a close look at one cluster (Fig. 2b right panel), which contains a set of functionally similar genes, such as that code a protein family of type I interferons (IFNs), which are responsible for immune and viral response.³³

3.3. Self-attention reveals impactful SGAs on cancer cell transcriptome

While it is widely accepted that cancer is mainly caused by SGAs, but not all SGAs observed in a cancer cell are causative.¹ Previous methods mainly concentrate on searching for SGAs with higher than expected frequency to differentiate candidate drivers SGAs from passenger SGAs. GIT provides a novel perspective to address the problem: identifying the SGAs that have a functional impact on cellular signaling systems and eventually lead DEGs as the tumor-specific candidate drivers. Here we compare the relationship of overall attention weights (inferred by GIT model) and the frequencies of somatic alterations (used as the benchmark/control group) in all the cancer types (Pan-Cancer) from our test data (Fig. 2c). In general, the attention weights are correlated with the alteration frequencies of genes, e.g., common cancer drivers such as TP53 and PIK3CA are the top two SGAs selected by both methods.² However, our self-attention mechanism assigns high weights to many of genes previously not designated as drivers, indicating these genes are potential cancer drivers although their roles in cancer development remain to be further studied. Table 3 lists top SGAs ranked according to GIT attention weights in pan-cancer and five selected cancer types, where known cancer drivers from TumorPortal³ and IntOGen³⁴ are marked as bold font. Apart from TP53 and PIK3CA as drivers in the pan-cancer analysis,² we also find the top cancer drivers in specific cancer types consistent with our knowledge of cancer oncology. For example, CDH1 and GATA3 are drivers of breast invasive carcinoma (BRCA),³⁵ CASP8 is known driver of head and neck squamous cell carcinoma (HNSC),³⁶ STK11, KRAS, KEAP1 are known drivers of lung adenocarcinoma (LUAD),³⁷ PTEN and RB1 are drivers of glioblastoma (GBM),³⁸ and FGFR3, RB1, HSP90AA1, STAG2 are known drivers in urothelial bladder carcinoma (BLCA).³⁹ In contrast, the most frequently mutated genes (control group) are quite different from that using attention mechanism (experiment group), and only a few of them are known drivers (SI Sec. S7).

Table 3.

Top five SGA-affected genes ranked according to attention weight.

Rank	PANCAN	BRCA	HNSC	LUAD	GBM	BLCA

1	*TP53*	*TP53*	*TP53*	*STK11*	*TP53*	*TP53*
2	*PIK3CA*	*PIK3CACASP8*		*TP53*	*PTEN*	*FGFR3*
3	*RB1*	*CDH1*	*PIK3CAKRAS*		C9orf53	*RB1*
4	*PBRM1*	*GATA3*	*CYLD*	CYLC2	*RB1*	*HSP90AA1*
5	*PTEN*	MED24	RB1	*KEAP1*	CHIC2	*STAG2*

Open in a new tab

3.4. Personalized tumor embeddings reveal distinct survival profiles

Besides learning the specific biological function impact of SGAs on DEGs, we further examined the utility of tumor embeddings e_t in two perspectives: (1) Discovering patterns of tumors potentially sharing common disease mechanisms across different cancer types; (2) Using tumor embedding to predict patient survival.

We first used the t-SNE plot of tumor embeddings to illustrate the common disease mechanisms across different cancer types (Fig. 3a). When cancer type embedding e_s is included in full tumor embedding e_t, which has a much higher weight than any individual gene embedding (Fig. 1b, Eq. 1) and dominates the full tumor embedding, tumor samples are clustered according to cancer types. This is not surprising as it is well appreciated that expressions of many genes are tissue-specific.⁴⁰ To examine the pure effect of SGAs on tumor embedding, we removed the effect of tissue by subtracting cancer type embeddings e_s, followed by clustering tumors in the stratified tumor embedding space (Fig. 3b). It is interesting to see that each dense area (potential tumor clusters) includes tumors from different tissues of origins, indicating SGAs in these tumors may reflect shared disease mechanisms (pathway perturbations) among tumors, warranting further investigations.

The second set of experiments was to test whether differences in tumor embeddings (thereby difference in disease mechanisms) are predictive of patient clinical outcomes. We conducted unsupervised k-means clustering using only breast cancer tumors from our test set, which reveals 3 three groups (Fig. 3c) with significant difference in survival profiles evaluated by log-rank test⁴¹ (Fig. 3d; p-value=0.017). In addition, using tumor embeddings as input features, we trained l_1,2-regularized (elastic net)⁴² Cox proportional hazard models⁴³ in a 10-fold cross-validation (CV) experiment. This led to an informative ranked list of tumors according to predicted survivals/hazards evaluated by the concordance index (CI) value (CI=0.795), indicating that the trained model is very accurate. We further split test samples into two groups divided by the median of predicted survivals/hazards, which also yields significant separation of patients in survival profiles (Fig. 3e; p-value=5.1 × 10⁻⁸), indicating that our algorithm has correctly ranked the patients according to characteristics of the tumor.

As shown above, distinct SGAs may share similar embeddings if they share similar functional impact. Thus, two tumors may have similar tumor embeddings even though they do not share any SGAs, as long as the functional impact of distinct SGAs from these tumors are similar. Therefore, tumor embedding makes it easier to discover common disease mechanisms and their impact on patient survival. To further test this, we also performed clustering analysis on breast cancer tumors represented in original SGA space, followed similar survival analysis as described in the previous paragraph (SI Sec. S8).

3.5. Tumor embeddings are predictive of drug responses of cancer cell lines

Precision oncology concentrates on using patient-specific omics data to determine optimal therapies for a patient. We set out to see if SGA data of cancer cells can be used to predict their sensitivity to anti-cancer drugs. We used the CCLE dataset,⁴⁴ which performed drug sensitivity screening over hundreds of cancer cell lines and 24 anti-cancer drugs. The study collects genomic and transcriptomic data of these cell lines, but in general, the genomic data (except the molecularly targeted genes) from a cell line are not sufficient to predict sensitivity its sensitivity to different drugs.

We discretized the response of each drug following the procedure in previous research.^44,45 Since CCLE only contains a small subset of mutations in TCGA dataset (around 1,600 gene mutations), we retrained the GIT with this limited set of SGAs in TCGA, using default hyperparameters we set before. Cancer type input was removed as well, which is not explicitly provided in CCLE dataset. The output of tumor embeddings e_t was then extracted as feature. We formulated drug response prediction as a binary classification problem with l₁-regularized cross entropy loss (Lasso), where the input can be raw sparse SGAs or tanh-curved tumor embeddings tanh(e_t). Following previous work,⁴⁴ we performed 10-fold CV experiment training Lasso using either inputs to test the drug response prediction task of four drugs with distinct targets. Lasso regression using tumor embeddings consistently outperforms the models trained with original SGAs as inputs (Fig. 4). Specifically, in the case of Sorafenib, the raw mutations just give random prediction results, while the tumor embedding is able to give predictable results. It should be noted that it is possible that certain cancer cells may host SGAs along the pathways related to FGFR, RAF, EGFR, and RTK, rendering them sensitive to the above drugs. Such information can be implicitly captured and represented by the tumor embeddings, so that the information from raw SGAs are captured and pooled to enhance classification accuracy.

Fig. 4. — ROC curves and the areas under the curve (AUCs) of Lasso models trained with original SGAs and tumor embeddings representations on predicting responses to four drugs.

4. Conclusion and Future Work

Despite the significant advances in cancer biology, it remains a challenge to reveal disease mechanisms of each individual tumor, particularly which and how SGAs in a cancer cell lead to the development of cancer. Here we propose the GIT model to learn the general impact of SGAs, in the form of gene embeddings, and to precisely portray their effects on the downstream DEGs with higher accuracy. With the supervision of DEGs, we can further assess the importance of an SGA using multi-head self-attention mechanisms in each individual tumor. More importantly, while the tumor embeddings are trained with predicting DEGs as the task, it contains information for predicting other phenotypes of cancer cells, such as patient survival and cancer cell drug sensitivity. The key advantage of transforming SGA into a gene embedding space is that it enables the detection and representation of the functional impact of SGAs on cellular processes, which in turn enables detection of common disease mechanisms of tumors even if they host different SGAs. We anticipate that GIT, or other future models like it, can be applied broadly to gain mechanistic insights of how genomic alterations (or other perturbations) lead to specific phenotypes, thus providing a general tool to connect genome to phenome in different biological fields and genetic diseases. One should also be careful that despite the correlation of genomic alterations and phenotypes such as survival profiles and drug response, the model may not fully reveal the causalities and there may exist other confounding factors not considered.

There are a few future directions for further improving the GIT model. First of all, decades of biomedical research has accumulated a rich body of knowledge, e.g., Gene Ontology and gene regulatory networks, which may be incorporated as the prior of the model to boost the performance.⁴⁶ Secondly, we expect that by getting a larger corpus of tumor data with mutations and gene expressions, we will be able to train better models to minimize potential overfitting or variance. Lastly, more clinically oriented investigations are warranted to examine, when trained with a large volume of tumor omics data, the learned embeddings of SGAs and tumors may be applied to predict sensitivity or resistance to anti-cancer drugs based SGA data that are becoming readily available in contemporary oncology practice.

Supplementary Material

NIHMS1061137-supplement-1.pdf^{(760.2KB, pdf)}

Acknowledgments

We would like to thank Yifan Xue and Michael Q. Ding for providing the processed TCGA data and discretized CCLE data. We also thank to the helpful suggestions from anonymous reviewers.

Funding

This work has been partially supported by the following NIH grants: R01LM012011, R01LM010144, and U54HG008540, and it has also been partially supported by the Grant #4100070287 awarded by the Pennsylvania Department of Health. The content is solely the responsibility of the authors and does not necessarily represent the official views of the above funding agencies.

Footnotes

Supplemental information (SI), GIT model, pre-processed TCGA data, and gene embeddings are available at https://github.com/yifengtao/genome-transformer.

References

1.Vogelstein B, Papadopoulos N, Velculescu VE et al. , Cancer genome landscapes, Science 339, 1546 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Kandoth C, McLellan MD, Vandin F et al. , Mutational landscape and significance across 12 major cancer types, Nature 502, p. 333 (October 2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Lawrence MS, Stojanov P, Mermel CH et al. , Discovery and saturation analysis of cancer genes across 21 tumour types, Nature 505, p. 495 (January 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Ciriello G, Miller ML, Aksoy BA et al. , Emerging landscape of oncogenic signatures across human cancers, Nat. Genet 45, p. 1127 (September 2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Zack TI, Schumacher SE, Carter SL et al. , Pan-cancer patterns of somatic copy number alteration, Nat. Genet 45, p. 1134 (September 2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Stransky N, Cerami E, Schalm S, Kim JL and Lengauer C, The landscape of kinase fusions in cancer, Nat. Commun 5, p. 4846 (September 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Jones PA and Baylin SB, The fundamental role of epigenetic events in cancer, Nat. Rev. Genet 3, p. 415 (June 2002). [DOI] [PubMed] [Google Scholar]
8.Dees ND, Zhang Q, Kandoth C et al. , MuSiC: identifying mutational significance in cancer genomes., Genome Res. 22, 1589 (August 2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Lawrence MS, Stojanov P, Polak P et al. , Mutational heterogeneity in cancer and the search for new cancer-associated genes, Nature 499, p. 214 (June 2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Reva B, Antipin Y and Sander C, Predicting the functional impact of protein mutations: application to cancer genomics, Nucleic Acids Res. 39, e118 (September 2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Niu B, Scott AD, Sengupta S et al. , Protein-structure-guided discovery of functional mutations across 19 cancer types, Nat. Genet 48, p. 827 (June 2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Mikolov T, Sutskever I, Chen K, Corrado G and Dean J, Distributed representations of words and phrases and their compositionality, in Proc. of NIPS, 2013. [Google Scholar]
13.Pennington J, Socher R and Manning CD, GloVe: global vectors for word representation., in Proc. of EMNLP, 2014. [Google Scholar]
14.Tao Y, Godefroy B, Genthial G and Potts C, Effective feature representation for clinical text concept extraction, in Proc. of Clinical NLP Workshop, June 2019. [Google Scholar]
15.Ashburner M, Ball CA, Blake JA et al. , Gene Ontology: tool for the unification of biology, Nature Genet. 25, p. 25 (May 2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Cai C, Cooper GF, Lu KN et al. , Systematic discovery of the functional impact of somatic genome alterations in individual tumors through tumor-specific causal inference, PLOS Comput. Biol 15, p. e1007088 (July 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Lee B, Min S and Yoon S, Deep learning in bioinformatics, Brief. Bioinform 18, 851 (2016). [DOI] [PubMed] [Google Scholar]
18.Lan K, Wang D, Fong S et al. , A survey of data mining and deep learning in bioinformatics, J. Med. Syst 42, p. 139 (2018). [DOI] [PubMed] [Google Scholar]
19.Vaswani A, Shazeer N, Parmar N et al. , Attention is all you need, in Proc. of NIPS, 2017 [Google Scholar]
20.Bahdanau D, Cho K and Bengio Y, Neural machine translation by jointly learning to align and translate, in Proc. of ICLR, 2015. [Google Scholar]
21.Xu K, Ba J, Kiros R et al. , Show, attend and tell: neural image caption generation with visual attention, in Proc. of ICML, 07–09 July 2015. [Google Scholar]
22.Network TCGAR et al. , The cancer genome atlas pan-cancer analysis project, Nat. Genet 45, p. 1113 (September 2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Chen L, Cai C, Chen V and Lu X, Trans-species learning of cellular signaling systems with bimodal deep belief networks., Bioinformatics 31, 3008 (September 2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Chen L, Cai C, Chen V and Lu X, Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model., BMC Bioinformatics 17 Suppl 1, p. 9 (January 2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Rosenblatt F, The perceptron: a probabilistic model for information storage and organization in the brain, Psychol. Rev, 65 (1958). [DOI] [PubMed] [Google Scholar]
26.Vandin F, Upfal E and Raphael BJ, De novo discovery of mutated driver pathways in cancer., Genome Res. 22, 375 (February 2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Rumelhart DE, Hinton GE and Williams RJ, Learning representations by back-propagating errors, Nature 323, p. 533 (October 1986). [Google Scholar]
28.Kingma DP and Ba JL, Adam: a method for stochastic optimization, in Proc. of ICLR, 2015. [Google Scholar]
29.Srivastava N, Hinton G, Krizhevsky A, Sutskever I and Salakhutdinov R, Dropout: A simple way to prevent neural networks from overfitting, J. of Mach. Learn. Res 15, 1929 (2014). [Google Scholar]
30.Tibshirani R, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. B 58, 267 (1994). [Google Scholar]
31.Maaten L and Hinton G, Visualizing high-dimensional data using t-SNE, J. Mach. Learn. Res 9, 2579 (01 2008). [Google Scholar]
32.Mi H, Muruganujan A, Casagrande JT and Thomas PD, Large-scale gene function analysis with the PANTHER classification system, Nat. Protoc 8, p. 1551 (July 2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.de Weerd NA and Nguyen T, The interferons and their receptors–distribution and regulation, Immunol. Cell Biol 90, 483 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Gonzalez-Perez A, Perez-Llamas C, Deu-Pons J et al. , IntOGen-mutations identifies cancer drivers across tumor types, Nat. Methods 10, p. 1081 (September 2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Network TCGA et al. , Comprehensive molecular portraits of human breast tumours, Nature 490, p. 61 (September 2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Stransky N, Egloff AM, Tward AD et al. , The mutational landscape of head and neck squamous cell carcinoma, Science 333, 1157 (August 2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Network TCGAR et al. , Comprehensive molecular profiling of lung adenocarcinoma, Nature 511, p. 543 (July 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Brennan CW, Verhaak RGW, McKenna A et al. , The somatic genomic landscape of glioblastoma, Cell 155, 462 (October 2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Network TCGAR et al. , Comprehensive molecular characterization of urothelial bladder carcinoma, Nature 507, p. 315 (January 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Hoadley KA, Yau C, Wolf DM et al. , Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin., Cell 158, 929 (August 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Mantel N, Evaluation of survival data and two new rank order statistics arising in its consideration, Cancer Chemoth. Rep 50, 163 (March 1966). [PubMed] [Google Scholar]
42.Zou H and Hastie T, Regularization and variable selection via the elastic net, J. R. Stat. Soc. B 67, 301 (2005). [Google Scholar]
43.Cox DR, Regression Models and Life-Tables (Springer New York, New York, NY, 1992), New York, NY, pp. 527–541. [Google Scholar]
44.Barretina J, Caponigro G, Stransky N et al. , The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity, Nature 483, p. 603 (March 2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Ding MQ, Chen L, Cooper GF, Young JD and Lu X, Precision oncology beyond targeted therapy: combining omics data with machine learning matches the majority of cancer cells to effective therapeutics, Mol. Cancer Res (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Ma J, Yu MK, Fong S et al. , Using deep learning to model the hierarchical structure and function of a cell, Nat. Methods 15, p. 290 (March 2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1061137-supplement-1.pdf^{(760.2KB, pdf)}

[R1] 1.Vogelstein B, Papadopoulos N, Velculescu VE et al. , Cancer genome landscapes, Science 339, 1546 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Kandoth C, McLellan MD, Vandin F et al. , Mutational landscape and significance across 12 major cancer types, Nature 502, p. 333 (October 2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Lawrence MS, Stojanov P, Mermel CH et al. , Discovery and saturation analysis of cancer genes across 21 tumour types, Nature 505, p. 495 (January 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Ciriello G, Miller ML, Aksoy BA et al. , Emerging landscape of oncogenic signatures across human cancers, Nat. Genet 45, p. 1127 (September 2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Zack TI, Schumacher SE, Carter SL et al. , Pan-cancer patterns of somatic copy number alteration, Nat. Genet 45, p. 1134 (September 2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Stransky N, Cerami E, Schalm S, Kim JL and Lengauer C, The landscape of kinase fusions in cancer, Nat. Commun 5, p. 4846 (September 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Jones PA and Baylin SB, The fundamental role of epigenetic events in cancer, Nat. Rev. Genet 3, p. 415 (June 2002). [DOI] [PubMed] [Google Scholar]

[R8] 8.Dees ND, Zhang Q, Kandoth C et al. , MuSiC: identifying mutational significance in cancer genomes., Genome Res. 22, 1589 (August 2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Lawrence MS, Stojanov P, Polak P et al. , Mutational heterogeneity in cancer and the search for new cancer-associated genes, Nature 499, p. 214 (June 2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Reva B, Antipin Y and Sander C, Predicting the functional impact of protein mutations: application to cancer genomics, Nucleic Acids Res. 39, e118 (September 2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Niu B, Scott AD, Sengupta S et al. , Protein-structure-guided discovery of functional mutations across 19 cancer types, Nat. Genet 48, p. 827 (June 2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Mikolov T, Sutskever I, Chen K, Corrado G and Dean J, Distributed representations of words and phrases and their compositionality, in Proc. of NIPS, 2013. [Google Scholar]

[R13] 13.Pennington J, Socher R and Manning CD, GloVe: global vectors for word representation., in Proc. of EMNLP, 2014. [Google Scholar]

[R14] 14.Tao Y, Godefroy B, Genthial G and Potts C, Effective feature representation for clinical text concept extraction, in Proc. of Clinical NLP Workshop, June 2019. [Google Scholar]

[R15] 15.Ashburner M, Ball CA, Blake JA et al. , Gene Ontology: tool for the unification of biology, Nature Genet. 25, p. 25 (May 2000). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Cai C, Cooper GF, Lu KN et al. , Systematic discovery of the functional impact of somatic genome alterations in individual tumors through tumor-specific causal inference, PLOS Comput. Biol 15, p. e1007088 (July 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Lee B, Min S and Yoon S, Deep learning in bioinformatics, Brief. Bioinform 18, 851 (2016). [DOI] [PubMed] [Google Scholar]

[R18] 18.Lan K, Wang D, Fong S et al. , A survey of data mining and deep learning in bioinformatics, J. Med. Syst 42, p. 139 (2018). [DOI] [PubMed] [Google Scholar]

[R19] 19.Vaswani A, Shazeer N, Parmar N et al. , Attention is all you need, in Proc. of NIPS, 2017 [Google Scholar]

[R20] 20.Bahdanau D, Cho K and Bengio Y, Neural machine translation by jointly learning to align and translate, in Proc. of ICLR, 2015. [Google Scholar]

[R21] 21.Xu K, Ba J, Kiros R et al. , Show, attend and tell: neural image caption generation with visual attention, in Proc. of ICML, 07–09 July 2015. [Google Scholar]

[R22] 22.Network TCGAR et al. , The cancer genome atlas pan-cancer analysis project, Nat. Genet 45, p. 1113 (September 2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Chen L, Cai C, Chen V and Lu X, Trans-species learning of cellular signaling systems with bimodal deep belief networks., Bioinformatics 31, 3008 (September 2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Chen L, Cai C, Chen V and Lu X, Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model., BMC Bioinformatics 17 Suppl 1, p. 9 (January 2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Rosenblatt F, The perceptron: a probabilistic model for information storage and organization in the brain, Psychol. Rev, 65 (1958). [DOI] [PubMed] [Google Scholar]

[R26] 26.Vandin F, Upfal E and Raphael BJ, De novo discovery of mutated driver pathways in cancer., Genome Res. 22, 375 (February 2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Rumelhart DE, Hinton GE and Williams RJ, Learning representations by back-propagating errors, Nature 323, p. 533 (October 1986). [Google Scholar]

[R28] 28.Kingma DP and Ba JL, Adam: a method for stochastic optimization, in Proc. of ICLR, 2015. [Google Scholar]

[R29] 29.Srivastava N, Hinton G, Krizhevsky A, Sutskever I and Salakhutdinov R, Dropout: A simple way to prevent neural networks from overfitting, J. of Mach. Learn. Res 15, 1929 (2014). [Google Scholar]

[R30] 30.Tibshirani R, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. B 58, 267 (1994). [Google Scholar]

[R31] 31.Maaten L and Hinton G, Visualizing high-dimensional data using t-SNE, J. Mach. Learn. Res 9, 2579 (01 2008). [Google Scholar]

[R32] 32.Mi H, Muruganujan A, Casagrande JT and Thomas PD, Large-scale gene function analysis with the PANTHER classification system, Nat. Protoc 8, p. 1551 (July 2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.de Weerd NA and Nguyen T, The interferons and their receptors–distribution and regulation, Immunol. Cell Biol 90, 483 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Gonzalez-Perez A, Perez-Llamas C, Deu-Pons J et al. , IntOGen-mutations identifies cancer drivers across tumor types, Nat. Methods 10, p. 1081 (September 2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Network TCGA et al. , Comprehensive molecular portraits of human breast tumours, Nature 490, p. 61 (September 2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Stransky N, Egloff AM, Tward AD et al. , The mutational landscape of head and neck squamous cell carcinoma, Science 333, 1157 (August 2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Network TCGAR et al. , Comprehensive molecular profiling of lung adenocarcinoma, Nature 511, p. 543 (July 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Brennan CW, Verhaak RGW, McKenna A et al. , The somatic genomic landscape of glioblastoma, Cell 155, 462 (October 2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] 39.Network TCGAR et al. , Comprehensive molecular characterization of urothelial bladder carcinoma, Nature 507, p. 315 (January 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Hoadley KA, Yau C, Wolf DM et al. , Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin., Cell 158, 929 (August 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Mantel N, Evaluation of survival data and two new rank order statistics arising in its consideration, Cancer Chemoth. Rep 50, 163 (March 1966). [PubMed] [Google Scholar]

[R42] 42.Zou H and Hastie T, Regularization and variable selection via the elastic net, J. R. Stat. Soc. B 67, 301 (2005). [Google Scholar]

[R43] 43.Cox DR, Regression Models and Life-Tables (Springer New York, New York, NY, 1992), New York, NY, pp. 527–541. [Google Scholar]

[R44] 44.Barretina J, Caponigro G, Stransky N et al. , The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity, Nature 483, p. 603 (March 2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] 45.Ding MQ, Chen L, Cooper GF, Young JD and Lu X, Precision oncology beyond targeted therapy: combining omics data with machine learning matches the majority of cancer cells to effective therapeutics, Mol. Cancer Res (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] 46.Ma J, Yu MK, Fong S et al. , Using deep learning to model the hierarchical structure and function of a cell, Nat. Methods 15, p. 290 (March 2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

From genome to phenome: Predicting multiple cancer phenotypes based on somatic genomic alterations via the genomic impact transformer

Yifeng Tao

Chunhui Cai

William W Cohen

Xinghua Lu

Abstract

1. Introduction

2. Materials and methods

2.1. SGAs and DEGs pre-processing

2.2. The GIT neural network

2.2.1. GIT network structure: encoder-decoder architecture

Fig. 1.

2.2.2. Pre-training gene embeddings using Gene2Vec algorithm

2.2.3. Encoder: multi-head self-attention mechanism

2.2.4. Decoder: multi-layer perceptron (MLP)

2.3. Training and evaluation

3. Results

3.1. GIT statistically detects real biological signals

Table 1.

3.2. Gene embeddings compactly represent the functional impact of SGAs

Table 2.

Fig. 2.

3.3. Self-attention reveals impactful SGAs on cancer cell transcriptome

Table 3.

3.4. Personalized tumor embeddings reveal distinct survival profiles

Fig. 3.

3.5. Tumor embeddings are predictive of drug responses of cancer cell lines

Fig. 4.

4. Conclusion and Future Work

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases