Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2021 Feb 22;37(16):2441–2449. doi: 10.1093/bioinformatics/btab113

Inferring perturbation profiles of cancer samples

Martin Pirkl 1,2,, Niko Beerenwinkel 3,4,
Editor: Russell Schwartz
PMCID: PMC8388028  PMID: 33617647

Abstract

Motivation

Cancer is one of the most prevalent diseases in the world. Tumors arise due to important genes changing their activity, e.g. when inhibited or over-expressed. But these gene perturbations are difficult to observe directly. Molecular profiles of tumors can provide indirect evidence of gene perturbations. However, inferring perturbation profiles from molecular alterations is challenging due to error-prone molecular measurements and incomplete coverage of all possible molecular causes of gene perturbations.

Results

We have developed a novel mathematical method to analyze cancer driver genes and their patient-specific perturbation profiles. We combine genetic aberrations with gene expression data in a causal network derived across patients to infer unobserved perturbations. We show that our method can predict perturbations in simulations, CRISPR perturbation screens and breast cancer samples from The Cancer Genome Atlas.

Availability and implementation

The method is available as the R-package nempi at https://github.com/cbg-ethz/nempi and http://bioconductor.org/packages/nempi.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Cancer progression is often linked to alterations in driver genes (Bailey et al., 2018). A mutation in a driver gene increases the probability of cancer development. These genes are often not functioning normally in cancer cells, but are inhibited or over-expressed. We call genes with this abnormal behavior perturbed. Perturbations are hard to observe directly. However, some observable alterations such as mutations provide evidence for a gene perturbation. If the gene has a non-silent mutation, it is probable that its behavior is perturbed. The combination of different molecular profiles is useful to identify perturbed genes. In general, however, not all different types of measurements are available. For example, if only gene expression data is available, the identification of perturbed genes due to mutations may be difficult. Even if the mutation profiles are available, they may not reveal all perturbed genes correctly due to measurement error causing false positive and false negative mutation calls. In the case of a true negative mutation call, the gene could still be perturbed in a different way, e.g. by micro RNA activity (O’Brien et al., 2018; Shivdasani, 2006).

Identification of driver genes is important to characterize cancer types and help establish useful therapies. Especially knowledge about the genomic landscape can be helpful to establish successful treatments (Al-Lazikani et al., 2012). Several methods deal with driver gene identification on a global scale. Some methods rely mainly on mutation data to derive driver genes for specific cancers (Lawrence et al., 2013). Other methods include descriptive features of the genes (Tokheim et al., 2016) or combine different data types (Dimitrakopoulos et al., 2018; Hou et al., 2018; Hou and Ma, 2014) available from, for example, The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/, Network et al., 2008). However, not only the identification of driver genes is important, but also which gene is perturbed in which cancer sample, especially when it comes to supporting treatment decisions based on these information. A gene can be a driver for breast cancer, but is only mutated in a few samples. Maybe it is also perturbed in other samples, but not mutated. It is also useful to know, which other genes are perturbed and in what combinations. Hence, we want to know the perturbation profiles of each tumor.

Inferring perturbation profiles can be viewed as a classification problem for each gene. A sample would either be classified as ‘gene X is perturbed’ or ‘gene X is not perturbed’. For example, one can learn a classifier for each gene, which is possibly perturbed, based on gene expression profiles. Hence, this problem can be solved with supervised learning methods, such as, for example, support vector machines (Cortes and Vapnik, 1995; Honghai et al., 2005; Yang et al., 2012), neural networks (Nelwamondo et al., 2007; Smieja et al., 2018) or random forests (Pantanowitz and Marwala, 2009). Alternatively, data imputation methods can also be used to infer incomplete perturbation profiles (Azur et al., 2011; Shah, 2018; Stekhoven and Bühlmann, 2012).

We developed a novel method called nested effects model-based perturbation inference (NEMπ), which uses supervised learning to infer unobserved perturbations. We use a network approach based on gene expression data with samples labeled by their perturbations. We use the inferred network to learn the complete perturbation profiles of all samples. We iteratively optimize the perturbation profile and relearn the network until a convergence criterion is reached (Fig. 1).

Fig. 1.

Fig. 1.

Perturbation inference scheme. The binary perturbation matrix P (A) with known (blue) and unknown (red) perturbations and the continuous log odds matrix R (B) derived from gene expression data D are available for the same set of samples. P and R are initially used to infer a causal network ϕ of the perturbed genes. Iterative EM algorithm (C): Based on ϕ and R the soft perturbation profile Γ is inferred. Γ and ϕ are iteratively updated until convergence. The incomplete part is inferred (green box) and the rest revised

We validate NEMπ on five single-cell RNA-seq (scRNA-seq) perturbation datasets from experiments using Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR). These experiments are tailor-made to show that NEMπ can successfully predict perturbations from gene expression data. Additionally, we perform an exploratory analysis on breast cancer (BRCA) data from TCGA. We use known mutation profiles to learn the perturbation profile of the patient samples. We compare the predicted perturbation profiles to copy number variations and methylated states of the corresponding mutated genes in the same patient samples.

2 Materials and methods

NEMπ is build on the causal network learning approach called Nested Effects Models (NEM, Markowetz et al., 2005; Tresch and Markowetz, 2008). We extend this model to infer perturbation profiles from gene expression profiles.

NEM and its extensions have been applied to various perturbation datasets. Most recent versions of the algorithm have been extended to combinatorial perturbations (Pirkl et al., 2016, 2017) probabilistic perturbations (Srivatsa et al., 2018), time-series data (Anchang et al., 2009; Froehlich et al., 2011; Wang et al., 2014), hidden player inference (Sadeh et al., 2013), context specific signaling (Sverchkov, 2018) and single cell perturbations (Anchang et al., 2018; Pirkl and Beerenwinkel, 2018). NEMπ is related to NEMiX (Siebourg-Polster et al., 2015). NEMiX also infers a perturbation. However, NEMiX predicts whether the whole pathway has been activated or not. Hence, it performs a clustering of samples (cells) into two clusters to account for inactive pathways explaining different expression profiles for the same perturbation of different samples. Unlike NEMπ, NEMiX does not infer gene perturbations, but treats them as a prior fixed parameter.

2.1 Network model

Let n be the number of perturbed genes (P-genes) with unknown perturbation states in a subset of u samples. m is the number of features or effect genes (E-genes) for which gene expression data is available. Let P=(pij) be the perturbation matrix with pij = 1, if P-gene i is perturbed in sample j. We assume that P is only known for a subset of samples.

We parameterize the causal network of the n P-genes by the transitively closed adjacency matrix ϕ of the P-genes. θ describes the relationship between P-genes and E-genes with θij=1, if P-gene i is the parent of E-gene j. We employ the assumption that each E-gene can have at most one parent. Our expected data pattern is computed by F=ϕθ. That is, if P-gene i is perturbed all descendants of i are also perturbed as well as all E-genes, which are children of the perturbed P-genes. Hence, fij = 1, if E-gene j is a child of i or a child of a descendant of i, and fij = 0 otherwise.

Let D=(dij) be the gene expression data and R=(rij) the corresponding log odds with

rij=logP(dij|F)P(dij|N)

with the null model N, which does not predict any effects and the full model F, which predicts effects for all E-genes in all samples. I.e. F is the expected profile F of ϕ, if all P-genes are indistinguishable and a perturbation of one single P-gene causes a perturbation of all other P-genes. rij <0 means E-gene i shows no effect in sample j and rij > 0 means E-gene i shows an effect in sample j. Hence large values in R correspond to the ones in F. As in Tresch and Markowetz (2008) we compute

L=(lij)=FR

with lij the log odds of the perturbation of P-gene i in sample j. The full log odds for a candidate model (ϕ,θ) given the data can then be computed as

logP(D|ϕ,θ)P(D|N)=trace(L) (1)

and is optimized with respect to ϕ and θ.

2.2 Perturbation inference

If the perturbation information is complete, we learn the causal network ϕ and E-gene attachments θ by optimizing the log odds in (1). However, in that case we assume R has the same number of columns (samples) than P-genes. In other words, for each P-gene we have a corresponding column in R, in which the P-gene is perturbed. In our more general case, we have many more samples than P-genes and allow for combinatorial perturbations. Because the perturbations are only observed in a subset of samples, we introduce the hidden random variable Z=(zij) with zij = 1, if P-gene i has been directly perturbed in sample j and zij = 0 otherwise. The causal network ϕ propagates the direct perturbation to the descendants of P-gene i. This propagation is computed by Ω=ϕTZ. We call positive entries in Z direct perturbations, while Ω describes the actual perturbation profiles of the samples. For example, in Figure 2 only P-gene 2 is directly perturbed in sample 7 (z2,7=1) and is also an ancestor of P-gene 3 (ϕ2,3=1). Hence, both P-genes 2 and 3 are perturbed in sample 7. Furthermore, each P-gene i has a prior probability πi=P(zij=1) j of being perturbed with

iπi=1.

Fig. 2.

Fig. 2.

The network ϕ (left) predicts a perturbation of gene 3, if gene 2 is perturbed. Hence, the direct perturbation Z (top) does not need to include a perturbation of gene 3, since this is propagated via the network and included in the perturbation profile Ω=ϕTZ (bottom)

In our model, the direct perturbations are not only propagated via the causal network ϕ but also via the E-gene attachments θ. Similar to before, this is done by matrix multiplication F˜=ΩTθ. Z can have multiple 1s in each column and therefore we have to set values in F˜, which are greater than 1 to 1. Hence, F˜, as previously F, describes the expected data pattern for all E-genes and samples in the log odds matrix R.

For maximum likelihood estimation, we want to know how probable are the gene expression profiles D given perturbation profiles Z. We need to maximize the probability of the full data (D, Z) given the model parameters, i.e. the causal network ϕ and the E-gene attachments θ,

maxϕ,θP(D,Z|ϕ,θ).

We re-formulate this optimization problem to maximizing the log odds

logP(D,Z|ϕ,θ)P(D,Z|N)=i=1ni=juzij(logπi+k=1mlogP(dkj|fik)P(dkj|N))=i=1ni=kuzij(logπi+k=1mfikrkj).

However, parts of the data are hidden (Z). We solve this problem by implementing an expectation maximization algorithm (Dempster et al., 1977). In the E-step, we fix the causal network ϕ, the E-gene attachments θ and the P-gene priors π and compute the expectations of the direct perturbations zij for the jth sample dj by  

γij=P(zij=1|dj)=πik=1mP(dkj|fik)s=1nπsk=1mP(dkj|fsk)=πiexp(lij)s=1nπsexp(lsj)

with Γ=(γij). In the M-step, we optimize the expected value of the log odds

trace(ΓTlog(πTexp(L)))=i=1nj=1uγij(logπi+k=1mfikrkj) (2)

with respect to the causal network ϕ and the E-gene attachments θ. The priors π are computed by

πi=j=1uγijs=1nj=1uγsj.

We perform the E- and M-step iteratively until the log odds in (2) or the parameters do not change anymore.

The optimization in the M-step is done by adding or removing edges in the causal network ϕ. All modifications of the network ϕ are evaluated before the change (greedy search). This is done until no change in ϕ increases the log odds anymore. After each change we are estimating the E-gene attachments θ. In each greedy search we can start with a specific network ϕ. More different starts increase the chances to reach a global optimum, but also increase the run time. We increase the probability that the log odds increase without using too many restarts by starting the greedy search three different times, from the previous solution ϕi1, the empty, and the fully connected network in the ith M-step with ϕ0 as the empty network. We take the highest scoring solution as the new causal network ϕi. The run time complexity is O(n2) for every iteration in the number of P-genes n.

During the optimization of the M-step, we would also have to search for an optimal θ. However, we estimate the E-gene attachments θ in the following way. After changing an edge in ϕ and before computing the log odds, we estimate θ by first computing Q=ϕTΓRT, with qij as the log odds of the observed data pattern of E-gene j given that E-gene j is attached to P-gene i. We estimate the attachments θ by maximum a posteriori of the log odds Q with

θij=1qij=max{qsj,s=1,,n}.

The priors π remain fixed during the optimization of the network ϕ and the E-gene attachments θ. Additionally we include a null E-gene, which does not predict any effect and has otherwise badly fitting E-genes attached to it.

To avoid over-fitting, we add a null component (null P-gene), which does not predict any effects. Hence, if the null gene dominates the log odds for a sample, it is hardly used in the inference. In other words, the model predicts that no P-gene was perturbed in the sample.

The initial expectation Γ0 of the direct perturbations Z is based on a given incomplete perturbation matrix P=(pij) (e.g. a mutation matrix) with

γ0ij=pijs=1npsj.

Hence, if a sample is perturbed in two genes, their responsibility for that sample is 50% each. However, another possibility is to include prior belief for the perturbations in a sample and not treat them equally. E.g. if perturbation i in sample j is twice as certain as perturbation k, we can set γ0ij=23 and γ0kj=13.

3 Simulation study

As a proof of principle, we use NEMπ to simulate data based on a ground truth. We simulate a dataset based on random parameters: a causal network of the P-genes ϕ, E-gene attachments θ and a discrete perturbation matrix Z, which includes only the direct perturbations. The complete perturbation profile is computed by the network propagation Ω=ϕTZ. The simulations are based on n =5, 10, 15 P-genes, mn=10 E-genes per P-gene and around 10×n×2 samples to ensure a reasonable amount of samples with correct perturbation profiles. In general, a higher number of E-genes decreases noise, especially because NEMπ can exclude badly fitting E-genes. Additionally, we add 10% uninformative samples and E-genes which consist only of noise and are not related to the ground truth. As described in the Method section, the implementation of a null sample during iteration accounts for those samples. We allow roughly 20% double and 10% triple perturbations (columns in Z with more than one entry equal to 1). We add Gaussian noise with standard deviation σ=1,3,5 for 100 runs each. We compare NEMπ to support vector machines (svm, R-package e1071, Meyer et al., 2019) neural networks (nn, Venables and Ripley, 2002), random forest (rf, Liaw and Wiener, 2002) and k nearest neighbors (Venables and Ripley, 2002) classification methods. We trained the classifiers on the labeled samples and computed the class label probabilities on the test and training set. We classify each sample and P-gene separately, i.e. we learn a classifier for each single P-gene based on the gene expression predicting whether the gene is perturbed in the sample or not. Afterwards we combine the class probabilities for each sample and P-gene to a matrix corresponding to the estimator of the perturbation profile Ω^=ϕ^TΓ^ provided by NEMπ. Additionally, we compare our results to the two data imputation methods, namely mice (Azur et al., 2011; Shah, 2018) and missForest (Stekhoven and Bühlmann, 2012). We used the default implementations except for mice, which was running too long to converge. Hence, we reduced the number of iterations from 5 to 2.

We measured the degree of success as the area under the precision-recall curve (AUC, Supplementary Material) by comparing the ground truth perturbation profile Ω=ϕTZ with predicted perturbation profile Ω^.

Samples with unobserved perturbation profiles In a first study, we randomly removed the perturbation profiles for 10%,50% and 90% of the samples and tried to infer them (Fig. 3). The AUC shows that we can recover the perturbation profiles very well, even at high noise levels and much better than the other methods. For 5 P-genes, random forest is the only competitive method. However, with only 10% unobserved samples, all methods do only marginally better than randomly guessing the unknown perturbations. With larger sets of unobserved samples, the classifiers stay mostly robust while random guessing drops in performance. For example, with 50% unobserved samples and medium noise for 10 P-genes, the best classifiers have an AUC of around 0.6, more than 0.1 above random guessing, while NEMπ achieves an AUC of approximately 0.9. Additionally to the AUC, we also show the accuracy of the inferred network ϕ, (Supplementary Fig. S1) and the E-gene attachment θ (Supplementary Fig. S2). The accuracy of the network and attachment break down considerably at high noise levels, e.g. 50% network accuracy for 90% unobserved and σ = 5.

Fig. 3.

Fig. 3.

Unobserved perturbations. Shown is the area under the precision-recall curve between the predicted and ground truth perturbation profile. The columns show different amounts of samples with unobserved perturbation profiles (10%,50%,90%). The rows show different numbers of P-genes (5, 10, 15). Overall our approach (red) performs better than svm, neural nets, random forest, missForest, mice and k-nearest neighbors

NEMπ infers the network ϕ during optimization. However, if the underlying network is known, it can be provided and the network learning step is skipped. NEMπ proves to be robust against false positive edges in the given network (Supplementary Fig. S3), but less robust against false negatives (Supplementary Fig. S4). We suggest to let NEMπ do the network optimization unless there is high confidence in the prior network.

Augmenting perturbation profiles In a second study, we fixed the samples with unobserved perturbation profiles to 50% and randomly changed 10% and 50% of the rest of the perturbation profiles, respectively. For a random sample we first sampled the number of directly perturbed P-genes x{1,2,3} and then sampled x different P-genes to be perturbed. E.g. for a known sample we forget which genes are actually perturbed and draw x =2. Hence, we sample two random P-genes and label the sample as perturbed by them instead. This study shows that we can successfully recover unobserved perturbations profiles, even when a fraction of the given perturbation profiles are incorrect (Fig. 4). Furthermore, we can even partially augment incorrect perturbations. For 15 P-genes and medium noise levels, NEMπ has an AUC of over 0.8 when 75% are unobserved or incorrect.

Fig. 4.

Fig. 4.

Unobserved and incorrect perturbations. We set the number of perturbation profiles unobserved at a constant 50%. Shown is the area under the precision-recall curve between the predicted and ground truth perturbation profile. The columns show different amounts of incorrect profiles (10%,50%). The rows show different numbers of P-genes (5, 10, 15). Overall our approach (red) performs better than svm, neural nets, random forest, missForest, mice and k-nearest neighbors

Inference with respect to unknown P-genes Lastly, we simulated data for a large number of P-genes x{50,100}, but kept the number of informative samples fixed to 1000. However, we only included eight P-genes in our model and did not use any information about the other unknown P-genes during the inference. We wanted to investigate, how well we can infer the perturbation profiles of these eight P-genes. This is the most realistic scenario according to the results of Bailey et al. (2018), who find eight driver genes exclusive to breast cancer and 64 pan-cancer driver genes. Together with non-pan-cancer non-exclusive genes, this makes a total of 93 perturbed genes.

The accuracy of the perturbation profiles decreases slightly due to the systematic noise of the unknown P-genes (Fig. 5). Interestingly, the accuracy is also more robust against Gaussian noise, due to the large sample size for only eight P-genes. The other methods break down completely. As would be expected, reduced sample sizes are diminishing to NEMπ’s accuracy (Supplementary Figs S5 and S6).

Fig. 5.

Fig. 5.

Unknown confounding P-genes. The area under the precision-recall curve between the predicted and ground truth perturbation profiles is shown in for 50 P-genes and 100 P-genes (rows), and 10% and 90% unobserved samples (columns), respectively. The number of known P-genes is set to eight

4 Validation on CRISPR scRNA-seq data

We validate our approach on several CRISPR scRNA-seq datasets published by Adamson et al. (2016) (GEO: GSE90546, Barrett et al., 2012; Edgar, 2002). Our goal is to predict the perturbations in all cells from a random subset. In this case, the perturbations have been introduced experimentally using Perturb-seq. The datasets are from a pilot study on 7 genes, a larger study on 82 genes and an epistasis study on three genes. The last study consists of three datasets with different chemical treatments and includes double and triple perturbations. All genes are involved in the regulation of the endoplasmic reticulum pathway.

We removed genes with a median expression of zero counts and used the R package Linnorm (Yip et al., 2017) to pre-process the single-cell data. For the computation of the log-odds, we refer to the Supplementary Material (p. 1). After pre-processing, the five datasets consist of 1754×3927 (pilot), 2794×53290 (main), 3399×4015 (epistasis 1), 3615×3602 (epistasis 2) and 3614×3363 (epistasis 3) genes times cells.

After learning a network ϕ for each dataset with the original NEM, we use the network as the ground truth for this validation study. We employ an exhaustive search for the datasets with three genes and greedy search for the others. The ground truth perturbation profile is computed by Ω=ϕTΓ with Γ derived from the known cell labels, i.e. which P-gene has been perturbed in which cell.

For the validation, we remove the labels for 50% of the cells and use the different methods to re-learn the perturbations. For the main study, we randomly sample a subset of 10 and 15 P-genes. NEMπ is not provided with the ground truth ϕ but has to learn the network from the partially labeled data (Supplementary Fig. S7). The accuracy is computed as before over 100 independent runs. All methods achieve the highest accuracy for the datasets from the epistasis studies (Fig. 6, bottom). For the other two datasets with more P-genes (7, 10, 15) the accuracy drops (Fig. 6, top). The main study shows the highest variation in accuracy due to the random sampling of P-genes. The accuracy is in general higher for 10% and lower for 90% unlabeled cells (Supplementary Fig. S8). We distinguished runs with a dense network inferred by NEMπ and a sparse network (Supplementary Fig. S9). As expected, NEMπ has more power for dense networks. Overall, NEMπ achieves the highest accuracy and has the most success in predicting perturbations. The imputation method mice took too long to compute and did not converge.

Fig. 6.

Fig. 6.

Accuracy of the various methods for the CRISPR scRNA-seq datasets. All methods perform very well for the epistasis studies except for svm (bottom). Overall NEMπ outperforms the other methods, but shows a large variance over the randomly samples P-genes of the main study

5 Exploratory analysis on breast cancer

We apply our method to the breast cancer (BRCA) dataset from TCGA, which has many samples including controls. In this analysis, we want to explore the possibility to predict other perturbations, like copy number abberations or methylation. As the initial incomplete perturbation matrix, we use the mutation matrix M=(mij) with mij = 1, if sample j has a mutation in gene i and otherwise mij = 0. We aim to 1) infer perturbation profiles for samples, which have no mutation data available (unobserved perturbations) and 2) augment the known mutations. As P-genes we choose the following driver genes previously identified as exclusive to BRCA (Bailey et al., 2018): CBFB, CDKN1B, GATA3, GPS2, MAP2K4, NCOR1, PTPRD, TBX3.

We used the R-package TCGAbiolinks (Colaprico et al., 2016) to access and download the gene expression counts and mutation information. We define a gene in a sample as mutated, if it was called by at least three of the four methods available in the TCGA dataset (Cibulskis et al., 2013; Fan et al., 2016; Harris et al., 2012; Koboldt et al., 2012) to avoid false positives. Furthermore, we set mutations to 0, which were labeled as ‘silent’ by TCGA.

The BRCA dataset consists of 1215 samples including 113 control samples. We summarize duplicate samples and duplicate genes with the median. Roughly 92% of all samples do not carry a mutation for any of the P-genes, because either none were called or mutation data was not available. We filtered out lowly expressed genes (median < 10 counts) to obtain 20, 213 E-genes. We used edgeR (Robinson et al., 2010) to normalize the gene expression. For the computation of the log odds we refer to Supplementary Material. In the last step we removed uninformative E-genes (median log odds equal to 0) and were left with 19, 381 genes. The log odds for the BRCA dataset follow a similar distribution than our simulated data (Supplementary Fig. S10).

NEMπ takes roughly 2.5 minutes on a MacBook pro (2017) to converge in 42 iterations (Supplementary Fig. S11). The inferred expectations Γ of the perturbation matrix Z is a sparse matrix with virtually binary predictions (Fig. 7, Supplementary Fig. S11), i.e. many samples have only predicted 0 s (white) for all except one P-gene (1, dark blue), which corresponds to a single direct perturbation. In only few samples the expectations show more uncertainty (light blue). All perturbations in Γ are propagated via the inferred causal network ϕ (Fig. 8). Hence, all samples with a direct perturbation of MAP2K4 have also perturbations in all other P-genes except for GATA3 and TBX3. Vice versa for all samples with a direct perturbation in CDKN1B our model predicts no perturbation in any other P-gene. Although, these eight genes have not been found to be in a joint signaling pathway, gene ontology analysis (Szklarczyk et al., 2019, https://string-db.org/) shows evidence of similar activity in biological processes like the regulation of the B-cell receptor signaling pathway (Supplementary Table S1).

Fig. 7.

Fig. 7.

The expectations Γ of the direct perturbations Z inferred from the BRCA dataset. Shown are the expectations of the P-genes (rows) for the samples (columns). Dark blue values are close to 1, while light blue values are between 0 and 1

Fig. 8.

Fig. 8.

Causal network ϕ inferred from the BRCA dataset. Shown is the causal network connecting the P-genes based on their effect on the gene expression. This network propagates perturbations predicted by Γ (Fig. 7) to all descendants. E.g. in all the samples with a perturbation of MAP2K4 predicted by Γ all other genes except for GATA3 and TBX3 are also perturbed

Comparison to other modes of perturbation A perturbation of a gene can happen in different ways. For example, a gene is mutated in some samples and therefore perturbed on the DNA level. However, in other samples no mutation is observed, but a copy number aberration, which can also lead to a perturbation of the gene. Our predicted perturbation profiles for all samples (Ω) for the BRCA samples are learned based solely on observed mutations. To investigate whether we can capture other modes of perturbations, we compare our prediction to available copy number variation (CNV) and methylation data. CNVs are provided by TCGA as a matrix C with cij = 0, if there is no copy number aberration of gene i in sample j, and cij{2,1,1,2}, if there is a loss (–) or a gain, respectively. We binarized C by setting all entries 0 to 1. We called sites methylated (1) with a cutoff of > 0.5 for the beta score {0,1} provided by TCGA. Those perturbations by methylations are stored in a matrix H with hij = 1, if gene i is methylated in sample j.

For visualization, we binarized our predicted perturbation profile Ω with a cutoff of 18 and combined the mutations matrix M, the CNV matrix C, the methylation matrix H and our predicted perturbations Ω in one matrix (Fig. 9). The dark blue regions show that our predictions based on the mutations capture unobserved perturbations implied by CNV or methylation not covered by mutations.

Fig. 9.

Fig. 9.

Predicted and measured perturbations. This matrix visualizes predicted and measured perturbations (mutation, CNV or methylation) of the P-genes (rows) over all samples (columns). Dark blue are perturbations, which are predicted and confirmed by a measurement (true positives). Light blue perturbations are only predicted (false positives), red are only measured (false negatives) and white are neither (true negatives)

The high false positives (light blue) may be explained by the fact that we have not covered all types of perturbation. A gene can be indirectly perturbed without any CNV, mutation or methylation (e.g. micro RNA activity, O’Brien et al., 2018; Shivdasani, 2006), or the aberration is not detected due to noise.

Even though CDKN1B is hardly mutated in any sample, we predict it as perturbed in all samples. It is downstream of all other genes in the network and hence perturbed, if any other gene is perturbed. PTPRD and CDKN1B have similar profiles with only few samples without a predicted perturbation.

GATA3, MAP2K4 and TBX3 on the other hand have the least amount of predicted perturbations. Additionally the predicted perturbations for all three genes are mutually exclusive. This is also reflected in the continuous matrix Γ (Fig. 7), where GATA3, MAP2K4 and TBX3 hardly share any samples (light blue).

Next, we used svm, random forest, neural net and k-nearest neighbors classifiers, and missForest to impute perturbations only from known mutations. For a comparison of the methods we computed the AUC of the precision-recall curve (Table 1, first row). The overall accuracy, while greater than random, is low. However, if we assume that the perturbations caused by CNVs and methylation are propagated by the network ϕ inferred by NEMπ, the accuracy increase, especially for NEMπ and svm (Table 1, second row).

Table 1.

Area under the precision-recall curve for the eight breast cancer-specific driver genes

NEMπ svm Neural net Random forest missForest knn Random
0.60 0.62 0.58 0.63 0.60 0.59 0.54
0.88 0.87 0.76 0.78 0.74 0.78 0.74

Note: All methods achieve a marginally higher accuracy than random guessing in predicting CNVs and/or methylated sites (first row). If perturbations by CNVs and methylations are propagated by the network ϕ inferred by NEMπ, the accuracy increases, especially for NEMπ and svm (second row).

Next, we randomly sampled 10 genes from the pan-cancer list of Bailey et al. (2018) and predicted CNVs and methylation from mutations and gene expression profiles to asses the variance of the AUC over the dataset. Overall the performance stays low (Supplementary Fig. S12, left) for all methods with NEMπ having a significantly greater accuracy than the other methods except for random guessing, which is still worse than NEMπ but misses the 5% cut for significance (rank sum test of the accuracy values with alternative ’greater’, P-value 0.06757). However, if the network ϕ inferred by NEMπ is true, the perturbations caused by CNVs or methylations are propagated via the network ϕ. In this case, NEMπ is also significantly better than random (Supplementary Fig. S12, center). Additionally, NEMπ is also almost 50 times faster than neural nets and random forest and 5 times faster than svm. Only missForest and knn are faster than NEMπ (Supplementary Fig. S12, right).

In an additional analysis, we performed leave-one-out cross-validation exclusively on mutated samples. We removed one sample and trained a model on the mutation and expression profiles of the remaining samples. Then we predicted the mutation profile of the removed sample solely based on its expression profile. NEMπ achieves the highest AUC (Supplementary Fig. S13). However, overall accuracy is very low across all methods. This may be explained by the fact that all methods try to predict perturbations, including, for example, CNVs and methylated sites, and not only mutations, which can be sparse.

While NEMπ is not designed to predict driver genes, there might be an overlap between drivers and highly perturbed genes identified by NEMπ. To asses a potential overlap, we compare the previous results (Fig. 9) with the driver gene identification method DawnRank (Hou and Ma, 2014). However, DawnRank does not build a predictor like the other methods. Instead it uses mutation calls, gene expression data and a prior gene network to infer sample-specific driver gene rankings, normalized to a range of 0 to 1. Therefore, we apply DawnRank to the mutations, gene expression data and a prior network from String-db (Szklarczyk et al., 2019). We used the pre-processed network from the R package Prodigy (Dinstag and Shamir, 2019). We compare the normalized ranks of our genes of interest to the mutation calls just like in the leave-one-out cross-validation with the AUC of the precision-recall curve. DawnRank achieves an AUC of 0.15. However, DawnRank predicts CDKN1B as a driver gene with a score of close to 100% in all samples (minimum: 99.72%, Supplementary Fig. S14). This agrees with the perturbation pattern of CDKN1B predicted by NEMπ (Fig. 9). Hence, for the eight driver genes, CDKN1B is the most significant gene in both analyses and adds support to our assumption that driver genes and highly perturbed genes overlap.

6 Discussion

We have introduced NEMπ, a novel method for inferring perturbation profiles of biological samples based on known incomplete perturbations and gene expression data. We have shown in a simulation study that our method successfully learns perturbation profiles in several different situations with the help of the underlying causal network of the perturbed genes. Overall, we achieve higher accuracy than comparable methods usually used for such a problem (i.e. support vector machines, neural networks, random forest, k-nearest neighbors, and the imputation methods missForest and mice). However, these methods are at a disadvantage, because they do not assume an underlying network, which is used to generate the data.

We applied NEMπ to several CRISPR scRNA-seq datasets. This allowed for a validation of NEMπ on real data in a controlled setting. We show that NEMπ achieves high accuracy in learning the removed labels (known perturbations) from a subset of cells. Especially on the dataset with only three genes and combinatorial perturbations, all methods achieve high accuracy. The accuracy decreases for the datasets with more perturbed genes. Naturally, on the dataset with a random sample of 10 and 15P-genes over the 100 runs the variance increases due to the heterogeneity of the sampled cells and the inferred underlying network.

We applied our method to the breast cancer (BRCA) data from TCGA. We chose the dataset for its large number of samples, including controls. Control samples are necessary to normalize the tumor samples with respect to differential expression. With few or no control samples, normalization becomes more difficult and unreliable. We chose to infer the perturbation profiles of eight genes, which have been previously identified as exclusive to breast cancer. Hence, the genes should be highly relevant and our simulations have shown that we can account for unknown P-genes. Furthermore, we selected the P-genes in the application especially since it is known that they are driver genes, which are unique to BRCA. Hence, it is expected that those genes are highly perturbed in that cancer type.

We learned the perturbation profiles purely from mutations (and gene expression) data. We compared our predictions to other datasets implying gene perturbations (CNV, methylation). This shows that NEMπ recovers many perturbations not included in the mutation profiles. However, there are also predictions with no measured perturbations and vice versa. The reason for this diversion can be simply noise. False negative predictions can also occur, because some genetic aberrations do not cause a perturbation of the gene. False positive predictions can be explained by an indirect perturbation of gene by other means than genetic aberrations (propagated perturbation). For randomly sampled P-genes (known pan-cancer genes), NEMπ achieves on average the largest area under the precision-recall curve. If we assume that perturbations caused by CNVs and methylations are propagated by the network ϕ inferred by NEMπ, the accuracy increases.

As shown in the application to simulated and real data, NEMπ can handle a high number of samples and E-genes. Prediction accuracy is robust even against a large underlying network with only few known P-genes. While NEMπ can be applied to more than 15 P-genes, accuracy decreases and run-time increases. Hence, we suggest to employ other methods to reduce the amount of P-genes to a feasible number. It would be interesting to extend NEMπ with a divide and conquer approach to make it applicable to a larger set of genes. E.g. NEMπ could be applied to subsets of genes to predict local perturbation profiles. How those profiles would be summarized still remains an open problem.

While we use the causal network to predict indirect perturbations, interpretations of the network itself is not clear. The network is build based on similar expression profiles among the P-genes. However, it is not clear, if the P-genes are actually directly or indirectly interacting with each other.

Funding

Part of this work was funded by SystemsX.ch, the Swiss Initiative in Systems Biology [RTD 2013/152] (TargetInfectX—Multi-Pronged Perturbation of Pathogen Infection in Human Cells), evaluated by the Swiss National Science Foundation and by ERC Synergy Grant [609883].

Conflict of Interest: none declared.

Supplementary Material

btab113_Supplementary_Data

Contributor Information

Martin Pirkl, Department of Biosystems Science and Engineering, ETH Zurich, Basel 4058, Switzerland; Swiss Institute of Bioinformatics, Basel 4058, Switzerland.

Niko Beerenwinkel, Department of Biosystems Science and Engineering, ETH Zurich, Basel 4058, Switzerland; Swiss Institute of Bioinformatics, Basel 4058, Switzerland.

References

  1. Adamson B.  et al. (2016) A multiplexed single-cell crispr screening platform enables systematic dissection of the unfolded protein response. Cell, 167, 1867–1882.e21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Al-Lazikani B.  et al. (2012) Combinatorial drug therapy for cancer in the post-genomic era. Nat. Biotechnol., 30, 679–692. [DOI] [PubMed] [Google Scholar]
  3. Anchang B.  et al. (2009) Modeling the temporal interplay of molecular signaling and gene expression by using dynamic nested effects models. Proc. Natl. Acad. Sci. USA, 106, 6447–6452. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Anchang B.  et al. (2018) Drug-nem: optimizing drug combinations using single-cell perturbation response to account for intratumoral heterogeneity. Proc. Natl. Acad. Sci. USA, 115, E4294–E4303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Azur M.  et al. (2011) Multiple imputation by chained equations: what is it and how does it work?  Int. J. Methods Psychiatric Res., 20, 40–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Bailey M.H.  et al. (2018) Comprehensive characterization of cancer driver genes and mutations. Cell, 173, 371–385.e18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Barrett T.  et al. (2012) NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids Res., 41, D991–D995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Cibulskis K.  et al. (2013) Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol., 31, 213– 219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Colaprico A.  et al. (2016) Tcgabiolinks: an r/bioconductor package for integrative analysis of tcga data. Nucleic Acids Res., 44, e71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Cortes C., Vapnik V. (1995) Support-vector networks. Mach. Learn., 20, 273–297. [Google Scholar]
  11. Dempster A.P.  et al. (1977) Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodological), 39, 1–38. [Google Scholar]
  12. Dimitrakopoulos C.  et al. (2018) Network-based integration of multi-omics data for prioritizing cancer genes. Bioinformatics, 34, 2441–2448. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Dinstag G., Shamir R. (2019) PRODIGY: personalized prioritization of driver genes. Bioinformatics, 36, 1831–1839. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Edgar R. (2002) Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res., 30, 207–210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Fan Y.  et al. (2016) Muse: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data. Genome Biol., 17, 178. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Froehlich H.  et al. (2011) Fast and efficient dynamic nested effects models. Bioinformatics, 27, 238–244. [DOI] [PubMed] [Google Scholar]
  17. Harris C.C.  et al. (2012) SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics, 28, 311–317. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Honghai F.  et al. (2005). A SVM regression based approach to filling in missing values. In: Khosla R.  et al. (eds.) Knowledge-Based Intelligent Information and Engineering Systems. Springer, Berlin, Heidelberg, pp. 581–587. [Google Scholar]
  19. Hou J.P., Ma J. (2014) Dawnrank: discovering personalized driver genes in cancer. Genome Med., 6, 56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Hou Y.  et al. (2018) Maxmif: a new method for identifying cancer driver genes through effective data integration. Adv. Sci. (Weinheim, Baden-Wurttemberg, Germany), 5, 1800640; 1800640–1800640. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Koboldt D.C.  et al. (2012) Varscan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res., 22, 568–576. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Lawrence M.S.  et al. (2013) Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature, 499, 214–218. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Liaw A., Wiener M. (2002) Classification and regression by randomforest. R. News, 2, 18–22. [Google Scholar]
  24. Markowetz F.  et al. (2005) Non-transcriptional pathway features reconstructed from secondary effects of RNA interference. Bioinformatics, 21, 4026–4032. [DOI] [PubMed] [Google Scholar]
  25. Meyer D.  et al. (2019) e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. CRAN R package version 1.7-0.1. https://cran.r-project.org/web/packages/e1071/index.html
  26. Nelwamondo F.V.  et al. (2007) Missing data: a comparison of neural network and expectation maximization techniques. Curr. Sci., 93, 1514–1521. [Google Scholar]
  27. Network T.C.G.A.R.  et al. (2008) Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature, 455, 1061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. O’Brien J.  et al. (2018) Overview of microrna biogenesis, mechanisms of actions, and circulation. Front. Endocrinol., 9, 402; 402–402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Pantanowitz A., Marwala T. (2009) Missing data imputation through the use of the random forest algorithm. In: Yu W., Sanchez E.N. (eds.) Advances in Computational Intelligence, pp. 53–62. Springer, Berlin, Heidelberg. [Google Scholar]
  30. Pirkl M., Beerenwinkel N. (2018) Single cell network analysis with a mixture of Nested Effects Models. Bioinformatics, 34, i964–i971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Pirkl M.  et al. (2016) Analyzing synergistic and non-synergistic interactions in signalling pathways using Boolean nested effect models. Bioinformatics, 32, 893–900. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Pirkl M.  et al. (2017) Inferring modulators of genetic interactions with epistatic nested effects models. PLOS Comput. Biol., 13, e1005496–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Robinson M.D.  et al. (2010) edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26, 139–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Sadeh M.J.  et al. (2013) Considering unknown unknowns: reconstruction of nonconfoundable causal relations in biological networks. J. Comput. Biol., 20, 920–932. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Shah A. (2018) CALIBERrfimpute: Imputation in MICE using Random Forest. R package version 1.0-1. https://cran.r-project.org/src/contrib/Archive/CALIBERrfimpute/
  36. Shivdasani R.A. (2006) Micrornas: regulators of gene expression and cell differentiation. Blood, 108, 3646–3653. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Siebourg-Polster J.  et al. (2015) Nemix: single-cell nested effects models for probabilistic pathway stimulation. PLOS Comput. Biol., 11, e1004078–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Smieja M.  et al. (2018) Processing of Missing Data by Neural Networks. Curran Associates Inc, Red Hook, NY, USA.
  39. Srivatsa S.  et al. (2018) Improved pathway reconstruction from RNA interference screens by exploiting off-target effects. Bioinformatics, 34, i519–i527. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Stekhoven D.J., Buhlmann P. (2012) MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics, 28, 112–118. [DOI] [PubMed] [Google Scholar]
  41. Sverchkov Y. (2018) Context-specific nested effects models. In Proceedings of the Annual International Conference on Research in Computational Biology (RECOMB).  https://www.springerprofessional.de/context-specific-nested-effects-models/15676774 [DOI] [PMC free article] [PubMed]
  42. Szklarczyk D.  et al. (2019) STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res., 47, D607–D613. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Tokheim C.J.  et al. (2016) Evaluating the evaluation of cancer driver genes. Proc. Natl. Acad. Sci. USA, 113, 14330–14335. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Tresch A., Markowetz F. (2008) Structure learning in nested effects models. Stat. Appl. Genet. Mol. Biol., 7, Article9. [DOI] [PubMed] [Google Scholar]
  45. Venables W.N., Ripley B.D. (2002). Modern Applied Statistics with S, 4th edn. Springer, New York. [Google Scholar]
  46. Wang X.  et al. (2014) Reconstructing evolving signalling networks by hidden markov nested effects models. Ann. Appl. Stat., 8, 448–480. [Google Scholar]
  47. Yang B.  et al. (2012) A data imputation method with support vector machines for activity-based transportation models. In: Wang Y., Li T. (eds.) Foundations of Intelligent Systems. Springer, Berlin, Heidelberg, pp. 249–257. [Google Scholar]
  48. Yip S.H.  et al. (2017) Linnorm: improved statistical analysis for single cell rna-seq expression data. Nucleic Acids Res., 45, e179. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btab113_Supplementary_Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES