Learning epistatic gene interactions from perturbation screens

Kieran Elmes; Fabian Schmich; Ewa Szczurek; Jeremy Jenkins; Niko Beerenwinkel; Alex Gavryushkin

doi:10.1371/journal.pone.0254491

. 2021 Jul 13;16(7):e0254491. doi: 10.1371/journal.pone.0254491

Learning epistatic gene interactions from perturbation screens

Kieran Elmes ^1,^#, Fabian Schmich ^2,^3,^#, Ewa Szczurek ⁴, Jeremy Jenkins ⁵, Niko Beerenwinkel ^2,^3,^*, Alex Gavryushkin ^1,^6,^*

Editor: Ruben Artero⁷

PMCID: PMC8277066 PMID: 34255784

Abstract

The treatment of complex diseases often relies on combinatorial therapy, a strategy where drugs are used to target multiple genes simultaneously. Promising candidate genes for combinatorial perturbation often constitute epistatic genes, i.e., genes which contribute to a phenotype in a non-linear fashion. Experimental identification of the full landscape of genetic interactions by perturbing all gene combinations is prohibitive due to the exponential growth of testable hypotheses. Here we present a model for the inference of pairwise epistatic, including synthetic lethal, gene interactions from siRNA-based perturbation screens. The model exploits the combinatorial nature of siRNA-based screens resulting from the high numbers of sequence-dependent off-target effects, where each siRNA apart from its intended target knocks down hundreds of additional genes. We show that conditional and marginal epistasis can be estimated as interaction coefficients of regression models on perturbation data. We compare two methods, namely glinternet and xyz, for selecting non-zero effects in high dimensions as components of the model, and make recommendations for the appropriate use of each. For data simulated from real RNAi screening libraries, we show that glinternet successfully identifies epistatic gene pairs with high accuracy across a wide range of relevant parameters for the signal-to-noise ratio of observed phenotypes, the effect size of epistasis and the number of observations per double knockdown. xyz is also able to identify interactions from lower dimensional data sets (fewer genes), but is less accurate for many dimensions. Higher accuracy of glinternet, however, comes at the cost of longer running time compared to xyz. The general model is widely applicable and allows mining the wealth of publicly available RNAi screening data for the estimation of epistatic interactions between genes. As a proof of concept, we apply the model to search for interactions, and potential targets for treatment, among previously published sets of siRNA perturbation screens on various pathogens. The identified interactions include both known epistatic interactions as well as novel findings.

1 Introduction

Genetic interactions are also referred to as epistasis, a term that originates from the field of statistical genetics and describes genetic contributions to the phenotype that are not linear in the effects of single genes [1, 2]. Considering two genes at a time, positive and negative epistasis refer to a greater and smaller effect, respectively, of the double mutant genotype than expected from the two single mutant genotypes relative to the wild type. In genetics, the phenotype of primary interest is the reproductive success of a cell, which is commonly termed fitness [3]. In this context, a fitness landscape is the mapping of each combination of possible configurations of gene mutations to a fitness phenotype [4].

The knowledge of fitness landscapes is highly relevant for personalized disease treatment [5]. In cancer, for example, genetic aberrations result in cells with increased somatic fitness, for instance, by evading apoptosis or gaining the ability to metastasise. This increase subsequently promotes post-metastatic tumour development [6]. A major challenge in cancer therapy is the fact that many genes with driving mutations cannot be adequately targeted for inhibition due to toxic side effects and rapid development of drug resistance [7, 8]. To overcome this challenge, a strategy based on the inhibition of genes that interact with genes with cancer driving alterations was proposed [9]. This strategy is based on the principle of synthetic lethality [5, 10, 11], the extreme case of negative epistasis, where single mutants are compatible with cell viability but the double mutant results in cell death. Identifying synthetic lethal gene interactions allows targeting cancer cells in which one of the two genes is mutated, by using drugs that affect the other. In the presence of this drug, the cancer cell lineage will no longer be viable [12].

The identification of fitness landscapes is however a very challenging task, simply due to the exponential growth of the space of interactions. For yeast, for example, it has been shown to be feasible to experimentally perform 75% of all pairwise knockouts [13]. Similarly, [14, 15] study fitness landscapes with a small number of genes in which all or nearly all genotypes of interest have been measured. However, in humans, with approximately 20,000 protein-coding genes, this would constitute to almost 200 million experiments to test all pairwise interactions. An approach that has been successfully applied to identify synthetic lethality in vitro is large-scale perturbation screening of human cancer cell lines using RNA interference [16–19]. However, this strategy only allows cataloguing synthetic lethal gene pairs where one gene is always specific to the screened cell line. While these methods may be sufficient for the identification of a few promising targets for cancer therapy, they do not allow us to estimate general pairwise gene interactions at the human exome scale. To our knowledge, there are currently no methods for inferring gene interactions at this scale. We therefore focus on demonstrating that our approach is sound, rather than comparing to existing methods.

Short-interfering RNAs (siRNAs), the reagents used in RNAi perturbation screening, exhibit strong off-target effects, which results in high numbers of false positives rendering the perturbations hard to interpret [20]. While this is usually conceived as a problem, here we take advantage of this property for the estimation of genetic interactions [21–23]. We propose a novel approach for the second order approximation of a human fitness landscape by inferring the fitness of single gene perturbations and their pairwise interactions from RNAi screening data (Fig 1). Our approach is not restricted to interactions with mutant genes of a specific cell line or explicit double knockdowns. We leverage the combinatorial nature of sequence-dependent off-target effects of siRNAs, where each siRNA in addition to the intended on-target knocks down hundreds of additional genes simultaneously. Not distinguishing between on- and off-targeted genes, we consider each siRNA knockdown as a combinatorial knockdown of multiple genes. Hence, every large-scale RNAi screen, though unintended, contains large numbers of observations of high-order combinatorial knockdowns and provides a rich source for the extraction of pairwise epistasis. These off-target effects have previously been used to improve inference of signalling pathways among a small number (on the order of a dozen) genes [22, 23]. Here, however, we attempt to use it to discover epistatic gene pairs in a genome-wide fashion (i.e. among tens of thousands of genes). Our approach is formulated as a regularized regression model. It can also be deployed for the estimation of epistasis from phenotypes other than fitness, such as for instance phenotypes that measure the activity of disease-relevant pathways, e.g. for pathogen entry [24], TGFβ-signalling [25], or WNT-signalling [26]. Long term, the identification of disease-relevant epistatic gene pairs may allow the design or re-purposing of agents for combinatorial therapy with the potential to improve the efficacy of drugs.

Fig 1 — Black arrows indicate outputs that are actually produced. Red arrows indicate theoretical output.

In solving this model, we adapt two recent statistical learning methods, namely glinternet [27] and xyz [28] to select genes and gene-pairs with non-zero effects on fitness, and evaluate both models on simulated data from real RNAi libraries. We vary the signal-to-noise ratios, number of true gene–gene interactions, number of observations per double knockdown and effect size for epistasis. We find that, within ranges that are realistic to real RNAi data, both approaches are capable of inferring pairwise epistasis with favourable precision and sensitivity when only a small number of genes are involved in interactions. In several tests glinternet continued to infer correct interactions up to several thousand genes, however the run time prohibits more thorough testing. To demonstrate the model on a real data set, we use the perturbation data from [24]. Using glinternet, we search for interactions between kinases, and report the most significant results.

Our simulations are performed using R, and the source code is available at: https://github.com/bioDS/xyz-simulation.

2 Methods

We fix the binary alphabet Σ = {0, 1} representing the two possible states in a perturbation experiment. The value zero denotes the normal state of the gene (unperturbed wild type), whereas the value one indicates knockdown of the gene (perturbed). For p genes we denote by Σ^p the set of binary sequences of length p, indicating the perturbation status of each gene. Any subset $P \subseteq Σ^{p}$ is called a perturbation space and its elements are called perturbation types. If the perturbations are genetic mutations, then the perturbation types are genotypes.

2.1 Fitness landscapes and epistasis

In the following, we focus on fitness landscapes, but would like to note that the theory also holds for any mapping of perturbation type to phenotype. A fitness landscape is a mapping $f : P \to R_{+}$ from perturbation type space to non-negative fitness values. Genetic interactions are a property of the underlying fitness landscape [29]. For p = 2 genes, the perturbation type space $P = {0, 1}^{2}$ contains the wild-type 00, two single perturbations 01 and 10, and the double perturbation 11. The fitness landscape $f : {0, 1}^{2} \to R_{+}$ can be written as

\begin{matrix} f (0, 0) & = & β_{0} \\ f (1, 0) & = & β_{0} + β_{1} \\ f (0, 1) & = & β_{0} + β_{2} \\ f (1, 1) & = & β_{0} + β_{1} + β_{2} + β_{1, 2} \end{matrix}

for parameters $β_{i} \in R$ . β₀ is called the bias, β₁ and β₂ main effects, and β_1,2 the interaction. Epistasis is defined as

\begin{matrix} ε = f (0, 0) + f (1, 1) - f (0, 1) - f (1, 0) \end{matrix}

(1)

It measures the deviation of the fitness of the double knockdown from the expectation under a linear fitness model in the main effects. We see that ε = β_1,2.

2.1.1 Fitness landscape model

It is challenging to generalize the notion of epistasis (Eq 1), because in higher dimensions, many more types of genetic interactions exist [29], even when restricting to pairwise interactions. In general, it will be impossible to estimate all interactions encoded in the fitness landscape reliably from data. In the following, we show how to assess marginal and conditional pairwise epistasis. For p ≥ 1 genes, we consider the Taylor expansion of the fitness landscape

\begin{matrix} f (x_{1}, \dots, x_{p}) = β_{0} + \sum_{i} x_{i} β_{i} + \sum_{i < j} x_{i} x_{j} β_{i, j} + \sum_{i < j < k} x_{i} x_{j} x_{k} β_{i, j, k} + \dots \end{matrix}

(2)

Ignoring interactions of order 3 and higher we obtain the more computationally tractable approximation:

\begin{matrix} f (x_{1}, \dots, x_{p}) \approx β_{0} + \sum_{i} x_{i} β_{i} + \sum_{i < j} x_{i} x_{j} β_{i, j} \end{matrix}

(3)

We show in Appendix A in S1 Appendix that in the fitness landscape model (Eq 3), which contains all main effects and pairwise interactions, but no interactions of higher order, the interaction terms β_i,j alone determine conditional and marginal epistasis of the fitness landscape. Note that although we are discussing the Taylor approximation of the fitness function, the resulting pairwise epistasis definition is identical to that in [15].

2.2 Estimation of epistasis from RNAi perturbation screens

In in vitro RNAi experiments cells are perturbed by reagents, such as siRNA, shRNA, and dsRNA [30], each targeting a specific gene for knockdown. In recent years, it has been shown [20] that siRNAs exhibit strong sequence-dependent off-target effects, such that, in addition to the intended target gene, hundreds of other genes are knocked down. Thus, we can regard siRNA perturbation experiments as combinatorial knockdowns affecting multiple genes simultaneously. On the basis of the fitness landscape model (Eq 3), we propose a regression model for the estimation of epistasis from RNAi data. This inference is only feasible because of the unintended combinatorial nature of siRNA knockdowns.

2.2.1 Perturbation type space

For an RNAi-based perturbation screen, the perturbation type space $P = {g_{1}, \dots, g_{n}}$ is represented as the n × p matrix X that contains g_i in row i. Based on the nucleotide sequences of the reagents, perturbations can be predicted by models for micro RNA (miRNA) target prediction [31]. We use X₁, …, X_p to denote the p column vectors of X for genes 1, …, p and denote by X_i ∘ X_j the column vector consisting of the element-wise products of the entries of X_i and X_j. As a measure of fitness, we use the vector $Y \in R_{+}^{n}$ , denoting the number of cells present after siRNA knockdown.

2.2.2 Regression model

We aim to estimate the conditional epistasis β_i,j between the $(\begin{array}{l} p \\ 2 \end{array})$ pairs of genes (i, j) ∈ {1, …, p}² from all combinatorial gene perturbations in the screen represented in the n × p matrix X, and the n × 1 vector of fitness phenotypes Y. Based on (Eq 3) we regress phenotype Y on perturbations X,

\begin{matrix} E [Y ∣ X] = β_{0} + \sum_{i} X_{i} β_{i} + \sum_{i < j} (X_{i} \circ X_{j}) β_{i, j} \end{matrix}

(4)

The estimated β_i,j are interpreted as the expected change in the response variable Y per unit change in the predictor variable (X_i ∘ X_j) with all other predictors held fixed [32]. From Corollary 1 it follows that estimates for marginal epistasis ε_i,j can be obtained by multiplication of β_i,j with the constant 2^p−2.

2.2.3 Inference

We aim to infer the regression parameters β = (β₀, β_{i:i>0}, β_{i,j:i<j}). Since it is infeasible to directly perform least squares linear regression on the matrix containing all $(\begin{array}{l} p \\ 2 \end{array})$ interactions, we use a two-stage process. First, we use either the group lasso regularisation package glinternet [27], or the xyz interaction search algorithm [28] to select non-zero interactions. This variable selection step is the main computational challenge.

When using glinternet, we infer parameters β = (β₀, β_{i:i>0}, β_{i,j:i<j}) by minimising the squared-error loss function

\begin{matrix} L (Y, X; β) = \frac{1}{2} ‖ Y - (β_{0} + \sum_{i} X_{i} β_{i} + \sum_{i < j} (X_{i} \circ X_{j}) β_{i, j}) ‖_{2}^{2} \end{matrix}

(5)

under the strong hierarchy constraint

\begin{matrix} β_{i, j} \neq 0 if and only if both β_{i} \neq 0 and β_{j} \neq 0 . \end{matrix}

(6)

This constraint allows conditional epistasis between gene i and j, i.e., β_i,j ≠ 0, only if both single-gene effects β_i and β_j are present and constrains the search space. Lim and Hastie ([27]) show that this model can be formulated as a linear regression model with overlapped group lasso (OGL) penalty [33], where, in contrast to the group lasso [34], each predictor can be present in multiple groups.

To perform the variable selection, xyz searches for pairs (i, j) that maximise Y^T X_i X_j. These are the interaction effects that account for the largest component of the response Y. While xyz can be used directly to find the largest interactions, we used xyz_regression to estimate all interactions. xyz_regression solves the following elastic-net problem [28]

\begin{matrix} min_{(β_{0}, β) \in R^{p + 1}, θ \in R^{p (p - 1) / 2}} [\frac{1}{2 n} \sum_{i = 1}^{N} {(y_{i} - β_{0} - x_{i}^{T} β - w_{i}^{T} θ)}^{2} + λ (P_{α} (β) + P_{α} (θ))], \end{matrix}

(7)

where

\begin{matrix} W \in R^{n \times p (p - 1) / 2} = (X_{1} \circ X_{2}, X_{1} \circ X_{3}, \dots, X_{1} \circ X_{p}, X_{2} \circ X_{3}, \dots, X_{p - 1} \circ X_{p}) \end{matrix}

(8)

is the matrix of interactions, and $θ \in R^{p (p - 1) / 2}$ is the vector of regression coefficients for pairwise combinations of columns in W.

\begin{matrix} P_{α} (β) = (1 - α) \frac{1}{2} {| | β | |}_{ℓ_{2}}^{2} + {α | | β | |}_{ℓ_{1}} \end{matrix}

(9)

is the elastic-net penalty.

The parameter α decides the compromise between the ridge-regression penalty (α = 0) and the lasso penalty (α = 1). We left the default value of α = 0.9. The solution is found iteratively, with only a particular set of beta values are allowed to be non-zero at each iteration. In every iteration, the beta values that violate the Karush–Kuhn–Tucker conditions (Eq 10) are added to this set.

\begin{matrix} K K T C o n d i t i o n s : X^{T} (Y - X β) = λ s, s_{i} \in {\begin{matrix} {1} & i f β_{i} > 0 \\ {- 1} & i f β_{i} < 0 \\ [- 1, 1] & i f β_{i} = 0 \end{matrix} \end{matrix}

(10)

Rather than being computed directly, these beta values are found using the xyz algorithm (See Appendix C in S1 Appendix for details). We followed the recommendation in [28] and used $L = \sqrt{p}$ projections to find the strong interactions. Our own tests in Appendix D in S1 Appendix also suggest that further projections do not improve performance.

Second, once the non-zero effects have been estimated using either glinternet or xyz, we construct a matrix X′ with all elements of the set {X_i|X_i ≠ 0} ∪ {X_i ∘ X_j|X_i ⋅ X_j ≠ 0} as columns, in an arbitrary order. We then fit Y ∼ X′β using R’s lm least squares linear regression to calculate the coefficient estimates and corresponding p-values, the latter being whether the value significantly deviates from zero according to a t-test with n − k − 1 degrees of freedom, where k is the number of effects predicted to be non-zero and including in the final regression step. We adjust the p-value to control the false discovery rate with the method of [35], and refer to this adjusted value as the q-value. Given this two-step procedure, we do not expect these values to be the same as if they were calculated using the complete interaction matrix and it should be noted that these may be biased estimates [44].

2.3 Software

The overlapped group lasso for strongly hierarchical interaction terms is implemented in the R-package glinternet 1.0.10 by Lim and Hastie [27] and available through the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/web/packages/glinternet/. The xyz algorithm is implemented in xyz 0.2 by Gian-Andrea Thanei [28] available at https://cran.r-project.org/web/packages/xyz/. The simulations are run using a version of this software that also contains a trivial bug fix, available at https://github.com/bioDS/xyz-simulation. For the data simulation, analysis and visualisation, we used the R-packages Matrix 1.2.6, dplyr 0.4.3, tidyr 0.4.1 and ggplot2 2.1.0. All simulations are performed using R 3.2.4.

2.4 Simulation of RNAi data

The data simulation followed a three-step procedure. First, we simulate the siRNA–gene perturbation matrix X based on real siRNA libraries. Second, main effects β_i and conditional epistasis between pairs of genes β_i,j are sampled. Based on X and β, we then sample fitness phenotypes Y from our model (Eq 3) and add noise to match specific signal-to-noise ratios [36]

\begin{matrix} SNR = \frac{Var (E [Y ∣ X])}{Var (Y - E [Y ∣ X])} . \end{matrix}

(11)

Details for each step including parameter ranges are as follows.

We simulate siRNA–gene perturbation matrices based on four commercially available genome–wide libraries for 20822 human genes from Qiagen with an overall size of 90000 siRNAs. First, we predict sequence dependent off-targets using TargetScan [37] for each siRNA as described in [21]. We threshold all predictions to be 1 if larger than zero and 0 otherwise. Then, we sample n = 1000 siRNAs from {1, …, 90000} and p = 100 genes from {1, …, 20822} without replacement and construct the n × p binary matrix X. Hence, each row X_i⋅ then contained the perturbation type g_i = (x_i,1, …, x_i,p).

We simulate q ∈ {5, 20, 50, 100} non-zero conditional epistasis terms β_i,j between genes i and j from all observed combinatorial knockdowns, i.e. if the simulated screen contained siRNAs that target both genes. This is a necessary condition for the identifiability of β_i,j, as otherwise, according to the model (Eq 4), β_i,j will be multiplied by a zero vector X_i ∘ X_j = 0. The effect size of the β_i,j is sampled from N(0, 2). In order to maintain a strong hierarchy, we subsequently simulate for each interaction β_i,j both main effects β_i and β_j. Further, we add r ∈ {0, 20, 50, 100} additional main effects. The effect sizes of the main effects are sampled from N(0, 1), so that the variance in the response fitness phenotypes are split in a ratio of 1:2 between main effects and interactions.

In order to model synthetic lethal pairs, interactions with effect strength of −1000 (on log scale) are added to the simulated data. Since lethal interactions may occur with little or no main effect present [10], we allow these pairs to violate the strong hierarchy and do not add main effects. This is done both for biological plausibility, and to evaluate the performance of xyz and glinternet under less ideal circumstances. Since only glinternet assumes the strong hierarchy, this scenario might favour xyz.

Based on simulated perturbation matrices X, simulated main effects β_i and interaction terms β_i,j, we sampled fitness values with β₀ = 0 according to the fitness landscape model (Eq 3)

\begin{matrix} Y \sim N (\sum_{i} X_{i} β_{i} + \sum_{i < j} (X_{i} \circ X_{j}) β_{i, j}, σ^{2} I), \end{matrix}

where we chose σ² for fixed SNRs s ∈ {2, 5, 10}.

2.5 Evaluation criteria

We focus the evaluation on the estimated parameters of the model, specifically the conditional epistasis terms, ${\hat{β}}_{{i, j : i < j}}$ , rather than the model’s performance in predicting the fitness phenotypes Y. Given the ground truth of true conditional epistasis between gene i and j, β_{i,j:i<j}, we assess the performance of the model to identify epistasis, i.e., estimated non-zero coefficients ${\hat{β}}_{i, j}$ , by computing the number of true positives (TPs), false positives (FPs) and false negatives (FN). Here, TPs represent the number of gene pairs (i, j) such that β_i,j ≠ 0 and ${\hat{β}}_{i, j} \neq 0$ , FPs the number of gene pairs (i, j): β_i,j = 0 and ${\hat{β}}_{i, j} \neq 0$ and FNs the number of gene pairs (i, j):β_i,j ≠ 0 and ${\hat{β}}_{i, j} = 0 .$ The performance is then summarised using the following measures

\begin{matrix} precision & = & \frac{TP}{TP + FP} \\ recall & = & \frac{TP}{TP + FN} \\ F1 & = & 2 \frac{precision \times recall}{precision + recall} \end{matrix}

Furthermore, we investigate whether estimates ${\hat{β}}_{i, j}$ have the same sign as the ground truth conditional epistasis and we quantify the deviation of the magnitude from the truth. Where applicable, we also evaluate the effect of selection of only those β_i,j which significantly deviate from zero on the model’s performance.

3 Results

First, we evaluate the proposed approach to estimating epistatic effects from off-target perturbations on simulated data. The approach depends on a model able to detect non-zero pairwise interactions (Fig 1). Here, we evaluate the approach using two such alternative models, glinternet and xyz.

We evaluate the ability of both xyz and glinternet to identify epistasis between pairs of genes from RNAi screens on simulated data with p = 100 genes and n = 1000 siRNAs. Only for xyz, we also test larger data sets, with p = 1000 and n = 10000. We use off-target information from real siRNAs and investigate the performance for varying signal-to-noise ratios, number of true interactions, number of observations per double knockdown, and effect sizes for epistasis.

We perform a separate set of tests where we specifically assess the performance of the two methods to identify synthetic lethal interactions, the strongest negative interactions. For this purpose, we simulate a separate data set that contains additional synthetic lethal pairs of genes. In this test, we attempt to identify only lethal interactions using xyz and glinternet, given increasingly large numbers of genes.

3.1 Identification of epistasis under varying conditions

Both xyz and glinternet are tested on a series of small simulated data sets. For each combination of parameters q ∈ {5, 20, 50, 100}, r ∈ {0, 20, 50, 100} and s ∈ {2, 5, 10}, controlling the number of true interactions, the number of additional main effects, and the SNRs of the fitness phenotypes, respectively, we sample 50 independent data sets. xyz is tested on a series of larger data sets, with parameters q ∈ {50, 200, 500, 1000}, r ∈ {0, 200, 500, 1000} and s ∈ {2, 5, 10}. Only 10 independent data sets are sampled in these cases. Each data set consists of the perturbation matrix X, phenotypes Y, true conditional epistasis β_i,j and main effects β_i.

The distribution of the number of observations for pairwise knockdowns of gene i and j is shown in Fig 13 in S1 Appendix for an exemplary perturbation matrix X. While only a few genes have many observations, 87% of gene pairs are simultaneously perturbed by at least one siRNA. Note that the distribution seen in Fig 13 in S1 Appendix is similar for both p = 100 and p = 1, 000 genes. We also find that number of additional main effects has relatively little impact on detecting interactions (Appendix B in S1 Appendix), and this value is kept constant during our tests. We select only estimates ${\hat{β}}_{i, j}$ with a magnitude significantly different from zero (q-value < 0.05). This significantly improves precision, at a slight cost to recall, using both glinternet and xyz (Fig 2).

Fig 2 — Top and bottom panels depict gain of precision and loss of recall, respectively. (a) `glinternet`; (b) `xyz`.

3.1.1 Number of double knockdowns per gene pair

We fixed the number of additional main effects to 20 and investigated performance with respect to the number of double knockdowns per epistatic gene pair, i.e. siRNAs that target both genes (Fig 3). The results are largely similar for both xyz and glinternet. As expected, for increasing numbers of observations, we observe an increase in precision and recall with a steeper increase of precision compared to recall and decreased performance for higher number of true interactions. The number of true epistatic gene pairs primarily affects recall, which decreases for higher numbers of true non-zero β_i,j. For gene pairs with more than 80 observations of the double knockdown, glinternet shows strong performance with F1 values between 0.68 − 0.9 across all tested numbers of true interactions and an SNR larger than or equal to 5 (Fig 3a).

Fig 3 — The number of additional main effects not overlapping with the set of interacting genes is fixed to 20. Results using (a) `glinternet` and (b) `xyz`.

xyz shows significantly improved performance for gene pairs with more than 40 observations, with F1 values almost all above 0.25. Small numbers of true interactions are particularly accurate, with F1 > 0.5 when there are also only 5 such effects (Fig 3b).

The number of times each pair of genes is observed is shown in Fig 4. We see that in the large simulation, in which all parameters are multiplied by ten, the number of observations of each pair of genes is similarly scaled. As a result, the overall distribution is similar to the smaller simulation.

3.1.2 Epistatic effect size

We observe that, for both xyz and glinternet, the performance of the model increases with the absolute value of the magnitude of the conditional epistasis between pairs of genes |β_i,j| (Fig 5). Both for negative and positive epistasis, recall and precision steeply increase with increasing effect size. For pairs of genes with |β_i,j| > 1 and SNRs ≥10, the model performs favourably with F1 values of 0.6 and higher in glinternet, and at least 0.25 in xyz. Overall performance also marginally improves for glinternet at SNR = 5, but no clear effect is seen for xyz or SNR = 10. With both xyz and glinternet, we observe exceptions to the general pattern of the overall V-shape for precision and recall, where strongly negative and positive epistasis and weak epistasis lead to high and low performance of the model, respectively. This effect can be explained by the fact that, after the significance test, an extremely small number of interactions are reported in these ranges (most often only one), with no false positives. The fact that the model’s performance notably decreases for small effect sizes around zero explains why we observe a trend of decreasing performance for increasing numbers of true interactions, when we average over all effect sizes. This is because sampling true epistatic effect sizes from N(0, 2) for increasing numbers of true interactions increases the fraction of interactions with small effects around zero.

Notably, we can see in Fig 5b that even when the overall performance is poor, xyz is still able to find a small number of strong interactions relatively accurately. This is particularly promising, since synthetic lethal pairs would be such interactions.

3.1.3 Direction

We evaluate the ability of each method to distinguish between negative and positive epistasis among epistatic gene pairs identified as true positives (Fig 6). For both glinternet and xyz, the fraction of incorrect estimates of direction (positive vs. negative) is higher for decreasing effect size and increasing number of true interactions. For epistatic effects with an absolute value > 1, we observe at most 3% incorrect predictions with glinternet, and 8% with xyz. We observe at most 9% and 15% incorrect predictions for smaller effect sizes for glinternet and xyz respectively. Furthermore, we observe that increasing SNRs leads to a subtle decrease of incorrectly predicted direction.

Fig 6 — The fraction of incorrectly identified signs between true and estimated epistasis for (a) `glinternet` and (b) `xyz`.

3.1.4 Magnitude

We evaluate the deviation of the magnitude of estimates for epistasis from the ground truth as a function of observed double knockdowns (Fig 7). The deviation in magnitude is computed as $\frac{| β_{i, j} | - | {\hat{β}}_{i, j} |}{| β_{i, j} |}$ , i.e. the percent relative change in deviation with respect to the true epistasis. We observe that across varying numbers of observations the model predicts the magnitude of epistasis between pairs of genes with high accuracy using both xyz and glinternet.

3.2 Scalability

Running glinternet until it has converged takes a prohibitively long time on larger data sets. While we are able to run our p = 100, n = 1, 000 simulations in slightly under two minutes, increasing to p = 1, 000, n = 10, 000 takes over two days using ten cores. Although using more threads is possible, the running time is already dominated by single-threaded components with ten cores. The multi-threaded performance is therefore limited to by Ahmdahl’s Law to approximately the performance we see here. Since fitting with small lambda values takes the majority of the time, we can improve this by changing the minimum value of lambda that gets used. Adjusting this from $\frac{lambdaMax}{100}$ to $\frac{lambdaMax}{4}$ , and fitting only five lambdas in this range rather than fifty, glinternet still takes over an hour. This makes the repeated simulations from subsection 3.1 impractical at a larger scale with glinternet, although we do investigate some larger data sets in subsection 3.3.

It should also be noted that this limits the scale of real-world data that can be analysed using glinternet. While some improvements are possible by disabling cross-validation or setting a high lower limit for lambda, we do not consider analysis of over 1, 000 genes to be practical. Further work needs to be done to develop regression methods for genome-scale data.

Since xyz has significantly shorter run time than glinternet, here we more thoroughly investigate performance on larger data sets. Repeating the earlier simulations with every parameter increased by a factor of 10 (Fig 8), we find that the overall trends remain the same. The fraction of incorrectly identified signs is omitted, as in this test there are no such results.

Fig 8 — (a) number of observations of double knockdown. (b) Precision/recall/f1 by actual effect strength.

There is a significant drop in both precision and recall, and now only effects with a magnitude greater than 3 are found a significant amount of the time (Fig 8b).

3.3 Synthetic lethal pairs

Synthetic lethal pairs are of particular interest, and given that xyz is able to somewhat reliably find extremely strong interactions, it is natural to ask whether it can be used to quickly find lethal pairs, despite its poor performance on weaker interactions. We fix the number of main effects to 10, and simulated 10000 siRNAs on 1000 genes. Synthetic lethal pairs are created as interaction effects of magnitude −1000 (log scale). This rather extreme assumption makes synthetic lethals the best possible case for detection via regression. In practice, synthetic lethal detection accuracy would likely be somewhere between what we see here and that of a small negative effect. Since lethal pairs often do not have strong main effects (i.e. do not follow the strong hierarchy assumption), the components of the interaction are not used as main effects in this case.

Increasing the number of lethal interactions significantly reduces recall, but does not have a clear effect on precision. At this scale, xyz is often able to correctly identify some lethal interactions (Fig 9), particularly when there are only a few to find.

Fig 9 — Neither side of the lethal interactions are used as main effects, and as far as lethal interactions are concerned, there is no hierarchy present.

3.3.1 Synthetic lethality detection in larger matrices

While we could not run a significant number of tests at this scale using glinternet, we could investigate how well its accuracy scales compared to xyz. To do this, we simulated sets of up to p = 4000 genes, and measured the performance of both xyz and glinternet. In this case, both to avoid allocating more elements to a matrix than R allows, and to keep the run time of glinternet low, only n = 2 × p siRNAs are simulated. The ratio of siRNAs, genes, main effects, interactions, and lethals, is fixed to:n = 200 siRNAs, p = 100 genes, b_i = 1 main effect, b_ij = 20 interaction effects, l = 5 lethal interactions. Data sets are then generated with these values multiplied by 5, 10, 20, and 40. As in the previous simulation, components of lethal interactions are not added as main effects. The strong hierarchy assumption is not valid in this case.

Interactions are then found with both xyz and glinternet. Here we focus specifically on synthetic lethal detection, and only correct lethal pairs are considered true positives, Any other pair (including a true interaction that is not a lethal) is considered a false positive.

We can see in Fig 10a that precision with glinternet remains fairly consistent as p increases. There is a roughly proportional reduction in recall as the number of lethal interactions increases. After a slight increase from 500 to 100 genes, the actual number of significant interactions found remains fairly consistent. Beyond p = 2000, we found that xyz typically fails to find any of the lethal pairs (Fig 10b)

Fig 11 shows that neither xyz nor glinternet quite demonstrate a linear run time, but the run time for glinternet increases sharply beyond p = 2000. It is possible that this is simply the result of less efficient cache use with larger data, but it is nonetheless worth noting.

3.4 Violations to model assumptions

For the regularised regression model (Eq 4) we assume strong hierarchy (Eq 6) between main effects β_i and interaction terms β_i,j in order to reduce the search space of all possible non-zero coefficients $p + (\begin{array}{l} p \\ 2 \end{array})$ during inference. We refer the reader to [27], where Lim and Hastie show how violations to this assumption affect the performance. For instance, the performance of the model is evaluated when the ground truth only obeys weak hierarchy, i.e. only one main effect present, no hierarchy, or anti-hierarchy. Additionally, approximately 2.5% of simulations produced no interactions using xyz, because the estimated interaction frequency of non interacting pairs was too low. These were fairly evenly distributed across all combinations of parameters (Fig 12), and are not believed to have substantially affected the results.

3.5 Summary recommendation

After simulating siRNA knockdown data sets of various sizes, and under various conditions, and attempting to reconstruct the interacting pairs using both xyz and glinternet, we arrive at the following recommendations. For data sets containing less than 4,000 genes (assuming between 2 and 10 experiments per gene), we recommend using glinternet to find interactions. Where glinternet would have a prohibitively long run time (data sets larger than those mentioned above), xyz continues to run quickly, and may still identify some useful results (Fig 9), particularly when interactions are observed a large number of times in the data and have strong effects (sections 3.1.1 and 3.1.2). Particularly when one expects a small number of significant interactions, increasing the number of projections beyond $\sqrt{n}$ may improve performance here (see Appendix D in S1 Appendix).

3.6 Effects in real data

Following the recommendation we have arrived at in subsection 3.5, we apply glinternet (followed by a linear regression analysis) to estimate epistatic effects from a real data set. We use the perturbation data from [24], containing siRNA screens targeting kinases in the presence of five bacterial pathogens and two viruses, and apply the routine as described in subsection 2.2 to identify pairwise kinase-kinase interactions. Specifically, we restrict the data to siRNAs that target kinases from the Qiagen Human Kinase siRNA Set V4.1, and the off-target effects within this set, resulting in an input matrix containing 11214 perturbations × 667 genes. Using $f = l o g_{2} (\frac{Cells after}{Cells before})$ as a fitness measure, we found 1662 effects, 116 of which had a p-value less than 0.05. Since we have assumed that perturbations are binary in our simulations, we continue to do so here. As a result, all non-zero predicted off-target effects are given a value of 1. The ten most significant predicted effects are shown in Table 1 (the full set of results, significant or otherwise, can be found at https://github.com/bioDS/xyz-simulation/blob/master/real_data_results_sorted.csv). Interestingly, the most significantly associated pair of genes, CDK5R1 and RPS6KA2, are both related to a common pathway. Specifically, CDK5R1 activates CDK5, which, along with RPS6KA2, is part of the IL-6 signalling pathway [38]. Searching both the ConsensusPathDB database [39], and STRING database [40] for relations between the found pairs, we find that a number of the interactions suggested here could be the result of existing known interactions. We each of the identified pairs of genes, we searched for common neighbours (a third gene with which both interact), shared pathways, and whether the produced proteins are found in the same protein complexes, and found the following known relationships:

Table 1. Ten most significant predicted effects of siRNA perturbation screens, targeting all human kinases.

Gene i	Gene j	Type	Combined Effect	P-value	i effect	j effect
CDK5R1	RPS6KA2	interaction	12.52	0.0047	1.71	-2.32
RIPK4	GRK3	interaction	-3.24	0.0056	-24.5	1.87
PHKB	GUK1	interaction	-7.47	0.0061	6.23	-28.4
MAP2K6	UCK1	interaction	-40.89	0.0094	13.8	-22.6
TNIK	PANK4	interaction	-37.41	0.0115	21.3	5.21
RPS6KB2	TTK	interaction	172.04	0.0118	0.5	-20.4
MAPK4	TRPM7	interaction	9.49	0.0120	8.46	16.3
HIPK1	NUAK2	interaction	-13.17	0.0126	18.1	29.5
CDK19	NA	main	3.80	0.0136	3.80
C17orf75	MAPK8IP3	interaction	21.74	0.0136	5.91	20.4

Open in a new tab

CDK5R1 and RPS6KA2 share a common neighbour, and are present in four of the same enriched pathway-based sets. TTK and RPS6KA2 share nine common neighbours. RIPK4 and GRK3 share one neighbour, nd homologs were found interacting in other species. TNIK and PANK4 share one neighbour, as do MAPK4 and TRPM7, MAP2K6 and UCK1, and HIPK1 and NUAK2. As we could not locate the other identified pairs in the database, we hypothesise that they might constitute novel interactions.

Of the interactions present in Table 1, we see that HIPK1 and NUAK2, TNIK and PANK4, and MAP2K6 and UCK1 are predicted to have negative epistatic effects, and may be promising synthetic lethal candidates.

For comparison we also fit a linear model including all genes, but no interactions. Comparing the R² values for each, we find that individual gene effects explain ≈15% of the variance (R² = 0.150) Including the interactions chosen by glinternet, and removing the main effects it sets to zero, we have R² = 0.392, more than doubling the fraction of explained variance. The Adjusted R² is also significantly higher for the pairwise model, 0.286 as opposed to 0.096, indicating that the additional interaction variables are contributing significantly more than random. Moreover, the Akaike An Information Criterion (AIC) values indicate the pairwise model is more informative, with an AIC value of −11607 as opposed to the main effect only model’s −9853. This highlights the importance of accounting for interactions in large-scale genotype-phenotype analyses, and relevance of bioinformatic tools with this capability.

4 Discussion

To the best of our knowledge, the presented model is the first approach that leverages the combinatorial nature of RNAi knockdown data resulting from sequence-dependent off-target effects for the large-scale prediction of epistasis between pairs of genes. To do this, we take the second-order approximation of the fitness landscape, including only individual and pairwise effects, and attempt to infer the parameters of this model. Since glinternet is able to find pairwise interactions among p = 1, 000 genes, we speculate that searching for three-way interactions is feasible among $\sqrt[3]{1, 000^{2}} = 100$ genes. We are not aware of any software currently able to do this, however.

For the majority of our tests, we simulate the presence of a strong hierarchy. This constraint would imply that for the inference of non-zero epistatic effects between gene i and j, β_i,j, we penalise cases where the main effects for both single genes, β_i and β_j, are zero. This constraint significantly decreases the complexity of the search space of interactions. However, in biology there are many examples of epistasis where the marginal effects of individual genes are very small, for instance if both genes redundantly execute the same function within the cell [41]. [13] found in their study of experimental double knockouts in yeast that single mutants with decreasing fitness phenotypes tended to exhibit an increasing number of genetic interactions. This observation is reassuring for glinternet, which can pick up the interaction as long as the true single-mutant effects are not exactly zero. Moreover, Lim and Hastie showed in a simulation study that the model is in fact flexible enough to also identify pairwise interactions violating the strong hierarchy constraint [27]. For the detection of strong interactions, specifically synthetic lethal pairs, we have also demonstrated that the strong hierarchy constraint is not required.

In a simulation study, we sampled perturbations for n = 1000 siRNAs and p = 100 genes, and n = 10000 siRNAs with p = 1000 genes. As a consequence of high-throughput genome-wide screening platforms, the setting of n = 10 × p, i.e. ten perturbations with different siRNAs per gene, is realistic even for higher order organisms with tens of thousands of genes [21, 24]. Sampling the perturbations directly from commercially available RNAi libraries allows us to translate results from the simulation study to applications on real data. We observe that increasing SNRs, as expected, results in an overall increase of the number of successfully identified gene pairs with true epistasis.

Nevertheless, we found that even for a moderate SNR of only 2, the model identifies interactions with acceptable performance using glinternet (F1 > 0.5 for 50 true interactions), when we observed each double knockdown over 40 times (Fig 3a) or the effect size of epistasis is larger than 1, i.e. |β_i,j| > 1 (Fig 5a). For an SNR of 5 and across all tested numbers of additional gene pairs and epistatic effect sizes, the performance of the model is approximately constant at around F1 = 0.5, independent of the number of true epistatic gene pairs (Appendix B in S1 Appendix).

Performance in our simulations also suggests that xyz is unable to accurately identify interactions in large data sets. Although xyz has a consistently short run time, and appears capable of running on genome-scale data, we see a significant drop in all other performance measures beyond p = 1000 genes.

The results when using glinternet, however, suggest that the general approach is able to accurately identify pairwise epistasis from large-scale RNAi data sets, given that the SNR of measured fitness phenotypes is larger than 2 and the effect size of epistasis is larger than 1. It is challenging to compare the performance of these models to approaches that estimate genetic interactions from other data, such as for instance from double knockout experiments [13], due to different scales of the epistatic effect size, however, the high precision of glinternet seems quite competitive. Moreover, our simulations demonstrated that if true epistatic effects between pairs of genes are identified, the model identifies both the direction of epistasis (positive and negative) as well as the magnitude of the epistatic effect with high accuracy (Figs 6 and 7).

In detecting lethal interactions specifically, the high precision of glinternet after testing for significant deviations is particularly promising. Using this as a method to detect likely synthetic lethal interactions from RNAi data sets, we could propose candidates for further investigation as anticancer drug targets [9, 12]. While the run time may prevent glinternet from being used as such a method in genome-scale applications, we can recommend it for use with smaller data sets, or where the number of potential interactions can be significantly reduced prior to running glinternet. As the precision does not appear to suffer with larger input, only the run time, we believe combining linear regression with a perturbation matrix is a promising method for further investigation, and work to improve the performance sufficiently for use in genome-scale applications is ongoing.

Demonstrating our method on a set of kinase siRNA knockdowns, we find a number of plausible proposed effects (Table 1). This set is sufficiently small that the true positives may be found experimentally by testing all ≈1.4 million gene pairs (as in [13]). Alternatively, the most likely candidates may analysed with targeted sequencing and fitness measurements (as in [42]) or clinical trials, (as in [43]). It is likely that a significant number of false positives are present among the proposed interactions, and we consider such verification to be an essential second step in discovering true epistatic effects.

While filtering results by p-value does significantly increase accuracy (Fig 2), the p-values we use do not account for the prior variable selection (using glinternet or xyz), and may therefore be biased. Recent work is able to overcome this limitation with regard lasso regression in some cases [44, 45], however existing implementations [45, 46] of these methods require storing the full interaction matrix X′. For non-trivial numbers of interactions this does not typically fit in memory, and we cannot work with it directly. Moreover, the procedure from [47] is not applicable when p ≫ n unless the variance can be efficiently estimated. Given recent progress in variance estimation for lasso regression [48] it may be possible to implement unbiased p-value calculations for lasso regression at this scale, and we suggest this as one possible improvement for future work.

Finally, it is worth noting that this approach is not limited to siRNA perturbation matrices, or to synthetic lethal detection. Any method of suppressing gene expression, combined with an affected proxy for fitness, could be used to find likely candidates for epistasis.

Supporting information

S1 Appendix

(PDF)

Click here for additional data file.^{(3.5MB, pdf)}

Data Availability

Everything necessary to reproduce our results, including data, has been uploaded to Github: https://github.com/bioDS/xyz-simulation.

Funding Statement

This work has partially been funded by SystemsX.ch, the Swiss Initiative in Systems Biology, under IPhD grant 2009/025 and RTD grants 51RT-0 126008 (InfectX) and 51RTP0 151029 (TargetInfectX), evaluated by the Swiss National Science Foundation. AG and KE acknowledge support from the Royal Society Te Ap¯arangi through a Rutherford Discovery Fellowship (RDF-UOO1702) awarded to AG. AG and KE were partially supported by Ministry of Business, Innovation, and Employment of New Zealand through an Endeavour Smart Ideas grant (UOOX1912) and a Data Science Programmes grant (UOAX1932).

References

1. Wright S. The Roles of Mutation, Inbreeding, Crossbreeding, and Selection in Evolution. Proc 6th Int Cong Genet. 1932;1:356–366. [Google Scholar]
2. Cordell HJ. Epistasis: What It Means, What It Doesn’t Mean, and Statistical Methods to Detect It in Humans. Human Molecular Genetics. 2002;11(20):2463–2468. doi: 10.1093/hmg/11.20.2463 [DOI] [PubMed] [Google Scholar]
3. Orr HA. Fitness and Its Role in Evolutionary Genetics. Nature Reviews Genetics. 2009;10(8):531–539. doi: 10.1038/nrg2603 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. de Visser JAGM, Cooper TF, Elena SF. The Causes of Epistasis. Proceedings of the Royal Society B: Biological Sciences. 2011;278(1725):3617–3624. doi: 10.1098/rspb.2011.1537 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Kaelin WG. The Concept of Synthetic Lethality in the Context of Anticancer Therapy. Nature Reviews Cancer. 2005;5(9):689–698. doi: 10.1038/nrc1691 [DOI] [PubMed] [Google Scholar]
6. Hanahan D, Weinberg RA. Hallmarks of Cancer: The next Generation. Cell. 2011. doi: 10.1016/j.cell.2011.02.013 [DOI] [PubMed] [Google Scholar]
7. Force T, Kolaja KL. Cardiotoxicity of Kinase Inhibitors: The Prediction and Translation of Preclinical Models to Clinical Outcomes. Nature Reviews Drug Discovery. 2011. doi: 10.1038/nrd3252 [DOI] [PubMed] [Google Scholar]
8. Holohan C, Van Schaeybroeck S, Longley DB, Johnston PG. Cancer Drug Resistance: An Evolving Paradigm. Nature Reviews Cancer. 2013. doi: 10.1038/nrc3599 [DOI] [PubMed] [Google Scholar]
9. Ashworth A, Lord CJ, Reis-Filho JS. Genetic Interactions in Cancer Progression and Treatment. Cell. 2011;145(1):30–38. doi: 10.1016/j.cell.2011.03.020 [DOI] [PubMed] [Google Scholar]
10. Jerby-Arnon L, Pfetzer N, Waldman YY, McGarry L, James D, Shanks E, et al. Predicting Cancer-Specific Vulnerability via Data-Driven Detection of Synthetic Lethality. Cell. 2014;158(5):1199–1209. doi: 10.1016/j.cell.2014.07.027 [DOI] [PubMed] [Google Scholar]
11. O’Neil NJ, Bailey ML, Hieter P. Synthetic Lethality and Cancer. Nature Reviews Genetics. 2017;18(10):613–623. doi: 10.1038/nrg.2017.47 [DOI] [PubMed] [Google Scholar]
12. Chan DA, Giaccia AJ. Harnessing Synthetic Lethal Interactions in Anticancer Drug Discovery. Nature Reviews Drug Discovery. 2011;10(5):351–364. doi: 10.1038/nrd3374 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Costanzo M, Baryshnikova A, Bellay J, Kim Y, Spear ED, Sevier CS, et al. The Genetic Landscape of a Cell. Science. 2010. doi: 10.1126/science.1180823 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Poelwijk FJ, Socolich M, Ranganathan R. Learning the Pattern of Epistasis Linking Genotype and Phenotype in a Protein. Nature Communications. 2019;10(1):4213. doi: 10.1038/s41467-019-12130-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Otwinowski J, Nemenman I. Genotype to Phenotype Mapping and the Fitness Landscape of the E. Coli Lac Promoter. PLoS ONE. 2013;8(5):e61570. doi: 10.1371/journal.pone.0061570 [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Steckel M, Molina-Arcas M, Weigelt B, Marani M, Warne PH, Kuznetsov H, et al. Determination of Synthetic Lethal Interactions in KRAS Oncogene-Dependent Cancer Cells Reveals Novel Therapeutic Targeting Strategies. Cell Research. 2012. doi: 10.1038/cr.2012.82 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Laufer C, Fischer B, Billmann M, Huber W, Boutros M. Mapping Genetic Interactions in Human Cancer Cells with RNAi and Multiparametric Phenotyping. Nat Meth. 2013. doi: 10.1038/nmeth.2436 [DOI] [PubMed] [Google Scholar]
18. McDonald ER, de Weck A, Schlabach MR, Billy E, Mavrakis KJ, Hoffman GR, et al. Project DRIVE: A Compendium of Cancer Dependencies and Synthetic Lethal Relationships Uncovered by Large-Scale, Deep RNAi Screening. Cell. 2017;170(3):577–592.e10. doi: 10.1016/j.cell.2017.07.005 [DOI] [PubMed] [Google Scholar]
19.DepMap B. DepMap 20Q2 Public. DOI. 2020. 10.6084/m9.figshare.12280541.v4 [DOI]
20. Jackson AL, Burchard J, Schelter J, Chau BN, Cleary M, Lim L, et al. Widespread siRNA “off-Target” Transcript Silencing Mediated by Seed Region Sequence Complementarity. RNA. 2006;12(7):1179–1187. doi: 10.1261/rna.25706 [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Schmich F, Szczurek E, Kreibich S, Dilling S, Andritschke D, Casanova A, et al. gespeR: A Statistical Model for Deconvoluting off-Target-Confounded RNA Interference Screens. Genome Biology. 2015;16(1):220. doi: 10.1186/s13059-015-0783-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Srivatsa S, Kuipers J, Schmich F, Eicher S, Emmenlauer M, Dehio C, et al. Improved Pathway Reconstruction from RNA Interference Screens by Exploiting Off-Target Effects. Bioinformatics. 2018;34(13):i519–i527. doi: 10.1093/bioinformatics/bty240 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Tiuryn J, Szczurek E. Learning Signaling Networks from Combinatorial Perturbations by Exploiting siRNA Off-Target Effects. Bioinformatics. 2019;35(14):i605–i614. doi: 10.1093/bioinformatics/btz334 [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Rämö P, Drewek A, Arrieumerlou C, Beerenwinkel N, Ben-Tekaya H, Cardel B, et al. Simultaneous Analysis of Large-Scale RNAi Screens for Pathogen Entry. BMC Genomics. 2014;15(1):1162. doi: 10.1186/1471-2164-15-1162 [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Schultz N, Marenstein DR, De Angelis DA, Wang WQ, Nelander S, Jacobsen A, et al. Off-Target Effects Dominate a Large-Scale RNAi Screen for Modulators of the TGF-β Pathway and Reveal microRNA Regulation of TGFBR2. Silence. 2011. doi: 10.1186/1758-907X-2-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Tang W, Dodge M, Gundapaneni D, Michnoff C, Roth M, Lum L. A Genome-Wide RNAi Screen for Wnt/Beta-Catenin Pathway Components Identifies Unexpected Roles for TCF Transcription Factors in Cancer. Proceedings of The National Academy Of Sciences Of The United States Of America. 2008. doi: 10.1073/pnas.0804709105 [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Lim M, Hastie T. Learning Interactions via Hierarchical Group-Lasso Regularization. Journal of Computational and Graphical Statistics. 2015;24(3):627–654. doi: 10.1080/10618600.2014.938812 [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Thanei GA, Meinshausen N, Shah RD. The Xyz Algorithm for Fast Interaction Search in High-Dimensional Data. Journal of Machine Learning Research. 2018;19(1):1343–1384. [Google Scholar]
29. Beerenwinkel N, Pachter L, Sturmfels B. Epistasis and Shapes of Fitness Landscapes. Statistica Sinica. 2007. [Google Scholar]
30. Singh S, Narang AS, Mahato RI. Subcellular Fate and Off-Target Effects of siRNA, shRNA, and miRNA. Pharmaceutical Research. 2011;28(12):2996–3015. doi: 10.1007/s11095-011-0608-1 [DOI] [PubMed] [Google Scholar]
31. Lewis BP, Shih Ih, Jones-Rhoades MW, Bartel DP, Burge CB. Prediction of Mammalian MicroRNA Targets. Cell. 2003;115(7):787–798. doi: 10.1016/S0092-8674(03)01018-3 [DOI] [PubMed] [Google Scholar]
32. Mosteller F, Tukey JW. Data Analysis and Regression: A Second Course in Statistics. Addison-Wesley Series in Behavioral Science: Quantitative Methods. 1977. [Google Scholar]
33.Jacob L, Obozinski G, Vert JP. Group Lasso with Overlap and Graph Lasso. In: Proceedings of the 26th annual international conference on machine learning; 2009. p. 433–440.
34. Yuan M, Lin Y. Model Selection and Estimation in Regression with Grouped Variables. J R Stat Soc Series B Stat Methodol. 2006. doi: 10.1111/j.1467-9868.2005.00532.x [DOI] [Google Scholar]
35. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological). 1995;57(1):289–300. doi: 10.1111/j.2517-6161.1995.tb02031.x [DOI] [Google Scholar]
36. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer; New York; 2009. [Google Scholar]
37. Garcia DM, Baek D, Shin C, Bell GW, Grimson A, Bartel DP. Weak Seed-Pairing Stability and High Target-Site Abundance Decrease the Proficiency of Lsy-6 and Other microRNAs. Nature Structural & Molecular Biology. 2011. doi: 10.1038/nsmb.2115 [DOI] [PMC free article] [PubMed] [Google Scholar]
38. Kandasamy K, Mohan SS, Raju R, Keerthikumar S, Kumar GSS, Venugopal AK, et al. NetPath: A Public Resource of Curated Signal Transduction Pathways. Genome Biology. 2010;11(1):R3. doi: 10.1186/gb-2010-11-1-r3 [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Kamburov A, Wierling C, Lehrach H, Herwig R. ConsensusPathDB-a Database for Integrating Human Functional Interaction Networks. Nucleic Acids Research. 2009;37:D623–D628. doi: 10.1093/nar/gkn698 [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Szklarczyk D, Morris JH, Cook H, Kuhn M, Wyder S, Simonovic M, et al. The STRING Database in 2017: Quality-Controlled Protein–Protein Association Networks, Made Broadly Accessible. Nucleic Acids Research. 2017;45(D1):D362–D368. doi: 10.1093/nar/gkw937 [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Puchta O, Cseke B, Czaja H, Tollervey D, Sanguinetti G, Kudla G. Network of Epistatic Interactions within a Yeast snoRNA. Science. 2016;352(6287):840–844. doi: 10.1126/science.aaf0965 [DOI] [PMC free article] [PubMed] [Google Scholar]
42. Vogwill T, Kojadinovic M, MacLean RC. Epistasis between Antibiotic Resistance Mutations and Genetic Background Shape the Fitness Effect of Resistance across Species of Pseudomonas. Proceedings of the Royal Society B: Biological Sciences. 2016;283(1830):20160151. doi: 10.1098/rspb.2016.0151 [DOI] [PMC free article] [PubMed] [Google Scholar]
43. van Gaalen FA, Verduijn W, Roelen DL, Böhringer S, Huizinga TWJ, van der Heijde DM, et al. Epistasis between Two HLA Antigens Defines a Subset of Individuals at a Very High Risk for Ankylosing Spondylitis. Annals of the Rheumatic Diseases. 2013;72(6):974–978. doi: 10.1136/annrheumdis-2012-201774 [DOI] [PubMed] [Google Scholar]
44. Lee JD, Sun DL, Sun Y, Taylor JE. Exact Post-Selection Inference, with Application to the Lasso. The Annals of Statistics. 2016;44(3):907–927. doi: 10.1214/15-AOS1371 [DOI] [Google Scholar]
45. Tibshirani RJ, Taylor J, Lockhart R, Tibshirani R. Exact Post-Selection Inference for Sequential Regression Procedures. Journal of the American Statistical Association. 2016;111(514):600–620. doi: 10.1080/01621459.2015.1108848 [DOI] [Google Scholar]
46. Dezeure R, Bühlmann P, Meier L, Meinshausen N. High-Dimensional Inference: Confidence Intervals, p-Values and R-Software Hdi. Statistical Science. 2015;30(4):533–558. doi: 10.1214/15-STS527 [DOI] [Google Scholar]
47. Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R. A Significance Test for the Lasso. The Annals of Statistics. 2014;42(2):413–468. doi: 10.1214/13-AOS1175 [DOI] [PMC free article] [PubMed] [Google Scholar]
48. Kennedy C, Ward R. Greedy Variance Estimation for the LASSO. Applied Mathematics & Optimization. 2020;82(3):1161–1182. doi: 10.1007/s00245-019-09561-6 [DOI] [Google Scholar]

PLoS One. 2021 Jul 13;16(7):e0254491. doi: 10.1371/journal.pone.0254491.r001

Author response to previous submission

Transfer Alert

This paper was transferred from another journal. As a result, its full editorial history (including decision letters, peer reviews and author responses) may not be present.

28 Sep 2020

Attachment

Submitted filename: 31-03-2021-perturbations_response_to_plos_cb_referees.pdf

Click here for additional data file.^{(1,012.1KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0254491.r002

Decision Letter 0

Ruben Artero

7 May 2021

PONE-D-20-30514

Learning epistatic gene interactions from perturbation screens

PLOS ONE

Dear Dr. Gavryushkin,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The reviewers found the method of interest but also raised a number of major limitations that require a detailed response by the authors. Importantly, reviewer 2 expressed concerns about the appropriateness of the statistical analysis.

Please submit your revised manuscript by Jun 20 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Ruben Artero, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for stating the following in the Acknowledgments Section of your manuscript:

[This work has partially been funded by SystemsX.ch, the Swiss Initiative in Systems Biology,under IPhD grant 2009/025 and RTD grants 51RT-0126008 (InfectX) and 51RTP0151029(TargetInfectX), evaluated by the Swiss National Science Foundation.We acknowledge support from the Royal Society Te Ap ¯arangi through a Rutherford DiscoveryFellowship (RDF-UOO1702) awarded to AG. This work was partially supported by Ministryof Business, Innovation, and Employment of New Zealand through an Endeavour Smart Ideasgrant (UOOX1912) and a Data Science Programmes grant (UOAX1932).]

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

[Funded: The required information is available in the Acknowledgements section of the manuscript.]

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

3. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match.

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

4. Thank you for stating the following in the Financial Disclosure section:

[Funded: The required information is available in the Acknowledgements section of the manuscript.].

We note that one or more of the authors are employed by a commercial company: Novartis Institutes for BioMedical Research

Please provide an amended Funding Statement declaring this commercial affiliation, as well as a statement regarding the Role of Funders in your study. If the funding organization did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript and only provided financial support in the form of authors' salaries and/or research materials, please review your statements relating to the author contributions, and ensure you have specifically and accurately indicated the role(s) that these authors had in your study. You can update author roles in the Author Contributions section of the online submission form.

Please also include the following statement within your amended Funding Statement.

“The funder provided support in the form of salaries for authors [insert relevant initials], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.”

If your commercial affiliation did play a role in your study, please state and explain this role within your updated Funding Statement.

2. Please also provide an updated Competing Interests Statement declaring this commercial affiliation along with any other relevant declarations relating to employment, consultancy, patents, products in development, or marketed products, etc.

Within your Competing Interests Statement, please confirm that this commercial affiliation does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests) . If this adherence statement is not accurate and there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared.

Please include both an updated Funding Statement and Competing Interests Statement in your cover letter. We will change the online submission form on your behalf.

Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests

5. Please remove your figures from within your manuscript file, leaving only the individual TIFF/EPS image files, uploaded separately. These will be automatically included in the reviewers’ PDF.

6. Please ensure that you refer to Figure 14, 17, 18, 19, 20, 21 and 22 in your text as, if accepted, production will need this reference to link the reader to the figure.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This paper describes a technique for inferring epistatic interactions from siRNA perturbation screens. The main ideas are (1) to treat the off-target effects of each siRNA together with the on-target effects as a complex perturbation of many genes and then (2) to extract epistatic interactions from a collection of the complex perturbations using a sparse regression framework. The paper sets up this framework, conducts an extensive simulation study where this approach is implemented using two different existing methods for sparse (elastic net) regression, and provides a reanalysis of an existing siRNA dataset.

The contribution here is sound, and indeed a rather creative idea, and is certainly suitable for PLoSOne with a few minor corrections.

- The definition of marginal epistasis could be improved by defining it in terms of averages rather than a sums. Then the fitness of a genotype in the marginal fitness landscape would simply be the expected fitness of a genotype given its allelic state at the specified subset of genes, and Corollary 1 simplifies to \\epsilon_{i,j}=\\beta_{i,j} instead of \\epsilon_{i,j}=2^{p-2} \\beta_{i,j}.

- The authors should also disambiguate between their concept of marginal epistasis and the (different) marginal epistasis concept studied in e.g. Crawford et al. 2017 PLoS Genetics.

- Sparse regression (Lasso, elastic map, and their Bayesian analogs) has previously been applied both in the fitness landscape and the quantitative genetics literature, and the reader should be aware of how the current contribution fits into this broader context. I am thinking of things like Cai et al. 2011 "Fast empirical Bayesian LASSO for multiple quantitative trait locus mapping" in the quantitative genetics literature and Otwinowski and Nemenman 2013 and Poelwijk et al. 2019 in the fitness landscape literature, but there is additional work in both areas (especially in quantitative genetics). One important feature of L1 regularized regression is that the results are sensitive to the choice of basis, and the Taylor basis used here is one difference with most other treatments of epistasis via regularized regression.

- The re-analysis of the siRNA screen should be clarified. The current text says that the kinases being targeted are kinases of the pathogens, but it appears that these are in fact the kinases of the human host. More generally, the reader needs more background on the experimental design of this study in order to understand the reanalysis. In Table 1, the reader also needs to see the corresponding single mutation effects in order to understand what kind of epistatic interactions these are (e.g. which of these are synthetic lethals), and some of the estimated effects seem awfully large (e.g. 172.04 on a log 2 scale).

Minor additional comments:

- More precise notation would be helpful in Eqn 6

- pg. 6 Karush-Kuhn-Tucker conditions

- pg. 7 \\theta needs to be defined

- pg. 17 What does failed to run mean?This paper describes a technique for inferring epistatic interactions from siRNA perturbation screens. The main ideas are (1) to treat the off-target effects of each siRNA together with the on-target effects as a complex perturbation of many genes and then (2) to extract epistatic interactions from a collection of the complex perturbations using a sparse regression framework. The paper sets up this framework, conducts an extensive simulation study where this approach is implemented using two different existing methods for sparse (elastic net) regression, and provides a reanalysis of an existing siRNA dataset.

The contribution here is sound, and indeed a rather creative idea, and is certainly suitable for PLoSOne with a few minor corrections.

- The authors should also disambiguate between their concept of marginal epistasis and the (different) marginal epistasis concept studied in e.g. Crawford et al. 2017 PLoS Genetics.

Minor additional comments:

- More precise notation would be helpful in Eqn 6

- pg. 6 Karush-Kuhn-Tucker conditions

- pg. 7 \\theta needs to be defined

- pg. 17 What does failed to run mean?

Reviewer #2: I would like to thank the editor for the possibility of reviewing this article entitled “LEARNING EPISTATIC GENE INTERACTIONS FROM PERTURBATION SCREENS” by Elmes et al.

The topic presented is attractive as there is still an unmet need for the understanding of the epistatic gene interactions and its role in developing new therapeutic approach to achieve synthetic lethality.

The abstract summarizes the entire content of the article. The figures and tables help in summarizing its content.

However, this article presents several weaknesses. Here you can find a list of general comments and revisions:

• I have concerns about the two-stage approach to identify interactions. Using linear model to calculate estimates and p values does not take in account the selection process of LASSO. Refitting the regression model after variable selection may lead to biased estimates. However, some advances have been done in this area (Taylor & Tibshirani, 2016; D. Lee, J. et al, 2016).

• Importance of interactions is selected based on p value, however the criteria should be effect size along with p value. Consequently, please, review the following sentence from page 8 “We are nonetheless able (…)” I would suggest to remove it.

• I would like the authors to explain if the impact of the scalability could be a problem in real world data.

• As a general comment, strong emphasis is done to simulated data but there is a need to further investigate the effects of this methodology in real world data.

• The last paragraph of the results section compares R2 of two models, however it is well-known that R2 increases when including more covariates. It is recommended to compare models with a complexity penalization approach, i.e, AIC.

• I would like to know if the authors studied the role of epistatic gene interactions in any specific disease. The presence of some clinical example could improve the quality of the article making it more easily readable.

• As a minor comment, when referring to the Normal distribution in an equation it is conventionally assumed to write it as N(�, �2) instead of Norm(�, �2)

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Jul 13;16(7):e0254491. doi: 10.1371/journal.pone.0254491.r003

Author response to Decision Letter 0

8 Jun 2021

This is included in the cover letter PDF file.

Attachment

Submitted filename: response to reviewers1.pdf

Click here for additional data file.^{(145.9KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0254491.r004

Decision Letter 1

Ruben Artero

29 Jun 2021

Learning epistatic gene interactions from perturbation screens

PONE-D-20-30514R1

Dear Dr. Gavryushkin,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Ruben Artero, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: (No Response)

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS One. doi: 10.1371/journal.pone.0254491.r005

Acceptance letter

Ruben Artero

2 Jul 2021

PONE-D-20-30514R1

Learning epistatic gene interactions from perturbation screens.

Dear Dr. Gavryushkin:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Ruben Artero

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Appendix

(PDF)

Click here for additional data file.^{(3.5MB, pdf)}

Attachment

Submitted filename: 31-03-2021-perturbations_response_to_plos_cb_referees.pdf

Click here for additional data file.^{(1,012.1KB, pdf)}

Attachment

Submitted filename: response to reviewers1.pdf

Click here for additional data file.^{(145.9KB, pdf)}

Data Availability Statement

Everything necessary to reproduce our results, including data, has been uploaded to Github: https://github.com/bioDS/xyz-simulation.

[pone.0254491.ref001] 1. Wright S. The Roles of Mutation, Inbreeding, Crossbreeding, and Selection in Evolution. Proc 6th Int Cong Genet. 1932;1:356–366. [Google Scholar]

[pone.0254491.ref002] 2. Cordell HJ. Epistasis: What It Means, What It Doesn’t Mean, and Statistical Methods to Detect It in Humans. Human Molecular Genetics. 2002;11(20):2463–2468. doi: 10.1093/hmg/11.20.2463 [DOI] [PubMed] [Google Scholar]

[pone.0254491.ref003] 3. Orr HA. Fitness and Its Role in Evolutionary Genetics. Nature Reviews Genetics. 2009;10(8):531–539. doi: 10.1038/nrg2603 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref004] 4. de Visser JAGM, Cooper TF, Elena SF. The Causes of Epistasis. Proceedings of the Royal Society B: Biological Sciences. 2011;278(1725):3617–3624. doi: 10.1098/rspb.2011.1537 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref005] 5. Kaelin WG. The Concept of Synthetic Lethality in the Context of Anticancer Therapy. Nature Reviews Cancer. 2005;5(9):689–698. doi: 10.1038/nrc1691 [DOI] [PubMed] [Google Scholar]

[pone.0254491.ref006] 6. Hanahan D, Weinberg RA. Hallmarks of Cancer: The next Generation. Cell. 2011. doi: 10.1016/j.cell.2011.02.013 [DOI] [PubMed] [Google Scholar]

[pone.0254491.ref007] 7. Force T, Kolaja KL. Cardiotoxicity of Kinase Inhibitors: The Prediction and Translation of Preclinical Models to Clinical Outcomes. Nature Reviews Drug Discovery. 2011. doi: 10.1038/nrd3252 [DOI] [PubMed] [Google Scholar]

[pone.0254491.ref008] 8. Holohan C, Van Schaeybroeck S, Longley DB, Johnston PG. Cancer Drug Resistance: An Evolving Paradigm. Nature Reviews Cancer. 2013. doi: 10.1038/nrc3599 [DOI] [PubMed] [Google Scholar]

[pone.0254491.ref009] 9. Ashworth A, Lord CJ, Reis-Filho JS. Genetic Interactions in Cancer Progression and Treatment. Cell. 2011;145(1):30–38. doi: 10.1016/j.cell.2011.03.020 [DOI] [PubMed] [Google Scholar]

[pone.0254491.ref010] 10. Jerby-Arnon L, Pfetzer N, Waldman YY, McGarry L, James D, Shanks E, et al. Predicting Cancer-Specific Vulnerability via Data-Driven Detection of Synthetic Lethality. Cell. 2014;158(5):1199–1209. doi: 10.1016/j.cell.2014.07.027 [DOI] [PubMed] [Google Scholar]

[pone.0254491.ref011] 11. O’Neil NJ, Bailey ML, Hieter P. Synthetic Lethality and Cancer. Nature Reviews Genetics. 2017;18(10):613–623. doi: 10.1038/nrg.2017.47 [DOI] [PubMed] [Google Scholar]

[pone.0254491.ref012] 12. Chan DA, Giaccia AJ. Harnessing Synthetic Lethal Interactions in Anticancer Drug Discovery. Nature Reviews Drug Discovery. 2011;10(5):351–364. doi: 10.1038/nrd3374 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref013] 13. Costanzo M, Baryshnikova A, Bellay J, Kim Y, Spear ED, Sevier CS, et al. The Genetic Landscape of a Cell. Science. 2010. doi: 10.1126/science.1180823 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref014] 14. Poelwijk FJ, Socolich M, Ranganathan R. Learning the Pattern of Epistasis Linking Genotype and Phenotype in a Protein. Nature Communications. 2019;10(1):4213. doi: 10.1038/s41467-019-12130-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref015] 15. Otwinowski J, Nemenman I. Genotype to Phenotype Mapping and the Fitness Landscape of the E. Coli Lac Promoter. PLoS ONE. 2013;8(5):e61570. doi: 10.1371/journal.pone.0061570 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref016] 16. Steckel M, Molina-Arcas M, Weigelt B, Marani M, Warne PH, Kuznetsov H, et al. Determination of Synthetic Lethal Interactions in KRAS Oncogene-Dependent Cancer Cells Reveals Novel Therapeutic Targeting Strategies. Cell Research. 2012. doi: 10.1038/cr.2012.82 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref017] 17. Laufer C, Fischer B, Billmann M, Huber W, Boutros M. Mapping Genetic Interactions in Human Cancer Cells with RNAi and Multiparametric Phenotyping. Nat Meth. 2013. doi: 10.1038/nmeth.2436 [DOI] [PubMed] [Google Scholar]

[pone.0254491.ref018] 18. McDonald ER, de Weck A, Schlabach MR, Billy E, Mavrakis KJ, Hoffman GR, et al. Project DRIVE: A Compendium of Cancer Dependencies and Synthetic Lethal Relationships Uncovered by Large-Scale, Deep RNAi Screening. Cell. 2017;170(3):577–592.e10. doi: 10.1016/j.cell.2017.07.005 [DOI] [PubMed] [Google Scholar]

[pone.0254491.ref019] 19.DepMap B. DepMap 20Q2 Public. DOI. 2020. 10.6084/m9.figshare.12280541.v4 [DOI]

[pone.0254491.ref020] 20. Jackson AL, Burchard J, Schelter J, Chau BN, Cleary M, Lim L, et al. Widespread siRNA “off-Target” Transcript Silencing Mediated by Seed Region Sequence Complementarity. RNA. 2006;12(7):1179–1187. doi: 10.1261/rna.25706 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref021] 21. Schmich F, Szczurek E, Kreibich S, Dilling S, Andritschke D, Casanova A, et al. gespeR: A Statistical Model for Deconvoluting off-Target-Confounded RNA Interference Screens. Genome Biology. 2015;16(1):220. doi: 10.1186/s13059-015-0783-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref022] 22. Srivatsa S, Kuipers J, Schmich F, Eicher S, Emmenlauer M, Dehio C, et al. Improved Pathway Reconstruction from RNA Interference Screens by Exploiting Off-Target Effects. Bioinformatics. 2018;34(13):i519–i527. doi: 10.1093/bioinformatics/bty240 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref023] 23. Tiuryn J, Szczurek E. Learning Signaling Networks from Combinatorial Perturbations by Exploiting siRNA Off-Target Effects. Bioinformatics. 2019;35(14):i605–i614. doi: 10.1093/bioinformatics/btz334 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref024] 24. Rämö P, Drewek A, Arrieumerlou C, Beerenwinkel N, Ben-Tekaya H, Cardel B, et al. Simultaneous Analysis of Large-Scale RNAi Screens for Pathogen Entry. BMC Genomics. 2014;15(1):1162. doi: 10.1186/1471-2164-15-1162 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref025] 25. Schultz N, Marenstein DR, De Angelis DA, Wang WQ, Nelander S, Jacobsen A, et al. Off-Target Effects Dominate a Large-Scale RNAi Screen for Modulators of the TGF-β Pathway and Reveal microRNA Regulation of TGFBR2. Silence. 2011. doi: 10.1186/1758-907X-2-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref026] 26. Tang W, Dodge M, Gundapaneni D, Michnoff C, Roth M, Lum L. A Genome-Wide RNAi Screen for Wnt/Beta-Catenin Pathway Components Identifies Unexpected Roles for TCF Transcription Factors in Cancer. Proceedings of The National Academy Of Sciences Of The United States Of America. 2008. doi: 10.1073/pnas.0804709105 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref027] 27. Lim M, Hastie T. Learning Interactions via Hierarchical Group-Lasso Regularization. Journal of Computational and Graphical Statistics. 2015;24(3):627–654. doi: 10.1080/10618600.2014.938812 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref028] 28. Thanei GA, Meinshausen N, Shah RD. The Xyz Algorithm for Fast Interaction Search in High-Dimensional Data. Journal of Machine Learning Research. 2018;19(1):1343–1384. [Google Scholar]

[pone.0254491.ref029] 29. Beerenwinkel N, Pachter L, Sturmfels B. Epistasis and Shapes of Fitness Landscapes. Statistica Sinica. 2007. [Google Scholar]

[pone.0254491.ref030] 30. Singh S, Narang AS, Mahato RI. Subcellular Fate and Off-Target Effects of siRNA, shRNA, and miRNA. Pharmaceutical Research. 2011;28(12):2996–3015. doi: 10.1007/s11095-011-0608-1 [DOI] [PubMed] [Google Scholar]

[pone.0254491.ref031] 31. Lewis BP, Shih Ih, Jones-Rhoades MW, Bartel DP, Burge CB. Prediction of Mammalian MicroRNA Targets. Cell. 2003;115(7):787–798. doi: 10.1016/S0092-8674(03)01018-3 [DOI] [PubMed] [Google Scholar]

[pone.0254491.ref032] 32. Mosteller F, Tukey JW. Data Analysis and Regression: A Second Course in Statistics. Addison-Wesley Series in Behavioral Science: Quantitative Methods. 1977. [Google Scholar]

[pone.0254491.ref033] 33.Jacob L, Obozinski G, Vert JP. Group Lasso with Overlap and Graph Lasso. In: Proceedings of the 26th annual international conference on machine learning; 2009. p. 433–440.

[pone.0254491.ref034] 34. Yuan M, Lin Y. Model Selection and Estimation in Regression with Grouped Variables. J R Stat Soc Series B Stat Methodol. 2006. doi: 10.1111/j.1467-9868.2005.00532.x [DOI] [Google Scholar]

[pone.0254491.ref035] 35. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological). 1995;57(1):289–300. doi: 10.1111/j.2517-6161.1995.tb02031.x [DOI] [Google Scholar]

[pone.0254491.ref036] 36. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer; New York; 2009. [Google Scholar]

[pone.0254491.ref037] 37. Garcia DM, Baek D, Shin C, Bell GW, Grimson A, Bartel DP. Weak Seed-Pairing Stability and High Target-Site Abundance Decrease the Proficiency of Lsy-6 and Other microRNAs. Nature Structural & Molecular Biology. 2011. doi: 10.1038/nsmb.2115 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref038] 38. Kandasamy K, Mohan SS, Raju R, Keerthikumar S, Kumar GSS, Venugopal AK, et al. NetPath: A Public Resource of Curated Signal Transduction Pathways. Genome Biology. 2010;11(1):R3. doi: 10.1186/gb-2010-11-1-r3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref039] 39. Kamburov A, Wierling C, Lehrach H, Herwig R. ConsensusPathDB-a Database for Integrating Human Functional Interaction Networks. Nucleic Acids Research. 2009;37:D623–D628. doi: 10.1093/nar/gkn698 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref040] 40. Szklarczyk D, Morris JH, Cook H, Kuhn M, Wyder S, Simonovic M, et al. The STRING Database in 2017: Quality-Controlled Protein–Protein Association Networks, Made Broadly Accessible. Nucleic Acids Research. 2017;45(D1):D362–D368. doi: 10.1093/nar/gkw937 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref041] 41. Puchta O, Cseke B, Czaja H, Tollervey D, Sanguinetti G, Kudla G. Network of Epistatic Interactions within a Yeast snoRNA. Science. 2016;352(6287):840–844. doi: 10.1126/science.aaf0965 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref042] 42. Vogwill T, Kojadinovic M, MacLean RC. Epistasis between Antibiotic Resistance Mutations and Genetic Background Shape the Fitness Effect of Resistance across Species of Pseudomonas. Proceedings of the Royal Society B: Biological Sciences. 2016;283(1830):20160151. doi: 10.1098/rspb.2016.0151 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref043] 43. van Gaalen FA, Verduijn W, Roelen DL, Böhringer S, Huizinga TWJ, van der Heijde DM, et al. Epistasis between Two HLA Antigens Defines a Subset of Individuals at a Very High Risk for Ankylosing Spondylitis. Annals of the Rheumatic Diseases. 2013;72(6):974–978. doi: 10.1136/annrheumdis-2012-201774 [DOI] [PubMed] [Google Scholar]

[pone.0254491.ref044] 44. Lee JD, Sun DL, Sun Y, Taylor JE. Exact Post-Selection Inference, with Application to the Lasso. The Annals of Statistics. 2016;44(3):907–927. doi: 10.1214/15-AOS1371 [DOI] [Google Scholar]

[pone.0254491.ref045] 45. Tibshirani RJ, Taylor J, Lockhart R, Tibshirani R. Exact Post-Selection Inference for Sequential Regression Procedures. Journal of the American Statistical Association. 2016;111(514):600–620. doi: 10.1080/01621459.2015.1108848 [DOI] [Google Scholar]

[pone.0254491.ref046] 46. Dezeure R, Bühlmann P, Meier L, Meinshausen N. High-Dimensional Inference: Confidence Intervals, p-Values and R-Software Hdi. Statistical Science. 2015;30(4):533–558. doi: 10.1214/15-STS527 [DOI] [Google Scholar]

[pone.0254491.ref047] 47. Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R. A Significance Test for the Lasso. The Annals of Statistics. 2014;42(2):413–468. doi: 10.1214/13-AOS1175 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0254491.ref048] 48. Kennedy C, Ward R. Greedy Variance Estimation for the LASSO. Applied Mathematics & Optimization. 2020;82(3):1161–1182. doi: 10.1007/s00245-019-09561-6 [DOI] [Google Scholar]

PERMALINK

Learning epistatic gene interactions from perturbation screens

Kieran Elmes

Fabian Schmich

Ewa Szczurek

Jeremy Jenkins

Niko Beerenwinkel

Alex Gavryushkin

Roles

Abstract

1 Introduction

Fig 1. RNAi fitness landscape model.

2 Methods

2.1 Fitness landscapes and epistasis

2.1.1 Fitness landscape model

2.2 Estimation of epistasis from RNAi perturbation screens

2.2.1 Perturbation type space

2.2.2 Regression model

2.2.3 Inference

2.3 Software

2.4 Simulation of RNAi data

2.5 Evaluation criteria

3 Results

3.1 Identification of epistasis under varying conditions

Fig 2. Trade-off between precision and recall for selecting the subset of interactions significantly deviating from zero versus all interactions.

3.1.1 Number of double knockdowns per gene pair

Fig 3. Identification of epistasis for increasing numbers of observations of the pairwise double knockdown.

Fig 4. The distribution of the fraction of gene pairs stratified by ranges of observed double knockdowns.

3.1.2 Epistatic effect size

Fig 5. Identification of epistasis for varying effect size.

3.1.3 Direction

Fig 6. Concordance between the sign of true and estimated epistasis.

3.1.4 Magnitude

Fig 7. Concordance between the sign of true and estimated epistasis.

3.2 Scalability

Fig 8. Simulations repeated using xyz and larger data sets.

3.3 Synthetic lethal pairs

Fig 9. Precision, recall, and F1 performance for varying numbers of synthetic lethal pairs, with additional background interactions, using xyz.

3.3.1 Synthetic lethality detection in larger matrices

Fig 10. Performance on increasingly large data sets.

Fig 11. Run time in seconds to find interactions on increasingly large data set.

3.4 Violations to model assumptions

Fig 12. Distribution of xyz failures.

3.5 Summary recommendation

3.6 Effects in real data

Table 1. Ten most significant predicted effects of siRNA perturbation screens, targeting all human kinases.

4 Discussion

Supporting information

Data Availability

Funding Statement

References

Author response to previous submission

Transfer Alert

Decision Letter 0

Ruben Artero

Roles

Author response to Decision Letter 0

Decision Letter 1

Ruben Artero

Roles

Acceptance letter

Ruben Artero

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Fig 8. Simulations repeated using `xyz` and larger data sets.

Fig 9. Precision, recall, and F1 performance for varying numbers of synthetic lethal pairs, with additional background interactions, using `xyz`.

Fig 12. Distribution of `xyz` failures.