Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Feb 1.
Published in final edited form as: Nat Protoc. 2021 Jan 11;16(2):812–840. doi: 10.1038/s41596-020-00436-7

Analysis framework and experimental design for evaluating synergy driving gene expression

Nadine Schrode 1, Carina Seah 2, PJ Michael Deans 2, Gabriel Hoffman 1,#, Kristen J Brennand 1,2,3,4,5,#
PMCID: PMC8609447  NIHMSID: NIHMS1753230  PMID: 33432232

Abstract

The mechanisms by which genetic risk variants interact with each other, as well as environmental factors, to contribute to complex genetic disorders remain unclear. We describe in detail our recently published approach to resolve distinct additive and synergistic transcriptomic impacts following combinatorial manipulation of genetic variants and/or chemical perturbagens. Although first developed for CRISPR-based studies of isogenic human induced pluripotent stem cell (hiPSC)-derived neurons, our methodology can be broadly applied to any RNA sequencing data, provided raw read counts are available. Whereas other differential expression analyses reveal the impact of individual perturbations, here we specifically query interactions between two or more perturbagens, resolving the extent of non-additive (synergistic) interactions between perturbations. We discuss the careful experimental design required to resolve synergistic effects, considerations of statistical power, and how to quantify observed synergy between experiments. Additionally, we speculate on potential future applications and explore the obvious limitations of this approach. Overall, by interrogating the impact of independent factors, alone and in combination, our analytic framework and experimental design facilitate the discovery of convergence and synergy downstream of gene and/or treatment perturbations hypothesized to contribute to complex diseases. We believe that this protocol can be successfully applied by any scientist with bioinformatic skills and basic proficiency in the R programming language; our computational pipeline (https://github.com/nadschro/synergy-analysis) is straightforward, does not require supercomputing support, and can be conducted in a single day upon completion of RNA sequencing experiments.

EDITORIAL SUMMARY

Here, the authors discuss experimental design considerations and describe a computational pipeline to reveal the synergistic and additive effects of combinatorial perturbations on gene expression measured by RNA-sequencing.

TWEET

A new protocol by @kristen.brennand’s team describing a computational pipeline to reveal the synergistic and additive effects of combinatorial perturbations on gene expression.

COVER TEASER

Evaluating synergistic effects on gene expression

INTRODUCTION

The complex genetic nature of many human diseases is becoming increasingly clear, with risk arising from the interplay of common (e.g. 143 schizophrenia 1; 90 Parkinson’s disease 2; 304 coronary artery disease 3; 143 type 2 diabetes 4) and rare (102 autism 5) variants, together with environmental exposures (reviewed in 6,7). While most findings from genome-wide association studies (GWAS) consider the linear effect of a single genetic variant on the trait of interest 8, molecular biology is known to be context-dependent. At the molecular level, the effect of a genetic variant on a molecular trait such as gene expression likely depends on the state of other variables, such as another genetic variant or a chemical stimulus. While findings of gene-gene interactions in GWAS have been lacking due to low statistical power, recent work in identifying stimulus-dependent genetic regulatory mechanisms has demonstrated genome-wide context-dependent effects 9,10. Moreover, genes themselves rarely act in an isolated fashion; rather, complex gene-gene and gene-environment interactions determine the transcriptional landscape we observe in gene expression experiments 11. The consequences of these interactions can prove unexpected in light of the individual effects observed, and so new studies should be undertaken to uncover the effects of gene interplay.

New approaches and technologies are urgently needed to causally link disease-associated genes to the cell types, biological pathways and cellular functions they impact, within the context of the polygenic nature of these diseases 12,13 and context-dependence of molecular biology 14,15. Simultaneous perturbation of gene pairs to inform on the nature of their interaction has long been standard practice in genetics research and such approaches have recently become much more systematic through the use of high-throughput genetic screens, especially in yeast 16. However, the complex genetic landscape of many human diseases requires applying these approaches on a much larger scale. Towards this, coupling the expanded toolbox of CRISPR-based tools for genetic and genomic screening 17 with human induced pluripotent stem cells or cancer cells to produce large numbers of patient-specific cells of the type impacted by disease 18,19, makes possible the multiplexed functional validation of risk variants and genes at an unprecedented scale.

In this protocol, we offer detailed considerations for experimental design, and an analytic framework for evaluating synergistic effects driving gene expression. We provide detailed guidelines for experimental design of perturbation experiments and bioinformatic scripts for data analysis, in order to extend our recently published approach to specifically query interactions between two or more perturbagens and resolve the extent of non-additive interactions between perturbations 20. (The terms “non-additive”, “synergistic”, and “epistatic” are used interchangeably in this context in the field.) Comparing the expected additive effect predicted by summing the result of individual perturbations, to the observed changes in an empirical combinatorial perturbation of those same elements, can reveal those downstream genes with synergistic changes that are likely to result from the interaction of the original manipulations 20. To make these assessments, a specific experimental design and analysis is required, the details of which we discuss here. As our strategy is motivated by first principles, both from the experimental design and statistical perspective, we expect it to be broadly applicable. Commonly, differential expression analyses reveal the impact of individual perturbations in isolation. To our knowledge, no alternative bioinformatic pipelines have been reported by which to conduct analyses similar to those described in this protocol.

Development of the protocol

We reported a functional validation pipeline for genetic variants that incorporated leading genomic, hiPSC- and CRISPR-based approaches 20. Combinatorial perturbation of schizophrenia-associated risk genes resulted in downstream effects that exceeded what would be expected from the additive effect of individually perturbed genes; synergistic genes converged on synaptic function, and included the rare and common variant genes implicated in psychiatric disease risk. Our strategy is suitable to a much larger number of human diseases, as it can be applied to uncover additive and synergistic interactions between risk variants, environmental perturbations and/or drug responses. As an example of the latter, it was recently applied to investigate the synergy of ATP6V1A and amyloid beta, modeling Alzheimer’s disease in hiPSC-neurons 21.

Although disease risk is widely held to be additive at the population level in most statistical models of disease 12, our hypothesis is that within individuals, risk variants (or other perturbations) can sum in different patterns, perhaps dependent upon whether their target genes are expressed in the same cell types or converge within the same biological pathways. With respect to gene-gene interaction studies, observations of synergy might suggest that, for at least a subset of complex genetic disorders or traits, polygenic risk scores should not be strictly additive, summing the predicted effect size of each risk variant identified. More specifically, our findings of biological epistatic effects 20 between common variants suggest that, at least for schizophrenia, additive polygenic risk scores (PRS) might be improved by considering the pathway and/or cellular function of the risk variants being summed 22,23. With respect to drug interactions (either drug × gene or drug × drug), this strategy could help elucidate the mechanisms underlying context-dependent response. For example, analyses could test if risk genes associated with addiction impact response to opioid treatment or withdrawal, whether variants associated with cardiovascular disease disproportionately impact cellular response to inflammatory cues, or how combinatorial drug therapies lead to emergent effects not seen for individual treatments.

Additional biological validation experiments need to be performed in the near future to explore the biological mechanisms of the synergy we detected 20. It is critically important to assess the extent to which the magnitude of observed synergy varies across pathways (both gene number and biological function), cell types, and donor background. First, in order to identify which specific interactions drive epistatic relationships, future experiments must systematically test smaller subsets of perturbations, in order to resolve if specific interactions are disproportionately driving the observed synergy. Second, it is important to test the extent to which the desired genetic manipulations occurred at the level of each cell, which can be addressed through single cell RNA sequencing analyses of gRNA sequence and transcriptome profiles in individual manipulated cells (e.g. ECCITE-seq 24, Perturb-seq 25 or CROP-seq 26). These methods will not only enable confirmation of gRNA integrations per cell, but can simultaneously measure transcriptomic differences at the single cell level. Third, it is necessary to functionally validate changes in the synergistic gene targets (at the RNA and/or protein levels), but also the cellular impact of transcriptional synergy; for example, in neurons this might involve analyses of dendritic morphology, synaptic density and neuronal activity. It is interesting to speculate if combinatorial perturbation of genes such as FURIN, which reduced neuronal outgrowth, and SNAP91, which altered pre-synaptic density and activity, will result in co-occurrence of these independent phenotypes, an exaggerated effect in one or both measures, or novel phenotypes not observed in either single manipulation alone 20.

The competing goals of increasing the scalability of our approach while minimizing experimental variability are difficult to reconcile, as we have only evaluated experiments conducted in a single batch to reduce variance in the data. While we aspire towards a comprehensive evaluation of combinatorial interactions between the hundreds of genes associated with polygenic risk for complex genetic disorders such as schizophrenia, across cell types and donors, the immediate applications of our method are more conservative in scope. Although pooled CRISPR screening approaches 2426 might prove capable of resolving the combinatorial impact of many genes, pooled screens are not suitable for other types of perturbations (e.g. drugs, inflammatory molecules, etc), for which automated arrayed screening may be the only solution.

Applications of the method

Gene expression studies, and even large-scale drug screens 27, typically compare just two conditions at a time (e.g. genetic perturbation or drug treatment versus control). Our experimental framework enables the study of gene-gene interaction (i.e. epistasis), gene-environment interactions, genotype-specific drug responses, and drug-drug interactions (Fig. 1). Although developed to explore combinatorial CRISPR-based perturbations in hiPSC-derived neurons, this experimental strategy is amenable to a wide-range of straightforward cell culture and animal studies. We envision applications ranging from combinatorial drug screening in cancer cell lines 28, to testing various drug or stress paradigms on wildtype and knockout mice 29. Here, we provide the full, annotated analysis workflow for identifying synergistic interactions affecting gene expression. We also describe the necessary elements of experimental design that must be incorporated, statistical power, and quantification of observed synergy between experiments. We discuss potential future applications and explore current limitations of our approach. Overall, we believe that, with modest changes to experimental design, many genetic and pharmacological studies, both in vitro and in vivo, could incorporate the study of combinatorial and synergistic effects, adding value to research as diverse as addiction (gene-environment), cancer (gene-drug), and toxicology (drug-drug).

Figure 1. General overview of the analysis pipeline, experimental design and differential expression contrast design.

Figure 1.

(a) Chart summarizing synergistic effect analysis. General considerations for experimental design are described in the introduction. Note that code for power calculations is also provided in the procedure in step 18. Methods to conduct the actual RNA sequencing are not described in this protocol but should only be undertaken following careful experimental design. Performing the procedure steps of this protocol results in the differential expression of each individual condition, as well as computed synergy from combinatorial and additive conditions. Both are passed on to further study via enrichment analysis. Procedure steps indicated in italics where applicable.. (b) RNA-seq sample setup with 3 samples per condition for same type perturbations (left) and different type perturbations (center and right). Different type perturbations allow for two choices: a separate control has to be set up for each condition (center), or all perturbation types must be present in all samples (right). Shape: perturbed gene, here square: gene 1, hexagon: gene 2; color: type of perturbation, here blue: CRISPRi, orange: drug treatment. (c) Visual representation of the contrasts to be defined during differential expression analysis: Each individual perturbation is compared against its control (upper left). The simultaneous perturbation is compared against its control (upper right). To model the additive effect of the perturbed genes, the individual comparisons are summed (lower left). To model the synergistic effect, the additive comparison is subtracted from the combinatorial perturbation comparison (lower right).

A suitable application for this analysis requires RNA sequencing data from multiple perturbed samples, both separate and in combination. The availability of biological replicates is necessary to improve the power to detect synergistic effects. To take full advantage of the provided script, the RNA sequencing input data should ideally be provided as raw read counts, as opposed to normalized RPKM, as a series of custom normalization steps are built into our analysis pipeline. Data stemming from microarrays are generally not suitable due to their limited scope.

There are unfortunately a very limited number of existing datasets that are appropriate for this analysis with which to evaluate the generalizability of our method. Critically, we note that the lack of currently suitable datasets does not suggest that a highly specialized laboratory, personnel, and/or substantial funding are required for successful implementation. We identified many instances of studies that would have been suitable, except that the publicly available datasets were conducted using microarrays rather than RNA-sequencing platforms 3033, or data was provided as RPKM rather than read counts 3440. We hope that more widespread adoption of this methodology will encourage data to be deposited in suitable formats.

A comprehensive search identified a number of suitable datasets 4144. We selected one, which investigated combinatorial drug treatment of BET and MEK inhibitors as therapy for MAPK and checkpoint inhibitor-resistant melanoma 44, owing to the availability of perturbations with single agents alone and in combination as well as biological replicates. This melanoma study reported that combining BET and MEK inhibitors synergistically curbed the growth of NRAS-mutant melanoma and prolonged the survival of mice bearing tumors refractory to MAPK inhibitors and immunotherapy. Our analysis found significant synergistic downregulation of genes associated with various cancers, such as alveolar rhabdomyosarcoma, glioblastoma, RB1 knockdown, breast cancer, and thyroid cancer, suggesting that conjunction therapy with these chemotherapeutic agents synergistically improved cancer targeting as opposed to the additive effects of each agent alone (Extended Data Figure 14). This improves our understanding of biological responses to treatment with multiple drugs in conjunction, especially in the field of cancer biology, where multiple chemotherapeutic agents are often prescribed together.

Continued progress towards precision medicine requires improvements in genotype-based diagnosis and predictions of drug treatment response. Understanding how polygenic risk adds within and across pathways will improve the calculation of polygenic risk scores. Evaluating the impact of drugs on complex genotypes will improve our ability to match patients to treatment. Moreover, understanding points of convergence and synergy between risk variants could lead to the identification of novel therapeutic targets with which to prevent or treat disease. Overall, the translational impact of our work includes potential improvements to additive PRS that incorporate pathway-specific PRS, better integration of gene ontology and synergistic effects into PRS scores, and/or the prioritization of convergent and synergistic genes for mechanistic follow-up and pathways for potential therapeutic targets.

Limitations

Our analysis has specific limitations that are common to many computational analysis pipelines, most notably regarding statistical power. Because we are considering a difference in fold changes between conditions (in essence a difference of differences), power is notably limited and the underlying data sets must be as consistent as possible, preferably lacking batch effects and other technical variation, or otherwise the sample size will need to be dramatically increased. We include a function (Step 18) to calculate the number of samples necessary to be powered for a comparison of differences among them, validating the power of the current experiment and informing the design of subsequent ones. Empirical measurement of variance is a required input in this function, so preliminary data from similar cells and conditions is helpful, in order to accurately estimate the variance in the ultimate experiment and thereby the sample number required. Briefly, for single gene perturbations, power estimates vary by gene, reflecting both the effect size of the eQTL and the standard deviation (SD) in associated expression differences. Although the SD of expression for isogenic hiPSC comparisons is generally greatly reduced compared to post-mortem datasets, increasing statistical power, in our analysis of Schizophrenia-associated risk genes (Ref. 20), we nonetheless had only >75% power for single-gene perturbations of three of the four genes associated with SZ-GWAS loci, even if comparing two isogenic hiPSC pairs, two replicates each, to twelve other edited lines from the same parental hiPSCs (we discuss power of isogenic analyses in more detail 19). Comparatively, power is increased for synergistic perturbations: our analysis of four SZ-GWAS genes was powered to resolve 1.8-fold logFCs in synergistic differentially expressed (DE) genes at 75% power when considering only four samples per condition (two donors × two replicates).

We propose evaluating two statistical tests of synergy, termed “synergy coefficients”, in order to compare the synergy observed in different pools of genes (Step 19). The first measures the existence of synergy, quantifying the estimated fraction of p-values that are non-null 45 (synergy coefficient, π1); for our four SZ-GWAS genes 20 the π1 is 0.34, and for the melanoma drug study 44 the π1 is 0.4854, whereas in an independent experiment in which we observed additive but not synergistic effects (CRISPRi ATP6V1A neurons +/− Aβ42 treatment) 21 the π1 was 0. The second measures the extent of synergy, calculating the fraction of genes with a synergy p-value < 0.05; the fraction was 0.18 for our four SZ gene dataset 20, 0.23 for the melanoma drug study 44, and 0.02 for the aforementioned non-synergistic study 21.

Fitting of this model for differential expression results in genes that show a difference in the differential expression computed for the additive model and for the combinatorial perturbation. However, interpretation of the resulting differentially expressed genes (DEGs) depends on several factors, such as the direction of fold change in all three models. To identify genes of interest, namely those whose magnitude of change is larger in the combinatorial perturbation than the additive model, we categorize all genes by the direction of their change in both models and their log2 (Fold change) in the synergistic model (Step 20). Following the identification of synergistic effects, we perform over-representation analysis on these subsets of synergistically regulated genes (Step 25). This approach is not always feasible (e.g. if very few genes are significantly differentially expressed), so we suggest other criteria by which to subset genes that might lead to larger numbers of resulting genes for further exploration. Moreover, genes in gene sets may be associated with the respective pathway through repression or activation. In our experience, separately considering the “more up” and “more down” synergistic genes is therefore not always more informative than evaluating all “more” synergistic genes together (i.e. directionality of enrichment is not always meaningful). If no gene subsets fit the data at hand, genes may be individually inspected for interpretation.

Overall, it is only through increased application of our methodology that we will collectively begin to understand the generalizability of synergistic effects on gene expression in psychiatric genetics, or biology as a whole. Does synergy arise more frequently following genetic or pharmacological manipulations? Is the extent of synergy greatest for within or cross-pathway perturbations? Is synergy dependent on the number of perturbations achieved, or does it arise through specific epistatic interactions? How does transcriptional synergy impact downstream cellular phenotypes and ultimately contribute to disease risk?

Experimental design

Sample setup

For successful analysis of synergistic effects, samples of all individual perturbations and all combinatorially perturbed samples, as well as appropriate controls, are necessary. The design of the latter depends on the type of perturbations that are performed. If all perturbations are of the same kind, e.g. two or more drug treatments, only one control (with biological replicates) needs be set up, e.g. vehicle treatment. However, if more than one type of perturbation is to be combined (e.g. CRISPR, shRNA and/or drug), separate and combined controls are required (Fig. 1b). In that case either a separate control has to be set up for each condition, or all perturbation types must be present in all samples. While both are valid approaches, the latter allows for fewer samples, thereby lessening experimental expenses, while also accounting for all perturbation effects in all cells. This in turn simplifies the computational analysis described in this protocol, as only one control condition has to be added to the contrasts (step 12). However, if perturbation types are difficult to perform or time consuming, this approach may prove impractical.

To avoid technical variation, all experiments should be carried out at the same time (i.e. in one batch) and any other experimental variability minimized. If this is not feasible, blocking (arranging of experimental units in groups (blocks) that are similar to one another) and randomization should be performed to avoid confounding technical and biological variation, which would impede subsequent analysis. If this is not attainable, our analysis framework also includes built in consideration for batch effects (Step 9). If batch information is recorded in the metadata, it is possible to identify significant contributions of batch effects to clustering via plotting the multidimensional scaling (MDS) plot (Step 8). If the effect exists, it can be accounted for in the design matrix (Step 9).

Following RNA isolation, library preparation and sequencing, read alignment and transcript quantification are performed to obtain raw transcript counts. These steps are not described in this protocol but are described in detail elsewhere 46. Some or all can be performed in the laboratory but are often outsourced to core facilities or specialized businesses.

Differential Expression analysis

This protocol begins with a standard differential expression analysis from raw counts, including filtering for lowly expressed genes to avoid artifacts, calculating normalization factors for differing library sizes and principal component analysis to inform covariates to include in the linear model. For complex or a large number of possible covariates, we suggest analysis of their contribution to gene expression variance, using the VariancePartition package 47. Following log transformation a linear model is fit for each gene. At this point, the protocol varies from standard DE analysis approaches for the first time (for example, see 48). When defining contrasts for the desired group comparisons, contrasts for the expected additive model and the gene interaction-dependent synergistic effect are added (Fig. 1 c). The additive contrast is calculated by summation of all individual contrasts, while the synergistic effect is defined by the difference between the expected and the observed contrasts. We then recalculate the coefficients, standard deviations and correlation matrix in terms of the comparisons of interest and apply Empirical Bayes moderation to obtain more precise estimates of gene-wise variability.

Synergistic effect analysis

Once the differential expression analysis has been performed, we can calculate how many samples of each condition are needed to detect synergistic effects with sufficient power. For this reason, it is advisable to perform the differential expression analysis protocol up to this point on a previous similar data set and perform the power calculation using the relevant standard deviations. Since the interaction effect of the perturbed genes is defined as the difference (Δ) between the expected (βE) and observed (βO) fold change compared to the appropriate baseline, the statistical power to detect a significant interaction depends on the precision of the expected and observed fold changes. Increasing the sample size increases the precision of the estimates so that they have a smaller variance and standard error. Here we evaluate the statistical power of Δ = |βE – βO| based on the standard error of each of the fold changes. Using real RNA-seq data to estimate the standard error of the expected and observed fold changes (σE and σO, respectively) from 6 samples, and taking into account the fact that increasing the sample size by a factor of F will decrease the standard errors factor of √F, we can compute the statistical power to detect a synergistic interaction as a function of the sample size.

To be able to compare the extent of synergy between data sets, but also to determine as early as possible how much synergy to expect, a synergy coefficient, π1, can be calculated. π1 is the fraction of non-null synergistic P-values and can give insight into the existence of a synergistic component, even if the P-values themselves are not significant genome-wide. Beyond π1, to get a sense of the extent of synergy, we can calculate the fraction of genes with a synergistic FDR smaller than (e.g.) 10%. If this value is low, but π1 is high, it is an indicator that the study might be underpowered and more samples are needed to resolve the present synergistic effects.

Once these values have been evaluated, we can proceed with the analysis of the synergistic effect. To make sense of those genes uniquely perturbed in the combinatorial manipulation, one must consider the nature of their synergistic effect. Genes that are synergistically differentially expressed might exhibit positive or negative synergy. However, the biological meaning of these results depends largely on the fold change direction in the additive and the combinatorial comparisons. In general, genes with significantly increased absolute fold changes (i.e. in either direction) in the combinatorial perturbation compared to the additive model, are biologically of more interest than those more moderately changed, as they represent genes that are particularly affected by synergistic regulation. To take this into account, we categorize synergistically DEGs, or “synergistically regulated genes”, by magnitude of synergy rather than merely the direction of expression change between expected and observed experiments (Fig. 2).

Figure 2. Synergistic effects.

Figure 2.

(a) Hypothetical differential expression results showing the effect of individual gene modifications on gene expression (shown as FC), the implementation of the expected additive model based on the individual gene perturbations and the measured combinatorial perturbation. Comparison of the results from the expected additive model to the measured combinational perturbation allows for the detection of synergistic effects on gene expression. (b) log2(fold changes) of three representative genes (CRMP1, FMN1, DLX1) resulting from individual gene perturbations, the computed additive models and combinatorial perturbation effects on expression, illustrating different possible synergistic effects (negative, positive and none, respectively). Panels a and b published previously 20 as Fig. 5a,b.

Gene set enrichment analysis

To gain insight into the differential expression results, and compare individual and combined perturbations on a broad scale, we perform gene set enrichment analysis. It can be advantageous to create a custom gene set collection, consisting of relevant gene sets for the field of interest. If these are then subdivided into gene set categories, plots become more informative 20. In this protocol, the limma:camera() R function is used, which determines enrichment by identifying sets of genes for which the distribution of t-statistics differs from expectation. This allows the evaluation of trends in the data, even if there are few genome-wide significant DE genes. In comparison to the analysis of broad transcriptional changes, determining which gene sets are over-represented in, e.g., a subset of DE genes that was determined to be negatively synergistically regulated, can help to extract the nature and potential function of these specific genes of interest. In these cases we perform over-representation analysis.

MATERIALS

EQUIPMENT

Hardware

Computer (we used a 16 RAM memory, 4-core processor) CRITICAL 150 MB disk space should be available for source and result files.

Software

  • R version 3.5.0 (2018-04-23)

  • R packages:

  • limma_3.38.3 49

  • edgeR_3.24.3 50

  • pheatmap_1.0.12 51

  • RColorBrewer_1.1-2 52

  • ggplot2_3.1.1 53

  • ggpubr_0.2 54

  • qvalue_2.18.0 55

  • plyr_1.8.4 56

  • wesanderson_0.3.6 57

  • GSEABase_1.44.0 58

  • grid_3.6.2 59

  • scales_1.0.0 60

  • WebGestaltR_0.4.0 61

  • stringr_1.4.0 62

Files

  • count matrix: to be provided as raw counts with sample names in columns and Ensembl IDs in rows. For data containing gene symbols rather than Ensembl IDs, please see Step 7.

  • metadata file: containing sample names in rows (must be identical to column names of the count matrix) and metadata information for each sample such as treatment, genotype, perturbation, donor, replicate, cell line, etc. in columns.

  • gene annotation file: containing at a minimum the Ensemble IDs of all genes analyzed in rows and their respective Ensembl ID and gene symbol in columns. For data containing gene symbols rather than Ensembl IDs, please see Step 7.

CRITICAL To use the example data provided, download counts.csv, meta.csv and anno.csv from Supplementary Data 1 and save them in your current working directory. anno.csv contains gene annotations for the majority of genes in the human genome and can likely be used for other data sets as well.

Example Data

See Supplementary Data 1: “data+code_Schrode.zip”, also available from www.synapse.org/#!Synapse:syn20502314. Originally published 20. This example data set stems from our study, which perturbed four risk genes individually and in combination in a hiPSC model of schizophrenia. Three genes (SNAP91, TSNARE1 and CLCN3) were perturbed using CRISPRa, while one (FURIN) was perturbed using RNAi in hiPSC-derived neurons. Each condition (SNAP91a, TSNARE1a, CLCN3a, FURINi, CRISPR control, shRNA control, combined perturbation, combined control) comprised 4 samples from two donors. For detailed information please refer to (ref 20).

See Supplementary Data 2: “data+code_Echevarria-Vargas.zip”. Originally published 44.

PROCEDURE

Setup (<5 mins)

Preparing R environment

  • 2

    Name your experiment. This will be used to save results files and generate plot titles. In this example data set, we study the perturbation of 4 genes, in a hiPSC model of schizophrenia and have thus named the experiment ‘SCZ’. Open R or R studio and type:

    experiment.title=“SCZ”
    
  • 3

    Load R packages by typing the following

    pacman::p_load(limma, edgeR, pheatmap, RColorBrewer, ggplot2, ggpubr, qvalue,  plyr, wesanderson, GSEABase, grid, scales, WebGestaltR, stringr)
    ?TROUBLESHOOTING
    
  • 4
    Define custom functions in advance for ease of use: Run each of these function definitions in order to define them for later use. Note that this does not process any data; it only stores custom functions to be used later in the analysis pipeline. Each function description explains what the function will do once applied to the data, starting in step 9.
    • The mds() function is based on plotMDS() in the limma package. When given a DGE object and a column in the meta data table, containing groups of interest, it produces a multidimensional scaling plot, colored by the provided groups.
      mds <- function(normDGE, metacol, title){
       mcol <- as.factor(metacol)
       col <- rainbow(length(levels(mcol)), 1, 0.8, alpha = 0.5)[mcol]
       plotMDS(normDGE, col = col, pch = 16, cex = 2)
       legend(“center”, fill = rainbow(length(levels(mcol)), 1, 0.8), legend =    levels(mcol), horiz = F, bty = “o”, box.col=“grey”, xpd=TRUE)
       title(main=title)
      }
      
    • The cameraplusplots() function is based on camera() in the limma package. When given a contrast in the form of a named vector (such as a column of the contrast matrix- (“cont.matrix”, created in step 12, which defines the comparisons being made, such as combinatorial vs. additive), a list of gene set groups, a voom-transformed object (such as object “v”, created in step 10, which generates a precision weight based on the mean-variance relationship of the counts), a design matrix and a color palette for the gene set groups, it produces a scatter plot of all tested gene sets and their adjusted P-values as well as a bar graph of the 10 most significant gene sets, colored by gene set group/category.
      cameraplusplots <- function(contrast, genesetlist, vobject, design,
                               catcolors, title){ 
       tmp.list <- list()
       cam <- data.frame(matrix(ncol = 5, nrow = 0))
       for (i in 1:length(genesetlist)){ 
        cam.s <- camera(vobject, genesetlist[[i]], design, contrast = 
            contrast, inter.gene.cor = 0.01)
        tmp.list[[i]] <- cam.s
        names(tmp.list)[i] <- names(genesetlist)[i] 
        tmp.list[[i]]$category <- names(tmp.list[i]) 
        colnames(cam) <- names(tmp.list[[1]]) 
        cam <- rbind.data.frame(cam, tmp.list[[i]]) 
        print(paste0(“Gene set categories run: “, i))
       }
       cam$neglogFDR <- -log10(cam$FDR) 
        ## for plotting purposes only:
      cam$dirNeglogFDR <- cam$neglogFDR
       cam[(cam$Direction == “Down”), “dirNeglogFDR”] <- -cam[(cam$Direction ==
        “Down”), “neglogFDR”]
       grob <- grobTree(textGrob(c(“UP”,”DOWN”), x = c(0.94, 0.89), y = c(0.95,
        0.05), hjust = 0, gp = gpar(fontsize = 13))) 
       q <- ggplot(aes(x = cam$category, y = dirNeglogFDR, color = category),
        data = cam) +
        scale_color_manual(values = catcolors) +
        geom_jitter(aes(size = NGenes, alpha = neglogFDR), pch = 19,
            show.legend = F) +
        scale_size_continuous(range = c(4,16)) +
        scale_alpha_continuous(range = c(0.4, 1)) +
        geom_hline(yintercept = c(−1.3, 1.3), color = “red”, alpha = 0.5) +
        geom_hline(yintercept = 0) +
        scale_y_continuous(limits = c(−10, 10), oob = squish, labels = abs) +
        labs(x = “Gene set categories”, y = “-log10(FDR)”, title = title) +
        theme_bw(14) +
        theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
            axis.ticks.x = element_blank(),
            panel.grid.minor = element_blank(),
            panel.grid.major = element_blank()) +
        annotation_custom(grob) 
       print(q)
       cam$geneSet <- row.names(cam)
       cam10 <- as.data.frame(cam)
       cam10 <- cam10[order(cam10$FDR),]
       cam10 <- cam10[1:10,]
       grob <- grobTree(textGrob(c(“DOWN”,”UP”), x = c(0.03, 0.9),  y=c(0.025),
        hjust = 0, gp = gpar(fontsize = 9, col = “grey60”)))
       g <- ggplot(aes(x = geneSet, y = dirNeglogFDR, fill = category),
        data = cam10) +
        geom_col()+
        aes(reorder(stringr::str_wrap(geneSet, 60),-FDR), dirNeglogFDR) +
        xlab(NULL) +
        geom_hline(yintercept = c(−1.3, 1.3), color = “red”, alpha = 0.3) +
        geom_hline(yintercept = 0) +
        scale_y_continuous(limits = c(−10, 10), oob = squish, labels = abs) +
        labs(y = “-log10(FDR)”, title = title) +
        scale_fill_manual(values = catcolors) +
        coord_flip() +
        theme_bw() +
        theme(panel.grid.minor = element_blank(),
            panel.grid.major.y = element_blank()) +
        annotation_custom(grob) 
       print(g) 
       return(cam)
      }
      
    • The oraplot() function will take the data frame resulting from over-representation analysis using WebGestaltR (done in step 30) and a color palette as input and returns a bar graph of the 10 most significant gene sets.
      oraplot <- function(orares, catcolors, name){
       orares.n <- orares[order(orares$FDR),]
       orares.n <- orares.n[1:10,]
       orares.n$neglogFDR <- -log10(orares.n$FDR)
       orares.n <- orares.n[orares.n$neglogFDR>0,]
       orares.n$geneSet <- gsub(“_”, “ “, orares.n$geneSet)
       g <- ggplot(aes(x=reorder(str_wrap(geneSet, 60), neglogFDR),   y = neglogFDR, fill = category), data = orares.n) +
        geom_col() +
        geom_hline(yintercept = 1.3, color = “red”, alpha = 0.5) +
        labs(y = “-log10(FDR)”, x = ““, title = paste0(name)) +
        scale_fill_manual(values = catcolors) +
        coord_flip() +
        theme_bw(11)
       return(g)
      }
      
    • The power.compare.logFC() function will take the variances of the combinatorial perturbation and the additive model comparisons, generated in step 18, as inputs (sig1, sig2; variance in the additive model is usually higher, in proportion to the number of individual perturbations). Further, the number of samples used to determine the variances (N), a vector of sample numbers of interest (N_other), a significance cutoff (alpha) and the number of tests performed (n_tests, usually the number of transcripts). N_other/N is the relative sample size and the variance of logFC1 - logFC2 is sig12 + sig22. Since the standard error of the mean is inversely proportional to sqrt(N), multiplying the sample size by F decreases the SE by sqrt(F). On the variance scale, this corresponds to dividing by n_scale.
      power.compare.logFC <- function(sig1, sig2, N, N_other = c(2,4,6,8,10), alpha = 0.05, n_tests = 20000){
       d <- seq(0, 3, length.out=1000)
       alpha_multiple <- alpha / n_tests
       df <- lapply(N_other/N, function(n_scale){
        sigSq <- (sig1^2 + sig2^2) / n_scale
        cutoff <- qnorm(alpha_multiple/2, 0, sd = sqrt(sigSq), lower.tail =         FALSE)
        p1 <- pnorm(−1*cutoff, d, sqrt(sigSq))
        p2 <- 1-pnorm(cutoff, d, sqrt(sigSq))
        data.frame(n_scale, d, power=p1+p2)
       })
       df <- do.call(“rbind”, df)
       ggplot(df, aes(d, power, color = as.factor(n_scale*N))) +
        geom_line() +
        theme_bw(14) +
        theme(aspect.ratio = 1, plot.title = element_text(hjust = 0.5)) +
        ylim(0, 1) +
        scale_color_discrete(“Samples”) +
        xlab(bquote(abs(logFC[observed] - logFC[expected]))) +
        ggtitle(“Power versus difference in logFC”)
      }
      
    • The categorize.synergy() function is given the combined matrix of log2FC values of the additive model, the combinatorial perturbation and the synergistic effect differential expression results. It creates a new column in the resulting data frame, assigning synergy categories to each gene.
      categorize.synergy <- function(logFCmatrix, meanSE){
       m <- logFCmatrix
       m$magnitude.syn <- NA
       for (i in 1:length(m$Gene_name)){
        if (m$Synergistic.logFC[i] > meanSE){
         if (m$Additive.logFC[i] < -meanSE){
          if (m$Combinatorial.logFC[i] > meanSE){
           m$magnitude.syn[i] = “more.up”
          } else m$magnitude.syn[i] = “less.down”
         } else m$magnitude.syn[i] = “more.up”
        }
        else if (m$Synergistic.logFC[i] < -meanSE){
         if (m$Additive.logFC[i] > meanSE){
          if (m$Combinatorial.logFC[i] < -meanSE){
           m$magnitude.syn[i] = “more.down”
          } else m$magnitude.syn[i] = “less.up”
         } else m$magnitude.syn[i] = “more.down”
        } else m$magnitude.syn[i] = “same”
       }
       m$magnitude.syn <- as.factor(m$magnitude.syn)
       return(m)
      }
      
    • The stratify.by.syn.cat() function will take a subset of interest from the table created by the categorize.synergy() function, above, as input. It creates a list object containing vectors of gene symbols by synergy category, which is used as input for over-representation analysis with WebGestaltR in step 30.
      stratify.by.syn.cat <- function(log2FC.matrix.sub){
       synergy.cat.list <- list(“less.down” = as.character(log2FC.matrix.sub[
        log2FC.matrix.sub$magnitude.syn == “less.down”, “Gene_name”]),
        “less.up” = as.character(log2FC.matrix.sub[
              log2FC.matrix.sub$magnitude.syn == “less.up”, “Gene_name”]),
        “more.down” = as.character(log2FC.matrix.sub[
              log2FC.matrix.sub$magnitude.syn == “more.down”, “Gene_name”]),
        “more.up” = as.character(log2FC.matrix.sub[
              log2FC.matrix.sub$magnitude.syn == “more.up”, “Gene_name”]),
        “same” = as.character(log2FC.matrix.sub[
              log2FC.matrix.sub$magnitude.syn == “same”, “Gene_name”]))
       return(synergy.cat.list)
      }
      

Loading and preprocessing data

  • 2

    Load data

    Read in metadata (here meta.csv) and expression data (raw counts, here counts.csv) and match the order of genes listed in the files (this step is very important to avoid later confusion).

    meta <- read.csv(“meta.csv”, row.names = 1)
    counts <- read.csv(“counts.csv”, row.names=1)
    meta <- meta[match(colnames(counts), row.names(meta)),]
    ?TROUBLESHOOTING
    
  • 3
    Filter lowly expressed genes
    • Plot counts over counts per million (cpm) and visually inspect the graph. Here, we aim to keep roughly 10 counts in 4 or more samples (number dependent on total number of samples/replicates) (Fig. 3A). Adjust, v = 0.25;in the abline() function, which currently corresponds to counts > 0.25) based on the y value of the intercept between the horizontal line and the plotted counts/cpm(counts) line.
      pdf(paste0(“results/”, experiment.title, “−1_cpm-counts.pdf”))
      plot(cpm(counts)[, 1], counts[, 1], ylim = c(0, 50), xlim = c(0, 3))
      abline(h = 10, col = “red”)
      abline(v = 0.25, col = “red”)
      dev.off()
      keep <- rowSums(cpm(counts[]) > 0.25) >= 4
      gExpr <- counts[keep,]
      dim(gExpr)
      
  • 4

    Create a DGEList object, which is a data format that holds the read counts, normalization factors, experimental group data, gene annotations and library size information. Note that in this step, gene annotations are manually added to the DGEList object. This step differs if your counts file is already annotated with gene names rather than ensembl IDs. In this case, the annotations (last 4 lines) do not need to be manually added. (See Troubleshooting)

    y <- DGEList(gExpr)
    y <- calcNormFactors(y)
    anno <- read.csv(“anno.csv”)
    row.names(anno) <- anno$ensembl
    anno <- anno[match(rownames(y), rownames(anno)),]
    y$genes <- anno
    ?TROUBLESHOOTING
    
  • 5

    Create diagnostic plot:

    Plot multidimensional scaling plot to assess variables that should be added as covariates (Fig. 3b). To do this, visually inspect each plot and examine clustering of points. Ideally, points should cluster mostly by experimental group and less by technical/uninteresting covariates such as batch, donor, line, or any other variable defined in your metadata. If points cluster by an undesired variable (for instance, donor instead of perturbation), that variable should be considered a covariate and will need to be added to the linear model in step 9 to account for its effect on the data.

    pdf(paste0(“results/”, experiment.title, “−2_mds.pdf”))
    for (i in 1:length(colnames(meta))){
     mds(y, meta[ ,i], colnames(meta)[i])
    }
    plotMDS(y)
    dev.off()
    

Figure 3. Differential expression analysis output.

Figure 3.

a) Plot showing counts over cpm. Horizontal red line marks 10 counts. Arrow indicates the intersection with the plotted data, which here equals 0.25 cpm (vertical red line). b) MDS plots highlighting two metadata variables, donor and line. Sample data are mainly separated by donor (left, marked by arrow) and cell line (right, marked by arrow). c) Voom mean-variance plot. Red line indicates smoothed curve fitted to the linear model by average expression. d) Volcano and mean difference (MA) plots of differential expression in the additive (left graph pair) and the combinatorial (right graph pair) comparisons. Significantly differentially expressed genes are highlighted in blue and red in volcano plots and the top 10 significant genes are denoted in blue in MA plots. MA plots in panel D published previously 20 in Extended Data Figure 4a.

Fitting a linear model

  • 6
    Design model
    • Define the linear model to be used in the differential expression analysis. Add the variable of interest as well as any variables (from the metadata columns analyzed in step 8) that showed clustering from visual inspection of the MDS plot in Step 8. In this example “mod.gene” represents our variable of interest, while the MDS plot showed “line” and “donor” as additional covariates. Only use variables of interest as identified by clustering in the MDS plot.
      design <- model.matrix(~ 0 + mod.gene + line + donor, meta)
      
    • (Optional) Subsequently remove the column name from the created design matrix to ease its use. R will automatically create column names that unnecessarily combine the variable name and its value, which makes them cumbersome to type downstream. Here, the variable name is removed..
      colnames(design)
      colnames(design) <- gsub(“mod.gene”, ““, colnames(design))
      colnames(design) 
      ?TROUBLESHOOTING
      
  • 7
    Voom transform
    • The voom() function prepares the data for linear modeling by converting counts to logCPM and computing weights for heteroscedasticity adjustment. It also creates a diagnostic plot for the mean-variance trend (Fig. 3c). Visually inspect it for fit. Increase the cutoff for lowly expressed genes to improve fit if necessary.
      v <- voom(y, design, plot = TRUE, save.plot = TRUE)
      
    • Save the plot (optional)
      pdf(paste0(“results/”, experiment.title, “−3_voom.pdf”))
      plot(v$voom.xy, type = “p”, pch=20, cex=0.16, 
      main = “voom: Mean-variance trend”, 
      xlab = “log2(count size + 0.5)”, 
      ylab = “Sqrt(standard deviation)”)
      lines(v$voom.line, col=“red”)
      dev.off()
      
  • 8

    Fit model

    fit <- lmFit(v, design)
    
  • 9
    Define group comparisons (contrasts)
    • Define your comparisons of interest (in the example below, we look at perturbations of four genes (SNAP91, TSNARE1, CLCN3, and FURIN alone and in combination)). Each comparison is assigned a name and a function using elementary algebra. In addition to the standard comparisons of the individual and combined gene perturbations with their control, we also add equations for the additive and the synergistic model.
      cont.matrix <- makeContrasts(SNAP91a = sanp91 - ctrl,
      TSNARE1a = tsnare1 - ctrl,
      CLCN3a = clcn3 - ctrl,
      FURINi = furin - ctrl,
      Additive = sanp91 + tsnare1 + clcn3 + furin - 4*ctrl,
      Combinatorial = all - all.ctrl,
      Synergy = all - sanp91 - tsnare1 - clcn3 - furin - all.ctrl + 4*ctrl,
      levels = design) 
      ?TROUBLESHOOTING
      
  • 10

    (Optional) Visualize the contrasts in a heatmap.

    cont.p <- t(cont.matrix)
    h <- pheatmap(cont.p,
     display_numbers = T, number_format = “%.0f”,
     breaks = seq(−3, 1, by = 0.5),
     color = colorRampPalette(rev(brewer.pal(n = 10, name = “RdYlBu”)))(12),
        cluster_cols = F, cluster_rows = F)
    print(h)
    
  • 11

    Calculate coefficients and standard errors for each contrast.

    fit.cont <- contrasts.fit(fit, cont.matrix)
    

Assessing differential expression

  • 12

    Perform empirical Bayes moderation by executing the eBayes() function from the limma package and computing statistics using the decideTests() function.

    fit.cont <- eBayes(fit.cont)
    plotSA(fit.cont, main = “Final model: Mean-variance trend”,
      ylab = “Sqrt(standard deviation)”)
    summa.fit = decideTests(fit.cont, adjust.method = “fdr”)
    
  • 13
    Save DEG result tables created in Steps 12–15.
    • Create a list of all results. This will be used later to run analyses for all comparisons in Steps 19, 20 and 25.
      res.list <- list()
      for (i in 1:length(colnames(fit.cont$contrasts))){
       x <- topTable(fit.cont, coef = i, sort.by = “p”, n = Inf, confint = T)
       res.list[[i]] <- x
       names(res.list)[i] <- colnames(fit.cont$contrasts)[i]
       write.csv(x, paste0(“results/”, experiment.title, “_DEGs_”,
        colnames(fit.cont$contrasts)[i], “.csv”))
      }
      
  • 14

    Save the DEG result plots.

    Create volcano and mean difference (MA) plots for each contrast, with all significant DE genes and the top 10 genes highlighted, respectively (Fig. 3d).

    pdf(paste0(“results/”, experiment.title, “−4_volcano-md-plots.pdf”))
    par(mfrow = c(1, 2))
    for (i in 1:length(colnames(fit.cont$contrasts))){
     plotMD(fit.cont, coef = i, status = summa.fit[, i], values = c(−1, 1))
     volcanoplot(fit.cont, coef = i, highlight = 10, 
      names = fit.cont$genes$Gene_name)
    }
    dev.off()
    par(mfrow = c(1, 1))
    
    Optional: Plot the expression of the top 3 DE genes in each contrast in all samples.
    • Change the meta$mod.gene variable (in “p <- qplot(meta$mod.gene […], fill = meta$mod.gene, […])) to your variable of interest
    • scale_x_dicrete and scale_fill_manual (italicized below) are commented out (through the use of #) and are optional additions for aesthetics. If used (remove the #), their variables have to be changed to reflect the data at hand.
      pdf(paste0(“results/”, experiment.title, “−5_top3-expression-plots.pdf”))
      par(mfrow = c(1, 1))
      for (i in 1:length(colnames(fit.cont$contrasts))){
       x <- topTable(fit.cont, coef = i, sort.by = “p”, n = Inf)
       cat(“  \n\n###”,  colnames(fit.cont$contrasts)[i], “ \n\n”)
       for (j in 1:3){
        deg <- as.character(x[j,”ensembl”])
        p <- qplot(meta$mod.gene, v$E[deg, ], geom = “boxplot”, fill = 
                    meta$mod.gene, ylab = “Normalized expression”, xlab = “group”, 
                    main = paste0(j, “. DEG: “, as.character(x[j, “Gene_name”]))) +
         geom_jitter() +
      #scale_x_discrete(limits = c(“ctrl”, “sanp91”, “tsnare1”, 
                    #”clcn3”, “furin”, “all.ctrl”, “all”)) +
      #scale_fill_manual(values = (c(“orchid4”, “grey”, 
                    #”steelblue”, “grey”, “firebrick”, “blue”, “darkblue”))) +
          rotate_x_text(angle = 45) +
          theme_bw(14)+
          theme(legend.position = “none”, 
                     axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
       print(p)
       }
      }
      dev.off()
      ?TROUBLESHOOTING
      

Determining power to detect synergistic effects

  • 15
    Calculate power. Power calculations from this analysis can inform future studies. Ideally, power calculations should be performed on a similar dataset in advance.
    • Calculate mean standard error for all measured comparisons.
      SE <- sqrt(fit.cont$s2.post) * fit.cont$stdev.unscaled
      
    • Calculate power. Choose the standard error (SE) matrix column names that represent the additive and the combinatorial perturbation to calculate the median standard error. Then use them to run the power.compare.logFC() function, which creates a power plot (Fig. 4a).
      colnames(SE)
      sig1 <- median(SE[,”Additive”])
      sig2 <- median(SE[,”Combinatorial”])
      g <- power.compare.logFC(sig1, sig2, N = 4, N_other = c(4, 6, 8, 10, 14), 
      alpha = 0.05, n_tests = 20000)
      pdf(paste0(“results/”, experiment.title, “−6_synergy-power.pdf”))
      print(g)
      dev.off()
      

Figure 4. Synergistic effect analysis output.

Figure 4.

a) Plot visualizing synergistic effect power calculations. X-axis shows synergistic log2FCs. Arrow shows that, in the current example, 10 samples per condition are required to resolve a synergistic log2FC of 1.4 at 75% power. b) Histogram of synergistic P-values. c) Pie chart showing the proportions of genes that fall into different synergistic differential expression categories: same (6816 genes), less up (5490 genes), less down (5129 genes), more up (3798 genes) and more down (2598 genes) in the combinatorial vs. the additive model. Exact numbers and percentages are calculated in step 21. d) Hierarchical clustering of the differential expression log2(fold changes) of all synergy categories, in the additive model versus the combinatorial perturbation comparisons.

Determining the extent of synergy

  • 16

    Calculate synergy coefficient and percentage of synergistic DE genes (FDR<10%). Plot histogram of all synergistic P-values to visualize the distribution (Fig. 4b).

    synergy.pvalues <- res.list$Synergy$P.Value
    pi1 <- 1 - qvalue(synergy.pvalues)$pi0
    print(pi1)
    pdf(paste0(“results/”, experiment.title, “−7_synergy-coefficient.pdf”))
    plot.new()
    text(0.4,0.75,labels=paste0(“\n”,round(pi1 * 100, 2), 
    “% non-null \np-values and \n”, 
    round(sum(res.list$Synergy$adj.P.Val < 0.1) *
                            100/length(res.list$Synergy$ensembl), 2), 
    “ % of genes with \nsynergy FDR < 0.1”))
    hist(synergy.pvalues)
    dev.off()
    

Identification and categorization of synergistic genes

  • 17
    Define synergistic effect categories.
    • Determine an expression cutoff range. Here +/− the mean standard error for all empirically measured comparisons is used.
      meanSE = mean(SE[,c(1,2,3,4,6)])
      
    • Create a table combining log2 fold change columns of additive, combinatorial and synergy contrasts. Make certain the vectors (here: c(1,2,4,9)) refer to the column indices of ensembl ID, gene name, logFC and adjusted P-value in the DEG result tables. If they do not, adjust the numbers to correctly reflect these column indices.
      colnames(res.list$Additive)
      log2FC.matrix <- Reduce(function(x,y) merge(x,y,by=c(“ensembl”, “Gene_name”),
        all = TRUE),
       list(res.list$Additive[,c(1,2,4,9)],
               res.list$Combinatorial[,c(1,2,4,9)],
               res.list$Synergy[,c(1,2,4,9)]))
      colnames(log2FC.matrix)
      colnames(log2FC.matrix) <- c(“Ensembl”, “Gene_name”,
                                   “Additive.logFC”, “Additive.FDR”,
                                   “Combinatorial.logFC”, “Combinatorial.FDR”,
                                   “Synergistic.logFC”, “Synergistic.FDR”)
      rownames(log2FC.matrix) <- log2FC.matrix$Ensembl
      
    • Add a column assigning synergistic categories to each gene using the categorize.synergy() function, which takes the previously created matrix and the expression cutoff into account.
      log2FC.matrix <- categorize.synergy(log2FC.matrix, meanSE)
      write.csv(log2FC.matrix, paste0(“results/”, experiment.title,
      “_LogFC-FDR-synergy_matrix.csv”))
      genes.per.category <- count(log2FC.matrix, “magnitude.syn”)
      print(genes.per.category)
      
  • 18
    Visualize the categories in a pie chart.
    • Create a table containing sums of synergistic category genes.
      genes.per.category$category <- factor(genes.per.category$magnitude.syn,
        levels=c(“same”, “less.up”, “less.down”, “more.up”, “more.down”))
      genes.per.category$percent <- paste0(round(genes.per.category$freq *
        100/sum(genes.per.category$freq), 0), “ %”)
      write.csv(genes.per.category, paste0(“results/”, experiment.title,
      “_gene-count_synergy-categories.csv”))
      
    • Plot a pie chart (Fig. 4c).
      zissou <- wes_palette(“Zissou1”, 6, type = “continuous”)
      q <- ggplot(genes.per.category, aes(x = ““, y = freq, fill = category)) +
           geom_col() +
           coord_polar(“y”, start=0) +
           scale_fill_manual(values=zissou) +
           theme_void()
      pdf(paste0(“results/”, experiment.title,
      “−8_synergy-categories_pie-chart.pdf”))
      print(q)
      dev.off()
      
  • 19
    Visualize the categories in heatmaps ?TROUBLESHOOTING
    • Create heatmaps of the log2FC in the additive and the combinatorial comparisons for each synergy category (Fig. 4d).
      pdf(paste0(“results/”, experiment.title,
      “−9_heatmaps_log2FC-Add-vs-Combi.pdf”))
      for (i in 1:length(levels(log2FC.matrix$magnitude.syn))){
      breaks <- c(seq(−6, −0.3,by=0.1),seq(0.3, 6,by=0.1))
      breaks <- append(breaks, −9,0)
      breaks <- append(breaks, 9)
      tmp <- log2FC.matrix[log2FC.matrix$magnitude.syn == 
           levels(log2FC.matrix$magnitude.syn)[i],
           c(“Additive.logFC”,”Combinatorial.logFC”)]
      h <- pheatmap(tmp,
           kmeans_k = 30,
            cellwidth = 70, cellheight = 5,
           border_color = NA,
            breaks=breaks,
           cluster_cols = F,
           show_rownames = F,
           color = colorRampPalette(rev(brewer.pal(n=9, name=“RdBu”)))(117),
           main = paste0(“logFC expected vs. measured:\n”,
                         levels(log2FC.matrix$magnitude.syn)[i]))
      print(h)
      }
      dev.off()
      ?TROUBLESHOOTING
      

Enrichment analysis

All comparisons: GSEA

  • 20
    Create a list containing the gene set groups of interest. Ensure the gene sets are in the required format (gmt) using the functions ids2indices(), geneIds() and getGmt().
    • In this example these are previously manually curated gene set groups, saved in the “genesets” folder: disorder.gmt, behavior.gmt, connectivity.gmt, head.gmt, neural.gmt, postsynapse.gmt, presynapse.gmt and synapse.gmt.
    • Further gmt files for gene sets of interest can be found at https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp. Custom gene sets can be created by assembling a spreadsheet in the gmt format and naming the file using the “.gmt” file extension. For information on the formatting of gmt files, refer to https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats.
      gs.list <- list(
        “disorder” = ids2indices(geneIds(getGmt(“genesets/disorder.gmt”)),
            id = v$genes$Gene_name),
        “behavior” = ids2indices(geneIds(getGmt(“genesets/behavior.gmt”)),
           id=v$genes$Gene_name),
        “connectivity” = ids2indices(geneIds(getGmt(“genesets/connectivity.gmt”)),
           id=v$genes$Gene_name),
        “head” = ids2indices(geneIds(getGmt(“genesets/head.gmt”)),
           id=v$genes$Gene_name),
        “neural” = ids2indices(geneIds(getGmt(“genesets/neural.gmt”)),
           id=v$genes$Gene_name),
        “postsynapse” = ids2indices(geneIds(getGmt(“genesets/postsynapse.gmt”)),
           id=v$genes$Gene_name),
        “presynapse” = ids2indices(geneIds(getGmt(“genesets/presynapse.gmt”)),
           id=v$genes$Gene_name),
        “synapse” = ids2indices(geneIds(getGmt(“genesets/synapse.gmt”)),
           id=v$genes$Gene_name))
      
  • 21
    (Optional) Create a custom color palette.
    • Create a custom color palette with at least as many colors as gene set groups to be tested.
      catcols=c(“behavior” = “#8c3800”,   #brown
            “disorder” = “#3C4347”,   #darkgrey
             “connectivity” = “#e0a81c”,  #mustard
            “head” = “#702658”,    #purple
            “neural” = “#004878”,   #blue
            “postsynapse” = “#486030”,  #darkgreen
            “presynapse” = “#a8c018”,  #lightgreen
            “synapse” = “#5c9340”)   #green
      # additional colors:  # “#6CA7AD”, “#B72415”, “#A88C05”, “#E06C03”, “#653C82”,   # “#0287AA”, “#AD5A60”, “#9B0420”)
      show_col(catcols)
      
  • 22
    Run gene set enrichment for all comparisons using camera.
    • Loop through all contrasts in the cont.matrix object. Perform enrichment and visualize as scatter (Fig. 5a) and bar plots (Fig. 5b) using the custom cameraplusplots() function.
      camera.res.list <- list()
      for (j in 1:length(colnames(cont.matrix))){
       print(paste0(“Contrast: “, colnames(cont.matrix)[j]))
       pdf(paste0(“results/”, experiment.title, “−10_GSEA-”,
        colnames(cont.matrix)[j], “-plots.pdf”))
       camera.res <- cameraplusplots(contrast = cont.matrix[ ,j],
             genesetlist = gs.list, vobject = v, design = design,
             catcolors = catcols, title = paste0(colnames(cont.matrix)[j]))
       dev.off()
       camera.res.list[[j]] <- camera.res
       names(camera.res.list)[j] <- colnames(cont.matrix)[j]
       write.csv(data.frame(camera.res), paste0(“results/”, experiment.title,
      “_GSEA-”, colnames(cont.matrix)[j], “.csv”))
      }
      
    • Plot legend (Fig. 5a).
      pdf(paste0(“results/”, experiment.title, “−10_GSEA-plot-legend.pdf”))
      plot(1,type = ‘n’, xlab = ‘‘, ylab = ‘‘, xaxt = ‘n’, yaxt = ‘n’, bty = ‘n’)
      legend(“center”, names(catcols), cex = 1.2, fill = catcols)
      dev.off()
      
Figure 5. Gene set enrichment analysis (GSEA) output.

Figure 5.

a) Competitive GSEA of differential expression in the additive (top) and the combinatorial (bottom) comparisons using limma camera, based on 698 neural gene sets, stratified by 8 neural categories. b) Bar chart showing detailed results of the 10 most significant gene sets as in (a), ranked by significance. Red lines denote enrichment FDR of 5%.

Specific gene subsets: ORA

  • 23
    Create synergistic gene subsets to be analyzed.
    • Adjust FDR cutoff depending on the subtlety of the synergistic effect. In this example a cutoff of synergistic FDR < 1% was chosen.
      log2FC.matrix.sub <- subset(log2FC.matrix, Synergistic.FDR < 0.01)
    • Create additional potentially interesting subsets (commented out):
      ## Synergy genes with combinatorial FDR < 5%
      #log2FC.matrix.sub <- subset(log2FC.matrix, Combinatorial.FDR < 0.05)
      ## Genes with synergistic Fold Change > 2 or < 0.5
      #log2FC.matrix.sub <- subset(log2FC.matrix, Synergistic.logFC > 1.5 |
        Synergistic.logFC < −1.5)
      
  • 24

    Stratify chosen subset by synergy category using the custom stratify.by.syn.cat() function.

    syn.cat.list <- stratify.by.syn.cat(log2FC.matrix.sub)
    
  • 25

    Define reference genes as all genes analyzed. Since lowly expressed genes were filtered out at the beginning of the protocol in Step 6, this can be interpreted as all expressed genes.

    allgenes <- as.character(y$genes$Gene_name)
    
  • 26
    Create a list of file paths that specify the directories in which the gene sets of interest are saved.
    • WebgestaltR, which is used for over-representation analysis, requires gene sets to be provided in a different format than camera. Use the following code to reload the genesets in the WebgestaltR-usable format (as opposed to the method of loading gs.list in step 23, which is optimized for camera).
      gs.list <- list(“disorder” = “genesets/disorder.gmt”,
      “behavior” = “genesets/behavior.gmt”,
      “connectivity” = “genesets/connectivity.gmt”,
      “head” = “genesets/head.gmt”,
      “neural” = “genesets/neural.gmt”,
      “postsynapse” = “genesets/postsynapse.gmt”,
      “presynapse” = “genesets/presynapse.gmt”,
      “synapse” = “genesets/synapse.gmt”)
      
  • 27
    Run over-representation analysis for “more up” and “more down” synergy categories using the WebGestaltR() function. Here, “more up” refers to synergistically upregulated genes, where upregulation in the combinatorial condition is greater than what would have been predicted by the additive model. In contrast, “more down” refers to synergistically downregulated genes. These categories of interest were created in step 20 using the categorize.synergy() function.
    • Loop through the “more.up” and “more.down” vectors in the previously created (Step 27) list object syn.cat.list.
    • Create a data frame for all results.
    • Visualize using the custom oraplot() function (Fig. 6a, b).
      for (i in 3:4){
      tryCatch({
        ora <- data.frame(matrix(ncol = 11, nrow = 0))
        ora.list <- list()
        goi <- syn.cat.list[[i]]
        for (j in 1:length(gs.list)){
          tryCatch({
          ora.s <- WebGestaltR(enrichMethod = “ORA”,
                                 organism = “hsapiens”,
                                 interestGene = goi,
                                 interestGeneType = “genesymbol”,
                                 referenceGene = allgenes,
                                 referenceGeneType = “genesymbol”,
                                 enrichDatabase = “others”,
                                 enrichDatabaseFile = file.path(gs.list[j]),
                                 enrichDatabaseType = “genesymbol”,
                                 sigMethod = “top”, topThr = 50, minNum = 3,
                                 isOutput = F)
          ora.list[[j]] <- ora.s
            names(ora.list)[j] <- names(gs.list)[j]
            ora.list[[j]]$category <- names(ora.list[j])
            colnames(ora) <- names(ora.list[[1]])
            ora <- rbind.data.frame(ora, ora.list[[j]])
          }, error = function(e){cat(“ERROR :”,conditionMessage(e), “\n”)})
        }
           write.csv(ora, paste0(“results/”, experiment.title, “_ORA-”,
             levels(log2FC.matrix$magnitude.syn)[i], “.csv”))
           g <- oraplot(ora, catcols,       paste0(levels(log2FC.matrix$magnitude.syn)[i]))
        pdf(paste0(“results/”, experiment.title, “−11_ORA-”,
            levels(log2FC.matrix$magnitude.syn)[i], “-plots.pdf”))
           print(g)
        dev.off()
      }, error = function(e){cat(“ERROR :”,conditionMessage(e), “\n”)})
      }
      
Figure 6. Over-representation analysis (ORA) output.

Figure 6.

a - b) Over-representation analysis, using a hypergeometric test, of 698 curated gene sets and those ‘more downregulated’ (a) and ‘more upregulated’ (b) genes with significant synergistic differential expression (FDR < 1%), ranked by adjusted significance. Red lines denote enrichment FDR of 5%.

TIMING

The timing of the biological experiments performed to generate data for the analyses described in the Procedure will vary widely dependent upon the numbers of days required to generate and perturb the cell type(s) of interest. To generate the CZN dataset used to illustrate the protocol, we assayed 21-day-old glutamatergic neurons induced by overexpression of NGN2 63 for which CRISPR-perturbations were engineered for the duration of the experimental timeline 64. RNA sequencing library preparation can be completed in 2 days, although RNA sequencing requires 2–6 weeks to be conducted by most fee-for-service arrangements. On a standard laptop, running the entire bioinformatic workflow takes 20 minutes to 2 hours, depending on your level of expertise, the size of your data set and the formatting of your source data. For instance, if your source data is not formatted with Ensembl IDs as rows and condition names as columns, you will have to manipulate your source data to fit this format, thereby adding time. If your source data does not contain Ensembl IDs and rather uses gene names, you will be able to skip annotation steps and save time.

Stem cell culture, CRISPR perturbation and neuronal induction: 21 days

RNA sequencing library preparation: 1–2 days

Sequencing: 2–6 weeks

Synergy analysis, described here: 20 minutes – 2 hours. Step 20 may take a few minutes depending on the number of genes to which you are applying the analysis.

TROUBLESHOOTING

Troubleshooting advice can be found in Table 1.

Table 1.

Troubleshooting table.

Step Problem Possible cause Possible solution
3 When loading R packages, libraries for WebGestaltR fail to load (may be a macOS-specific problem) Some packages in R require an X11 Server and/or libraries associated with an X11 server. Apple no longer provides this software with OS X so installation of a third party app is required for full functionality. Download and install XQuartz (https://www.xquartz.org/) for mac; log out and log back in to complete installation. Packages should now install and load correctly.
3 Error message: Error in library(“PackageName”) : there is no package called ‘PackageName’ The package may not be installed Use install.packages(“PackageName”) to first install the package before loading it
5 Error importing the counts file using read.csv() The counts file may be in .txt format instead of .csv. If counts file is in .txt, use read.delim(“filename.txt”, row.names = 1) when importing counts
5 When loading data and matching counts file to metadata data frame, metadata data frame generates blank rows To match the data, counts must have the same number of columns as metadata has rows. Your “Counts” csv file may include extra columns without matching rows in the metadata file. This may be the case for files with extra columns for gene names and info. Ensure the row names of metadata match column names of counts file. Remove extra columns in “Counts” file and reload data.
5 Column names in counts data frame are changed to include “X” before the value, which returns an error when matching counts with metadata R automatically will add an “X” to the beginning of each column name if the first character in a column title is numeric. Name your columns starting with a character rather than an integer, or ensure your metadata row names also contain the “X”.
5 Error given: Error in file(file, “rt”) : cannot open the connection. The dataframe cannot be loaded due to an incorrect filename/directory path name given in the loading code. Check that the filename and directory path listed in this line of code is correct. For example, a metadata csv file named “meta” in the folder “results” would be listed as “results/meta.csv”
7 Error matching row names of DGElist object to annotations The code is built to add annotations and gene names for genes in counts file that are listed as Ensembl IDs. If your counts file is already named using gene names and not Ensembl IDs, this matching will not work. If your counts data already uses gene names rather than Ensembl IDs, skip the two lines of code making annotation row names ensembl IDs and matching the data frame to the DGEList object
9, 17 Error given: “mod.gene” not found “mod.gene” is the name of the variable of interest in the example metadata file. If another name for this variable is used in your own metadata file then these steps will not be able to find this variable in your data. The “mod.gene” variable in the design model code in step 9 and the qplot code in step 17 must be replaced with the relevant name for the variable of interest in your metadata file.
12 Error given: The levels must by syntactically valid names in R. The names for the perturbations in your contrast matrix contain invalid characters (such as “&”) or contain spaces. Check the list of valid names using the function help(make.names) and change the perturbation names in your metadata file accordingly.
20 Merging matrices to create log2FC matrix adds extra rows Duplicate rows in counts data causes incorrect matching To check if you have duplicate rows check if there is a discrepancy between nrow(unique(res.list$combinatorial[,c(“ensembl”,”Gene_name”)])) and nrow(res.list$combinatorial[,c(“ensembl”,”Gene_name”)]) by setting them equal to each other. To fix, remove duplicates. This ideally should be done prior to loading data in step 5, but can be accomplished by subsetting using [!duplicated()], ie: log2FC.matrix <- merge(a, b[!duplicated(b[, c(“ensembl”, “Gene_name”)]),], all.x=TRUE, by=c(“ensembl”, “Gene_name”))
22 .pdf file of the heatmaps generated in step 22 fails to open May arise from an incomplete generation of one of the figures in this pdf; repeating the execution of the code in this step a second time should produce a working .pdf. Repeat execution of code in step 22.
Various .pdf files of multiple figures fail to open Usually occurs as a result of the code generating the .pdf files being incomplete/not copied over correctly. Double check execution of pdf-generating code and make sure all formatting is correct when copying code over to your R console.

ANTICIPATED RESULTS

To demonstrate the functionality of this protocol, we re-analyzed data from our study of schizophrenia risk genes, which perturbed four genes individually and in combination, and revealed large synergistic effects 20. To demonstrate the broader applicability of the analysis pipeline we additionally analyzed the results from a study of the interactions between BET and MEK inhibitors to synergistically inhibit the growth of NRAS-mutant melanoma 44.

Differential expression analysis

Step 6 plots a visual correlation of gene counts and counts per million (cpm) (Fig. 3a, Extended Data Figure 1a). We aim to retain only genes with more than ten counts in at least four samples, as this allows us to visually determine the cpm value corresponding to ten counts (here: 0.25). This value is subsequently entered to define the “keep” variable.

In Step 8, a multidimensional scaling (MDS) plot displaying sample names is created, as well as color coded MDS plots for every metadata column. Fig. 3b and Extended Data Figure 1a show two examples of the four plots created from the example data. They show clearly that sample clustering was mainly determined by donor and cell line, which will be subsequently added as covariates to the linear model design.

The voom() function used in Step 10 creates a mean-variance plot (Fig. 3c, Extended Data Figure 1c). The fitted trend line is shown in red and can be used as a visual diagnostic tool to assess fit.

Following linear modeling, empirical Bayes moderation is performed in Step 15 and the mean-variance trend for the final model plotted.

The differential expression results for each comparison are saved as csv files in Step 16, including ensemble ID, gene symbol and description, logFC, left and right limit of the confidence interval for the logFC, average expression, t-statistic, P-value, adjusted P-value and B-statistic for each gene.

Step 17 then creates the respective mean difference (MA) and volcano plots for each comparison (Fig. 3d, and Extended Data Figure 1d).

Synergistic effect analysis

Step 18 calculates the power to detect synergistic effects, given the variances of the data at hand. The resulting power plot (Fig. 4a, Extended Data Figure 2a) visualizes the power (y) to resolve a specific synergistic logFC (x) for several sample sizes (color).

In Step 19, the synergy coefficient, π1, is calculated to determine the existence of a synergistic component in the data. Similarly, the fraction of genes with a synergistic FDR smaller than 10% is computed to determine the extent of synergy and a histogram of all synergistic p-values is created to visualize their distribution (Fig. 4b, Extended Data Figure 2b). All three are written into a pdf file.

In Step 20, following the creation of a table to combine the logFCs and FDRs of the combinatorial, additive and synergistic comparisons, genes are assigned a synergy category based on these variables. The result is written into a csv file. These categories are then visualized in a pie chart (Step 21, Fig. 4c, Extended Data Figure 2c) and their logFCs are plotted in separate heatmaps (Step 22, Fig. 4d, Extended Data Figure 2d). These categories describe how gene expression varies synergistically between the combinatorial and the additive model. For example, genes that are “more up” have expression levels in the combinatorial condition that are greater than the expected expression based on the additive model.

Gene set enrichment analysis

Gene set enrichment analysis of all comparisons is performed in Step 25. The cameraplusplots() function used here also creates a scatter plot of all gene sets tested and their −log10(FDR) for each comparison (Fig. 6a, Extended Data Fig. 3a). It also plots bar charts of the −log10(FDR) for the top 5 gene sets in each comparison (Fig. 5b, Extended Data Figure 3b). In the same loop, the results are saved as csv files for each comparison. Finally, step 30 creates bar charts for the over-representation analysis of “more up” and “more down” genes (or any other gene subset chosen) and saves the results as csv files (Fig. 6, Extended Data Figure 4).

Key data used in this protocol

Extended Data

Extended Data Fig. 1. Differential expression analysis output, related to Figure 3.

Extended Data Fig. 1

A) Plot showing counts over cpm. Horizontal red line marks 10 counts. Arrow indicates the intersection with the plotted data, which here equals 1.4 cpm (vertical red line). B) MDS plots highlighting two metadata variables respectively. Sample data are separated by treatment (left), but not by replicate (right). C) Voom mean-variance plot. D) Volcano and mean difference (MA) plots of differential expression in the additive (left) and the combinatorial (right) comparisons. Significantly differentially expressed genes are highlighted in blue and red (Volcano plot) and the top 10 significant genes are denoted in blue (MA plot).

Extended Data Fig. 2. Synergistic effect analysis output, related to Figure 4.

Extended Data Fig. 2

A) Plot visualizing synergistic effect power calculations. X-axis shows synergistic log2FCs. In the current example, 10 samples per condition are required to resolve a synergistic log2FC of 1.6 at 75% power. B) Histogram of synergistic P-values. C) Pie chart showing the proportions of genes that fall into different synergistic differential expression categories. D) Hierarchical clustering of the differential expression log2(fold changes) of all synergy categories, in the additive model versus the combinatorial perturbation comparisons.

Extended Data Fig. 3. Gene set enrichment analysis (GSEA) output, related to Figure 5.

Extended Data Fig. 3

A) Competitive GSEA of differential expression in the additive (top) and the combinatorial (bottom) comparisons using limma camera, based on two cancer hallmark gene sets. B) Bar chart showing detailed results of the 10 most significant gene sets as in (A). Red lines denote enrichment FDR of 5%.

Extended Data Fig. 4. Over-representation analysis (ORA) output, related to Figure 6.

Extended Data Fig. 4

A - B) Over-representation analysis, using a hypergeometric test, of 2 publicly available gene sets and those ‘more downregulated’ (A) and ‘more upregulated’ (B) genes with significant synergistic differential expression (FDR < 1%), ranked by adjusted significance. Red lines denote enrichment FDR of 5%.

Supplementary Material

data+code 2

Supplementary Data 2. “data+code_Echevarria-Vargas.zip”. Originally published 44.

data+code Schrode

Supplementary Data 1. “data+code_Schrode.zip”, also available from www.synapse.org/#!Synapse:syn20502314. Originally published 20.

ACKNOWLEDGEMENTS

This work was partially supported by National Institute of Health (NIH) grants R56 MH101454 (K.J.B) and R01 MH106056 (K.J.B.). This work was supported in part through the computational resources and staff expertise provided by Scientific Computing at the Icahn School of Medicine at Mount Sinai.

Footnotes

Code availability

Code is available at https://github.com/nadschro/synergy-analysis.

COMPETING FINANCIAL INTEREST STATEMENT

The authors declare no conflicts of interest.

Data availability

RNA-seq data from our study of schizophrenia risk genes 20, including their individual and combined perturbation, is available at www.synapse.org/#!Synapse:syn20502314. Downloading this data requires that you are a registered Synapse user and have agreed to the Synapse terms of use. Figures 3, 4, 5 and 6 were created based on this data. RNA-seq data from the NRas-mutant melanoma study (ref. 44) can be accessed at… and was reanalyzed here to generate Extended Data Figure 14.

REFERENCES

  • 1.Pardinas AF et al. Common schizophrenia alleles are enriched in mutation-intolerant genes and in regions under strong background selection. Nat Genet, doi: 10.1038/s41588-018-0059-2 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Nalls MA et al. Expanding Parkinson’s disease genetics: novel risk loci, genomic context, causal insights and heritable risk. bioRxiv, 388165, doi: 10.1101/388165 (2019). [DOI] [Google Scholar]
  • 3.Nelson CP et al. Association analyses based on false discovery rate implicate new loci for coronary artery disease. Nat Genet 49, 1385–1391, doi: 10.1038/ng.3913 (2017). [DOI] [PubMed] [Google Scholar]
  • 4.Xue A et al. Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. Nat Commun 9, 2941, doi: 10.1038/s41467-018-04951-w (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Satterstrom FK et al. Autism spectrum disorder and attention deficit hyperactivity disorder have a similar burden of rare protein-truncating variants. Nat Neurosci 22, 1961–1965, doi: 10.1038/s41593-019-0527-8 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Cavalli G & Heard E Advances in epigenetics link genetics to the environment and disease. Nature 571, 489–499, doi: 10.1038/s41586-019-1411-0 (2019). [DOI] [PubMed] [Google Scholar]
  • 7.Chaste P & Leboyer M Autism risk factors: genes, environment, and gene-environment interactions. Dialogues Clin Neurosci 14, 281–292 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Visscher PM et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am J Hum Genet 101, 5–22, doi: 10.1016/j.ajhg.2017.06.005 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Ye CJ et al. Intersection of population variation and autoimmunity genetics in human T cell activation. Science 345, 1254665, doi: 10.1126/science.1254665 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Moyerbrailean GA et al. High-throughput allele-specific expression across 250 environmental conditions. Genome Res 26, 1627–1638, doi: 10.1101/gr.209759.116 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Phillips PC Epistasis--the essential role of gene interactions in the structure and evolution of genetic systems. Nat Rev Genet 9, 855–867, doi: 10.1038/nrg2452 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Wray NR, Wijmenga C, Sullivan PF, Yang J & Visscher PM Common Disease Is More Complex Than Implied by the Core Gene Omnigenic Model. Cell 173, 1573–1580, doi: 10.1016/j.cell.2018.05.051 (2018). [DOI] [PubMed] [Google Scholar]
  • 13.Boyle EA, Li YI & Pritchard JK An Expanded View of Complex Traits: From Polygenic to Omnigenic. Cell 169, 1177–1186, doi: 10.1016/j.cell.2017.05.038 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Baeza-Centurion P, Minana B, Schmiedel JM, Valcarcel J & Lehner B Combinatorial Genetics Reveals a Scaling Law for the Effects of Mutations on Splicing. Cell 176, 549–563 e523, doi: 10.1016/j.cell.2018.12.010 (2019). [DOI] [PubMed] [Google Scholar]
  • 15.Kuzmin E et al. Systematic analysis of complex genetic interactions. Science 360, doi: 10.1126/science.aao1729 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.VanderSluis B et al. Integrating genetic and protein-protein interaction networks maps a functional wiring diagram of a cell. Curr Opin Microbiol 45, 170–179, doi: 10.1016/j.mib.2018.06.004 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Shalem O, Sanjana NE & Zhang F High-throughput functional genomics using CRISPR-Cas9. Nat Rev Genet 16, 299–311, doi: 10.1038/nrg3899 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Rehbach K, Fernando MB & Brennand KJ Integrating CRISPR Engineering and hiPSC-Derived 2D Disease Modeling Systems. J Neurosci 40, 1176–1185, doi: 10.1523/JNEUROSCI.0518-19.2019 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Hoffman GE, Schrode N, Flaherty E & Brennand KJ New considerations for hiPSC-based models of neuropsychiatric disorders. Mol Psychiatry 24, 49–66, doi: 10.1038/s41380-018-0029-1 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Schrode N et al. Synergistic effects of common schizophrenia risk variants. Nat Genet 51, 1475–1485, doi: 10.1038/s41588-019-0497-5 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Wang M et al. Molecular Networks and Key Regulators of the Dysregulated Neuronal System in Alzheimer’s Disease. bioRxiv, 788323, doi: 10.1101/788323 (2019). [DOI] [Google Scholar]
  • 22.Elam KK, Clifford S, Shaw DS, Wilson MN & Lemery-Chalfant K Gene set enrichment analysis to create polygenic scores: a developmental examination of aggression. Translational psychiatry 9, 212, doi: 10.1038/s41398-019-0513-7 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Choi SW & O’Reilly PF PRSice-2: Polygenic Risk Score software for biobank-scale data. Gigascience 8, doi: 10.1093/gigascience/giz082 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Mimitou EP et al. Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nat Methods 16, 409–412, doi: 10.1038/s41592-019-0392-0 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Dixit A et al. Perturb-Seq: Dissecting Molecular Circuits with Scalable Single-Cell RNA Profiling of Pooled Genetic Screens. Cell 167, 1853–1866 e1817, doi: 10.1016/j.cell.2016.11.038 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Datlinger P et al. Pooled CRISPR screening with single-cell transcriptome readout. Nat Methods 14, 297–301, doi: 10.1038/nmeth.4177 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Readhead B et al. Expression-based drug screening of neural progenitor cells from individuals with schizophrenia. Nat Commun 9, 4412, doi: 10.1038/s41467-018-06515-4 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Duan Q et al. LINCS Canvas Browser: interactive web app to query, browse and interrogate LINCS L1000 gene expression signatures. Nucleic Acids Res 42, W449–460, doi: 10.1093/nar/gku476 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Charbogne P, Kieffer BL & Befort K 15 years of genetic approaches in vivo for addiction research: Opioid receptor and peptide gene knockout in mouse models of drug abuse. Neuropharmacology 76 Pt B, 204–217, doi: 10.1016/j.neuropharm.2013.08.028 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Vasilatos SN et al. Crosstalk between lysine-specific demethylase 1 (LSD1) and histone deacetylases mediates antineoplastic efficacy of HDAC inhibitors in human breast cancer cells. Carcinogenesis 34, 1196–1207, doi: 10.1093/carcin/bgt033 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Shahbazi J et al. The Bromodomain Inhibitor JQ1 and the Histone Deacetylase Inhibitor Panobinostat Synergistically Reduce N-Myc Expression and Induce Anticancer Effects. Clinical cancer research : an official journal of the American Association for Cancer Research 22, 2534–2544, doi: 10.1158/1078-0432.CCR-15-1666 (2016). [DOI] [PubMed] [Google Scholar]
  • 32.Walasek MA et al. The combination of valproic acid and lithium delays hematopoietic stem/progenitor cell differentiation. Blood 119, 3050–3059, doi: 10.1182/blood-2011-08-375386 (2012). [DOI] [PubMed] [Google Scholar]
  • 33.Slowikowski K et al. CUX1 and IkappaBzeta (NFKBIZ) mediate the synergistic inflammatory response to TNF and IL-17A in stromal fibroblasts. Proc Natl Acad Sci U S A 117, 5532–5541, doi: 10.1073/pnas.1912702117 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Kuchenov D et al. A combinatorial extracellular code tunes the intracellular signaling network activity to distinct cellular responses. bioRxiv, 346957, doi: 10.1101/346957 (2018). [DOI] [Google Scholar]
  • 35.Fursova NA et al. Synergy between Variant PRC1 Complexes Defines Polycomb-Mediated Gene Repression. Mol Cell 74, 1020–1036 e1028, doi: 10.1016/j.molcel.2019.03.024 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Glover KP, Chen Z, Markell LK & Han X Synergistic Gene Expression Signature Observed in TK6 Cells upon Co-Exposure to UVC-Irradiation and Protein Kinase C-Activating Tumor Promoters. PLoS One 10, e0139850, doi: 10.1371/journal.pone.0139850 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Licciardello MP et al. A combinatorial screen of the CLOUD uncovers a synergy targeting the androgen receptor. Nature chemical biology 13, 771–778, doi: 10.1038/nchembio.2382 (2017). [DOI] [PubMed] [Google Scholar]
  • 38.Sriraman A et al. Cooperation of Nutlin-3a and a Wip1 inhibitor to induce p53 activity. Oncotarget 7, 31623–31638, doi: 10.18632/oncotarget.9302 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Gupta S et al. IL-6 augments IL-4-induced polarization of primary human macrophages through synergy of STAT3, STAT6 and BATF transcription factors. Oncoimmunology 7, e1494110, doi: 10.1080/2162402X.2018.1494110 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Goldstein I, Paakinaho V, Baek S, Sung MH & Hager GL Synergistic gene expression during the acute phase response is characterized by transcription factor assisted loading. Nat Commun 8, 1849, doi: 10.1038/s41467-017-02055-5 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Oner MG et al. Combined Inactivation of TP53 and MIR34A Promotes Colorectal Cancer Development and Progression in Mice Via Increasing Levels of IL6R and PAI1. Gastroenterology 155, 1868–1882, doi: 10.1053/j.gastro.2018.08.011 (2018). [DOI] [PubMed] [Google Scholar]
  • 42.Smitheman KN et al. Lysine specific demethylase 1 inactivation enhances differentiation and promotes cytotoxic response when combined with all-trans retinoic acid in acute myeloid leukemia across subtypes. Haematologica 104, 1156–1167, doi: 10.3324/haematol.2018.199190 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Rajaraman S et al. Measles Virus-Based Treatments Trigger a Pro-inflammatory Cascade and a Distinctive Immunopeptidome in Glioblastoma. Mol Ther Oncolytics 12, 147–161, doi: 10.1016/j.omto.2018.12.010 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Echevarria-Vargas IM et al. Co-targeting BET and MEK as salvage therapy for MAPK and checkpoint inhibitor-resistant melanoma. EMBO molecular medicine 10, doi: 10.15252/emmm.201708446 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Storey JD & Tibshirani R Statistical significance for genomewide studies. Proc Natl Acad Sci U S A 100, 9440–9445, doi: 10.1073/pnas.1530509100 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Corney DC RNA-seq Using Next Generation Sequencing. Materials and Methods 3 (2013). [Google Scholar]
  • 47.Hoffman GE & Schadt EE variancePartition: interpreting drivers of variation in complex gene expression studies. BMC bioinformatics 17, 483, doi: 10.1186/s12859-016-1323-z (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Hoffman GE et al. Transcriptional signatures of schizophrenia in hiPSC-derived NPCs and neurons are concordant with post-mortem adult brains. Nat Commun 8, 2225, doi: 10.1038/s41467-017-02330-5 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Ritchie ME et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43, e47, doi: 10.1093/nar/gkv007 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Robinson MD, McCarthy DJ & Smyth GK edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140, doi: 10.1093/bioinformatics/btp616 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Kolde R pheatmap: Pretty Heatmaps. R package version 1.0.12, <https://CRAN.R-project.org/package=pheatmap> (2019). [Google Scholar]
  • 52.Neuwirth E RColorBrewer: ColorBrewer Palettes. R package version 1.1–2, < https://CRAN.R-project.org/package=RColorBrewer> (2014). [Google Scholar]
  • 53.Wickham H ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag; New York: (2016). [Google Scholar]
  • 54.Kassambara A ggpubr: ‘ggplot2’ Based Publication Ready Plots. R package version 0.2.5, <https://CRAN.R-project.org/package=ggpubr> (2020). [Google Scholar]
  • 55.Storey JD, Bass AJ, Dabney A & Robinson D qvalue: Q-value estimation for false discovery rate control. R package version 2.18.0 <http://github.com/jdstorey/qvalue> (2019). [Google Scholar]
  • 56.Wickham H The Split-Apply-Combine Strategy for Data Analysis. . Journal of Statistical Software 40, 1–29 (2011). [Google Scholar]
  • 57.Ram K & Wickham H wesanderson: A Wes Anderson Palette Generator. R package version 0.3.6, < https://CRAN.R-project.org/package=wesanderson> (2018).
  • 58.Morgan M, Falcon S & Gentleman R GSEABase: Gene set enrichment data structures and methods. R package version 1.48.0, <https://bioconductor.org/packages/release/bioc/html/GSEABase.html> (2019).
  • 59.R Core Team. R: A language and environment for statistical computing, <https://www.R-project.org/.> (2019).
  • 60.Wickham H & Seidel D scales: Scale Functions for Visualization. R package version 1.1.1, <https://CRAN.R-project.org/package=scales> (2020).
  • 61.Wang J & Liao Y WebGestaltR: Gene Set Analysis Toolkit WebGestaltR. R package version 0.4.3, <https://CRAN.R-project.org/package=WebGestaltR> (2020).
  • 62.Wickham H stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0, <https://CRAN.R-project.org/package=stringrr> (2019).
  • 63.Ho SM et al. Rapid Ngn2-induction of excitatory neurons from hiPSC-derived neural progenitor cells. Methods 101, 113–124, doi: 10.1016/j.ymeth.2015.11.019 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Ho SM et al. Evaluating Synthetic Activation and Repression of Neuropsychiatric-Related Genes in hiPSC-Derived NPCs, Neurons, and Astrocytes. Stem Cell Reports, doi: 10.1016/j.stemcr.2017.06.012 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

Related links

Key reference(s) using this protocol

Schrode, N. et al. Nat Genet 51, 1475–1485 (2019): https://doi.org/10.1038/s41588-019-0497-5

Key data used in this protocol

Orbán-Németh Z. et al. Nat. Protoc. 13, 478–494 (2018) https://doi.org/10.1038/s41596-019-0147-5

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

data+code 2

Supplementary Data 2. “data+code_Echevarria-Vargas.zip”. Originally published 44.

data+code Schrode

Supplementary Data 1. “data+code_Schrode.zip”, also available from www.synapse.org/#!Synapse:syn20502314. Originally published 20.

Data Availability Statement

RNA-seq data from our study of schizophrenia risk genes 20, including their individual and combined perturbation, is available at www.synapse.org/#!Synapse:syn20502314. Downloading this data requires that you are a registered Synapse user and have agreed to the Synapse terms of use. Figures 3, 4, 5 and 6 were created based on this data. RNA-seq data from the NRas-mutant melanoma study (ref. 44) can be accessed at… and was reanalyzed here to generate Extended Data Figure 14.

RESOURCES