Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2023 Feb 2;51(4):1637–1651. doi: 10.1093/nar/gkad046

Recovering false negatives in CRISPR fitness screens with JLOE

Merve Dede 1, Traver Hart 2,3,
PMCID: PMC9976895  PMID: 36727483

Abstract

It is widely accepted that pooled library CRISPR knockout screens offer greater sensitivity and specificity than prior technologies in detecting genes whose disruption leads to fitness defects, a critical step in identifying candidate cancer targets. However, the assumption that CRISPR screens are saturating has been largely untested. Through integrated analysis of screen data in cancer cell lines generated by the Cancer Dependency Map, we show that a typical CRISPR screen has a ∼20% false negative rate, in addition to library-specific false negatives. Replicability falls sharply as gene expression decreases, while cancer subtype-specific genes within a tissue show distinct profiles compared to false negatives. Cumulative analyses across tissues improves our understanding of core essential genes and suggest only a small number of lineage-specific essential genes, enriched for transcription factors that define pathways of tissue differentiation. To recover false negatives, we introduce a method, Joint Log Odds of Essentiality (JLOE), which builds on our prior work with BAGEL to selectively rescue the false negatives without an increased false discovery rate.

INTRODUCTION

The search for essential genes—genes whose loss of function results in a severe fitness defect—has been of outstanding interest to the scientific community. Prior to advanced genomic technologies, the assumption was that the majority of genes were essential for life (1). This idea was disproven by several studies that utilized saturating random mutagenesis to show that in Caenorhabditis elegans and Saccharomyces cerevisiae, only 12–15% of the genome was estimated to be essential (2,3). These studies were limited by the methods at the time and the lack of the availability of complete genome sequences.

With the advances in genome technologies that enabled sequencing of eukaryotic organisms, systematic gene knockout studies were performed in S. cerevisiae, identifying essential genes by deletion of open reading frames in the yeast genome (4,5). This method identified 17% of yeast genes as essential for growth in rich medium (5). Generation of siRNA and shRNA libraries to conduct genome-wide RNAi screens facilitated the study of essential genes in other model organisms (6–9). In these RNAi screens, 30% of the genome was shown to be essential in D.melanogaster cell lines, (6), compared to only 8.5% of the C. elegans genome in whole worms (7). A later study showed that a binary classification of genes into essential and non-essential was misleading due to the context dependent nature of gene essentiality and that 97% of yeast genes showed some growth phenotype under different environmental conditions (10).

Identifying essential genes in human cancer cells is of special interest in oncology since the cancer-specific essential genes represent genomic vulnerabilities that can potentially be targeted with novel therapeutic agents. Studies using CRISPR/Cas9 technology revealed that mammalian cells have more essential genes than RNAi screens were able to detect and that, at the same false discovery rate, CRISPR screens generated 3–4 times more essential genes (11). Moreover, multiple groups revealed lists of ∼2000 highly concordant human essential genes, and comparison of CRISPR technology to orthogonal techniques such as random insertion of gene traps also showed consistent results (12–14). These findings were initially thought to indicate that the CRISPR-Cas9 screens are near-saturating; that is, that a well-designed screen can detect a cell's full complement of essential genes. However, it is still poorly understood whether possible systematic biases in CRISPR screens affect our understanding of human cell-intrinsic gene essentiality. In the absence of a ground truth, the actual true positive, false positive and false negative rates of a typical genome-wide CRISPR-Cas9 knockout screen are still unknown.

In this study, we examine some of the biases characteristic of genome-wide CRISPR-Cas9 knockout screening. Using publicly available genome wide screen data from 769 genetically heterogeneous cell lines from the Cancer Dependency Map initiative (15,16), we demonstrate the systematic biases in these screens, investigate and model the actual number of essential genes identifiable with CRISPR technology, estimate the false discovery rate (FDR) and false negative rate (FNR) in a CRISPR screen, and describe a computational method, the Joint Log Odds of Essentiality (JLOE), that can rescue these false negatives. We show that JLOE improves screen concordance with the Sanger dataset (17) and reduces the number of background-specific essential genes.

MATERIAL AND METHODS

Essentiality data generation

The raw read count file of genome-wide CRISPR pooled library screens for 769 cell lines using Avana library (Broad DepMap, 2020b; Meyers et al., 2017) (Broad Institute's DepMap Project 20Q2 release) was downloaded from the data depository at https://depmap.org/portal/. The dataset was filtered in order to keep only the protein-coding genes for downstream analysis and the gene names were updated using Human Genome Organization (HUGO) Gene Nomenclature Committee (HGNC) (18) and Consensus Coding Sequence (CCDS) (19) databases. sgRNAs targeting multiple genes were discarded to avoid genetic interaction effects. The updated raw read counts were processed with the CRISPRcleanR (20) algorithm to correct for gene- independent copy number induced fitness effects and calculate fold change. CRISPRcleanR uses a circular binary segmentation algorithm which was previously used during the analysis for array-based comparative genomic hybridization assay (21,22) and applies it to genome-wide CRISPR screens. For each individual chromosome, CRISPRcleanR detects regions targeted by multiple sgRNAs with reasonably equal log fold changes (logFCs). If the sgRNAs in these regions target a low number of unique genes, this would indicate that the phenotype arises due to gene-independent effects and therefore the logFCs are corrected through mean or median based centering depending on the presence of outliers (20). CRISPRcleanR processed fold changes of each cell line were analyzed through updated BAGEL2 algorithm (23) (https://github.com/hart-lab/bagel). Compared to the previous published version of BAGEL (24), the updated version of BAGEL employs a linear regression model to interpolate outliers and uses 10-fold cross validation for data sampling. BAGEL is a Bayesian classifier that is trained using previously defined gold standard reference sets of core-essential and nonessential gene sets. BAGEL estimates the distribution of fold changes of all gRNAs targeting all genes in either the essential or nonessential training sets and then it calculates the log likelihood of uncharacterized sgRNAs belonging to either the essential or nonessential distributions and gives a log Bayes Factor (BF) as the final output. Essentiality of each gene was measured as BF, which reflects the relative statistical confidence of gene essentiality based on gold standard reference sets of 681 core essential genes and 927 nonessential genes (11,25). Positive BF indicates essential genes whereas negative BF indicates non-essential genes. The list of gold standard core-essential genes and nonessential genes can be found in the same repository as the BAGEL v2 software. The qualities of each screen were evaluated through the ‘precision-recall’ function of BAGEL and the F-measure, which is the harmonic mean of the precision recall, was calculated for each screen at BF of 5. Finally, 659 cell lines for the Avana dataset were selected for downstream analysis by an F-measure threshold of 0.80 to prevent noise from marginal quality of the screens.

Cumulative essentials analysis

A cumulative analysis approach was used to evaluate and display the cumulative distribution of essential genes and calculate the total number of true essentials (true positives) as well as the error rate (FDR) in an average CRISPR-Cas9 knockout screen. This approach is based on the principle described previously (11) that if you screen a sufficient number of cell lines with no false positives, you would expect to see a cumulative essential genes observations plot that flattens out with a slope of zero when the total number of essential genes is reached. However, in reality, if you have some FDR or heterogeneity in the population of cell lines being screened, the slope of the cumulative essential plot would stay positive. This is because the true hits get saturated but in addition, it is likely that false positives as well as cell-line specific context essential genes will also be captured and be added on in each consecutive screen.

To perform the cumulative essentials analysis for the screens in the DepMap project, herein referred to as the Avana data, the BFs for all genes across all cell lines was constructed in a separate matrix and the cumulative essentials plot was modeled in any sets of 20 screens from the dataset. First, an initial set of 20 cell lines were sampled without replacement and the essential genes in the first screen out of 20 were identified with a BF of greater than or equal to 10. Then the subsequent screen was evaluated to obtain the essential genes in that screen and the newly discovered essential genes that were not identified in the prior screen were added to the list of essential gene hits to obtain a cumulative essential gene list. The process was repeated for all 20 cell lines to capture all cumulative essentials in the 20 screens. This random sampling process was repeated 100 times to prevent bias in the modeling process. The resulting mean cumulative essentials curves were plotted with standard deviations reflecting the variability of cumulative essential genes observed in each of 100 iterations. Moreover, for each iteration, the newly discovered essential genes in each set of 20 screens were also identified and displayed on the same plots.

Synthetic genome modeling approach

To estimate the number of essential genes per screen as well as the average error rate, in silico simulations of synthetic screens were conducted. First, for each dataset, a synthetic genome was constructed with the number of genes assayed in that library (i.e. for Avana library N = 18 111). For a given genome with N number of genes in it, there is a number of true essential genes, represented by n, and number of non-essential genes (N – n). Then, the precision (1 – FDR) of the assay is represented by the ratio of true positives to that of the total number of hits. We defined a range of thresholds for false discovery rate to test in our model from 1–15%. We then randomly sampled this synthetic genome with a screen with randomly drawn 1287 hits for the Avana dataset (the mean number of essential gene hits at BF ≥ 10 across all Avana screens) from the essential (n) and non-essential (N − n) populations based on the defined FDR in the simulation (e.g. for 10% FDR, 129 nonessentials and 1158 essentials were randomly selected). We observed the cumulative essential genes across eight iterations since we estimated that sampling eight screens was a good estimate of observing the trend in screen saturation in the Avana data. At the same time, we constructed a mean cumulative essentials curve determined from the Avana data for any 8 screens using bootstrapping for 100 iterations. After running the simulations for a range of different number of essential genes (n) and FDR values, cumulative observation curves were plotted for each simulation. The root mean squared error for every synthetic screen was calculated by evaluating the difference between the observed values and the cumulative essentials curve determined from the Avana data.

Since the Avana data contains cell lines arising from multiple tissue types, it is possible that some of the tissue specific essential genes would be wrongly classified as false positives. Therefore, in order to minimize this effect, the synthetic genome modeling approach was repeated for individual tissue types which were represented by at least 16 high-quality cell lines (as defined by F-measure of greater than or equal to 0.8) using the same parameters described above. All of the analyses were performed using Python version 3.6 using multiple packages including pandas (26), NumPy (27), the sklearn.metrics, sklearn.utils, resample modules in SciKits (28), Matplotlib (29), seaborn (30) and scipy (31).

Saturation modeling approach and identification of high-confidence essential genes

While the in-silico simulations with synthetic genome model enabled an estimation of the average number of essential genes per screen with an upper limit for the FDR, they did not give information about which genes were truly essential. To distinguish truly essential genes from false positives, we identified the tissue types in the Avana data that were well represented (n ≥ 16 screens). Eighteen tissue types in the Avana fit our criteria and for these tissue types, the frequency of essential gene observations in any eight screens was evaluated separately for each dataset. For each tissue type, a set of eight initial screens were selected, the cumulative essential genes in this set were identified and the frequency of essentiality observations was plotted for this set. For the modeling, it was assumed that if these first eight screens have reached saturation (to approximate for our model), then the subsequent hits in the next set of screens would give us false positives. Therefore, a subsequent set of 8 screens from the same tissue type were randomly selected without replacement and the newly discovered cumulative essential genes that were not present in the initial screens were identified to model the frequency distribution of the false positives. This process was repeated 100 times for each tissue type and the resulting distribution of essential gene counts in these screens were plotted, which were used to calculate the bin-wise FDR. For the 100 iterations that were performed per tissue type, genes observed as essential in at least three screens on an average out of 8 were considered as high confidence essential genes in that tissue. Finally, we assessed how many genes were captured as essential in common in all tissues to find the set of ‘common’ essential genes (n = 954) and context essentials in each tissue type. We used the UpsetR package (32) in R programming language to visualize the set of intersections of essential genes across all tissue types for both of the datasets.

Expression data and analysis

The log2 transformed RNA-seq TPM expression data was utilized from DepMap Data Portal for the 2020Q2 release (33) for this analysis. The mean TPM expression levels were calculated for all genes in each bin in their corresponding cell lines in which the gene was a hit. The process was repeated for all the bins from the histogram of the frequency of essentiality observations of the initial set of eight screens representing the true positives and the subsequent set of eight screens representing false positives.

Essentiality profiles of hits in non-essential screens

In this analysis, we evaluated the mean essentiality scores of genes representing true positives in the screens that they were not observed as essential in. For each iteration, we determined the screens in which genes in each bin had a BF <10 and calculated a mean essentiality score for each gene in those screens. We then visualized the distribution of mean essentiality score observations of genes in each bin in the non-essential screens as a violin plot using the Seaborn package (30).

Joint log odds of essentiality (JLOE) approach

In order to explain the methodology behind JLOE, we need to re-visit some details about our BAGEL algorithm. The output of BAGEL is the Bayes factor (BF), the statistical representation of which can be displayed with the equation below:

graphic file with name M0001.gif

This equation can be converted to a posterior log odds as follows by multiplying by an appropriate ratio of priors:

graphic file with name M0001a.gif
graphic file with name M0002.gif

Performing log transformation, we get:

graphic file with name M0003.gif

Based on empirical information from genome-wide CRISPR-Cas9 knockout screens regarding the number of essential genes in a given genome, BAGEL considers Inline graphic (ie. the background expectation of gene essentiality in humans is 10%), which yields Inline graphic and therefore we have:

graphic file with name M0006.gif

The relationship between the posterior probability of essentiality and BFs can be seen in Supplementary Figure S1C, which shows that BF of 10 corresponds to a posterior probability of essentiality of ∼99% and that is why we use this strict threshold for identifying essential genes. During BF calculations with our BAGEL algorithm, we traditionally use this flat prior ratio of −3 for all screens. However, JLOE relies on the notion that this prior ratio gets updated as we gain more observations from the available data. The final outcome is a binary essentiality call and JLOE score based on posterior log odds ratio for each gene instead of a BF.

First, we revisited our saturation modeling approach, where we evaluated sets of eight screens at a time in each tissue type. Then we calculated the posterior probability of essentiality for that gene in each screen sequentially. For each gene, we ranked the screens from the highest BF to lowest. Next, for the top ranked screen, we used the initial log prior ratio of −3 and assigned a binary call of essentiality for posterior log(odds) >7 (corresponding to a posterior Pr(ess) of 99%). For the second-ranked screen, the prior is now based on whether the gene is essential in the top-ranked screen. If so, we use an updated prior ratio derived from the bin-wise FDR calculated from the saturation modeling approach. For FDR ∼ 50%, the log prior ratio is approximately zero:

graphic file with name M0007.gif

Again, a joint posterior Log Odds >7 indicates gene essentiality. The process is repeated for the next ranked screen, until all eight screens are evaluated. Since the bin-wise FDR values are negligible for frequent hits (FDR 0.3% for genes essential in five of eight screens; log prior ratio 8.4), we capped the log prior ratio at nine to prevent instability. In addition, to ensure that all cell lines are properly sampled, we repeated the sampling process 100 times, selecting 8 screens without replacement in each iteration from every tissue type to minimize bias. For every tissue type, we recorded all binary calls and joint posterior Log Odds for all genes in 100 iterations as well as the cell lines that were sampled in each iteration. Finally, we converted these tissue level scores (frequency of observation out of eight screens) to cell line level assignments by calculating the total number of times a binary call of essentiality was made for each gene in each cell line across 100 iterations with JLOE and then normalizing by how many times that cell line was chosen in our 100 iterations. With this method, we obtained normalized binary essentiality calls at a cell line level that ranged from 0 to 1 for each gene as well as normalized JLOE scores. We provide these tables across all tissue types in Supplementary Tables S6 and S7, respectively. Also, we note that our JLOE methodology can be applied to other BF datasets with less that eight cell lines in a tissue type, using the bin-wise priors shown in Supplementary Table S2.

Identification of false negative genes in tissues

Mean frequency observations out of eight screens with JLOE approach were compared to those from the saturation modeling approach. The observations were binned from each method and hits in <8/8 screens at BF >10 with saturation modeling approach but 8/8 screens after JLOE in both Avana and Sanger screens were classified as false-negative genes across all tissue types.

RESULTS

To systematically evaluate the biases in genome-wide CRISPR-Cas9 knockout screening, we processed the raw read counts of loss of function screens performed in 769 genetically heterogeneous cell lines from the 2020Q2 release of the publicly available Avana data (15). We applied our computational pipeline as described in the methods section to correct for copy number effects using the previously described CRISPRcleanR (20) and BAGEL (23,24) algorithms to assign an essentiality score (Bayes Factor, BF) to each gene in every screen. After applying quality control metrics (see Methods), our final dataset included 659 screens meeting the F-measure criteria of 0.80 and above (Supplementary Figure S1A and B, Supplementary TableS1). These 659 cell lines are derived from 25 tissue types, with varying representation (Figure 1A).

Figure 1.

Figure 1.

Synthetic genome modeling. (A) The representation of each tissue type in the 659 high-performing screens in the Avana data. Tissue types with > = 16 high-performing cell lines are indicated with red bars. (B) The cumulative number of essential genes (BF ≥ 10) and newly discovered essential gene curves in any randomly selected set of 20 Avana screens show a curve that converges into a positive slope. Blue circles indicate the number of cumulative essential genes in each screen and the green triangles show the number of newly discovered essential gene hits in each consecutive screen. The error bars represent the standard deviation of the number of cumulative essential genes and the newly discovered essentials in 100 iterations. (C) Diagram overview of the synthetic genome modeling of essential genes. From a synthetic genome of 18 111 genes, 1287 hits were randomly sampled from the essential (n) and non-essential gene (NEG) populations based on the defined false discovery rate (FDR) in the simulation. The resulting cumulative essential hits across eight iterations were plotted and compared to the mean cumulative essentials curve from the Avana data. (D) The synthetic genome modeling reveals the best fitting model with n = 1600 genes and 9.5% FDR (white box). (E) Plot showing the model with the best fit (in blue) to the Avana data (in red). (F) Average screen false negative rate (FNR) as determined by comparing the number of essential genes from the best fit model in each tissue type to the mean number of observed essentials in each tissue type. The red dashed line indicates a mean FNR of 21.2% across all tissue types tested.

Synthetic genome modeling of essential genes in 659 cell lines in the avana data

To estimate the total number of essential genes in a cell, we employed a method based on the cumulative essentiality observations across all screens in the Avana data. Our approach is based on the expectation that, for a sufficient number of identical screens without any false positives, a plot of the cumulative number of essential genes would flatten to a slope of zero when all of the essential genes in the population are identified. In contrast, in a population of screens with either cellular heterogeneity or some degree of false discovery rate (or most likely both), the slope of the cumulative essential genes plot would remain positive, reflecting the continuous accumulation of false positives as well as cell line specific essential genes in otherwise saturated screens. This principle was previously demonstrated to estimate the total population of essential genes detected in shRNA screens (11).

We applied this logic to the Avana dataset and plotted the cumulative essential genes across sets of 20 cell lines that are randomly selected without replacement from all Avana screens (Figure 1B). The sampling process was repeated 100 times to minimize bias in the selection of screens. To define gene essentiality, a strict threshold of BF >10 was used for every gene in each screen since this threshold represents a posterior probability of gene essentiality of ∼99% (Supplementary Figure S1C). Using this BF 10 threshold yielded a cumulative essentials curve that converged to a positive slope, which was consistent with previous observations in the shRNA screens (Figure 1B).

We propose a model of cell-intrinsic gene essentiality and CRISPR screen sensitivity that describes the trend shown in Figure 1B. First, there exists a large population of essential genes across all screens. Second, we posit the existence of some saturation point, here shown at k = 8 screens (n = ∼2500 cumulative essential genes). Given this saturation point, it follows that all essential genes are never captured in a single experiment and each screen carries some unknown false negative rate (FNR). Multiple screens are needed before all essential genes in the population can be identified. Third, after the saturation point is reached, additional screens continue to accumulate a combination of false positives as well as context-specific essential genes that were either not detectable or not present in the prior set of screens, and that the rate at which these genes are observed offers some estimate of the false discovery rate (FDR) of each screen.

To estimate the parameters of this model, we carried out repeated screens in silico and compared synthetic cumulative essential curves to those derived from the data. Starting with a genome of N = 18 111 genes – the number of genes tested in the Avana library—we arbitrarily defined n essential genes, leaving N – n nonessential. We further defined a screen false discovery rate hyperparameter between 1 and 15%. Then we repeatedly sampled this genome with a screen that randomly drew 1287 hits—the mean number of hits at BF >10 in a single Avana screen; Figure 1C—from the essential and nonessential populations based on the FDR hyperparameter (e.g. at 10% FDR, 129 nonessentials and 1158 essentials were randomly selected; Figure 1C). Finally, we determined the cumulative hits across k = 8 replicate screens, estimating that eight samples was a good estimate of screen saturation in the data (Figure 1C) and judging that it was more important to fit the model to our observations in this region than in the saturated region (k >> 8). We calculated the root-mean-squared error from the mean cumulative essentials curve determined from the Avana data and plotted RMSE vs. the two model hyperparameters (Figure 1D) observing that the best fit occurred with n = 1600 essential genes and FDR = 9.5% (Figure 1E). Notably, a region of good fits, with RMSE < 2xRMSEmin, occurs between n =1500–1800 essential genes and a corresponding decrease in per-screen FDR from 9.5% to ∼6.5% (Supplementary Figure S1D).

Given the broad range of lineages from whence the screened cell line models were derived, it is clear that some of the context-specific essential genes will be incorrectly classified as false positives in our synthetic genome modeling approach. In order to minimize the effect of these tissue-specific essential genes, we repeated the analysis using cell lines from only one tissue type, filtering for 18 tissues that are represented by at least 16 high quality screens (Figure 1A). Each tissue type yielded similar trends in cumulative essential curves in the Avana data (Supplementary Figure S2). We repeated this modeling approach in each tissue type and obtained remarkably similar results (Supplementary Figures S3 and S4). Next, we compared the best-fit number of essential genes from the synthetic genome model in each tissue type to the mean number of essential gene hits observed in that same tissue type and calculated an average false negative rate across each tissue (Figure 1F). Across all tissue types we examined, we determined the mean FNR to be ∼20% in each screen.

Saturation modeling to differentiate essential genes and false positives

While our synthetic genome approach gave us an estimate of the expected total number of essential genes and the FDR in an average screen, it does not provide any way to differentiate true hits from false positives. To address this issue, we utilized an alternative view of the saturating behavior of CRISPR screens, similar to the methodology described for shRNA screens (11). Based on our judgment that screening in virtually all lineages achieved saturation after approximately eight cell lines, we again selected the lineages with at least twice this number of screens indicated in Figure 1A. From each lineage, we randomly selected eight screens (‘initial screens’) without replacement, identified the cumulative number of essential genes (defined by BF > 10) and determined the number of cell lines in which each gene was classified as essential. We then randomly selected an additional eight screens (‘subsequent screens’), again without replacement, and determined the number of cell lines in which each gene was classified as new hits – that is, BF >10 but not a hit in any of the initial eight screens (Figure 2A). We assumed that all of these new hits were false positives, and that the histogram of observations of these false positives estimates the frequency of false positives in the initial screens. It is almost certainly not the case that all of these are actually false positives, given the known presence of tumor subtypes within each tissue/lineage, the high likelihood of subtype-specific essential genes, and the probability that any given subtype escaped being selected in the initial eight screens. However, this assumption is useful for modeling purposes, as it provides an estimate of the upper bound of the false discovery rate using this saturation modeling approach. We repeated this process 100 times and plot the resulting distributions of hits versus number of screens (Figure 2A).

Figure 2.

Figure 2.

Saturation modeling approach. (A) Histogram showing the distribution of the number of essential genes and the number of cell lines in which each gene was classified as essential in colorectal cancer cell lines. Blue bars represent the distribution of true positives (TPs) and the orange bars represent the frequency of false positives (FPs). The error bars represent the standard deviation of observed essentials in 100 iterations. (B) For the TPs in colorectal cancer cell lines depicted in (A), the mean mRNA expression (log2(TPM)) of the genes in each bin shows higher expression where more frequently observed essential genes show higher levels of expression. (C) Violin plot showing the distribution of the essentiality scores (Bayes Factor, BFs) of the TP genes in (A) and (B), in the screens where they were not observed as essential.

Notably, the putative false positives follow the expected distribution: most are detected in only a single screen (Figure 2A). We used the distribution of the true and false positives to estimate both binwise and cumulative false discovery rates (Supplementary Table S2) by comparing the ratio of putative false positives (subsequent screens) to the total number of hits in each bin (initial screens). For example, if a gene is a hit in only one of eight colorectal screens, we estimate its FDR to be ∼38% (247 first bin FP / 648 first bin hits). We observed that the bin-wise FDR falls to less than 3% for genes that are observed in three or more out of eight screens. Therefore, we determined that genes observed in three or more of eight randomly selected screens represent the high-confidence set of essential genes in a given tissue type and this set includes both genes that are frequently false negatives as well as those that are subtype-specific within a lineage. A complete table of gene frequency of observation by tissue type is presented in Supplementary Table S3.

False negatives and subtype-specific genes

A closer examination of the U-shaped histogram in Figure 2A reveals significant implications. Genes that are observed as essential in the intermediate number of screens (∼3 to 6 out of 8) are either false positives that are repeatedly observed, false negatives that are repeatedly missed, or cancer subtype-specific genes that are truly only essential in some cell lines, violating our modeling assumption that the cell lines are identical (in reality, some combination of these three possibilities is more likely). We demonstrate from the hit frequency of the subsequent screens (in orange) that these genes are very unlikely to be false positives. We therefore sought to determine whether these genes are false negatives or background-specific essentials.

First, we examined the mean mRNA expression levels of genes in each bin in the corresponding cell lines that they were detected as essential. We observed a clear trend in which more frequently observed essential genes show higher gene expression (Figure 2B, right Y-axis). In contrast, hits in only one screen show markedly lower expression. In addition, putative false positives from subsequent screens show a similarly lower average gene expression than more frequently observed hits (Supplementary Figure S5).

Next, we examine the essentiality profiles of genes in the screens where the gene is not a hit. This measures whether a gene that is essential (BF > 10) in, for example, five screens is truly non-essential in the remaining 3 screens with very negative essentiality scores (BF ←10) or whether it falls in the intermediate range near BF = 0. Figure 2C shows that the more frequently a gene is classified as essential in the initial screens, the higher its average BF in screens where it is not essential. This observation is strongly consistent with false negatives rather than context-specific essential genes.

High confidence context-essential genes and a newly defined set of core essential genes

After judging that genes observed in three or more (out of eight) screens represent the high-confidence set of essential genes in a given tissue type, we identified 992 genes that are essential at that frequency in all 18 lineages we evaluated (Figure 3A). Applying these high-confidence essentials to our Daisy Model (11), we find that each tissue type carries an additional 300–600 context specific essential genes (Figure 3A, inset), which make up the petals of the daisy. Interestingly, these additional context-essentials are also widely, but not universally, shared across backgrounds: each lineage has only three (central nervous system) to 41 (hematopoietic and lymphoid) genes which are uniquely essential in that lineage. We found many known gene-tissue relationships in this set of unique context essentials. For example, the SOX10 transcription factor was found to be essential in only skin cells where it plays a major role in the production and function of melanocytes (34,35). CTNNB1 and TCF7L2 are essential only in colorectal cancer cell lines, where activation of the Wnt pathway results in accumulation of B-catenin that interacts with and acts as a coactivator for TCF7L2 that in turn activates downstream genes responsible for colorectal cancer cell survival as well as resistance to chemo-radiotherapy (36–38). ER+ breast cancer cell lines depend specifically on transcription factors FOXA1 and GATA3, which are overexpressed in ER+ breast carcinomas (39,40). E2F1, which was preferentially essential in pancreatic cancer cells, has been previously shown to regulate both pancreatic B-cell development and cancer growth by increasing the expression of PDK1 and PDK3 which results in increased aerobic glycolysis and growth in pancreatic cancers (41–43). Overall, our analysis indicates that genes uniquely essential in a particular context are very rare, while hundreds are shared across some but not all tissue types. A complete table of common and context essential genes are listed in Supplementary Table S4.

Figure 3.

Figure 3.

High confidence set of core and context essential genes. (A) Upset plot showing the intersection of the number of genes between high-confidence essential genes in each tissue type. Inset: Daisy plot showing the relationship between high-confidence context-essential genes and common essential genes. Genes essential in a given tissue type is represented by the petals of the daisy indicating their numbers for each tissue type. The petals overlap to varying degrees, but all tissues share the common set of essential genes (N = 954). (B) Venn diagram comparing common essential genes to previously defined gold standard set of core essential genes. (C) Box plots showing the mean mRNA (TPM) expression of common core essentials (n = 300), genes unique to new common essentials identified here (n = 314), Core V2 specific essentials (n = 142) and core essential genes specific to the Sanger dataset (n = 58). (D) Gene Ontology (GO) biological process enrichment for core essential genes unique to a specific approach. (E) Comparison of the distribution of essentiality scores of Sanger specific core essentials in common cell lines between the Avana and Sanger data.

Comparing common essentials to previous gold standard core essential genes

The common essentials defined in this study include genes that are identified as hits in every tissue type at a frequency of at least 3/8 screens. Therefore, we reasoned that they should define a superset of previously defined gold standard sets of core essential genes. We compared our set of 992 common essential genes to the Core Essential Genes v2 (CoreV2) that we previously defined as a gold standard training set for our BAGEL algorithm (25), as well as the core essentials defined by the Sanger dataset collected at the Wellcome Trust Sanger Centre (17). The newly identified common essentials comprise 507 of 686 (74%) of CoreV2 genes and 453 of 548 (83%) of Sanger core essentials, while 314 genes were common in all approaches (Figure 3B). We also examined the gene expression profiles of these core essential genes and observed that the 314 genes at the intersection of the three approaches have higher median gene expression compared to genes that are uniquely core essential in each approach (median log2(TPM) 6.3 versus 5.1 for new common specific genes, 5.4 for CEGV2 and 4.7 for Sanger specific genes; Figure 3C). An expanded comparison of the newly identified common essentials with the more recently defined core essential genes using CoRe AdaM (44) and CENtools (45) approaches showing the overlap between the datasets is shown in Supplementary Figure S6. These observations are consistent with an increased false negative rate among essential genes with moderate levels of mRNA expression. Core essential genes specific to each approach are listed in Supplementary Table S5.

Interestingly, the genes unique to each dataset showed a bias reflecting the experimental approach used. Genes unique to common essentials defined in this study are highly enriched for essential processes including ribosome biogenesis and mRNA processing genes (Figure 3D). Moreover, genes unique to CoreV2 show a strong bias towards genes encoding subunits of the mitochondrial translation and electronic transport chain. It should also be noted that experimental assay length is different between the Avana and Sanger screens (21 days versus 14 days, respectively) (46) and therefore different timepoints may result in differences in dependency profiles of certain genes. Consistent with this notion, dependency on oxphos genes, which is correlated with cell growth rate (47), represents a large source of bias in screen hits, and these genes are likely excluded from the Sanger core essentials for the same reason. We did not observe any strong functional enrichment among Sanger-specific core essentials, compared to the other classes. Among 58 Sanger-specific core essentials, 11 were not targeted in the Avana library and the remainder show intermediate BF scores in the 148 high-performing Avana screens of the same cell lines (Figure 3E, red curve). Collectively, these observations are consistent with there being a set of CRISPR library-specific false negatives, as previously reported (48), which may be independent of the expression-associated false negatives discussed in this work.

False negatives are evident in essential protein complexes

In the previous sections, we demonstrated a nontrivial false negative rate in an average genome-wide CRISPR-Cas9 screen (Figure 1F). Moreover, we showed that these false negative genes tended to be observed in an intermediate number of screens in our saturation modeling approach. Here, we will highlight the false negatives in essential protein complexes.

Previous studies in model organisms have shown that the individual subunits of protein complexes tend to be either all essential or non-essential (49) excluding cases where there might be functional buffering due to redundancy within the complex members (50). This notion leads to a model where essential protein complexes are comprised of essential genes whereas non-essential complexes contain only non-essential genes. Therefore, in human protein complexes, if a complex has essential functions, we expect to see all obligate members of that complex to be collectively and entirely essential, unless there is functional buffering or alternative subunit usage. To see if the same trend exists in the Avana data, we examined previously defined essential protein complexes to see how consistently the BAGEL algorithm detects individual essential genes. The proteosome is a large, essential protein complex responsible for degrading intracellular proteins (51) and we expect this complex to be essential in all cell lines since it carries a core-essential function. In Figure 4A, we see the binary essentiality calls (BF> = 10) of the members of the 26S proteosome complex as defined in the Comprehensive Resource of Mammalian Protein complexes (CORUM) database (52) among colorectal cancer cell lines in the Avana data. We observe that most individual screens are unable to capture all of the individual members of this complex, even though every subunit is detected in multiple cell lines.

Figure 4.

Figure 4.

False negative genes in essential protein complexes. Binary essentiality calls of essential protein complexes among colorectal cancer cell lines in the Avana data based on BAGEL derived essentiality scores. Dark blue color indicates essentiality (BF ≥ 10) and white color indicates non-essentiality (BF < 10). (A) Binary essentiality calls of members of 26S proteosome complex. (B) Binary essentiality calls of members of COP9 signalosome complex among colorectal cancer cell lines in the Avana data based on BAGEL derived essentiality scores. The relationship between screen quality (F-measure) and the number of non- essential hits within essential protein complexes. Boxplots showing the distribution of F-measures of screens versus the number of times a non-essential call was made among members of proteosome complex (C) and COPS9 signalosome complex (D). MCM complex is an example of a CRISPR-library specific false negative. While some members of the MCM complex cannot be captured as essential in multiple colorectal screens in the Avana dataset (E), the complex shows more uniform essentiality in Sanger colorectal cell lines (F).

Similarly, in Figure 4B, we see the binary essentiality calls in Avana colorectal cancer cell of the members of the essential, evolutionarily conserved COP9 signalosome complex which plays a role in controlling protein ubiquitinylation (53). Again, we observe that not all screens can identify members of this complex as essential. In this complex, we do not observe COPS7A and COPS7B as essential in any screen since, as we and others have previously shown, COPS7A and COPS7B are synthetic lethal paralog pairs and they encode alternate, replaceable subunits of the complex, while the other subunits are irreplaceable and are uniformly essential (54,55).

We first asked whether these false negatives were associated with underlying screen quality. We plotted the F-measure of each screen versus the number of non-essential hits within the complex for both the proteosome and COP9 signalosome complexes. Even though we only considered high-performing screens in our analysis, we still observed a trend where the screens with lower F-measures tended to miss more of these essential genes (Figure 4C and D). We further investigated whether these false negatives were library-specific. Comparing Sanger and Avana datasets, we observed that this phenomenon indeed exists as shown for the minichromosome maintenance (MCM) complex, an essential helicase regulating DNA replication in eukaryotic cells (56). While multiple Avana colorectal screens fail to capture members of this essential complex (BF ≥ 10), matched Sanger colorectal screens show more consistent essentiality calls for members of the same complex, indicating that library-specific false negatives also exist (Figure 4E and F).

Sharing BAGEL information across screens: joint log odds of essentiality

Having shown that CRISPR-Cas9 screens miss expected hits, we sought to develop a method to recapture these false negatives. Our approach extends fundamental concepts from our BAGEL algorithm (see Materials and Methods for a detailed description). The final output of BAGEL is a log Bayes Factor (BF) which represents the relative statistical confidence of that a gene is a member of one model (essential genes) or another (nonessentials). Based on empirical information from genome-wide CRISPR-Cas9 knockout screens regarding the number of essential genes in a given genome, we estimate the background expectation of gene essentiality in humans as 10%, which yields a log prior ratio ∼−3. When using our BAGEL algorithm, we traditionally use this flat log2 prior ratio of −3 during BF calculations for all screens. In order to rescue false negatives, and to differentiate between false negative genes and subtype-specific genes, we employed an alternative method based on the joint analysis of BAGEL results, which we call JLOE (Joint Log Odds of Essentiality). In this approach, instead of using a flat prior ratio for all screens, we update this prior ratio as we gain information from the screens. Gene essentiality is reported as a posterior log odds instead of a Bayes Factor (BF), and binary hit calls are assigned based on a posterior log odds threshold.

Briefly, we evaluated sets of eight screens at a time in each tissue type. For each gene, we ranked the screens from the highest BF to lowest in each set. Then we calculated the posterior probability of essentiality for that gene in each screen sequentially. For the top ranked screen, we used the initial log prior ratio of −3 and assigned a binary call of essentiality for posterior log(odds) > 7 (corresponding to a posterior Pr(ess) ∼ 99%). For the second-ranked screen, the prior is now based on whether the gene is essential in the top-ranked screen. If so, we use an updated prior ratio derived from the bin-wise FDR calculated from the saturation modeling approach (Supplementary Table S2) and again assigned gene essentiality for a joint posterior Log Odds >7. The process is repeated for the next ranked screen, until all 8 screens are evaluated. Since the bin-wise FDR values are negligible for frequent hits (FDR 0.3% for genes essential in 5 of 8 screens; log prior ratio 8.4), we capped the log prior ratio at 9 to prevent instability. In addition, to ensure that all cell lines are properly sampled, we repeated the sampling process 100 times, selecting 8 screens without replacement in each iteration from every tissue type to minimize bias (Figure 5A). The normalized binary calls and posterior log odds from JLOE approach can be found in Supplementary Tables S6 and S7, respectively.

Figure 5.

Figure 5.

JLOE approach identifies more essential genes in Avana genome-wide CRISPR-Cas9 knockout screens compared to saturation modeling approach. (A) Schematic description of JLOE method. (B) Comparison of the frequency of essential gene hits in saturation modeling versus JLOE. (C) Comparison of violin plots showing the distribution of the essentiality scores (Bayes Factor, BFs) of the hits in (A), in the screens where they were not observed as essential. JLOE can rescue the false negatives with a less significant increase in false positives compared to assigning binary calls using a lower BF threshold. (D) The number of essential gene hits using different thresholds (binary calls with BF 10, binary calls with BF 5 and mean binary call of 1 with JLOE across 100 iterations. (E) The number of false positive genes detected across colorectal cancer cell lines in the Avana data using same thresholds in (D). P-value annotation from t-test ind. samples with Bonferroni correction legend as follows; ns: 0.05 < P, *: 0.01 < P ≤ 0.05, **: 0.001 < P ≤ 0.01, ***: 0.0001 < P ≤ 0.001.

JLOE selectively rescues false negatives

After applying JLOE to all tissue types in the Avana and Sanger datasets, we compared the observations in both datasets to our saturation modeling approach to evaluate differences in the frequency of hits. JLOE resulted in higher frequency of gene essentiality calls, with fewer number of essential gene hits in fewer screens and a higher number of essential gene hits in six or more screen—including a large increase in the number of universally essential genes compared to the saturation modeling approach (Figure 5B for Avana data). Strikingly, the previously observed trend whereby genes that are hits in a higher number of screens have a higher average BF in the cell lines where they are not hits is completely abrogated by the JLOE approach (Figure 5C), suggesting that JLOE rescues false negatives.

A concern with boosting scores of marginal results is that false positives are also boosted. If this were the case, JLOE results would be equivalent to simply lowering the BF threshold for binary hit calls. To test this, we compared binary calls at BF > 10 (posterior log odds > 7 with uninformative prior) with binary calls at BF > 5 (posterior log odds > 2, or Pr(ess) > 80%) and binary calls with JLOE > 7 (Supplementary Table S6). Both relaxing the BF threshold and applying the JLOE framework increase the number of hits in Avana colorectal screens compared to the BF 10 baseline (Figure 5D). However, JLOE yields a much smaller increase in the number of false positives (defined here as expression log(TPM) < 1) compared to BF > 5 (Figure 5E), indicating JLOE successfully discriminates marginal essential genes from nonessentials. Strikingly, the mean number of JLOE essentials identified among colorectal cancer cell lines (N = 1565) is approximately what we had predicted using our synthetic genome modeling (N = 1600), indicating the consistency of our findings using different approaches.

JLOE increases enrichment of essential processes among high frequency hits

To further confirm that JLOE rescues true essentials, we compared always-essentials (hits in 8/8 screens at BF > 10) with JLOE-classified false negatives (hits in <8/8 screens at BF > 10 but 8/8 screens after JLOE) in both Avana and Sanger screens in colorectal cell lines (n = 25 each). Always-essentials in Avana (n = 359 genes) and Sanger (n = 403 genes) screens show significant overlap (Figure 6A – Jaccard coefficient 0.339), but more disjoint sets of false negatives (Figure 6B; Jaccard coefficient 0.182). Combining the always-essentials with the JLOE-rescued false negatives yields not only more genes (Avana, n = 937; Sanger, n = 839), but also better concordance between the two sets (intersection n = 610 genes, Jaccard coefficient = 0.523; Figure 6C). Moreover, boosting always-essentials with JLOE-rescued false negatives improves gene annotation enrichment statistics for core essential processes including transcription, translation, and cell cycle regulation in both Avana (Figure 6D) and Sanger (Figure 6E) screens. Overall, these results show that the JLOE approach can rescue false negative genes from essential pathways and improves the concordance between the Avana and Sanger datasets.

Figure 6.

Figure 6.

JLOE increases the concordance between Avana and Sanger datasets. (A) The overlap of always-essential genes (with frequency observations of 8/8 in saturation modeling approach among colorectal cancer cell lines in Avana and Sanger datasets. (B) Venn diagram showing the overlap of the false negative genes among colorectal cancer cell lines in Avana and Sanger datasets. (C) Combination of false negative genes with always-essential hits in saturation modeling approach in both datasets. The recovered false negatives increase the enrichment of essential biological processes in colorectal cell lines among high frequency hits defined in saturation modeling approach in (D) Avana colorectal cell lines and (E) Sanger colorectal cell lines. Blue bars indicate the enrichment among always-essentials and orange bars depict the increase in enrichment with the addition of false negative genes recovered with JLOE.

DISCUSSION

The vast amount of publicly available human cancer cell line data generated by the international Cancer Dependency Map initiative offers an opportunity to re-evaluate the assumptions under which these genome-scale experiments have been carried out. One such assumption is that CRISPR screens are saturating and that a well-designed single whole genome CRISPR-Cas9 knockout screen can detect virtually all of the essential genes in that background. Although library-specific biases have been identified (48) and copy number amplifications are known induce false positives, our knowledge of the systematic biases in CRISPR screens is likely still incomplete. Simply put, the true positive, false positive and false negative rates in an average genome-wide CRISPR-Cas9 knockout screen are fuzzy at best.

Here, we describe modeling approaches to estimate the true positive rate, false positive rate, and false negative rate from a typical genome-scale CRISPR screen. Importantly, these analyses established a baseline for the expected number of essential genes and the FDR in an average screen, with the broad definition of ‘essential gene’ being one whose knockout fitness defect causes the mutant to be outcompeted by cells growing at wildtype proliferation rates. Our modeling suggests that a typical cell expresses 1600–1900 essential genes, but a single knockout screen only detects ∼80% of these, and multiple screens are required to saturate the essential genes of any tissue or tumor subtype. We show that hits among highly expressed genes are often replicated but false negatives are more prevalent among genes expressed at moderate levels. These characteristics may limit our ability to identify meaningful differentially essential genes, in particular using isogenic cell lines to identify synthetic lethal interactions, since it suggests numerous replicates need to be screened in order to confidently discriminate cell-line-specific hits from false negatives and false positives. This knowledge is critical for evaluating the significance of findings from current experiments as well as for the design of future experiments for gene essentiality and synthetic lethality.

Another implication of the false negative rate is that most tissue types carry a larger number of overlapping essential genes than had been previously recognized. We identify nearly a thousand genes that are observed across all lineages that are frequent flyers in the Avana screens. This improved new set of core essential genes are enriched in core essential processes of the cells and, like the previous set, can be used as a metric for the quality control of genome-wide CRISPR-Cas9 knockout screens. In contrast, we find only 300–600 tissue-specific genes per lineage, with the overwhelming majority of genes showing overlap between related tissues. In fact, each tissue only carries at most a few dozen tissue-unique genes, and these are highly enriched for lineage-specific transcription factors.

Genes we classify as likely false negatives can be recovered though an approach we call the Joint Log Odds of Essentiality, or JLOE, which uses the Bayesian framework of BAGEL to improve analysis across multiple cell line screens. An unexpected, and unexplained, finding of our study is that these false negatives exhibit more moderate mRNA expression levels than previously described always-essential genes. This opens new questions about the biological relationship between expression level and the severity of loss of function phenotype, on one hand, and concerns about the effects of chromatin state on the relative efficacy of the CRISPR/Cas9 knockout paradigm on the other.

DATA AVAILABILITY

All code is available through Github (https://github.com/mervedede/JLOE_paper) and FigShare (https://doi.org/10.6084/m9.figshare.21865167.v1).

Supplementary Material

gkad046_Supplemental_Files

Contributor Information

Merve Dede, Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA.

Traver Hart, Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA; Department of Cancer Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

M.D. and T.H. were supported by NIGMS [R35GM130119]; M.D. was supported by a Schissler Foundation fellowship; T.H. is a CPRIT Scholar in Cancer Research; Andrew Sabin Family Fellow; MD Anderson Cancer Center Support Grant [P30 CA016672]. Funding for open access charge: NIGMS [R35GM130119].

Conflict of interest statement. None declared.

REFERENCES

  • 1. Horowitz N.H., Leupold U.. Some recent studies bearing on the one gene-one enzyme hypothesis. Cold Spring Harb. Symp. Quant. Biol. 1951; 16:65–74. [DOI] [PubMed] [Google Scholar]
  • 2. Brenner S. The genetics of Caenorhabditis elegans. Genetics. 1974; 77:71–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Goebl M.G., Petes T.D.. Most of the yeast genomic sequences are not essential for cell growth and division. Cell. 1986; 46:983–992. [DOI] [PubMed] [Google Scholar]
  • 4. Giaever G., Chu A.M., Ni L., Connelly C., Riles L., Véronneau S., Dow S., Lucau-Danila A., Anderson K., André B.et al.. Functional profiling of the Saccharomyces cerevisiae genome. Nature. 2002; 418:387–391. [DOI] [PubMed] [Google Scholar]
  • 5. Winzeler E.A., Shoemaker D.D., Astromoff A., Liang H., Anderson K., Andre B., Bangham R., Benito R., Boeke J.D., Bussey H.et al.. Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science. 1999; 285:901–906. [DOI] [PubMed] [Google Scholar]
  • 6. Dietzl G., Chen D., Schnorrer F., Su K.-C., Barinova Y., Fellner M., Gasser B., Kinsey K., Oppel S., Scheiblauer S.et al.. A genome-wide transgenic RNAi library for conditional gene inactivation in Drosophila. Nature. 2007; 448:151–156. [DOI] [PubMed] [Google Scholar]
  • 7. Kamath R.S., Fraser A.G., Dong Y., Poulin G., Durbin R., Gotta M., Kanapin A., Bot N.L., Moreno S., Sohrmann M.et al.. Systematic functional analysis of the Caenorhabditis elegans genome using RNAi. Nature. 2003; 421:231–237. [DOI] [PubMed] [Google Scholar]
  • 8. Meister G., Tuschl T.. Mechanisms of gene silencing by double-stranded RNA. Nature. 2004; 431:343–349. [DOI] [PubMed] [Google Scholar]
  • 9. Moffat J., Sabatini D.M.. Building mammalian signalling pathways with RNAi screens. Nat. Rev. Mol. Cell Biol. 2006; 7:177–187. [DOI] [PubMed] [Google Scholar]
  • 10. Hillenmeyer M.E., Fung E., Wildenhain J., Pierce S.E., Hoon S., Lee W., Proctor M., St.Onge R.P., Tyers M., Koller D.et al.. The chemical genomic portrait of yeast: uncovering a phenotype for all genes. Science. 2008; 320:362–365. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Hart T., Brown K.R., Sircoulomb F., Rottapel R., Moffat J.. Measuring error rates in genomic perturbation screens: gold standards for human functional genomics. Mol. Syst. Biol. 2014; 10:733. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Blomen V.A., Májek P., Jae L.T., Bigenzahn J.W., Nieuwenhuis J., Staring J., Sacco R., Diemen F.R.van, Olk N., Stukalov A.et al.. Gene essentiality and synthetic lethality in haploid human cells. Science. 2015; 350:1092–1096. [DOI] [PubMed] [Google Scholar]
  • 13. Hart T., Chandrashekhar M., Aregger M., Steinhart Z., Brown K.R., MacLeod G., Mis M., Zimmermann M., Fradet-Turcotte A., Sun S.et al.. High-resolution CRISPR screens reveal fitness genes and genotype-specific cancer liabilities. Cell. 2015; 163:1515–1526. [DOI] [PubMed] [Google Scholar]
  • 14. Wang T., Birsoy K., Hughes N.W., Krupczak K.M., Post Y., Wei J.J., Lander E.S., Sabatini D.M.. Identification and characterization of essential genes in the human genome. Science. 2015; 350:1096–1101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Meyers R.M., Bryan J.G., McFarland J.M., Weir B.A., Sizemore A.E., Xu H., Dharia N.V., Montgomery P.G., Cowley G.S., Pantel S.et al.. Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nat. Genet. 2017; 49:1779–1784. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Tsherniak A., Vazquez F., Montgomery P.G., Weir B.A., Kryukov G., Cowley G.S., Gill S., Harrington W.F., Pantel S., Krill-Burger J.M.et al.. Defining a cancer dependency map. Cell. 2017; 170:564–576. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Behan F.M., Iorio F., Picco G., Gonçalves E., Beaver C.M., Migliardi G., Santos R., Rao Y., Sassi F., Pinnelli M.et al.. Prioritization of cancer therapeutic targets using CRISPR–Cas9 screens. Nature. 2019; 568:511–516. [DOI] [PubMed] [Google Scholar]
  • 18. Braschi B., Denny P., Gray K., Jones T., Seal R., Tweedie S., Yates B., Bruford E.. Genenames.org: the HGNC and VGNC resources in 2019. Nucleic Acids Res. 2019; 47:D786–D792. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Farrell C.M., O’Leary N.A., Harte R.A., Loveland J.E., Wilming L.G., Wallin C., Diekhans M., Barrell D., Searle S.M.J., Aken B.et al.. Current status and new features of the Consensus Coding Sequence database. Nucleic Acids Res. 2014; 42:D865–D872. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Iorio F., Behan F.M., Gonçalves E., Bhosle S.G., Chen E., Shepherd R., Beaver C., Ansari R., Pooley R., Wilkinson P.et al.. Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting. BMC Genomics. 2018; 19:604. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Olshen A.B., Venkatraman E.S., Lucito R., Wigler M.. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004; 5:557–572. [DOI] [PubMed] [Google Scholar]
  • 22. Venkatraman E.S., Olshen A.B.. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007; 23:657–663. [DOI] [PubMed] [Google Scholar]
  • 23. Kim E., Hart T.. Improved analysis of CRISPR fitness screens and reduced off-target effects with the BAGEL2 gene essentiality classifier. Genome Med. 2021; 13:2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Hart T., Moffat J.. BAGEL: a computational framework for identifying essential genes from pooled library screens. BMC Bioinf. 2016; 17:164. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Hart T., Tong A.H.Y., Chan K., Van Leeuwen J., Seetharaman A., Aregger M., Chandrashekhar M., Hustedt N., Seth S., Noonan A.et al.. Evaluation and design of genome-wide CRISPR/SpCas9 knockout screens. Genes Genomes Genetics. 2017; 7:2719–2727. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Reback J., McKinney W., Jbrockmendel, Bossche J.V.D., Augspurger T., Cloud P., Gfyoung, Sinhrks, Klein A., Roeschke M.et al.. 2020; pandas-dev/pandas: Pandas 1.0.3. 10.5281/ZENODO.3715232.
  • 27. Harris C.R., Millman K.J., van der Walt S.J., Gommers R., Virtanen P., Cournapeau D., Wieser E., Taylor J., Berg S., Smith N.J.et al.. Array programming with NumPy. Nature. 2020; 585:357–362. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Buitinck L., Louppe G., Blondel M., Pedregosa F., Mueller A., Grisel O., Niculae V., Prettenhofer P., Gramfort A., Grobler J.et al.. API design for machine learning software: experiences from the scikit-learn project. 2013; arXiv doi:01 September 2013, preprint: not peer reviewedhttps://arxiv.org/abs/1309.0238.
  • 29. Hunter J.D. Matplotlib: a 2D Graphics Environment. Comput. Sci. Eng. 2007; 9:90–95. [Google Scholar]
  • 30. Waskom M., Botvinnik O., Gelbart M., Ostblom J., Hobson P., Lukauskas S., Gemperline D.C., Augspurger T., Halchenko Y., Warmenhoven J.et al.. 2020; mwaskom/seaborn: v0.11.0 (Sepetmber 2020). 10.5281/ZENODO.592845.
  • 31. SciPy 1.0 Contributors Virtanen P., Gommers R., Oliphant T.E., Haberland M., Reddy T., Cournapeau D., Burovski E., Peterson P., Weckesser W.et al.. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods. 2020; 17:261–272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Conway J.R., Lex A., Gehlenborg N.. UpSetR: an R package for the visualization of intersecting sets and their properties. Bioinformatics. 2017; 33:2938–2940. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Broad DepMap DepMap 20Q2 Public. 2020; 10.6084/M9.FIGSHARE.12280541.V4. [Google Scholar]
  • 34. Harris M.L., Baxter L.L., Loftus S.K., Pavan W.J.. Sox proteins in melanocyte development and melanoma. Pigment Cell Melanoma Res. 2010; 23:496–513. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Nonaka D., Chiriboga L., Rubin B.P.. Sox10: a pan-Schwannian and melanocytic marker. Am. J. Surg. Pathol. 2008; 32:1291–1298. [DOI] [PubMed] [Google Scholar]
  • 36. Albuquerque C., Pereira Pebre. Jordan P. Wnt Signalling-Targeted Therapy in the CMS2 Tumour Subtype: a New Paradigm in CRC Treatment?. Targeted Therapy of Colorectal Cancer Subtypes. 2018; Cham: Springer International Publishing; 75–100.Advances in Experimental Medicine and Biology. [DOI] [PubMed] [Google Scholar]
  • 37. Emons G., Spitzner M., Reineke S., Möller J., Auslander N., Kramer F., Hu Y., Beissbarth T., Wolff H.A., Rave-Fränk M.et al.. Chemoradiotherapy resistance in colorectal cancer cells is mediated by Wnt/β-catenin signaling. Mol. Cancer Res. 2017; 15:1481–1490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Murphy M., Chatterjee S.S., Jain S., Katari M., DasGupta R.. TCF7L1 modulates colorectal cancer growth by inhibiting expression of the tumor-suppressor gene EPHB3. Sci. Rep. 2016; 6:28299. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Albergaria A., Paredes J., Sousa B., Milanezi F., Carneiro V., Bastos J., Costa S., Vieira D., Lopes N., Lam E.W.et al.. Expression of FOXA1 and GATA-3 in breast cancer: the prognostic significance in hormone receptor-negative tumours. Breast Cancer Res. 2009; 11:R40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Davis D.G., Siddiqui M.T., Oprea-Ilies G., Stevens K., Osunkoya A.O., Cohen C., Li X.B.. GATA-3 and FOXA1 expression is useful to differentiate breast carcinoma from other carcinomas. Hum. Pathol. 2016; 47:26–31. [DOI] [PubMed] [Google Scholar]
  • 41. Denechaud P.-D., Fajas L., Giralt A.. E2F1, a novel regulator of metabolism. Front. Endocrinol. 2017; 8:311. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Kim S.Y., Rane S.G.. The Cdk4-E2f1 pathway regulates early pancreas development by targeting Pdx1+ progenitors and Ngn3+ endocrine precursors. Dev. Camb. Engl. 2011; 138:1903–1912. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Wang L.-Y., Hung C.-L., Chen Y.-R., Yang J.C., Wang J., Campbell M., Izumiya Y., Chen H.-W., Wang W.-C., Ann D.K.et al.. KDM4A coactivates E2F1 to regulate the PDK-dependent metabolic switch between mitochondrial oxidation and glycolysis. Cell Rep. 2016; 16:3016–3027. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Vinceti A., Karakoc E., Pacini C., Perron U., De Lucia R.R., Garnett M.J., Iorio F.. CoRe: a robustly benchmarked R package for identifying core-fitness genes in genome-wide pooled CRISPR-Cas9 screens. BMC Genomics. 2021; 22:828. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Sharma S., Dincer C., Weidemüller P., Wright G.J., Petsalaki E.. cen -tools: an integrative platform to identify the contexts of essential genes. Mol. Syst. Biol. 2020; 16:e9698. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Dempster J.M., Pacini C., Pantel S., Behan F.M., Green T., Krill-Burger J., Beaver C.M., Younger S.T., Zhivich V., Najgebauer H.et al.. Agreement between two large pan-cancer CRISPR-Cas9 gene dependency data sets. Nat. Commun. 2019; 10:5817. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Rahman M., Billmann M., Costanzo M., Aregger M., Tong A.H.Y., Chan K., Ward H.N., Brown K.R., Andrews B.J., Boone C.et al.. A method for benchmarking genetic screens reveals a predominant mitochondrial bias. Mol Sys Biol. 2021; 17:e10013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Ong S.H., Li Y., Koike-Yusa H., Yusa K.. Optimised metrics for CRISPR-KO screens with second-generation gRNA libraries. Sci. Rep. 2017; 7:7384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Hart G.T., Lee I., Marcotte E.R.. A high-accuracy consensus map of yeast protein complexes reveals modular nature of gene essentiality. BMC Bioinf. 2007; 8:236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Ryan C.J., Krogan N.J., Cunningham P., Cagney G.. All or nothing: protein complexes flip essentiality between distantly related eukaryotes. Genome Biol. Evol. 2013; 5:1049–1059. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Tanaka K. The proteasome: overview of structure and functions. Proc. Jpn. Acad. Ser. B. 2009; 85:12–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Giurgiu M., Reinhard J., Brauner B., Dunger-Kaltenbach I., Fobo G., Frishman G., Montrone C., Ruepp A.. CORUM: the comprehensive resource of mammalian protein complexes—2019. Nucleic Acids Res. 2019; 47:D559–D563. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Gutierrez C., Chemmama I.E., Mao H., Yu C., Echeverria I., Block S.A., Rychnovsky S.D., Zheng N., Sali A., Huang L.. Structural dynamics of the human COP9 signalosome revealed by cross-linking mass spectrometry and integrative modeling. Proc. Natl. Acad. Sci. U.S.A. 2020; 117:4088–4098. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. De Kegel B., Quinn N., Thompson N.A., Adams D.J., Ryan C.J.. Comprehensive prediction of robust synthetic lethality between paralog pairs in cancer cell lines. Cell Syst. 2021; 12:1144–1159. [DOI] [PubMed] [Google Scholar]
  • 55. Dede M., McLaughlin M., Kim E., Hart T.. Multiplex enCas12a screens detect functional buffering among paralogs otherwise masked in monogenic Cas9 knockout screens. Genome Biol. 2020; 21:262. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Lei M. The MCM complex: its role in DNA replication and implications for cancer therapy. Curr. Cancer Drug Targets. 2005; 5:365–380. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkad046_Supplemental_Files

Data Availability Statement

All code is available through Github (https://github.com/mervedede/JLOE_paper) and FigShare (https://doi.org/10.6084/m9.figshare.21865167.v1).


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES