Skip to main content
NAR Genomics and Bioinformatics logoLink to NAR Genomics and Bioinformatics
. 2021 Nov 30;3(4):lqab110. doi: 10.1093/nargab/lqab110

Identifying essential genes across eukaryotes by machine learning

Thomas Beder 1,2,3,, Olufemi Aromolaran 4,5, Jürgen Dönitz 6,7, Sofia Tapanelli 8, Eunice O Adedeji 9,10, Ezekiel Adebiyi 11,12, Gregor Bucher 13,, Rainer Koenig 14,15,
PMCID: PMC8634067  PMID: 34859210

Abstract

Identifying essential genes on a genome scale is resource intensive and has been performed for only a few eukaryotes. For less studied organisms essentiality might be predicted by gene homology. However, this approach cannot be applied to non-conserved genes. Additionally, divergent essentiality information is obtained from studying single cells or whole, multi-cellular organisms, and particularly when derived from human cell line screens and human population studies. We employed machine learning across six model eukaryotes and 60 381 genes, using 41 635 features derived from the sequence, gene function information and network topology. Within a leave-one-organism-out cross-validation, the classifiers showed high generalizability with an average accuracy close to 80% in the left-out species. As a case study, we applied the method to Tribolium castaneum and Bombyx mori and validated predictions experimentally yielding similar performances. Finally, using the classifier based on the studied model organisms enabled linking the essentiality information of human cell line screens and population studies.

Graphical Abstract

Graphical Abstract.

Graphical Abstract

CLEARER is a machine learning approach for predicting essential genes across eukaryotes. The classifier is trained on multiple species, allowing identification of essential genes in a new species with high accuracy.

INTRODUCTION

Essential genes are defined as indispensable for the reproductive success of an organism (1). Consequently, information on essentiality is used in a broad range of life science research, prominently to identify drug targets, as in cancer therapy (2) or identifying insecticidal targets, but also for the design of a minimal genome in synthetic biology. Many genome wide screens exploring phenotypes and gene functions have been performed using forward genetic methods (3–5). Later, reverse genetic methods were developed which allowed targeting individual genes specifically (6). Some large scale screens were focused on the question, which genes are essential for an organism (7). Functionally, these screens revealed that essential genes are involved in fundamental cellular maintenance processes like DNA, RNA and protein synthesis (8). Besides this, their encoded proteins are highly connected in protein-protein interaction (PPI) and metabolic networks (9,10). Furthermore, core essential genes were identified across different model systems (11), revealing their evolutionary conservation. Many essential genes show their essentiality not as a binary trait nor are they fixed across all intrinsic and extrinsic conditions within the evolutionary niche and may be influenced by the environment and the genetic context (1). For example, in yeast, many non-essential genes for growth in rich media are actually important in other growth conditions (12). Consequently, quantitative values for essentiality have been defined accounting for the degree of dependency on external influences, as well as the likelihood that a compensatory mutation occurs (13) and typically statistical score values are calculated for gene essentiality (14–16). Essential genes in humans were identified by studying cancer cell lines and more recently by genomic population studies, comprehensively reviewed by Bartha et al. (13). In cell lines, gene essentiality is assessed by cell viability after gene knock-out or knock-down, whereas in the population studies of humans it is assessed by scoring loss-of an allele or the depletion of variants in a gene. Interestingly, the essential gene lists of the two distinct approaches hardly overlap (13). This was unexpected since a cellular essential gene (CEG) should also be an organismal essential gene (OEG), even though not necessarily vice versa. This discrepancy between human CEG and OEG remains to be elucidated.

Despite their great value, experimental screens and population studies are very resource intensive. Consequently, on a genome scale, essential genes have been experimentally identified for several bacteria but only for a few eukaryotes, while population studies were performed only for human (Table 1). Because of experimental challenges and costs, the computational prediction of essential genes is of great interest and machine learning can considerably facilitate the search for essential genes in an organism. Following this approach, classifiers have been trained on a set of genes with known essentiality that are described by various features. For this, features can be based directly on the DNA or protein sequence (17,18), such as GC content or amino acid frequencies, or on more complex characteristics, e.g. the topology in PPI or co-expression networks (9,19). Subsequently, a trained classifier was used to predict a new set of genes finding novel essential gene candidates (9). The next milestone was the prediction of essential genes across species. For bacteria, the software Geptop 2.0 (20) calculates an essentiality score of an unknown gene based on information of 37 (bacterial) species and sequence similarity. For eukaryotes the task is more challenging, considering their complex multi-cellular architecture. Besides this, only a few model organisms have been experimentally screened for essential genes to date, even though the number is growing (21,22). To our knowledge only two studies were published predicting essential genes across eukaryotes using machine learning. Lloyd et al. (23) predicted essential genes in two plant species and Saccharomyces cerevisiae. They showed that inter-species prediction is feasible across plants, however, they observed a drastically reduced performance in cross plant-fungal species predictions. The other study, by Campos et al. (18), predicted essential genes in six eukaryotes using a leave-one-organism-out approach and it based merely on protein sequence features. These studies laid the foundations in machine learning based essential gene prediction highlighting the necessity to achieve a robust performance for predictions across organisms including humans.

Table 1.

Assembled essential gene information based on OGEE (21), DEG (26) and the literature

Species Source Data set abbreviation Method Essential genes Non essential genes
H. sapiens Population sequencing data Hs OE Loss-of fuction-variants (50–54) 2828 12 844
H. sapiens Cell line Hs CE RNAi (43,47,63), shRNA (48), CRISPR (14–16) 833 13 743
M. musculus Organism Mm OE Biallelic inactivation (56,62) 1966 4505
M. musculus Cell line Mm CE CRISPR (57) 939 7472
D. melanogaster Organism Dm OE P-element (59), RNAi (44), CRISPR (58) 246 271
D. melanogaster Cell line Dm CE RNAi (61), CRISPR (55) 1227 10 320
C. elegans Organism Ce OE RNAi (60) 737 10 128
S. cerevisiae Single cell Sc CE PCR based (7,45,21) 1036 4373
S. pombe Single cell Sp CE PCR based (46), Transposon (49) 1226 3379
Total 11 038 67 035

The aim of this study was to develop a classifier capable of identifying essential genes in eukaryotes even if no experimental data is available. For this, we trained and validated classifiers using essentiality information from six, well described model organisms. Within a leave-one-organism-out cross-validation, the classifiers were trained on data derived from five organisms and validated with the sixth organism. As case studies, we applied the classifier to the red flour beetle Tcastaneum and the silk moth Bombyx mori. We used the available RNA interference (RNAi) (24) and CRISPR (25) screen with defined lethality status as a control, while validating further predictions experimentally. Moreover, we aimed to fill the gap between human cell line screens and population studies, by integrating information from the model organisms to improve human essential gene assignments.

MATERIALS AND METHODS

Assembling the gold standard

We assembled essentiality information for genes from the six species C. elegans, D.melanogaster, H. sapiens, M. musculus, S. cerevisiae and S. pombe. For fly, mouse and human we could collect screens for CEG and OEG, for worm only for OEG, and for the yeasts (obviously) only for CEG. This essentiality information was collected from the databases OGEE (21) and DEG (26) and the literature as listed in Table 1. For genes with different essentially status in different screens, we performed a majority voting. For human cell line screens, a gene had to be studied in at least five experiments as previously suggested by Guo et al. (17). All genes, their class labels and predictions can be found in Supplementary Table S4.

Feature generation

A total of 41 635 features were generated based on seven different sources including protein and gene sequence, functional domains, topological features, evolution/conservation, subcellular localization, and gene sets from Gene Ontology (Figure 2). Protein and gene sequences of the organisms were obtained using biomaRt (27). For genes with isoforms, the features were generated individually for each isoform and the median of all was calculated. For deriving protein and gene sequence features, various numerical representations characterizing the nucleotide and amino acid sequences and compositions of the query gene were calculated using seqinR (28), protr (29), CodonW (http://codonw.sourceforge.net/) and rDNAse (30). With seqinR simple protein sequence information including the number of residues, the percentage of physico-chemical classes and the theoretical isoelectric point were calculated. Most protein sequence features were obtained using protr including autocorrelation, CTD, conjoint triad, quasi-sequence order and pseudo amino acid composition. CodonW was used to calculate simple gene characteristics like length and GC content but also frequency of optimal codons and effective number of codons. Using rDNAse, DNA descriptors like auto covariance or pseudo nucleotide composition, and kmer frequencies (n = 2–7) were calculated. Domain features including post-translational modifications were generated based on the tools provided by the Technical University of Denmark (http://www.cbs.dtu.dk/services/) and included prediction of membrane helices and β-turns, cofactor binding, acetylation and glycosylation sites. Topology features were computed based on protein-protein-associations (PPA) derived from STRING v11 (31) including degree, degree distribution, betweenness, closeness and clustering coefficient using the Python library NetworkX.

Figure 2.

Figure 2.

Functional characterization of essential genes. Distribution of CEG and OEG in major biological processes. Enriched gene sets were assigned to one of eleven major categories. The proportions were derived by dividing the number of essential genes in each gene set by the total number of essential genes of the according CEG or OEG entity. CEG were enriched in processes describing cellular biogenesis and cell cycle/proliferation, whereas OEG showed enrichment in regulation, development/morphogenesis and signaling.

Conservation features included the number of homologous proteins of a query protein in the complete RefSeq (32) database using PSI-BLAST (33). The number of proteins found with e-value cutoffs from 1e−5 to 1e-100 (in 1e−5 multiplication steps) were used as features. In addition, an alignment coverage score (ACS) was calculated for hits with a cutoff ≤1e−30 as described by Vinayagam et al. (34). The ACS is the average of the query coverage score (QCS) and the subject coverage score (SCS):

QCS and SCS are combined measures of size, identity and E-value of the alignment concerning the query or subject sequence,

graphic file with name M0001.gif (1)

and

graphic file with name M0001a.gif (2)

where AL denotes the alignment length, QL the length of the query sequence, SL the length of the subject sequence, and I the fraction of identical amino acids in the alignment. Next, the number of homologous sequences with a score from 0 to 0.95 in 0.05 steps were calculated. Analogously, the number of paralogous sequences were calculated, but blastn (35) alignment results with an e-value cutoff ≤1e−30 were used as input for the score. Subcellular localization features were predicted using DeepLoc (36), which assigns a score for each protein to its localization in eleven eukaryotic cell compartments. Gene set features were computed based on 3,874 Gene Ontology (GO) terms present in all analyzed organisms, similar to Chen et al. (37). By this, not only the characterization of the query gene was taken into account, but also of its neighbors in the protein association network making the features more robust against false gene set annotations. We assembled the neighbors of the query gene employing the gene network definitions of STRING v11 (31). For each of the gene sets, a Fisher's exact test for enrichment of interaction partners was performed. The −log10 values of the P-values were used as features.

Data normalization, feature selection and machine learning

Data analysis was performed using R. Values of each feature were z-transformed and each value was assigned to deciles. Next, we performed two steps for feature selection prior to the training procedure. The data was randomly split into training (4/5) and testing (1/5). Based on the training set, Least Absolute Shrinkage and Selection Operator (LASSO) was applied for feature selection using the glmnet package (38) in R (cv.glmnet function with parameters alpha = 1, type.measure = ‘auc’). To avoid collinearity, highly correlating features with Pearson correlation coefficients r ≥ 0.70 were removed.

To overcome class imbalances when training the classifiers, we used SMOTE (39) and trained with the classification algorithms Random Forest (RF) and Extreme Gradient Boosting (XGB) from the caret (40) package. For RF the tuneLength parameter in the train function was set to 3 resulting in three mtry values (number of predictors randomly sampled at each split). For XGB eta, nrounds, max_depth, min_child_weight and colsample_bytree were optimized in a tune-grid whereas gamma and subsample parameters were kept constant at 0 and 1, respectively. This resulted in 216 different parameter combinations for XGB tuning. To improve generalizability, for each organism we performed a stratified randomized 5-fold cross-validation in which feature selection, hyper-parameter tuning and training of the classifiers was done on 80% of the data. Twenty percent of the data was exclusively used for testing the performance, thus leaving this data completely untouched by the machines during training (Supplementary Figure S1). By this, we derived the training and test performances for each organism individually (Supplementary Figure S2), ensuring the capability of the machines to predict essential genes and to avoid over-fitting.

Leave-one-organism-out cross-validation scheme for the CLassifier of Essentiality AcRoss EukaRyotes (CLEARER)

For each individual species (five species for CEG, four for OEG predictions), five machines were trained (Supplementary Figure S1). Essential genes for the left-out species were predicted with machines trained on the according CEG or OEG data sets of the other organisms. Thereby the classifiers for each (non-left out) species supplied an essentiality prediction score between zero and one and the average of these scores was used for the prediction of a gene to be essential in the left-out species. This approach allowed testing the generalizability of CLEARER as the data from the left-out species was only used for testing.

Orthology-based essential gene prediction

We derived orthologs from the OrthoDB v10 (41) database and assigned the essentiality to the genes according to the data sources listed in Table 1. For predicting essential genes in an organism we selected the essential and non-essential orthologs in the other organisms and performed a hypergeometric test. P-values were FDR corrected for multiple testing and values lower than 0.05 considered to be significant.

RNAi experiments in T. castaneum

RNAi was performed according to the procedure described for the larval injection screen in Schmitt-Engel et al. (24) with minor modifications: we used another strain (San Bernardino) and scored lethality after 7 days after injection (instead of 11). We defined a gene as lethal if the lethality in the pupal or larval screen was at least 50%.

Functional enrichment analysis of essential genes

To study in which cellular processes essential genes were enriched, an enrichment analysis was performed using Gene Ontology (version from 2020-11-18), biological process, molecular function and cellular compartment. Enriched GO-terms (Fisher's exact test P < 0.05, FDR corrected) were selected and compared across the six species. Gene sets were removed if they showed high redundancy according to the following method. Redundancy between two gene sets was quantified using Jaccard similarity coefficients,

graphic file with name M0002.gif (3)

in which A and B are gene sets enriched for essential genes. An undirected graph G = (X, E) is introduced, with X being gene sets as vertices and E being gene set pairs with J(A, B) ≥ 0.3 as edges of the graph. A mixed integer linear model (weighted stable set problem) was setup with a constraint for each edge to select at most one of the vertices of an edge:

graphic file with name M0003.gif (4)

subjected to

graphic file with name M0004.gif (5)

where wi is the weight of a gene set. The weight is derived from the enrichment test P-value and maximization was performed employing linear integer programming solved by the software Gurobi (version 7.5.2, https://www.gurobi.com). This led to an optimal selection of at most one gene set from a pair in such a way that the overall number of non-redundant gene sets were maximized. Moreover, very general gene sets containing ≥1000 genes (in any organism) were removed and gene sets comprising ≤ 0.1% genes of the corresponding organism were not considered. For illustration, each gene set was assigned to one of eleven major groups (cell cycle, cellular structure, development, immune response, metabolism, neural processes, protein biogenesis, regulation, repair, RNA biogenesis and signaling).

Clustering

Scores of human cell lines were combined calculating the rank products leading to a combined rank for CEG. Similarly, combined ranks were calculated for OEG based on the scores of the population studies. Hierachical clustering of human genes was based on the percentiles of the combined scores from the cell lines (experimental CEG), population studies (experimental OEG) and CLEARER (OEG and CEG predictions). Clustering was performed using euclidian distance and average linkage of the R package pheatmap.

Associating human genes to phenotypes

To investigate how the predicted essential genes in human associate with human diseases, we extracted 225 443 phenotype-to-gene associations from the Human Phenotype Ontology (42) database. For each of the phenotypes, an enrichment test (Fisher's exact test) for the according genes in the clusters was performed following FDR correction and phenotypes with P < 0.05 were considered to be significantly enriched. Word clouds illustrating phenotypes were generated using the R package wordcloud2.

RESULTS

Essential genes are different in single cells compared to multi-cellular organisms

Our study was performed based on the six model organisms Homo sapiens (human), Mus musculus (mouse), Drosophila melanogaster (fly), Caenorhabditis elegans (worm), Saccharomyces cerevisiae and Schizosaccharomyces pombe (yeasts). Essentiality information for each of these organisms was taken from the databases Online GEne Essentiality (21), Database of Essential Genes (26) and the literature (7,14–16,43–63). Table 1 lists the number of genes for which this essentiality information could be assembled. To our knowledge we compiled the most comprehensive collection of essentiality information for these six eukaryotes comprising 11 038 essential and 67 035 non-essential labeled genes.

For humans, it has been shown that essentiality information derived from viability screens of cancer cell lines does not well agree with essentiality information derived from in vivo genetic studies (13). We speculated that this phenomenon can be generalized, i.e. CEG identified from cell lines or unicellular organisms (yeasts) are distinct from OEG. E.g. genes involved in embryo development or neural morphogenesis may be substantial for multicellular organisms, but not for a unicellular organism or a cell line. Hence, we defined two categories of gene essentiality, i.e., CEG and OEG, depending on the cellular or organismal nature of the experimental study (Table 1). We compared gene essentiality based on OEG and CEG experiments across the six investigated organisms based on orthologous gene groups. As shown before (13), gene essentiality inferred from human in vivo population studies correlated well with each other (r = 0.53 ± 0.15) and the correlation was even better between human cell line studies (r = 0.74 ± 0.07, Figure 1A). Interestingly, we also observed good pairwise correlations of CEG of human, fly, mouse and the yeasts (r = 0.38 ± 0.17). Moreover, we observed reasonable pairwise correlations of OEG across organisms (r = 0.20 ± 0.06). In contrast, there were lower correlations between CEG and OEG across organisms (r = 0.15 ± 0.13) and the lowest correlation between OEG and CEG of the same species was found in human (r = 0.13 ± 0.06). Despite that, the overlap of cellular and organismal essential genes was significant for all three organisms, by far the highest overlap was found in mouse (Figure 1B). The complete list of orthologous groups and essentiality information is provided in Supplementary Table S1.

Figure 1.

Figure 1.

Correlation of cellular and organismal essential genes. (A) Correlation analysis of essentiality across organisms shows that both cellular (CEG) and organismal essential genes (OEG) are conserved across the investigated eukaryotes. Essentiality on the cellular level correlated better than on the organismal level. The lowest correlation of CEG and OEG was observed for human. Human OEG scores (from population studies) and human CEG scores (from cell based knockout/knockdown screens) were obtained from Bartha et al. (13). Human population studies are denoted as scores used to define essentiality. The human cell line studies are denoted by the name of the corresponding cell lines and, in brackets, the first author of the study. Additionally, the OEG and CEG scores for Celegans, D. melanogaster, M. musculus, S. cerevisiae and S. pombe were included. (B) Venn diagrams showing the overlap of essential genes found in organisms with CEG and OEG data sets. P-values show the significance of the overlap.

Next, we studied the involvement in cellular processes of CEG and OEG. For this, we performed gene set enrichment analyses based on the gene set definitions of Gene Ontology. To get a better overview, each gene set was assigned to one of eleven major groups (cell cycle, cellular structure, development, immune response, metabolism, neural processes, protein biogenesis, regulation, repair, RNA biogenesis and signaling).

Whereas CEG were enriched in processes describing cellular macromolecule biogenesis and cell cycle/proliferation, OEG showed enrichment in regulation, development/morphogenesis, neural related processes and signaling (Figure 2). This reflects the need of multicellular organisms for functional organ systems, which do not only depend on the survival of the respective cells but also their concerted function within and between organs. Figure 2 shows the proportions of CEG and OEG of these different cellular processes for each organism. Notably, for human OEG less developmental gene sets were found compared to M.musculus, D. melanogaster and C. elegans potentially reflecting the different way to identify essential genes. The simple multicellular nematode C.elegans shows processes we observed in CEG of the other multi cellular organisms. All enriched gene sets are listed in Supplementary Table S2.

We observed that CEG are similar among the individual species, and similarly OEG. However, CEG and OEG are substantially different suggesting learning our machines with CEG and OEG separately.

Setting up the data sets for machine learning

We set up a machine learning procedure to predict essential genes across eukaryotes called CLassifier of Essentiality AcRoss EukaRyotes (CLEARER). Each data set from Table 1 served as the gold standard, and with this we trained individual classifiers for each organism for CEG and OEG based on 41 635 features from seven categories comprising information from protein and DNA sequences, domains, of gene network topology, evolutionary conservation, subcellular localization, biological processes and further gene set definitions (Figure 3). The complete list and description of features used in this study is shown in Supplementary Table S3. To predict essential genes across species, a leave-one-organism-out cross-validation was applied. An example is shown in Figure 3 where CEG of human are predicted by machines trained on fly, mouse and the yeasts. The entire machine learning workflow is depicted in Supplementary Figure S1. Notably, following a leave-one-organism-out cross-validation allowed realistic performance evaluations as the class labels of the left-out organism were only uncovered when evaluating the predictions. OEG were predicted separately, using essentiality information of the four available organisms comprising worm, fly, mouse and human (Table 1).

Figure 3.

Figure 3.

The workflow of CLEARER. The workflow is exemplified for the prediction of human cellular essential genes (CEG). For each gene 41 635 features based on seven different categories were assessed. Machine learning was performed and the intra-species classification performance evaluated. For the inter-species classification, the machines trained on the other organisms were used to predict human CEG.

Identifying essential genes within and across species

First, we tested our approach by predicting essential genes within the same species using a stratified randomized 5-fold cross-validation. Thereby, 80% of the data was used for feature selection, hyperparameter tuning and training of the classifiers, and 20% for testing (Supplementary Figure S1). We evaluated the performance of the two machine learning strategies Random Forests and Extreme Gradient Boosting. Random Forests performed slightly better on the test sets (Supplementary Figure S2) with an average ROC–AUC of 0.857 ± 0.057. The best performance was obtained for the human cell lines (ROC–AUC = 0.955 ± 0.018), which also reflects the consistency of the essentiality information within the six data sets (Table 1). Notably, we observed no difference between the performance in predicting CEG and OEG. For both, we got consistently high performances (Figure 4A, ROC–AUCCEG = 0.873, ROC–AUCOEG = 0.845, P = 0.13).

Figure 4.

Figure 4.

Performance of CLEARER. (A) Comparison of the prediction performance within and across species from all cross-validations. Species abbreviations are as listed in Table 1. (B) Line graph illustrating the maximal F1-score and accuracy cutoff for Hsapiens after combining CLEARER and the orthology-based approach. Dotted lines indicate maxima. (C) Box plots showing the percentiles and performance metrics for the maximal F1- score and maximal accuracy for the six model organisms. (D) Bar graphs showing the total number of correct and incorrect predicted genes.

Furthermore, we investigated the performance when combining CEG and OEG information. For this, we trained and tested using combined lists. As expected, we observed a distinctively reduced performance in terms of the ROC–AUC for the organisms with CEG and OEG information (Supplementary Figure S3). For the following, we based our analysis on classifiers which were trained separately either with OEG or CEG. In conclusion, our method predicted essentiality well, when CEG and OEG were trained separately and when applied to the same species.

Next, CLEARER (Figure 3) was applied to predict CEG and OEG across organisms, following a leave-one-organism-out cross-validation. We observed an ROC–AUC of 0.744 ± 0.084 on average (Figure 4A, Supplementary Figure S4 shows the performance for each organism separately). The overall accuracy of the final predictions of the within species classifiers and CLEARER were similar (no significant difference). Next, we compared CLEARER to the approach previously reported by Campos et al. (18), which used protein sequence features and the leave-one-organism-out cross-validation approach investigating the same species. In their approach CEG and OEG were not handled separately and machines were not trained on the individual organisms, but on combined lists of essential and non-essential genes (18). Following this way, we observed a reduced performance compared to CLEARER when using the same protein sequence features (ROC-AUC = 0.599 ± 0.067, P < 0.0001) and also when using our set of diverse features (ROC–AUC = 0.696 ± 0.044, P < 0.05). This demonstrated the improvement when integrating the comprehensive set of the above described features and an ensemble classifier trained on CEG and OEG of the individual species.

CLEARER performed well to predict essential genes for each of the model organisms investigated when trained on the other model organisms and significantly outperformed the previous approach.

Comparing CLEARER with an orthology-based prediction approach

The most common approach to find essential genes is based on inference by orthology. For all genes with orthologs in at least one of the other investigated organisms we performed a test for enrichment of essential orthologs. This enrichment test showed a similar accuracy as simple majority voting (accuracyenrichment = 0.784 ± 0.052, accuracymajority = 0.797 ± 0.047, P = 0.1563) but an increased sensitivity (sensitivityenrichment = 0.398 ± 0.188, sensitivitymajority = 0.314 ± 0.141, P = 0.031). Consequently, we used the enrichment test-based assignment. A major disadvantage of such an approach is that for 17 994 genes (29.8%) no ortholog with known essentiality was assigned, leaving these genes unpredicted (Figure 4D). However, for the genes with identified orthologs the enrichment test performed similar to CLEARER in terms of the accuracy (accucacyenrichment = 0.77, accucacyCLEARER = 0.74, Figure 4D). Despite that, CLEARER classified more essential (31.6%) and non-essential genes (24.4%) correctly (Figure 4D). Notably, both approaches shared a substantial number of true positive predictions (Supplementary Figure S5) suggesting combining both approaches for the best performance.

Combining CLEARER with the orthology-based approach improves the overall predictions

To obtain a unique classifier predicting CEG and OEG with the best performance, we now combined the predictions from CEG, OEG and the orthology-based approach. For the yeasts only predictions for CEG and from the orthology-based approach were considered. A combined score for each gene was calculated based on the ranking of the predictions for each approach. This score allowed to order the genes such that an optimal percentile (cutoff) could be selected above which a gene was regarded to be essential. The optimal cutoff needed to be selected to balance between high accuracy, sensitivity and precision. For this, two measures were regarded, i.e. (1) the maxima of the F1-score (harmonic mean of sensitivity and precision) and (2) the maximal accuracy. Figure 4b illustrates this tradeoff exemplarily for the prediction of essential genes for human. Across all six organisms, the best cutoff based on the maximal F1 approach was on average the 75% percentile, yielding an accuracy of 0.769 ± 0.99 (Figure 4C). The best cutoff for the maximal accuracy was the 95% percentile and yielded a higher accuracy (0.832 ± 0.077) and precision (0.606 ± 0.059), but the sensitivity was lower (0.186 ± 0.117). Even though the latter criterion yielded lower sensitivities, depending on the application, high precision can be very beneficial if validation experiments are costly or technically complex. We observed improved predictions (compared to CLEARER alone) with the combined approach yielding n = 1,060 (18.7%) more correctly identified essential genes and only a marginal decrease in specificity (1.1%) when applying the 75% percentile cutoff (Figure 3d). We used this combined classifier to predict essential genes for the case study applications (next section). The results of the predictions for all model organisms are shown in Supplementary Table S4.

CLEARER combined with the orthology-based approach performs well for Tribolium castaneum and Bombyx mori

As two case studies, we applied the unified classifier (CEGCLERARER, OEGCLEARER and orthology-based approach) to predict essential genes for T. castaneum and B. mori.

T. castaneum is an insect with emerging interest because in many respects its biology is more representative of insects than that of D. melanogaster (64), e.g. Tcastaneum is used as a model organism for pest control (65). Gene essentiality information was available from previously published knockdown screens (24,65,66), and this information was used to validate the predictions. We made predictions for 12 859 genes of which 2,783 had been tested by RNAi (Supplementary Table S5). Following the F1-based cutoff criterion, the top 25% predictions were considered to be essential. The achieved performance (sensitivity = 0.409, specificity = 0.809, and precision = 0.712) was comparable to the leave one organism out cross-validation results of the model organisms (sensitivity = 0.569 ± 0.0958, specificity = 0.802 ± 0.129, precision = 0.415 ± 0.087). The overlap of essential gene predictions and experimental results was highly significant (P < 0.0001, odds ratio = 2.63). Following the maximal accuracy cutoff, the top 5% of predictions were considered to be essential yielding high precision (0.859) and specificity (0.973), but only low sensitivity (0.143). The results show that CLEARER predicts essential genes for T.castaneum with a very similar performance as achieved for the model organisms when applying the leave-one-organism-out cross-validation. Next, we selected 200 genes with the highest prediction scores and validated them experimentally performing RNAi in vivo. Indeed, n = 160 genes (accuracy = 80.5%) proved experimentally to be essential (Supplementary Table S5). In addition, we randomly selected 200 genes and tested their essentiality in vivo (Supplementary Table S5). Again, the overlap of essential gene predictions and RNAi validation experiments was significant (P < 0.01, odds ratio = 2.78, precision = 0.646 and specificity = 0.850).

Essential genes of B. mori were investigated in the second case study. B. mori is a lepidopteran insect with agricultural importance in silk production but also in research (67). Gene essentiality information was available from a previously published CRISPR/Cas9 based knockout screen (25), and was used to validate the predictions. For B. mori, we made predictions for 8,150 genes with known lethality information from the knockout screen (listed in Supplementary Table S5). Following the F1-based cutoff criterion, we yielded a good classification performance, again similar to the results from the model organisms (accuracy = 0.786, sensitivity = 0.591 and specificity = 0.806). Using the maximal accuracy as the cutoff criterion resulted in higher accuracy (0.917) and specificity (0.983), but reduced sensitivity (0.267).

In summary, applying our approach to T. castaneum and B. mori yielded a similar performance as observed in the leave-one-organism-out cross-validation of the six model organisms showing the good applicability of our approach.

CLEARER supports defining human essential genes

Essential gene information for human derived from cancer cell line and population sequence studies hardly overlap, indicating a missing link between both approaches. We aimed to provide this link by integrating the available experimental information for human with predictions from the model organisms.

Essentiality scores of the experimental data from ten human cell line screens were combined (using rank products) to obtain an experimental CEG score. Similarly, a combined score was obtained for OEG based on the five population studies. As described above, human CEG and OEG poorly overlap, and hence the correlation of the according scores was low (r = 0.12, Figure 5A). In contrast, the scores from the combination of CLEARER and the orthology-based approach correlated much better with the scores from the cell line screens (r = 0.50) and the population studies (r = 0.24). Next, experimentally derived CEG and OEG scores were combined with the computational predictions using the rank product. This combined score showed the best correlation to the cell line screens (r = 0.53), population studies (r = 0.45) and CLEARER (r = 0.71).

Figure 5.

Figure 5.

CLEARER supports identifying essential genes for human. (A) Correlation of experimental and computational scores. Scores from five population studies and ten cell line screens were combined and compared to predictions from CLEARER. The highest correlation was found when combining the experimental data and computational predictions. (B) Clustering of computational and experimental scores separates essential and non-essential genes. (C) Essential gene clusters associate with early death and lethality from Full Spectrum of Intolerance to Loss-of-function categories by Cacheiro et al. (68). (D) Word cloud illustrating enriched phenotype-to-gene associations of genes from the essential gene clusters. (E) Box plots of the scores of the four essential gene clusters illustrate how CLEARER supported the final decision-making towards or against essentiality. (F) Venn diagrams showing the number of overlapping and non-overlapping phenotypes being enriched in genes of the according clusters. Notably, for the non-essential gene clusters 6, 7, 8 and 11 no enriched phenotype was identified.

We were now interested whether this combined score improved elucidating human essential gene definitions. For this, we clustered the essential genes based on their experimentally derived scores and CLEARER predictions. We identified eleven clusters distinctively separating essential and non-essential genes according to their combined scores (Figure 5B). In total, 7,739 genes (38.9%) with high combined essentiality scores were found in four clusters (2–4,10). Genes from these four clusters account for the majority of genes associated to an early death (72.6%, Figure 5C). Cacheiro et al. (68), recently proposed a categorization of human genes according to their essentiality. Genes from clusters 2, 3, 4 and 10 were highly enriched in the categories ‘cellular lethal’ and ‘developmental lethal’ of Cacheiro et al. (Figure 5C). In contrast, these genes were depleted in the ‘viable’ categories, which again indicates an accumulation of essential genes in clusters 2, 3, 4 and 10. Another indicator for essentiality was their association to human diseases. Genes from our essential gene clusters were combined and compared to genes from the non-essential gene clusters (clusters 1, 5, 6, 7, 8, 9 and 11). We observed a distinctively high enrichment of diseases and impairments associated to the genes from essential gene clusters. In total, 490 phenotypic descriptions were found to be enriched mostly associated with abnormal development but also hypoplasia, which is associated with an inadequate or below-normal number of cells (Figure 5D). In contrast, the 12,175 genes from the other seven (non-essential gene) clusters were not enriched for any phentoypic description. These results highly suggest a strong enrichment of essential genes in clusters 2, 3, 4 and 10, compared to the other clusters.

Another interesting finding was that in clusters where experimental and computational scores differed, CLEARER supported making the final decision. Genes from cluster 4 and cluster 10 had high scores either in the cell line screens (cluster 4) or population studies (cluster 10). Both were supported by the predictions from CLEARER. In contrast, genes from clusters 5 and 6 were not supported by CLEARER (Figure 5E). Next, the phenotypic associations of the individual clusters were compared (Figure 5F) again showing the overlap of associated diseases in the essential gene clusters. On the contrary, genes from the non-essential clusters were considerably less associated to diseases (contributing only 3.6% of all identified phenotypes) and their phenotypes did not overlap (Figure 5F). In fact, for the non-essential gene clusters 6, 7, 8 and 11 no associated phenotype was found. The combined score for each gene and enriched phenotypes for the clusters are listed in Supplementary Table S6.

The results show that the combined score supports defining human essential genes and may contribute filling the gap between cell line screens and population studies by integrating information from the other model organisms. Genes with high scores are associated with death, abnormal morphology, cancer and other diseases. Moreover, the combined score appears to be particularly supportive when the results of the cell line screens and population studies are divergent.

DISCUSSION

We employed machine learning across six model eukaryotes and 60 381 genes, using 41 635 features derived from sequence, gene functions and network topology. The approach enabled to predict essential genes in an organism based on data of other organisms, here the above described and well-studied model organisms of human, mouse, fly, worm and the yeasts. Within a leave-one-organism-out cross-validation, the classifiers showed a high generalizability with an average accuracy close to 80% in the left-out species. Applying our combined approach consisting of machines from all these organisms and a homology based predictor we could predict gene essentiality with a similar performance for two (unseen during training) organisms T. castaneum and B. mori. For T. castaneum, we validated the predictions by a random selection of our predictions experimentally.

During the study, we came across several hurdles for these predictions. Regarding the gold standard, we observed a considerable discrepancy between organismal and cellular essential gene assignments. The correlation of their essentiality scores was low, particularly for human. Still, the overlap of cellular and organismal essential genes was significant for all three studied organisms for which such a comparison was possible (human, mouse, fly). The highest overlap was observed in mouse. Some genes seem to be essential in both, the single cells and in the whole organism, others seem to be essential specifically either in the organism or the single cells. Particularly for human, there is a large discrepancy between the available essentiality information from cell line screens compared to population studies (13). Human cell lines are mostly derived from cancer cells. Notably, these cells have been investigated to identify cancer-specific and even cancer-sub-type-specific essential genes. For example, more than a thousand cancer cell lines have been studied by Ghandi et al. to identify pan- and cancer-specific essential genes as new chemotherapeutic targets (69). Essential genes in such cell lines may have specific tasks in these cells, such as to provide high proliferation, cellular repair or biogenesis of cellular compounds. In turn, depletion of mutations/conservation of genes in human populations rather lead to genes being essential for the regulation, development and differentiation of the organism. Pan-cancer essentiality information has been used to provide targets for cancer therapy. However, these therapies often come along with high cell toxicity enforcing low dosage treatments, which, in turn limit their effect on tumor elimination (70). However, a systematic machine learning based differentiation of gene essentiality of cancer cells and population studies might be an intriguing approach to define drug targets against cancer with reduced side effects, and we suggest such studies as a future perspective. For human and fly, essential gene information based on studying immortal cell lines and the overlap of essential genes was much lower than for mouse. For mouse, cellular essential gene information was derived from experiments with embryonic stem cells indicating that these correspond much better to the organismal development. The different biological sources and experimental methods led to rather diverse lists of identified essential genes. To cope with these inconsistencies we integrated essentiality information from various studies for the individual organisms thus balancing for differences between the studies. In addition, predictions based on machines trained across several organisms. Particularly for human, we showed that combining essentiality information of other, non-human organisms using CLEARER with orthology information and experimental data for human led to a combined score highly correlating with scores from both, cell line and population studies. This approach may hence suit to bridge human CEG and OEG information.

Genes with high scores associated with early death, abnormal morphology and cancer. The predictions from CLEARER supported the final decision of genes with higher scores from the cell line or population studies. This suggests that the combined score, integrating experimental data and computational predictions based on the model organisms, may provide a valuable resource for genetic studies of human health and diseases.

The presented method base on a recent development by us (9). Methodologically, orthology-based classification was included and more than 14,000 additional features were implemented. In comparison to the previous study, here cross-species essential gene predictions were performed, instead of predicting essential genes within the same organism (in the previous study, training and validation was done for D. melanogaster or for human cell lines). In this study, two case studies for application to unknown organisms were performed demonstrating the generalizability of the classifier, including 400 RNAi experiments for validation of the essential gene predictions in T. castaneum. Our approach highly outperformed a previously published method (18), which evaluated the performance of protein sequence derived features using the same model organisms. Moreover, our approach advantaged from distinguishing between CEG and OEG in the gold standard. Good prediction of essential genes for prokaryotes were provided by Geptop, a well developed prediction tool basing on sequence homology (20), and this was further developed for eukaryotic cells adding information of orthologous groups (71). However, essential gene prediction particularly for eukaryotes is challenging as few genome wide experimental screens are available. In addition, the experimental approaches are very heterogeneous comprising knockout as well as knockdown methods, and investigating cell lines, single cell organisms, whole multi-cellular organisms, and population studies. Recently, Cacheiro et al. (68), combined essential gene information of mouse and human orthologous genes resulting in 3819 predicted human essential genes. The assignment of their groups is in good accordance to our combined score. However, the clear advantage of our approach is that it is applied to all known genes and not just to orthologs. Another approach elucidating essential gene information is provided by HEGIAP (37). HEGIAP is a collection of analytical tools enabling to analyze the epigenetics, gene structure and evolution of human essential genes. To note, HEGIAP considers only human essential genes from cell line screens neglecting organismal essentiality. Besides this, HEGIAP is rather a descriptive tool collection not providing a final prediction for gene essentiality.

As a further perspective, more complex sequence-based and topology features, which have been developed and shown to be powerful in essential gene prediction of prokaryotes (72,73), may be added to CLEARER to further improve its performance. In this study, we focused on the prediction of essential genes. As another perspective, using the same approach but defining another gold standard, also other gene to phenotype associations can be predicted. For this, several databases (42,56,66,74,75) with gene to phenotype descriptions for model organisms can be used to set up the appropriate gold standard. Moreover, here we only considered loss-of-function of single genes. A further promising application will be to predict synthetic lethality, in which combinations of loss-of-functions lead to death.

Our method CLEARER allows the prediction of essential genes, in principle, for any eukaryote without the need of a screening experiment. This can simplify the search for essential genes in non-model organisms e.g. to find targets for pest and vector control. Instead of large scale experimental screens, CLEARER can provide a shortlist of putative essential genes, which can subsequently be tested experimentally on a smaller scale.

DATA AVAILABILITY

The source code is available at https://github.com/ThomasBeder/CLEARER and archived with complete data sets and trained models at Zenodo https://doi.org/10.5281/zenodo.5557738.

Supplementary Material

lqab110_Supplemental_Files

ACKNOWLEDGEMENTS

We thank Martin Milner for dsRNA injections.

Contributor Information

Thomas Beder, Integrated Research and Treatment Center, Center for Sepsis Control and Care (CSCC), Jena University Hospital, Am Klinikum 1, 07747 Jena, Germany; Institute of Infectious Diseases and Infection Control, Jena University Hospital, Am Klinikum 1, 07747 Jena, Germany; Department of Internal Medicine II, University Medical Center Schleswig-Holstein, Campus Kiel, 24105 Kiel, Germany.

Olufemi Aromolaran, Department of Computer & Information Sciences, Covenant University, Ota, Ogun State, Nigeria; Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria.

Jürgen Dönitz, Department of Evolutionary Developmental Genetics, GZMB, University of Göttingen, Justus-von-Liebig-Weg 11, 37077 Göttingen, Germany; Department of Medical Bioinformatics, University Medical Center Göttingen (UMG), 37099 Göttingen, Germany.

Sofia Tapanelli, Department of Life Sciences, Imperial College London, London SW7 2AZ, UK.

Eunice O Adedeji, Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria; Department of Biochemistry, Covenant University, Ota, Ogun State, Nigeria.

Ezekiel Adebiyi, Department of Computer & Information Sciences, Covenant University, Ota, Ogun State, Nigeria; Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria.

Gregor Bucher, Department of Evolutionary Developmental Genetics, GZMB, University of Göttingen, Justus-von-Liebig-Weg 11, 37077 Göttingen, Germany.

Rainer Koenig, Integrated Research and Treatment Center, Center for Sepsis Control and Care (CSCC), Jena University Hospital, Am Klinikum 1, 07747 Jena, Germany; Institute of Infectious Diseases and Infection Control, Jena University Hospital, Am Klinikum 1, 07747 Jena, Germany.

SUPPLEMENTARY DATA

Supplementary Data are available at NARGAB Online.

FUNDING

Deutsche Forschungsgemeinschaft (https://www.dfg.de/) [KO 3678/5-1]; German Federal Ministry of Education and Research (BMBF) within the project Center for Sepsis Control and Care (CSCC) [01EO1002, 01EO1502]; Bayer CropScience; DFG for the development of iBeetle-Base [DFG LIS project 417202192].

Conflict of interest statement. None declared.

REFERENCES

  • 1. Rancati G., Moffat J., Typas A., Pavelka N.. Emerging and evolving concepts in gene essentiality. Nat. Rev. Genet. 2018; 19:34–49. [DOI] [PubMed] [Google Scholar]
  • 2. Sharma A.K., Eils R., König R.. Copy number alterations in enzyme-coding and cancer-causing genes reprogram tumor metabolism. Cancer Res. 2016; 76:4058–4067. [DOI] [PubMed] [Google Scholar]
  • 3. Brenner S. The genetics of Caenorhabditis elegans. Genetics. 1974; 77:71–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Nüsslein-Volhard C., Wieschaus E.. Mutations affecting segment number and polarity in Drosophila. Nature. 1980; 287:795–801. [DOI] [PubMed] [Google Scholar]
  • 5. Driever W., Solnica-Krezel L., Schier A.F., Neuhauss S.C., Malicki J., Stemple D.L., Stainier D.Y., Zwartkruis F., Abdelilah S., Rangini Z.et al.. A genetic screen for mutations affecting embryogenesis in zebrafish. Development. 1996; 123:37–46. [DOI] [PubMed] [Google Scholar]
  • 6. Fire A., Xu S., Montgomery M.K., Kostas S.A., Driver S.E., Mello C.C.. Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature. 1998; 391:806–811. [DOI] [PubMed] [Google Scholar]
  • 7. Giaever G., Chu A.M., Ni L., Connelly C., Riles L., Véronneau S., Dow S., Lucau-Danila A., Anderson K., André B.et al.. Functional profiling of the Saccharomyces cerevisiae genome. Nature. 2002; 418:387–391. [DOI] [PubMed] [Google Scholar]
  • 8. Kim D.U., Hayles J., Kim D., Wood V., Park H.O., Won M., Yoo H.S., Duhig T., Nam M., Palmer G.et al.. Analysis of a genome-wide set of gene deletions in the fission yeast Schizosaccharomyces pombe. Nat. Biotechnol. 2010; 28:617–623. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Aromolaran O., Beder T., Oswald M., Oyelade J., Adebiyi E., Koenig R.. Essential gene prediction in Drosophila melanogaster using machine learning approaches based on sequence and functional features. Comput. Struct. Biotechnol. J. 2020; 18:612–621. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Plaimas K., Eils R., König R.. Identifying essential genes in bacterial metabolic networks with machine learning methods. BMC Syst. Biol. 2010; 4:56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Lin Y., Zhang R.R.. Putative essential and core-essential genes in Mycoplasma genomes. Sci. Rep. 2011; 1:1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Hillenmeyer M.E., Fung E., Wildenhain J., Pierce S.E., Hoon S., Lee W., Proctor M., St.Onge R.P., Tyers M., Koller D.et al.. The chemical genomic portrait of yeast: uncovering a phenotype for all genes. Science. 2008; 320:362–365. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Bartha I., Di Iulio J., Venter J.C., Telenti A.. Human gene essentiality. Nat. Rev. Genet. 2018; 19:51–62. [DOI] [PubMed] [Google Scholar]
  • 14. Blomen V.A., Majek P., Jae L.T., Bigenzahn J.W., Nieuwenhuis J., Staring J., Sacco R., van Diemen F.R., Olk N., Stukalov A.et al.. Gene essentiality and synthetic lethality in haploid human cells. Science. 2015; 350:1092–1096. [DOI] [PubMed] [Google Scholar]
  • 15. Hart T., Chandrashekhar M., Aregger M., Steinhart Z., Brown K.R., MacLeod G., Mis M., Zimmermann M., Fradet-Turcotte A., Sun S.et al.. High-resolution CRISPR screens reveal fitness genes and genotype-specific cancer liabilities. Cell. 2015; 163:1515–1526. [DOI] [PubMed] [Google Scholar]
  • 16. Wang T., Birsoy K., Hughes N.W., Krupczak K.M., Post Y., Wei J.J., Lander E.S., Sabatini D.M.. Identification and characterization of essential genes in the human genome. Science. 2015; 350:1096–1101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Guo F.-B., Dong C., Hua H.-L., Liu S., Luo H., Zhang H.-W., Jin Y.-T., Zhang K.-Y.. Accurate prediction of human essential genes using only nucleotide composition and association information. Bioinformatics. 2017; 33:1758–1764. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Campos T.L., Korhonen P.K., Gasser R.B., Young N.D.. An evaluation of machine learning approaches for the prediction of essential genes in eukaryotes using protein sequence-derived features. Comput. Struct. Biotechnol. J. 2019; 17:785–796. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Acencio M.L., Lemke N.. Towards the prediction of essential genes by integration of network topology, cellular localization and biological process information. BMC Bioinformatics. 2009; 10:290. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Wen Q.-F., Liu S., Dong C., Guo H.-X., Gao Y.-Z., Guo F.-B.. Geptop 2.0: an updated, more precise, and faster geptop server for identification of prokaryotic essential genes. Front. Microbiol. 2019; 10:1236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Chen W.-H., Lu G., Chen X., Zhao X.-M., Bork P.. OGEE v2: an update of the online gene essentiality database with special focus on differentially essential genes in human cancer cell lines. Nucleic Acids Res. 2017; 45:D940–D944. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Gurumayum S., Jiang P., Hao X., Campos T.L., Young N.D., Korhonen P.K., Gasser R.B., Bork P., Zhao X.-M., He L.et al.. OGEE v3: Online GEne Essentiality database with increased coverage of organisms and human cell lines. Nucleic Acids Res. 2021; 49:D998–D1003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Lloyd J.P., Seddon A.E., Moghe G.D., Simenc M.C., Shiua S.H.. Characteristics of plant essential genes allow for within- and between-species prediction of lethal mutant phenotypes. Plant Cell. 2015; 27:2133–2147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Schmitt-Engel C., Schultheis D., Schwirz J., Ströhlein N., Troelenberg N., Majumdar U., Dao V.A., Grossmann D., Richter T., Tech M.et al.. The iBeetle large-scale RNAi screen reveals gene functions for insect development and physiology. Nat. Commun. 2015; 6:7822. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Chang J., Wang R., Yu K., Zhang T., Chen X., Liu Y., Shi R., Wang X., Xia Q., Ma S.. Genome-wide CRISPR screening reveals genes essential for cell viability and resistance to abiotic and biotic stresses in Bombyx mori. Genome Res. 2020; 30:757–767. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Luo H., Lin Y., Gao F., Zhang C.-T., Zhang R.. DEG 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements: Table 1. Nucleic. Acids. Res. 2014; 42:D574–D580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Durinck S., Spellman P.T., Birney E., Huber W.. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat. Protoc. 2009; 4:1184–1191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Charif D., Lobry J.R.. Bastolla U., Porto M., Roman H.E., Vendruscolo M.. SeqinR 1.0-2: A Contributed Package to the R Project for Statistical Computing Devoted to Biological Sequences Retrieval and Analysis. Structural Approaches to Sequence Evolution: Molecules, Networks, Populations. 2007; Berlin, Heidelberg: Springer Berlin Heidelberg; 207–232. [Google Scholar]
  • 29. Xiao N., Cao D.-S., Zhu M.-F., Xu Q.-S.. protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics. 2015; 31:1857–1859. [DOI] [PubMed] [Google Scholar]
  • 30. Zhu M., Dong J., Cao D.. rDNAse: generating various numerical representation schemes of DNA sequences. 2016; R package version 1.1-1. [DOI] [PubMed] [Google Scholar]
  • 31. Szklarczyk D., Gable A.L., Lyon D., Junge A., Wyder S., Huerta-Cepas J., Simonovic M., Doncheva N.T., Morris J.H., Bork P.et al.. STRING v11: Protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019; 47:D607–D613. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Pruitt K.D., Tatusova T., Maglott D.R.. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic. Acids. Res. 2007; 35:D61–D65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Altschul S. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997; 25:3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Vinayagam A., König R., Moormann J., Schubert F., Eils R., Glatting K.-H., Suhai S.. Applying support vector machines for gene ontology based gene function prediction. BMC Bioinformatics. 2004; 5:116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J.. Basic local alignment search tool. J. Mol. Biol. 1990; 215:403–410. [DOI] [PubMed] [Google Scholar]
  • 36. Almagro Armenteros J.J., Sønderby C.K., Sønderby S.K., Nielsen H., Winther O.. DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics. 2017; 33:3387–3395. [DOI] [PubMed] [Google Scholar]
  • 37. Chen H., Zhang Z., Jiang S., Li R., Li W., Zhao C., Hong H., Huang X., Li H., Bo X.. New insights on human essential genes based on integrated analysis and the construction of the HEGIAP web-based platform. Brief. Bioinform. 2019; 21:1397–1410. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Friedman J., Hastie T., Tibshirani R.. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010; 33:1–22. [PMC free article] [PubMed] [Google Scholar]
  • 39. Chawla N. V., Bowyer K.W., Hall L.O., Kegelmeyer W.P.. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002; 16:321–357. [Google Scholar]
  • 40. Kuhn M. Building predictive models in R using the caret package. J. Stat. Softw. 2008; 28:1–26.27774042 [Google Scholar]
  • 41. Kriventseva E.V, Kuznetsov D., Tegenfeldt F., Manni M., Dias R., Simão F.A., Zdobnov E.M.. OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs. Nucleic Acids Res. 2019; 47:D807–D811. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Robinson P.N., Köhler S., Bauer S., Seelow D., Horn D., Mundlos S.. The human phenotype ontology: a tool for annotating and analyzing human hereditary disease. Am. J. Hum. Genet. 2008; 83:610–615. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Luo J., Emanuele M.J., Li D., Creighton C.J., Schlabach M.R., Westbrook T.F., Wong K.-K., Elledge S.J.. A genome-wide RNAi screen identifies multiple synthetic lethal interactions with the Ras oncogene. Cell. 2009; 137:835–848. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Chen S., Zhang Y.E., Long M.. New genes in Drosophila quickly become essential. Science (80-.). 2010; 330:1682–1685. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Dowell R.D., Ryan O., Jansen A., Cheung D., Agarwala S., Danford T., Bernstein D.A., Rolfe P.A., Heisler L.E., Chin B.et al.. Genotype to phenotype: a complex problem. Science. 2010; 328:469–469. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Han T.X., Xu X.-Y., Zhang M.-J., Peng X., Du L.-L.. Global fitness profiling of fission yeast deletion strains by barcode sequencing. Genome Biol. 2010; 11:R60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Cheung H.W., Cowley G.S., Weir B.A., Boehm J.S., Rusin S., Scott J.A., East A., Ali L.D., Lizotte P.H., Wong T.C.et al.. Systematic investigation of genetic vulnerabilities across cancer cell lines reveals lineage-specific dependencies in ovarian cancer. Proc. Natl. Acad. Sci. U.S.A. 2011; 108:12372–12377. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Marcotte R., Brown K.R., Suarez F., Sayad A., Karamboulas K., Krzyzanowski P.M., Sircoulomb F., Medrano M., Fedyshyn Y., Koh J.L.Y.et al.. Essential gene profiles in breast, pancreatic, and ovarian cancer cells. Cancer Discov. 2012; 2:172–189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Guo Y., Park J.M., Cui B., Humes E., Gangadharan S., Hung S., FitzGerald P.C., Hoe K.-L., Grewal S.I.S., Craig N.L.et al.. Integration profiling of gene function with dense maps of transposon integration. Genetics. 2013; 195:599–609. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Petrovski S., Wang Q., Heinzen E.L., Allen A.S., Goldstein D.B.. Genic intolerance to functional variation and the interpretation of personal genomes. PLos Genet. 2013; 9:e1003709. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Samocha K.E., Robinson E.B., Sanders S.J., Stevens C., Sabo A., McGrath L.M., Kosmicki J.A., Rehnström K., Mallick S., Kirby A.et al.. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 2014; 46:944–950. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Bartha I., Rausell A., McLaren P.J., Mohammadi P., Tardaguila M., Chaturvedi N., Fellay J., Telenti A.. The characteristics of heterozygous protein truncating variants in the human genome. PLoS Comput. Biol. 2015; 11:e1004647. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Fadista J., Oskolkov N., Hansson O., Groop L.. LoFtool: a gene intolerance score based on loss-of-function variants in 60 706 individuals. Bioinformatics. 2016; 33:471–474. [DOI] [PubMed] [Google Scholar]
  • 54. Cassa C.A., Weghorn D., Balick D.J., Jordan D.M., Nusinow D., Samocha K.E., O’Donnell-Luria A., MacArthur D.G., Daly M.J., Beier D.R.et al.. Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nat. Genet. 2017; 49:806–810. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Viswanatha R., Li Z., Hu Y., Perrimon N.. Pooled genome-wide CRISPR screening for basal and context-specific fitness gene essentiality in Drosophila cells. Elife. 2018; 7:1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Bult C.J., Blake J.A., Smith C.L., Kadin J.A., Richardson J.E., Anagnostopoulos A., Asabor R., Baldarelli R.M., Beal J.S., Bello S.M.et al.. Mouse Genome Database (MGD) 2019. Nucleic. Acids. Res. 2019; 47:D801–D806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Shohat S., Shifman S.. Genes essential for embryonic stem cells are associated with neurodevelopmental disorders. Genome Res. 2019; 29:1910–1918. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Port F., Strein C., Stricker M., Rauscher B., Heigwer F., Zhou J., Beyersdörffer C., Frei J., Hess A., Kern K.et al.. A large-scale resource for tissue-specific CRISPR mutagenesis in Drosophila. Elife. 2020; 9:e53865. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Spradling A.C., Stern D., Beaton A., Rhem E.J., Laverty T., Mozden N., Misra S., Rubin G.M.. The Berkeley Drosophila Genome Project gene disruption project: Single P-element insertions mutating 25% of vital Drosophila genes. Genetics. 1999; 153:135–177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Kamath R.S., Fraser A.G., Dong Y., Poulin G., Durbin R., Gotta M., Kanapin A., Le Bot N., Moreno S., Sohrmann M.et al.. Systematic functional analysis of the Caenorhabditis elegans genome using RNAi. Nature. 2003; 421:231–237. [DOI] [PubMed] [Google Scholar]
  • 61. Boutros M. Genome-wide RNAi analysis of growth and viability in Drosophila cells. Science. 2004; 303:832–835. [DOI] [PubMed] [Google Scholar]
  • 62. Liao B.-Y., Zhang J.. Mouse duplicate genes are as essential as singletons. Trends Genet. 2007; 23:378–381. [DOI] [PubMed] [Google Scholar]
  • 63. Silva J.M., Marran K., Parker J.S., Silva J., Golding M., Schlabach M.R., Elledge S.J., Hannon G.J., Chang K.. Profiling essential genes in human mammary cells by multiplex RNAi screening. Science. 2008; 319:617–620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Brown S.J., Shippy T.D., Miller S., Bolognesi R., Beeman R.W., Lorenzen M.D., Bucher G., Wimmer E.A., Klingler M.. The red flour beetle, Tribolium castaneum (Coleoptera): a model for studies of development and pest biology. Cold Spring Harb. Protoc. 2009; 2009: 10.1101/pdb.emo126. [DOI] [PubMed] [Google Scholar]
  • 65. Ulrich J., Dao V.A., Majumdar U., Schmitt-Engel C., Schwirz J., Schultheis D., Ströhlein N., Troelenberg N., Grossmann D., Richter T.et al.. Large scale RNAi screen in Tribolium reveals novel target genes for pest control and the proteasome as prime target. BMC Genomics. 2015; 16:674. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Dönitz J., Schmitt-Engel C., Grossmann D., Gerischer L., Tech M., Schoppmeier M., Klingler M., Bucher G.. iBeetle-Base: a database for RNAi phenotypes in the red flour beetle Tribolium castaneum. Nucleic Acids Res. 2015; 43:D720–D725. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Xia Q., Li S., Feng Q.. Advances in silkworm studies accelerated by the genome sequencing of Bombyx mori. Annu. Rev. Entomol. 2014; 59:513–536. [DOI] [PubMed] [Google Scholar]
  • 68. Cacheiro P., Muñoz-Fuentes V., Murray S.A., Dickinson M.E., Bucan M., Nutter L.M.J., Peterson K.A., Haselimashhadi H., Flenniken A.M., Morgan H.et al.. Human and mouse essentiality screens as a resource for disease gene discovery. Nat. Commun. 2020; 11:655. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Ghandi M., Huang F.W., Jané-Valbuena J., Kryukov G.V., Lo C.C., McDonald E.R., Barretina J., Gelfand E.T., Bielski C.M., Li H.et al.. Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature. 2019; 569:503–508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Chang L., Ruiz P., Ito T., Sellers W.R.. Targeting pan-essential genes in cancer: challenges and opportunities. Cancer Cell. 2021; 39:466–479. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Liu S., Wang S.-X., Liu W., Wang C., Zhang F.-Z., Ye Y.-N., Wu C.-S., Zheng W.-X., Rao N., Guo F.-B.. CEG 2.0: an updated database of clusters of essential genes including eukaryotic organisms. Database (Oxford). 2020; 2020:baaa112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Song K., Tong T., Wu F.. Predicting essential genes in prokaryotic genomes using a linear method: ZUPLS. Integr. Biol. 2014; 6:460–469. [DOI] [PubMed] [Google Scholar]
  • 73. Azhagesan K., Ravindran B., Raman K.. Network-based features enable prediction of essential genes across diverse organisms. PLoS One. 2018; 13:e0208722. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Harris T.W., Arnaboldi V., Cain S., Chan J., Chen W.J., Cho J., Davis P., Gao S., Grove C.A., Kishore R.et al.. WormBase: a modern Model Organism Information Resource. Nucleic Acids Res. 2019; 48:D762–D767. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75. Larkin A., Marygold S.J., Antonazzo G., Attrill H., Santos G., Garapati P. V, Goodman J.L., Gramates L.S., Millburn G., Strelets V.B.et al.. FlyBase: updates to the Drosophila melanogaster knowledge base. Nucleic Acids Res. 2021; 49:D899–D907. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

lqab110_Supplemental_Files

Data Availability Statement

The source code is available at https://github.com/ThomasBeder/CLEARER and archived with complete data sets and trained models at Zenodo https://doi.org/10.5281/zenodo.5557738.


Articles from NAR Genomics and Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES