Graphical abstract
Abbreviations: ML, machine-learning; RNAi, RNA interference; CRISPR, Clustered Regularly Interspaced Short Palindromic Repeats; SNP, single nucleotide polymorphism; CDS, coding sequence; TSS, transcription start site; EST, expressed sequence tag; VCF, variant call file; GFF, general feature format; ES, Essentiality Score; PPI, protein-protein interaction; SPLS, Sparse Partial Least Squares; GO, gene ontology; GLM, Generalised Linear Model; NN, Artificial Neural Network; GBM, Gradient Boosting Method; SVM, Support-Vector Machine; RF, Random Forest; ROC-AUC, Area Under the Receiver Operating Characteristic Curve; PR-AUC, Area Under the Precision-Recall Curve; TEA, Tissue Enrichment Analysis tool (WormBase)
Keywords: Caenorhabditis elegans, Machine-learning, Essential genes, Essentiality predictions
Abstract
Defining genes that are essential for life has major implications for understanding critical biological processes and mechanisms. Although essential genes have been identified and characterised experimentally using functional genomic tools, it is challenging to predict with confidence such genes from molecular and phenomic data sets using computational methods. Using extensive data sets available for the model organism Caenorhabditis elegans, we constructed here a machine-learning (ML)-based workflow for the prediction of essential genes on a genome-wide scale. We identified strong predictors for such genes and showed that trained ML models consistently achieve highly-accurate classifications. Complementary analyses revealed an association between essential genes and chromosomal location. Our findings reveal that essential genes in C. elegans tend to be located in or near the centre of autosomal chromosomes; are positively correlated with low single nucleotide polymorphim (SNP) densities and epigenetic markers in promoter regions; are involved in protein and nucleotide processing; are transcribed in most cells; are enriched in reproductive tissues or are targets for small RNAs bound to the argonaut CSR-1. Based on these results, we hypothesise an interplay between epigenetic markers and small RNA pathways in the germline, with transcription-based memory; this hypothesis warrants testing. From a technical perspective, further work is needed to evaluate whether the present ML-based approach will be applicable to other metazoans (including Drosophila melanogaster) for which comprehensive data sets (i.e. genomic, transcriptomic, proteomic, variomic, epigenetic and phenomic) are available.
1. Introduction
Model organisms, such as the free-living nematode Caenorhabditis elegans, have been utilised extensively to explore the biology of multicellular (metazoan) organisms [1], [2], [3]. The sequencing of the C. elegans genome [4] and subsequent development of functional genomics tools, such as double-stranded RNA interference (RNAi), transgenesis and, more recently, CRISPR/Cas9, combined with genetic mapping, have underpinned studies of gene function [5], [6], [7], [8], [9]. A key research focus has been to identify or define genes which are functionally essential for life in cells, tissues and/or the organism (thus called ‘essential genes’) using such gene knock-down or knock-out approaches [7], [10], [11], [12]. These efforts have led to a wealth of experimental data and information on essential genes, now publicly available in the WormBase database [13]. While these data are rich and highly informative, there have been some discrepancies in the assignment of gene essentiality among studies using phenotypic data. Such discrepancies can be due to some genes being ‘conditionally-essential’ [1] depending, for example, on developmental stage, strain or experimental/environmental conditions. However, it is also possible that some discrepancies might relate to possible off-target effects in RNAi [14] and/or human error during large-scale experiments [15]. Despite such variation among experimental studies, there appears to be a consensus set of essential genes in C. elegans.
In recent years, computational approaches have been evaluated for the prediction of the complement of essential genes on a genome-wide scale employing functional genomic-phenotypic data sets for C. elegans. Such approaches could become important tools for predicting essential genes in less-studied organisms, such as many parasitic helminths, for which extensive genome, transcriptome and/or proteome data are available, but for which genome-wide functional genomic data have been lacking (e.g., [16], [17]). Some studies of C. elegans data sets have used genome-wide genetic interaction networks [18], [19] or single-nucleotide polymorphism (SNP) analyses [20], [21]. Others have identified features, such as gene size, evolutionary rate, phyletic retention, transcription level, protein–protein interaction (PPI) network connectivity and/or cellular or subcellular localisation, which correlate with gene essentiality [1], [22], [23]. Despite the apparent utility or promise of these computational approaches, some discrepancies in experimental results among functional genomic studies, variation in the nature and extent of data sets used, and the limited curation of some data sets can markedly affect the confidence of predicting essential genes [1], [24], [25], [26]. Here, we tackle this problem by employing a scoring-system to assign essentiality to genes from phenotypic data and by establishing procedures for large-scale extraction/engineering and selection of features associated with those genes from extensive ‘omics data sets. Using these essentiality annotations and selected predictive features, we constructed and systematically evaluated a machine-learning (ML)-based workflow for the genome-wide prediction of essential genes in C. elegans.
2. Materials and methods
2.1. Data sets
We obtained extensive data and annotations from three sources (i.e. WormBase [27], the Ensembl database [28] and/or published studies). Functional genomic/phenomic data sets from RNAi studies and annotated data (genomic, transcriptomic, proteomic and epigenetic; in GFF) linked to the C. elegans genome were from WormBase (WS270 release – 25/02/2019) [27]. Genomic, coding sequences (CDSs) and proteins (canonical) were from Ensembl. Gene transcription data for different developmental stages [29]; transcription start site (TSS) locations in the genome [30]; multi-cell or single-cell transcriptomic data [31], [32]; Ribo-seq annotations [33]; epigenetic markers (ChIP-seq and ATAC-seq) [34], [35], [36]; and variomic data containing genome-wide SNPs (high-quality VCF; release 20180527) [37] were obtained from the peer-reviewed literature.
2.2. Scoring of gene essentiality and provisional assignment
From WormBase, we extracted phenotypic data from all published RNAi studies of C. elegans and corresponding ontology terms using established scripts (see Data and code availability). We extracted all ‘lethal’ terms and their descendants from the phenotype_ontology.WS270.obo file and all ‘not lethal’ terms from the association file (phenotype_association.WS270.wb; column 4). We used the latter file to identify individual genes reported (in the peer-reviewed literature) to be linked to ‘lethal’ or ‘not lethal’ phenotypes upon RNAi. For each gene, we then calculated an essentiality score (ES), defined as the total number of RNAi experiments reporting essential/lethal (E) terms squared divided by the total number of experiments reporting essential/lethal and non-essential/viable terms (T) squared (E2/T2). A gene was provisionally assigned as “essential” (ES > 0.9) or “non-essential” (ES < 0.1); any other genes with an ES between ≥0.1 and ≤0.9 were assigned as “conditionally-essential”.
2.3. Feature extraction or engineering
For individual genes, features were extracted from six (i.e. genomic, CDSs, overlapping-gene, transcriptomic, protein and ‘variome’) data sets derived from WormBase, Ensembl and/or published studies; see Data sets, above).
From genomic data, we extracted features including length, number of exons, distance from the chromosome centre (average distance between start codon of the first gene and the stop codon of the last gene in a chromosome), number of isoforms and presence/absence of associated Pfam-domains using “biomaRt” for R. From CDSs, we extracted nucleotide composition and correlation features using rDNAse (R package) as well as codon usage features using codonW (http://codonw.sourceforge.net).
For overlapping gene regions, we engineered new features (e.g., occurrence of chromatin state-domains; [34], [35], [36]) using the program BEDTools. The same approach was used to count features of overlapping genes defined in the GFF file (column 2) obtained from WormBase. In addition, we engineered additional features by establishing whether genes overlap outron- and/or exon-mapping transcription starting sites (TSS) (https://wormtss.utgenome.org) [30].
For ‘pooled’ transcriptomic data, we individually queried all designated ‘essential’, ‘conditionally-essential’ or ‘non-essential’ genes against the WormExp database, and then recorded the presence/absence of each gene in each of the first 30 returned data sets. For developmental transcriptomic data [29], we used the transcription levels of individual genes in each developmental stage as features. For single-cell transcriptomic data [31], we recorded the transcription level of each gene in each cell and enumerated the cells transcribing a particular gene.
From protein sequences, we extracted features using “protr” utilising all descriptors defined in this R package as well as the numbers of predicted transmembrane domains and signal peptides per protein employing TMHMM [38] and SignalP [39], respectively. We also obtained features from predicted protein subcellular localisations using WolfPsort [40] and DeepLoc [41] as well as protein disorder features employing DisEMBL [42].
For the variomes of C. elegans (variomics-natural file; see Data sets), we calculated the numbers of SNPs in individual genes using BEDTools and inferred the effect(s) of individual SNPs on gene function using SnpEff [43] - these data were employed as features. The Ka/Ks ratio was calculated from the SnpEff output using an available script (https://github.com/MerrimanLab/selectionTools/blob/master/extrascripts/kaks.py). The data sets and code used to extract or engineer features are in the “R Markdown” script available at (https://bitbucket.org/tuliocampos/essential_elegans).
2.4. Feature sets
We combined all extracted/engineered features with respective genes essentiality annotations and stacked this information into a matrix using R. In this feature matrix, each line represented a gene, each column represented an extracted feature and the last column represented the essentiality annotation (“essential” or “non-essential”); this matrix contained all data (“FULL”). To create a non-redundant (NR) set of features, we first clustered protein sequences using USEARCH (parameters: -cluster_fast -centroids) [44], obtained gene identifiers and then removed genes and associated features if multiple amino acid sequences had ≥25% identity, retaining only the centroid sequences of all individual clusters. Subsequently, we removed features with low variance from both the “FULL and “NR” feature sets using the nearZeroVar method in “caret”. For “FULL”, we also assessed statistical differences in the features between “essential” and “non-essential” using two-tailed pairwise t-tests (95% confidence interval) in R (t.test), recording p-values and Holm-Bonferroni corrected (p.adjust) values.
2.5. Feature selection, ML training and performance assessment
Features were selected by random subsampling from 10% to 90% of data representing “essential” or “non-essential” genes (in 10% stepwise increments) based on a consensus between elasticNet (alpha = 0.5) and ensemble Sparse Partial Least Squares (SPLS) methods using “glmnet” and “enspls” in R, respectively [26]. The features were then used to train each of six ML-models (GBM (Gradient Boosting Machine), GLM (Generalised Linear Model), NN (Neural Network - perceptron), Random Forest (RF), SVM (Support-Vector Machine) [26] and XGB (eXtreme Gradient Boosting – xgbTree) in the “caret” R-package. During the training process, we employed parameter-tuning and 5-fold cross-validation, ultimately selecting the models with highest ROC-AUC. Following subsampling, we employed the remaining data (90%–10%) to evaluate the performance of the final models using ROC-AUC and PR-AUC.
Subsequently, we trained each of the six ML-models with 100% of each set, and calculated the ‘importance’ of each feature for each ML algorithm for each feature set using the varImp method in the “caret” package. For each ML-model, we calculated ROC-AUCs using 5-fold cross-validation and plotted them against the parameters tested. We ranked the predictors according to the median feature-importance among the best three ML-models and selected 40 consensus-features that were highly predictive of gene essentiality employing the “FULL” or “NR” data set. Then, we assessed whether these consensus-features correlated with essentiality using “correlationfunnel”, and evaluated pairwise correlations among features using “corrplot” (R). Using this reduced set of consensus-features (NR_SELECTED), we then trained the ML-methods and evaluated their prediction-performance using ROC-AUC and PR-AUC. Finally, we assessed variation in these metrics using bootstrapping (1000-times) employing 90% of the consensus-features used for training and the remaining 10% for testing.
2.6. Distribution of gene and SNPs on chromosomes
We counted the number of SNPs per each 1000 bp-window on each chromosome using published variomic data (high-quality VCF; release 20180527) [37]. We established the locations of genes provisionally assigned as “essential”, ‘non-essential’ or ‘conditionally-essential’ (see Subsection 2.2) using the WormBase GFF file, and generated individual density plots showing the distribution of genes for each chromosome (“ggplot” for R). We compared the distributions of genes by essentiality annotations using Kolmogorov-Smirnov tests (ks.test in R) [36].
2.7. Gene ontology (GO), transcription and tissue enrichment analyses
Using the GBM, RF and XGB methods trained with NR_SELECTED data, we identified 500 C. elegans genes with the highest median probabilities of being essential and then conducted gene ontology (GO), transcription and tissue enrichment analyses. For these 500 genes, GO enrichment (for biological process, molecular function and cellular component) was carried out using the Gene Set Enrichment Analysis available at WormBase [27], DAVID [45] and WebGestalt (‘over-representation analysis’) [46] databases, after which WormExp database/website [32] was interrogated for transcription enrichment. Then, we queried WormBase using the Tissue Enrichment Analysis (TEA) tool [47].
2.8. Validation of ML predictions using mutant allele data
First, we ranked all genes used in the present study by their final ML predictions (see Sub-section 2.5). Second, a list of all C. elegans genes with at least one report of a “lethal” phenotype in the GExplore database [48] was created. Third, we incrementally searched for all genes in GExplore, according to ML probability, in an descending and also in an ascending manner, and then calculated cumulative ratios. These ratios were displayed in a graph using “ggplot” in R.
3. Results
We built and then employed a well-defined workflow (Fig. 1) to: (i) annotate genes for essentiality from phenomic data; (ii) extract features predictive of gene essentiality; (iii) train and test ML approaches using selected features; (iv) locate essential genes and SNPs to locations on chromosomes; and (v) explore gene ontology (GO) and transcription enrichments linked to essential genes.
3.1. Annotating genes for essentiality from phenomic data
We first categorised sets of genes as ‘essential’, ‘non-essential’ or ‘conditionally-essential’ – with the latter category reflecting discrepant experimental results between or among published studies. For this categorisation, we inspected the hierarchical phenotype ontology for C. elegans (in WormBase), obtained 150 ontology-identifiers and then used them to calculate individual essentiality scores (ESs) (Table S1). Using these ESs, we provisionally assigned 670 genes in C. elegans as essential, 16,070 as non-essential, and 1721 as conditionally-essential using RNAi data sets (Fig. 2a; Tables S2–S4). A small percentage of genes annotated as essential (23 of 670; 3.4%) or non-essential (1616 of 16,070; 10%) were recorded as having both lethal/essential and viable/non-essential entries in the phenotype association file from WormBase. Most gene annotations were supported by results from at least three published RNAi experiments (via WormBase): 527 (78.6%) for essential, 13,579 (84.5%) for non-essential and 1592 (92.5%) for conditionally-essential.
3.2. Predictive features identified from multiple sources
For all individual genes annotated previously, 55,694 features were identified. Following the removal of features exhibiting low variance, 1609 features (per gene) were retained and used in subsequent analyses. After p-value correction (Holm-Bonferroni), 801 features displayed significant differences between essential and non-essential genes (Table S5). More than half (n = 416) of these features were from protein sequences, 193 from nucleotide sequences, 42 from transcriptomic data (from the WormExp database), 16 from SNP data, 14 related to subcellular localisation, 9 to single-cell RNA-seq (scRNA-seq) data, 5 from genomic locations or gene models, and 4 related to evidence of transcription in different developmental stages. In addition, we identified 102 predictive features that overlap with the genomic locations of genes, including 50 features derived from WormBase, 49 from epigenetic markers, 2 from transcription start sites (outron/exon) and 1 from Ribo-seq (Table S5).
3.3. Systematic feature selection, and training/evaluation of ML approaches
First, we selected a complete (FULL) set of features from ‘essential’ and ‘non-essential’ genes (filtered) (n = 1609 per gene). Then, we used subsets of the FULL set (10–90% random samples) to train six individual ML methods (Gradient Boosting Machine, GBM; Generalised Linear Model, GLM; Neural Network, NN; Random Forest, RF; Support-Vector Machine, SVM; and eXtreme Gradient Boosting, XGB) to predict the same subsets, usually achieving high prediction performances (ROC-AUC of ~1 and PR-AUC of ~1; Fig. S1). Nonetheless, NN and GLM did exhibit a decrease in ROC-AUC (~0.97 and ~0.97, respectively) and in PR-AUC (~0.97 and ~0.8, respectively). Having trained individual ML methods, we then predicted gene essentiality from nine independent test-sets (not used for model training). Each of the six ML models achieved a high ROC-AUC of 0.94 to 1.0, with PR-AUCs of 0.75–0.95 for GBM, RF and XGB, and 0.65 to 0.76 for GLM, NN and SVM (Fig. S2). Only the latter model decreased PR-AUC as more data were added to individual training sets. Subsequently, we used the FULL set for the final selection of features and to train each of the six ML methods. Using this approach, we identified 418 predictors of gene essentiality, with the relative importance of these predictors being recorded for each model (Table S6).
Second, we created a non-redundant (NR) set of features by clustering protein sequences, retaining the centroid sequences with <25% identity representing all individual clusters. This NR dataset represented 615 essential and 12,193 non-essential genes, each having 1609 features. We employed this data set for the systematic selection of features as well as the training and testing of all six ML methods. The prediction performances of most ML models were commensurate with those achieved using the FULL data set (Fig. 2b – left), with SVM achieving a superior PR-AUC performance when trained using the NR set (Tables S6 and S7). Following feature-selection and training with the NR data set, 291 features were selected as the ‘best’ predictors of essentiality (representing a reduction of 30% compared with the FULL set).
Third, we established the minimum number (n = 40) of features that were highly-predictive for essentiality in the FULL or the NR data set (Fig. S3); 28 of these 40 features were shared between the two data sets. These highly-predictive features included: exon number; gene length; GC content; presence of an encoded signal peptide; sequence characteristics (e.g., nucleotide sequence composition, which considers order and physiochemical properties [PseKNC_5_Xc1.CGT] or amino acid triads in a protein sequence [CTriad_VS115]); epigenetic chromatin-state markers relating to promoter regions or exon transcription elongation, three of which associated with early embryo (EE_1, EE_2 and EE_3) and one in the third-stage larva (L3_2); subcellular localisation; expressed sequence tag (ESTs) ‘best-hit’ by BLAT (BLAT_Caen_EST_BEST in WormBase); RNAi probes (RNAi_primary) and peptide fragments from mass spectrometry (mass_spec_genome); scRNA-seq data (number of cells with transcription – num_cells_expressed) and transcription profiles of selected cells (e.g., cele.010.023.TCGTAGAGAA – in the germline) (Table S8).
Fourth, we assessed the correlation between 28 individual (highly-predictive) features and gene essentiality upon pairwise comparison (Fig. 3a). The correlations ranged between 0.1 and 0.35, showing that no single feature correlated perfectly with essentiality, which justified the use of multivariate methods for prediction using ML models. When we assessed the pairwise correlations among the 28 features (378 tests; Fig. 3b), most (>99%) values were between −0.5 and +0.5, and 12 (<1%) were >0.5. A strong correlation was recorded for chromatin-state markers in EE_1 to EE_3 and L3_2; num_cells_expressed; and scRNA-seq for cele.010.023.TCGTAGAGAA. Interestingly, num_cells_expressed also correlated positively with BLAT_Caen_EST_BEST, and the subcellular localisations ‘cytoplasm’ and ‘nucleus’ correlated negatively with ‘endoplasmic reticulum’ (Fig. 3b).
Fifth, we assessed the performances of the six individual ML models to predict essentiality from the NR data set using the final set of 28 highly-predictive features (NR_SELECTED data set). High ROC-AUCs (>0.95) were achieved for training sets. PR-AUCs were consistently ~1.0 for the XGB, GBM and RF models, compared with performances of ~0.98–0.85 for NN, 0.88–0.84 for SVM and 0.78–0.74 for GLM (Fig. 2b). For test sets, ROC-AUCs were >0.92 for all six ML models, and PR-AUCs were 0.85–0.96 for XGB, GBM and RF, and 0.65–0.77 for SVM, NN and GLM. An evaluation of the median importance of each of the 28 highly-predictive features for all six ML models showed that ‘num_cells_expressed’ (71.26), ‘BLAT_Caen_EST_BEST’ (66.62) and ‘RNAi_primary’ (54.98) were the strongest predictors using NR_SELECTED data (Table S8). Using the same data set, we assessed variation in the ROC-AUCs and PR-AUCs by bootstrapping (random subsampling; 90% of the data for training; 10% for testing; n = 1000) employing XGB, GBM or RF (Fig. 2c); ROC-AUCs were consistently ≥0.90 for these three ML models, with XGB and GBM each achieving a median ROC-AUC of >0.98. PR-AUCs were consistently >0.7 for these three models, occasionally achieving ~1, with a median of between 0.85 and 0.90.
Sixth, the entire NR_SELECTED data set was used to predict essentiality for each individual gene included here employing each of the six models, and essentiality probabilities calculated (Tables S9 and S10). Using the best performing models (i.e. GBM, RF and XGB), 755 genes were assigned as ‘essential’ based on high median probabilities (>0.70). Almost 65% of these genes (n = 490) had been annotated previously, based on ESs, as essential, 34% (n = 255) as conditionally-essential and 1% (n = 10) as non-essential. For each of the data sets (i.e. FULL, NR and NR_SELECTED), we then assessed the effects of parameter-tuning on ROC-AUC using a 5-fold cross-validation for each of the six final ML models (Figs. S4–S6). For the parameters tested, we observed that the prediction performance (ROC-AUC) was superior using a regularisation-parameter value of <0.02 for GLM; sigma-parameter of <0.02 for SVM; >1000 boosting iterations and max-tree-depth of ≥3 for both XGB and GBM; >10 hidden-layer units for NN; and randomly selected predictors of 10–50 for RF.
Finally, the validation of the final ML predictions against independent mutant allele data available in the GExplore database [48] (Fig. 4) showed that 7.25% of all C. elegans genes studied here have at least one “lethal” phenotype recorded in GExplore. The ratios of genes with a “lethal” phenotype were higher (>20%) for genes with higher ML probabilities (>0.7), and these ratios decreased to 7.25%, as more genes with lower probabilities were included in the search. Conversely, the ratios were consistently low (<5%) for genes with the lowest ML prediction probabilities (<0.1), and increased to 7.25% as more genes with higher ML prediction probabilities were included.
3.4. Essential genes and SNPs are usually located centrally on autosomal chromosomes of C. elegans
We calculated the numbers of SNPs per 1000 bp and then plotted them on to chromosomes (Fig. 3c). Interestingly, there were considerably more SNPs along chromosome arms than in the centres, except for sex chromosome X where SNPs were evenly distributed. Then, we investigated respective distributions (density plots) of essential, conditionally-essential and non-essential genes on chromosomes (Fig. 3d). We showed that essential genes (usually) have a higher density in the middle of autosomal chromosomes I to V rather than their arms, whereas the density of non-essential genes was higher in the arms of autosomal chromosomes (I–V; Fig. 3d). Interestingly, essential and conditionally-essential genes had similar distributions on all autosomal chromosomes, except chromosome III where the distributions of conditionally-essential and non-essential genes were similar. On sex chromosome X, there appeared to be a preference for essential genes on its left-arm.
The gene density patterns appeared to match SNP densities on chomosomes. Notably, essential genes are preferentially located within regions of low SNP density, as these genes tend to be more conserved than non-essential ones. Moreover, most essential genes are found on autosomal chromosomes (n = 173 on chromosome I; 124 on II; 131 on III; 120 on IV; 103 on V), and only a small number (n = 19) on the sex chromosome. Using Kolmogorov-Smirnov tests, we compared gene densities along chromosomes; there were significant differences between essential and non-essential (p = 1.243e−05), and between non-essential and conditionally-essential (p = 4.79e−12), but not significant between essential and conditionally-essential genes (p = 4.651e−1).
3.5. Gene ontology (GO) and transcription enrichments pertaining to essential genes
Multiple separate GO enrichment analyses (WormBase, WebGestalt and DAVID) revealed information on the biological processes, cellular components and molecular functions for which essential genes play a role. For biological processes, the three most significant terms were ‘peptide biosynthetic process’ (99 genes), ‘cellular macromolecule localisation’ (73) and ‘embryo development ending in birth or egg hatching’ (66) (WormBase; p ≤ 1.3e−10; Table S11); ‘embryo development ending in birth or egg hatching’, ‘ribonucleoprotein complex biogenesis’ and ‘translation’ (WebGestalt; Fig. S7); ‘translation’ (88 genes), ‘protein transport’ (26) and ‘intracellular protein transport’ (24) (DAVID; p ≤ 2e−10; Table S12). For cellular components, predominating terms were ‘organelle’ (412 genes), ‘cytoplasm’ (325) and ‘envelope’ (60) (WormBase; p ≤ 1.7e−08; Table S11); ‘cytosolic large ribosomal subunit’, ‘cytosolic ribosome’ and ‘large ribosomal subunit’ (WebGestalt; Fig. S8); ‘intracellular ribonucleoprotein complex’ (63 genes), ‘ribosome’ (62) and ‘cytosolic large ribosomal subunit (30)’ (DAVID; p ≤ 1.5e−07; Table S12). For molecular functions, highly-enriched terms were ‘structural constituent of ribosome’ (62 genes); ‘protein heterodimerisation activity’ (31) and ‘primary active transmembrane transporter activity’ (18) (WormBase; p ≤ 2.9e−05; Table S11); ‘ATPase activity, coupled to transmembrane movement of ions’, ‘structural constituent of ribosome’ and ‘structural molecule activity’ (WebGestalt; Fig. S9); ‘nucleotide binding’ (104 genes), ‘ATP binding’ (78) and ‘structural constituent of ribosome’ (61) (DAVID; p ≤ 1.3e−08; Table S12).
For transcription (WormEXP database; Table S13), there was an enrichment of targets for small RNAs bound to CSR-1 – an argonaut responsible for chromatin segregation and the protection of germline gene expression [49], [50], gene down-regulation in gonad-ablated C. elegans, constitutive post-embryonic gene expression as well as matches to orthologues in D. melanogaster and S. cerevisiae (Table S14). The transcription of most essential genes (92.6% of 500) was enriched in the ‘reproductive system’ (including germline and gonad tissues) (WormBase; Table S14).
4. Discussion
Here, we demonstrate that gene essentiality in C. elegans can be reliably predicted using ML models trained using: (i) sets of genes which are well-annotated for essentiality and (ii) features selected and/or engineered from ‘omics data. We also reveal highly-predictive features and multiple gene ontology and tissue enrichment analyses to associate with the functions of essential genes in the worm.
The prediction of essentiality from published functional genomic (i.e. RNAi) experiments can be challenging because of ambiguous or contradictory results achieved as a consequence of variations relating to C. elegans strains, techniques (soaking vs. injection), experimental conditions used, a lack of repeatability or reproducibility of findings and, in some instances, off-target effects in RNAi [51]. In order to not exclude data for genes that might be essential, we created a scoring system for the inclusion of conditionally-essential genes with ambiguous or variable results from previously published studies. Indeed, the present investigation using well-trained ML models showed that some of these genes provisionally assigned as ‘conditionally-essential’ (e.g., dpy-23 [WBGene00001082]; rpl-7 [WBGene00004418] and vha-15 [WBGene00020507]) are highly likely to be essential (Table S10). Indeed, “lethal” phenotypes have been recorded for dpy-23 (WBGene00001082) and vha-15 (WBGene00020507) in gene knockout data sets in the GExplore database. In addition, 10 genes provisionally assigned as non-essential appear to be essential based on ML predictions. For instance, phenotype information linked to essentiality upon knockout (‘L1 arrest’ and ‘reduced brood size’) has been reported for vav-1 (WBGene00006887) in GExplore. Nonetheless, further work is required to experimentally prove essentiality predictions using classical or modern (e.g., CRISPR/Cas9) gene knock-out methods [52].
Employing large-scale feature engineering, we identified strong essentiality predictors, not previously described, and showed that it is possible to predict gene essentiality reliably without protein–protein interaction network data – which can be error prone [53]. We identified a small number of features (n = 28) that, collectively, contributed to a significant improvement to ML prediction performance. Some of these predictors relate to exon number, GC content and subcellular localisation, identified previously by other workers [23], and novel genomic features such as scRNA-seq or epigenetic markers. Particularly exciting were the four epigenetic markers, EE_1, EE_2, EE_3 and L3_2, identified as being strong predictors of essential genes. For instance, EE_1 and EE_2 corresponded to chromatin states, defined in early embryos by the markers H3K4me3 and H3K4me2, respectively [34]. These markers are known to be involved in cellular differentiation [54], lifespan [55] and/or aging [56], are present in germline cells [57] and are represented throughout the life cycle of C. elegans [34]. Interestingly, H3K4me3 has also been associated with gene essentiality in human cells [58]. Previous work [59] has shown that chromatin organisation is highly variable among select metazoans, which would partially explain the distinctiveness in the spectra of essential genes among species [26]. This aspect stimulates studies to explore which features that are predictive of essentiality are common to or distinct among eukaryotic species representing closely and distantly related groups.
The ML models trained using selected features reliably predicted essential genes in C. elegans based on a thorough evaluation using multiple independent test sets and threshold-independent metrics (ROC-AUC/PR-AUC). PR-AUC is recognised to be more informative for ‘imbalanced’ data sets (e.g., markedly more non-essential than essential genes) [60]. In our systematic evaluation, we showed that predictions were quite consistent among the six ML methods and data sets of different sizes, with high prediction performances being achieved using a data set (i.e. NR_SELECTED) that was less prone to sequence bias. Moreover, the ensemble-based ML methods (XGB, GBM and RF) were shown to be most suitable for essentiality prediction, in accordance with other recent findings [26], [61]. Here, we calculated probabilities for gene essentiality based on predictions made using high-performing ML methods trained with the NR_SELECTED data set. In addition, a validation conducted using independent functional genomic (mutant allele) data revealed a clear relationship between the ML predictions and the likelihood of a “lethal” phenotype upon knockout. Future work should focus on experimentally confirming our ML-based predictions.
We showed that essential genes in C. elegans tend to be located in or near the centre of autosomal chromosomes, and are positively correlated with low SNP densities and epigenetic markers in promoter regions [34], [62]. GO results inferred that essential genes in C. elegans are involved in protein and nucleotide processing, are transcribed in most cells, are enriched in reproductive tissues and/or are targets for small RNAs bound to the argonaut CSR-1. It has been reported that the CSR-1 and its targets are involved in chromatin segregation [49] and protection of germline cells against piRNA-mediated silencing [50]. This argonaut appears to be responsible for holocentromere organisation [63], [64] particularly in nematodes of evolutionary clades V and III [64], [65]. Collectively, this information stimulates future investigations of the chromosomal structures and intricate molecular mechanisms linked to gene essentiality, which likely govern the life/survival of nematodes of these clades. Interestingly, selected (non-conserved) essential genes in C. elegans are known to be involved in chromatin segregation [66] and exhibit characteristics of house-keeping genes [67], which might suggest an interplay between epigenetic markers and small RNA pathways in the germline [68] linked to a transcription ‘memory’ profile of gene essentiality that is transmitted to the next generation of cells.
5. Conclusion
This study shows that well-trained ML methods can be useful tools to predict essential genes in C. elegans. From a biological perspective, our findings show that essential genes tend to be located in or near the centre of autosomal chromosomes; are positively correlated with low SNP densities and epigenetic markers in promoter regions; are involved in protein and nucleotide processing; are transcribed in most cells; are enriched in reproductive tissues or are targets for small RNAs bound to argonaut CSR-1. Based on these results, we speculate that there is an intimate interplay between epigenetic markers and small RNA pathways in the germline, with one or more transcription-based memory profile(s). From an informatic perspective, although the present ML approach seems promising for broader application, it remains to be established whether essentiality can be reliably predicted in distantly related taxa, based on evidence for C. elegans (cf. [26]). This aspect requires in-depth evaluation. As a first step, we propose to predict/explore gene essentiality in D. melanogaster – for which extensive data and feature sets are available – using the present ML approach, and then to compare findings with those achieved here for C. elegans. Such an investigation would establish whether there is a panel of concordant features which are strong predictors of essentiality in both of these model organisms (superphylum Ecdysozoa). If successful, the next step would be to assess the applicability of our approach to a range of metazoan (invertebrate) taxa, for which suitably large and informative genomic, transcriptomic and/or proteomic data sets are available (in the absence of functional genomic and PPI network data sets), so that a panel of “universal” strong predictors of essentiality can be defined for invertebrates.
6. Data and code availability
The data used herein, the code developed to perform the systematic ML approaches as well as information regarding software versions and attached libraries are available at: https://bitbucket.org/tuliocampos/essential_elegans. A static version linked to this publication is available at: https://doi.org/10.6084/m9.figshare.11533101.
CRediT authorship contribution statement
Tulio L. Campos: Conceptualization, Methodology, Software, Validation, Data curation, Writing - original draft, Visualization, Investigation, Writing - review & editing. Pasi K. Korhonen: Conceptualization, Supervision, Software, Validation, Visualization, Investigation, Writing - review & editing. Paul W. Sternberg: Visualization, Investigation, Writing - review & editing. Robin B. Gasser: Conceptualization, Supervision, Visualization, Investigation, Writing - review & editing. Neil D. Young: Conceptualization, Supervision, Visualization, Investigation, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This research was funded by grants from the National Health and Medical Research Council (NHMRC) of Australia and the Australian Research Council (ARC) to RBG, PKK and/or NDY. Other support to RBG was from the Melbourne Water. NDY was supported by a Career Development Fellowship, and PKK by an Early Career Research Fellowship from NHMRC. TLC was a recipient of a Research Training Program Scholarship from the Australian Government and is also supported by the Oswaldo Cruz Foundation (Fiocruz/Brazil). PWS was supported by U.S. National Institutes of Health grant U24-HG002223.
Footnotes
Supplementary data to this article can be found online at https://doi.org/10.1016/j.csbj.2020.05.008.
Contributor Information
Robin B. Gasser, Email: robinbg@unimelb.edu.au.
Neil D. Young, Email: nyoung@unimelb.edu.au.
Appendix A. Supplementary data
The following are the Supplementary data to this article:
References
- 1.Zhan T., Boutros M. Towards a compendium of essential genes - From model organisms to synthetic lethality in cancer cells. Crit Rev in Biochem Mol Biol. 2016;51:74–85. doi: 10.3109/10409238.2015.1117053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Howe D.G., Blake J.A., Bradford Y.M., Bult C.J., Calvi B.R., Engel S.R. Model organism data evolving in support of translational medicine. Lab Anim (NY) 2018;47:277–289. doi: 10.1038/s41684-018-0150-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Giansanti M.G., Fraschini R. Editorial: Model organisms: a precious resource for the understanding of molecular mechanisms underlying human physiology and disease. Front Genet. 2019;10:822. doi: 10.3389/fgene.2019.00822. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Caenorhabditis elegans Sequencing Consortium. Genome sequence of the nematode C. 666 elegans: a platform for investigating biology. Science 1998;282:2012–8. [DOI] [PubMed]
- 5.Clark D.V., Rogalski T.M., Donati L.M., Baillie D.L. The unc-22(IV) region of Caenorhabditis elegans: genetic analysis of lethal mutations. Genetics. 1988;119:345–353. doi: 10.1093/genetics/119.2.345. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Kamath R.S., Ahringer J. Genome-wide RNAi screening in Caenorhabditis elegans. Methods. 2003;30:313–321. doi: 10.1016/s1046-2023(03)00050-1. [DOI] [PubMed] [Google Scholar]
- 7.Kamath R.S., Fraser A.G., Dong Y., Poulin G., Durbin R., Gotta M. Systematic functional analysis of the Caenorhabditis elegans genome using RNAi. Nature. 2003;421:231–237. doi: 10.1038/nature01278. [DOI] [PubMed] [Google Scholar]
- 8.Sönnichsen B., Koski L.B., Walsh A., Marschall P., Neumann B., Brehm M. Full-genome RNAi profiling of early embryogenesis in Caenorhabditis elegans. Nature. 2005;434:462–469. doi: 10.1038/nature03353. [DOI] [PubMed] [Google Scholar]
- 9.Wang H., Park H., Liu J., Sternberg P.W. An efficient genome editing strategy to generate putative null mutants in Caenorhabditis elegans using CRISPR/Cas9. G3 (Bethesda) 2018;8:3607–3616. doi: 10.1534/g3.118.200662. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Rogalski T.M., Moerman D.G., Baillie D.L. Essential genes and deficiencies in the unc-22 IV region of Caenorhabditis elegans. Genetics. 1982;102:725–736. doi: 10.1093/genetics/102.4.725. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Meneely P.M., Herman R.K. Lethals, steriles and deficiencies in a region of the X chromosome of Caenorhabditis elegans. Genetics. 1979;92:99–115. doi: 10.1093/genetics/92.1.99. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Dickinson J.D., Goldstein B. CRISPR-Based methods for Caenorhabditis elegans genome engineering. Genetics. 2016;202:885–901. doi: 10.1534/genetics.115.182162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Harris T.W., Arnaboldi V., Cain S., Chan J., Chen W.J., Cho J. WormBase: a modern model organism information resource. Nucleic Acids Res. 2019;8:D762–D767. doi: 10.1093/nar/gkz920. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Zhou X., Xu F., Mao H., Ji J., Yin M., Feng X. Nuclear RNAi contributes to the silencing of off-target genes and repetitive sequences in Caenorhabditis elegans. Genetics. 2014;197:121–132. doi: 10.1534/genetics.113.159780. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Mohr S.E., Perrimon N. RNAi screening: new approaches, understandings, and organisms. Wiley Interdiscip Rev RNA. 2012;3:145–158. doi: 10.1002/wrna.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Hagen J., Lee E.F., Fairlie W.D., Kalinna B.H. Functional genomics approaches in parasitic helminths. Parasite Immunol. 2012;34:163–182. doi: 10.1111/j.1365-3024.2011.01306.x. [DOI] [PubMed] [Google Scholar]
- 17.Castelletto M.L., Gang S.S., Hallem E.A. Recent advances in functional genomics for parasitic nematodes of mammals. J Exp Biol 2020;7:223 (Pt Suppl 1). [DOI] [PMC free article] [PubMed]
- 18.Zhong W., Sternberg W.P. Genome-wide prediction of C. elegans genetic interactions. Science. 2006;311:1481–1484. doi: 10.1126/science.1123287. [DOI] [PubMed] [Google Scholar]
- 19.Lee I., Lehner B., Crombie C., Wong W., Fraser A.G., Marcotte E.M. A single gene network accurately predicts phenotypic effects of gene perturbation in Caenorhabditis elegans. Nat Genet. 2008;40:181–188. doi: 10.1038/ng.2007.70. [DOI] [PubMed] [Google Scholar]
- 20.Qin Z., Johnsen R., Yu S., Chu J.S., Baillie D.L., Chen N. Genomic identification and functional characterization of essential genes in Caenorhabditis elegans. G3 (Bethesda) 2018;8:981–997. doi: 10.1534/g3.117.300338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Yu S., Zheng C., Zhou F., Baillie D.L., Rose A.M., Deng Z. Genomic identification and functional analysis of essential genes in Caenorhabditis elegans. BMC Genomics. 2018;19:871. doi: 10.1186/s12864-018-5251-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Doyle M.A., Gasser R.B., Woodcroft B.J., Hall R.S., Ralph S.A. Drug target prediction and prioritization: using orthology to predict essentiality in parasite genomes. BMC Genomics. 2010;11:222. doi: 10.1186/1471-2164-11-222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Dong C., Jin Y.T., Hua H.L., Wen Q.F., Luo S., Zheng W.X. Comprehensive review of the identification of essential genes using computational methods: focusing on feature implementation and assessment. Brief Bioinform. 2018;21:171–181. doi: 10.1093/bib/bby116. [DOI] [PubMed] [Google Scholar]
- 24.Li M., Wang J.X., Wang H., Pan Y. Identification of essential proteins from weighted protein-protein interaction networks. J Bioinf Comput Biol. 2013;11:1341002. doi: 10.1142/S0219720013410023. [DOI] [PubMed] [Google Scholar]
- 25.Zhang X., Acencio M.L., Lemke N. Predicting essential genes and proteins based on machine learning and network topological features: a comprehensive review. Front Physiol. 2016;7:75. doi: 10.3389/fphys.2016.00075. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Campos T.L., Korhonen P.K., Gasser R.B., Young N.D. An evaluation of machine learning approaches for the prediction of essential genes in eukaryotes using protein sequence-derived features. Comput Struct Biotechnol J. 2019;17:785–796. doi: 10.1016/j.csbj.2019.05.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Howe K.L., Bolt B.J., Cain S., Chan J., Chen W.J., Davis P. WormBase 2016: expanding to enable helminth genomic research. Nucleic Acids Res. 2016;44:D774–D780. doi: 10.1093/nar/gkv1217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Birney E., Andrews T.D., Bevan P., Caccamo M., Chen Y., Clarke L. An overview of Ensembl. Genome Res. 2004;14:925–928. doi: 10.1101/gr.1860604. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Spencer W.C., Zeller G., Watson J.D., Henz S.R., Watkins K.L., McWhirter R.D. A spatial and temporal map of C. elegans gene expression. Genome Res. 2011;21:325–341. doi: 10.1101/gr.114595.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Saito T.L., Hashimoto S., Gu S.G., Morton J.J., Stadler M., Blumenthal T. The transcription start site landscape of C. elegans. Genome Res. 2013;23:1348–1361. doi: 10.1101/gr.151571.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Cao J., Packer J.S., Ramani V., Cusanovich D.A., Huynh C., Daza R. Comprehensive single-cell transcriptional profiling of a multicellular organism. Science. 2017;357:661–667. doi: 10.1126/science.aam8940. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Yang W., Dierking K., Schulenburg H. WormExp: a web-based application for a Caenorhabditis elegans-specific gene expression enrichment analysis. Bioinformatics. 2016;32:943–945. doi: 10.1093/bioinformatics/btv667. [DOI] [PubMed] [Google Scholar]
- 33.Kiniry S.J., O'Connor P.B.F., Michel A.M., Baranov P.V. Trips-Viz: a transcriptome browser for exploring Ribo-Seq data. Nucleic Acids Res. 2019;47:D847–D852. doi: 10.1093/nar/gky842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Evans K.J., Huang N., Stempor P., Chesney M.A., Down T.A., Ahringer J. Stable Caenorhabditis elegans chromatin domains separate broadly expressed and developmentally regulated genes. Proc Natl Acad Sci USA. 2016;113:E7020–E7029. doi: 10.1073/pnas.1608162113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Ikegami K., Egelhofer T.A., Strome S., Lieb J.D. Caenorhabditis elegans chromosome arms are anchored to the nuclear membrane via discontinuous association with LEM-2. Genome Biol. 2010;11:R120. doi: 10.1186/gb-2010-11-12-r120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Daugherty A.C., Yeo R.W., Buenrostro J.D., Greenleaf W.J., Kundaje A., Brunet A. Chromatin accessibility dynamics reveal novel functional enhancers in C. elegans. Genome Res. 2017;27:2096–2107. doi: 10.1101/gr.226233.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Cook D.E., Zdraljevic S., Roberts J.P., Andersen E.C. CeNDR, the Caenorhabditis elegans natural diversity resource. Nucleic Acids Res. 2017;45:D650–D657. doi: 10.1093/nar/gkw893. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Krogh A., Larsson B., von Heijne G., Sonnhammer E.L. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001;305:567–580. doi: 10.1006/jmbi.2000.4315. [DOI] [PubMed] [Google Scholar]
- 39.Petersen T.N., Brunak S., von Heijne G., Nielsen H. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods. 2011;8:785–786. doi: 10.1038/nmeth.1701. [DOI] [PubMed] [Google Scholar]
- 40.Horton P., Park K.-J., Obayashi T., Fujita N., Harada H., Adams-Collier C.J. WoLF PSORT: protein localization predictor. Nucleic Acids Res. 2007;35:W585–W587. doi: 10.1093/nar/gkm259. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Almagro Armenteros J.J., Sonderby C.K., Sonderby S.K., Nielsen H., Winther O. DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics. 2017;33:3387–3395. doi: 10.1093/bioinformatics/btx431. [DOI] [PubMed] [Google Scholar]
- 42.Linding R., Jensen L.J., Diella F., Bork P., Gibson T.J., Russell R.B. Protein disorder prediction: implications for structural proteomics. Structure. 2003;11:1453–1459. doi: 10.1016/j.str.2003.10.002. [DOI] [PubMed] [Google Scholar]
- 43.Cingolani P., Platts A., Wang le L., Coon M., Nguyen T., Wang L. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 2012;6:80–92. doi: 10.4161/fly.19695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Edgar R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26:2460–2461. doi: 10.1093/bioinformatics/btq461. [DOI] [PubMed] [Google Scholar]
- 45.Huang D.W., Sherman B.T., Tan Q., Collins J.R., Alvord W.G., Roayaei J. The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 2007;8:R183. doi: 10.1186/gb-2007-8-9-r183. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Wang J., Vasaikar S., Shi Z., Greer M., Zhang B. WebGestalt 2017: a more comprehensive, powerful, flexible and interactive gene set enrichment analysis toolkit. Nucleic Acids Res. 2017;45:W130–W137. doi: 10.1093/nar/gkx356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Angeles-Albores D., Lee R.Y.N., Chan J., Sternberg P.W. Tissue enrichment analysis for C. elegans genomics. BMC Bioinf. 2016;17:366. doi: 10.1186/s12859-016-1229-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Hutter H., Suh J. GExplore 1.4: an expanded web interface for queries on Caenorhabditis elegans protein and gene function. Worm. 2016;5 doi: 10.1080/21624054.2016.1234659. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Claycomb J.M. The Argonaute CSR-1 and its 22G-RNA cofactors are required for holocentric chromosome segregation. Cell. 2009;139:123–134. doi: 10.1016/j.cell.2009.09.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Wedeles C.J., Wu M.Z., Claycomb J.M. Protection of germline gene expression by the C. elegans Argonaute CSR-1. Dev Cell. 2013;27:664–671. doi: 10.1016/j.devcel.2013.11.016. [DOI] [PubMed] [Google Scholar]
- 51.Fellmann C., Lowe S.W. Stable RNA interference rules for silencing. Nat Cell Biol. 2014;16:10–18. doi: 10.1038/ncb2895. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Evers B., Jastrzebski K., Heijmans J.P., Grernrum W., Beijersbergen R.L., Bernards R. CRISPR knockout screening outperforms shRNA and CRISPRi in identifying essential genes. Nat Biotechnol. 2016;34:631–633. doi: 10.1038/nbt.3536. [DOI] [PubMed] [Google Scholar]
- 53.Kuchaiev O., Rasajski M., Higham D.J., Przulj N. Geometric de-noising of protein-protein interaction networks. PLoS Comput Biol. 2009;5 doi: 10.1371/journal.pcbi.1000454. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Benayoun B.A., Pollina E.A., Ucar D., Mahmoudi S., Karra K., Wong E.D. H3K4me3 breadth is linked to cell identity and transcriptional consistency. Cell. 2014;158:673–688. doi: 10.1016/j.cell.2014.06.027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Han S., Schroeder E.A., Silva-Garcia C.G., Hebestreit K., Mair W.B., Brunet A. Mono-unsaturated fatty acids link H3K4me3 modifiers to C. elegans lifespan. Nature. 2017;544:185–190. doi: 10.1038/nature21686. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Pu M., Wang M., Wang W., Velayudhan S.S., Lee S.S. Unique patterns of trimethylation of histone H3 lysine 4 are prone to changes during aging in Caenorhabditis elegans somatic cells. PLoS Genet. 2018;14 doi: 10.1371/journal.pgen.1007466. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Kelly W.G. Transgenerational epigenetics in the germline cycle of Caenorhabditis elegans. Epigenetics Chromatin. 2014;7:6. doi: 10.1186/1756-8935-7-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Chen H, Zhang Z, Jiang S, Li R, Li W, Zhao C, et al. New insights on human essential genes based on integrated analysis and the construction of the HEGIAP web-based platform. Brief Bioinform 2019;pii: bbz072. [DOI] [PMC free article] [PubMed]
- 59.Ho J.W., Jung Y.L., Liu T., Alver B.H., Lee S., Ikegami K. Comparative analysis of metazoan chromatin organization. Nature. 2014;512:449–452. doi: 10.1038/nature13415. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Saito T., Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE. 2015;10 doi: 10.1371/journal.pone.0118432. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Zhong J., Sun Y., Peng W., Xie M., Yang J., Tang X. XGBFEMF: An XGBoost-based framework for essential protein prediction. IEEE Trans Nanobiosci. 2018;17:243–250. doi: 10.1109/TNB.2018.2842219. [DOI] [PubMed] [Google Scholar]
- 62.Garrigues J.M., Sidoli S., Garcia B.A., Strome S. Defining heterochromatin in C. elegans through genome-wide analysis of the heterochromatin protein 1 homolog HPL-2. Genome Res. 2015;25:76–88. doi: 10.1101/gr.180489.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Subirana J.A., Messeguer X. A satellite explosion in the genome of holocentric nematodes. PLoS ONE. 2013;8 doi: 10.1371/journal.pone.0062221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Wedeles C.J., Wu M.Z., Claycomb J.M. A multitasking Argonaute: exploring the many facets of C. elegans CSR-1. Chromosome Res. 2013;21:573–586. doi: 10.1007/s10577-013-9383-7. [DOI] [PubMed] [Google Scholar]
- 65.Tu S., Wu M.Z., Wang J., Cutter A.D., Weng Z., Claycomb J.M. Comparative functional characterization of the CSR-1 22G-RNA pathway in Caenorhabditis nematodes. Nucleic Acids Res. 2015;43:208–224. doi: 10.1093/nar/gku1308. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Verster A.J., Styles E.B., Mateo A., Derry W.B., Andrews B.J., Fraser A.G. Taxonomically restricted genes with essential functions frequently play roles in chromosome segregation in Caenorhabditis elegans and Saccharomyces cerevisiae. G3 (Bethesda) 2017;7:3337–3347. doi: 10.1534/g3.117.300193. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Eisenberg E., Levanon E.Y. Human housekeeping genes, revisited. Trends Genet. 2013;29:569–574. doi: 10.1016/j.tig.2013.05.010. [DOI] [PubMed] [Google Scholar]
- 68.Gushchanskaia E.S., Esse R., Ma Q.C., Lau N.C., Grishok A. Interplay between small RNA pathways shapes chromatin landscapes in C. elegans. Nucleic Acids Res. 2019;47:5603–5616. doi: 10.1093/nar/gkz275. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data used herein, the code developed to perform the systematic ML approaches as well as information regarding software versions and attached libraries are available at: https://bitbucket.org/tuliocampos/essential_elegans. A static version linked to this publication is available at: https://doi.org/10.6084/m9.figshare.11533101.