Abstract
A first clue to gene function can be obtained by examining whether a gene is required for life in certain standard conditions, that is, whether a gene is essential. In bacteria, essential genes are usually identified by high-density transposon mutagenesis followed by sequencing of insertion sites (Tn-seq). These studies assign the term “essential” to whole genes rather than the protein domain sequences that encode the essential functions. However, genes can code for multiple protein domains that evolve their functions independently. Therefore, when essential genes code for more than one protein domain, only one of them could be essential. In this study, we defined this subset of genes as “essential domain-containing” (EDC) genes. Using a Tn-seq data set built-in Burkholderia cenocepacia K56-2, we developed an in silico pipeline to identify EDC genes and the essential protein domains they encode. We found forty candidate EDC genes and demonstrated growth defect phenotypes using CRISPR interference (CRISPRi). This analysis included two knockdowns of genes encoding the protein domains of unknown function DUF2213 and DUF4148. These putative essential domains are conserved in more than two hundred bacterial species, including human and plant pathogens. Together, our study suggests that essentiality should be assigned to individual protein domains rather than genes, contributing to a first functional characterization of protein domains of unknown function.
Subject terms: Bacterial genomics, Bacterial genetics, Sequence annotation, Computational biology and bioinformatics, Microbiology, Molecular biology
Introduction
A first step when characterizing gene function should be asking whether a given gene encodes an essential cellular function, whether the gene is necessary for the survival of the organism. A widely accepted method to identify essential genes in bacteria is high-density transposon mutagenesis, followed by Illumina-sequencing of the transposon insertion junctions (Tn-seq)1. During Tn-seq, transposon mutant cells are pooled and grown in optimal conditions, allowing cells with a transposon insertion located in a non-essential element to survive. Cells with a transposon insertion in an essential element should be lost or depleted from the population. When transposon insertions are identified by Illumina sequencing, read counts per gene in the central 70–90% of the open reading frame (disruptive insertions) are normalized by gene length and used to predict essentiality. 5–15% sequences from the 3′ and 5′ ends are usually removed from the analysis, as insertions within the terminal regions are likely non-disruptive2–5. While disrupted genes are regarded as “non-essential,” the method yields a list of putative essential genes as those with zero or very few mapped reads (Fig. 1a, b)3.
Figure 1.
Schematics of Tn-Seq reads mapped to the insertion sites in non-essential (a), essential (b), and essential domain-containing (EDC) genes (c–d). The number of transposon insertions related to the length of the gene (minus the non-informative 10% towards the 5′ and 3′ ends) is quantified and used to classify genes as non-essential (a), or essential (b) according to the relative number of reads mapped to that gene. Tn-seq analysis may miss EDC genes which are essential genes that contain an essential domain not spanning throughout the whole length of the gene (c–d). Genes are represented by arrows. Tn-seq reads that map to regions of those genes are represented by black boxes. Essential and non-essential regions are colored in red and green, respectively.
Another step towards identifying gene function is the annotation of the protein domains encoded by genes. Protein domains are functional or structural units that can fold, evolve, and function independently. Homology-based protein domain prediction and function assignment are effective starting points for understanding protein function, even when diverse protein architectures add complexity to functional annotations6,7. While domain databases such as Pfam8 and InterPro9 aim to provide maximum sequence coverage to predict protein domain identity, approximately 30% of all domains listed in these databases (Pfam 33.1 and InterPro 81.0) are ‘domains of unknown function (DUFs).’ Single DUFs are usually predicted to span through functionally uncharacterized proteins. However, studies suggest that at least some of these proteins may contain more than one domain10,11.
While robust and comprehensive, very few Tn-seq studies12–14 consider that genes may encode for more than one protein domain. Tn-seq analysis may classify a gene as “non-essential” due to the presence of transposon insertions in a non-essential coding region, despite the gene coding for a second domain not spanning through the whole gene length that might be essential3,15,16. We operationally defined this subclass of essential genes as “essential domain-containing” (EDC) genes (Fig. 1c, d) and present a computational pipeline to identify them in a Tn-seq dataset built-in Burkholderia cenocepacia K56-217. Unlike the previously reported methods, our method does not require in-depth understanding of computational platforms and generates a list of candidate EDC genes. By analyzing biases in transposon density in genes previously identified as “non-essential”, we found 40 genes where the encoded proteins contained putative essential and non-essential domains. Using a CRISPR Interference (CRISPRi)18 platform we developed for Burkholderia19, we experimentally confirmed growth defects, representing the loss of a putative essential function, in 27 EDC gene knockdowns. The identified EDC genes include ten encoding known multidomain proteins and two entirely uncharacterized genes encoding different N-terminal DUFs, demonstrating the utility of the approach. This study highlights that gene essentiality depends on the function of individual protein domains rather than entire proteins.
Results
Identification of EDC genes from Tn-seq data
To identify EDC genes in B. cenocepacia K56-2, we built a custom script that used our previous Tn-seq data17 to select genes that (i) were not previously found to be essential in B. cenocepacia K56-217, and (ii) had an asymmetric distribution of transposon insertions (Fig. 2). The script split each gene into two equal parts and selected genes with reads in only one region to identify genes with transposon insertion biases. We worked under the assumption that (i) each half could represent one functional domain and (ii) one of the domains may be essential while the other may not. We arbitrarily set the parameters “min ratio” and “min reads” to 0 and 0.14, respectively (see Material and Methods and Supplementary Fig. 1). These settings looked for genes that had zero reads at one end, while the number of reads in the non-empty end was at least 14% of that region's length. For example, if a section of a gene was 100 bp in length, it would require at least 14 reads mapped to that section to be considered non-essential. With these settings, the script produced an extensive list of 178 candidate EDC genes (Supplementary Table 1).
Figure 2.
Identification of putative essential domain-containing (EDC) genes from a Tn-seq dataset. Tn-Seq reads are first mapped to the reference genome. A custom-built script identifies genes with biased location of transposon insertions towards one half of the gene. The script parameters “min ratio’ and “min reads” were set such that genes were selected when (i) a half-region of that gene (at least 40% of the total gene length) showed no insertions (min ratio = 0), and (ii) the other half contained mapped reads in at least 14% of that gene-half length (min reads = 0.14) (see Supplementary Fig. 1and Material and Methods for details). Reads mapping to each 5′ or 3′ 10% end of the gene were discarded from the analysis.
Bioinformatic analysis of the candidate EDC genes
We reasoned that if EDC genes contained essential protein domains, then the essential protein domains may be encoded by essential genes in at least some other bacteria. We then searched for essential ortholog genes of the 178 candidate EDC genes by BLASTx searches against the ‘Database of Essential Genes (DEG)20 using 50% sequence alignment and 30% sequence identity as the cut-off. We found that 40 of the 178 genes had orthologs annotated as ‘essential” in other bacterial species. We wished to interrogate the domains encoded by these 40 genes using UniProt21 based on InterPro domains9. InterPro predicts the domain information by matching the protein or nucleic acid sequences against the member databases (collectively known as InterPro consortium) to identify ‘signatures’ associated with known domains. Thus, the InterPro prediction relies on the availability of sequence characterization and annotation. This analysis showed that from the 40 candidate EDC genes predicted to be essential by homology with other essential genes, 10 genes encoded multidomain proteins, and 7 of them were well-characterized, such as the N-terminal domain of DnaK and NusA (Fig. 3a). The remaining genes were predicted to have one single annotated domain (19 genes) that did not span the whole gene-length or encoded uncharacterized proteins (11 genes) (Supplementary Table 2). All 40 genes had transposon insertions located in one half of the gene, showing that the script was able to identify genes with biased transposon insertions (Supplementary Fig. 2). Taken together, these results suggest that the identified genes could be essential due to the presence of essential protein domain orthologues. Notably, 17 DNA regions were identified as coding for new putative essential protein domains (Table 1).
Figure 3.
Biased transposon insertion identifies putative essential domains of uncharacterized hypothetical proteins. Tn-seq reads from17 were mapped to the B. cenocepacia K56-2 genome and predicted to contain essential domains. (a) The script identified the well characterized essential N-terminal domains of DnaK (BCAL3270) and NusA (BCAL1506). Their respective CRISPRi mutants demonstrated a conditional growth defect. (b) Two uncharacterized genes BCAM1066 (WQ49_RS16145) and BCAS0158 (WQ49_RS10495) contain the Pfam domains DUF2213 (PF09979) and DUF4148 (PF13663), respectively, at the N-terminal end. The Tn-seq reads map to the C-terminal end of these genes, demonstrating the essentiality of DUF2213 and DUF4148. Putative essential domains are highlighted in blue. Black triangles represent the transposon insertion sites. Numbers on top of the domains denote amino acid sequence positions. Blue and red lines in the growth curves (a and b) represent growth in the absence and presence of rhamnose, respectively. Growth curves are shown for the most efficient sgRNAs. Growth curves values are the average of three independent biological replicates. Error bars indicate mean ± SD.
Table 1.
Putative essential genes and domains identified based on biased transposon insertions.
K56-2 locus tag | Homolog J2315 locus tag | Product name | Function | Reads at 5′ half | Reads at 3′ half | Identified putative essential domain |
---|---|---|---|---|---|---|
WQ49_RS00050 | BCAL3469 | Cell division protein FtsL | Essential cell division protein | 0 | 23 | Domain (FtsL) |
WQ49_RS00770 | BCAL3328 | NUDIX hydrolase | Nucleoside-diphosphatase | 0 | 49 | Domain (Nudix hydrolase) |
WQ49_RS00885 | BCAL3305 | Preprotein translocase subunit YajC | Secretase/insertase | 21 | 0 | New |
WQ49_RS01035 | BCAL3270 | DnaK | Chaperone | 0 | 227 | N-terminal Domain |
WQ49_RS02920 | BCAM1451 | Hypothetical protein | Unknown | 43 | 0 | New |
WQ49_RS03160 | BCAM1502 | Hypothetical protein | Unknown | 59 | 0 | New |
WQ49_RS03550 | QU43_RS62245 | Hypothetical protein | Unknown | 33 | 0 | New |
WQ49_RS03805 | BCAM1624 | MaoC family dehydratase | MaoC-like dehydratase | 46 | 0 | New |
WQ49_RS04450 | BCAM1749 | Hypothetical protein | Unknown | 17 | 0 | New |
WQ49_RS07360 | BCAM2338 | Glycosyl transferase family 1 | UDP-glycosyltransferase | 0 | 152 | Domain (Glyco_transf_28) |
WQ49_RS07395 | QU43_RS66100 | Hypothetical protein | Unknown | 0 | 58 | New |
WQ49_RS09185 | BCAS0417 | Cytochrome biogenesis protein CcdA | Electron transfer | 0 | 38 | New |
WQ49_RS10495 | BCAS0158 | hypothetical protein | Unknown | 0 | 34 | Domain (DUF4148) |
WQ49_RS11915 | BCAL0324 | TatB | Protein Transmembrane transporter | 0 | 57 | Domain (TatA_B_E) |
WQ49_RS12045 | BCAL0298 | Thiamine biosynthesis protein ThiS | Thiamine biosynthesis protein ThiS | 0 | 50 | Domain (ThiS) |
WQ49_RS12280 | BCAL0250 | 50S ribosomal protein L18 | Structural constituent of ribosome | 0 | 65 | Domain (Ribosomal_L18p) |
WQ49_RS12305 | BCAL0245 | RplX | Structural constituent of ribosome | 20 | 0 | Domain (L24-Pfam) |
WQ49_RS12315 | BCAL0243 | 30S ribosomal protein S17 | Structural constituent of ribosome | 0 | 64 | New |
WQ49_RS12365 | BCAL0233 | RpsJ | Structural constituent of ribosome | 0 | 25 | New |
WQ49_RS16145 | BCAM1066 | Hypothetical protein | Unknown | 0 | 425 | Domain (DUF2213) |
WQ49_RS18705 | BCAM0549 | Molecular chaperone GroES | Chaperone | 0 | 21 | Domain (Cpn10) |
WQ49_RS22170 | BCAM2699 | alpha/beta hydrolase | Putative hydrolase | 120 | 0 | Domain (Abhydrolase_3) |
WQ49_RS23945 | BCAL0558 | Cca | 3′-Cytidine-cytidine-tRNA adenylyltransferase | 0 | 79 | Domain (PolyA Polymerase)/Domain (Binding) |
WQ49_RS24070 | BCAL0585 | Hypothetical protein | Unknown | 0 | 23 | new |
WQ49_RS25525 | BCAL0878 | FmdB family transcriptional regulator | Regulatory activity | 0 | 30 | Domain (CxxC_CXXC_SSSS) |
WQ49_RS25680 | BCAL0909 | 16S rRNA maturation RNase YbeY | Endoribonuclease activity | 68 | 0 | Domain (UPF0054) |
WQ49_RS26625 | BCAL2715 | RpmG | Structural constituent of ribosome | 0 | 31 | Domain (Ribosomal_L33) |
WQ49_RS27920 | BCAL2334 | NADH-quinone oxidoreductase subunit K | NADH dehydrogenase | 0 | 21 | Domain (Oxidored_q2) |
WQ49_RS28635 | BCAL2199 | Fe–S cluster assembly transcriptional regulator IscR | DNA-binding transcription factor | 39 | 0 | Domain (Rrf2) |
WQ49_RS29230 | BCAL2091 | 30S ribosomal protein S2 | Structural constituent of ribosome | 0 | 86 | Domain (Ribosomal_S2) |
WQ49_RS30770 | BCAL1788 | Biopolymer transporter ExbD | Transmembrane transporter | 0 | 47 | Domain (ExbD) |
WQ49_RS31735 | NA | Hypothetical protein | Unknown | 0 | 42 | New |
WQ49_RS31805 | BCAL1585 | Transcriptional regulator | DNA binding | 44 | 0 | New |
WQ49_RS32210 | BCAL1506 | NusA | DNA-binding transcription factor | 0 | 93 | Domain (NusA_N) |
WQ49_RS32225 | BCAL1503 | SMC-Scp complex | Cell Division/chromosome separation | 0 | 94 | Domain (SMC) |
WQ49_RS32625 | BCAL1424 | ABC transporter | ATPase | 63 | 0 | New |
WQ49_RS34660 | BCAL0990 | 50S ribosomal protein L32 | Structural constituent of ribosome | 27 | 0 | New |
WQ49_RS34895 | BCAL2925 | 50S ribosomal protein L19 | Structural constituent of ribosome | 0 | 26 | Domain (Ribosomal_L19) |
WQ49_RS35060 | BCAL2958 | Membrane protein | Porin activity | 43 | 0 | Domain (OmpA) |
WQ49_RS03390 | BCAM1545 | LuxR family transcriptional regulator | DNA binding | 251 | 0 | Domain (HTH luxR-type) |
CRISPRi knockdowns of EDC genes show growth defects
To phenotypically characterize the effect of knocking down EDC genes, we used CRISPR interference or CRISPRi19 to create knockdown mutants of the genes of interest. CRISPRi comprises a chromosomally integrated dcas9 under the control of a rhamnose-inducible promoter and plasmid-borne sgRNA driven by a constitutively active synthetic promoter, PJ2311919. Simultaneous expression of dcas9 and a target-specific sgRNA allows the dCas9 to bind the target DNA region and, thus, sterically interfere with transcription by RNA polymerase18,19. To inhibit the expression of the candidate genes, we designed two sgRNAs against each of the candidate genes targeting the start codon and adjacent region on the non-template strand (Supplementary Fig. 3a,c). For phenotypic characterization, we grew the cells in LB with and without rhamnose. Upon induction of dCas9 with rhamnose, 27 out of the 40 candidate genes showed at least 25% growth inhibition relative to the uninduced condition (Supplementary Fig. 3b,d).
DUF2213 and DUF4148 appear to be essential domains
The presence of DUFs is a common feature of hypothetical or uncharacterized proteins. To initiate functional characterization of DUFs, we focused on two genes containing DUF-coding sequences, which their respective CRISPRi mutants demonstrated a conditional growth defect (Fig. 3b). WQ49_RS16145 (BCAM1066) and WQ49_RS10495 (BCAS0158) contain DUF2213 (Pfam accession PF09979) and DUF4148 (Pfam accession PF13663), respectively at the N-terminal end of the proteins (Fig. 3b). BLAST searches of BCAM1066 and BCAS0158 genes as a query against the DEG20 showed that BCAM1066 (WQ49_RS16145) had 30% sequence similarity with lysK (B8GXH3) from Caulobacter crescentus, and BCAS0158 (WQ49_RS10495) had a 52% sequence identity with a predicted amino acid permease (BPSS1112) from Burkholderia pseudomallei K96243 (data not shown). Mining of the Pfam database (https://pfam.xfam.org/) showed that these DUFs are well conserved across the bacterial species: DUF2213 is present in 209 bacterial species, including bacterial pathogens (Acinetobacter baumannii, Enterobacter cloacae, Haemophilus influenzae, Burkholderia cepacia, Shigella flexneri), plant pathogens (Agrobacterium tumefaciens), and biotechnologically relevant species (Pseudomonas putida) (Fig. 4a and Supplementary Table 5). DUF4148 is found in 204 bacterial species, primarily in Burkholderia species (i. e Burkholderia cepacia, Burkholderia mallei, Burkholderia vietnamiensis) and plant pathogens such as Ralstonia solanacearum (Fig. 4b; Supplementary Table 5). DUF2213 is also present in many phage-related proteins (Fig. 4a). Eight unique domain architectures were observed for proteins containing DUF2213 and five for DUF4148 (Fig. 4c, d). DUF2213 is associated with another essential domain PF00293, a NUDIX hydrolase (Fig. 4c). In other proteins, DUF2213 is associated with the LPD3 domain (PF18798) and DUF1073 (PF06381) which is also conserved across bacterial species11 (Fig. 4c). On the other hand, Pfam analysis of DUF4148 shows that DUF4148 differs in domain length among species and is associated with the Pfam domain PF00144, known to confer resistance against β-lactams (Fig. 4d)22. Nonetheless, the encoded N-terminus was highly conserved, suggesting that it is functionally significant. The Pfam-based analysis of species distribution also revealed that DUF2213 is present in six eukaryotic species (five metazoans and one fungal species), whereas DUF4148 is present in five eukaryotic species (three viridiplantae species and two metazoan species). The widespread distribution of these DUFs indicates the functional importance of these putative essential domains, creating an impetus for further characterization.
Figure 4.
Phylogenetic trees with taxonomic information of DUF2213 (PF09979) and DUF4148 (PF13663) and domain architectures of proteins containing these domains. (a–b) Phylogenetic trees of DUF2213 (a) and DUF4148 (b) across the species with taxonomic annotations. DUF2213 is widely distributed within bacterial, archaeal, phage and eukaryotic species, whereas DUF4148 is mostly distributed in bacteria (primarily in Proteobacterial species). Trees shown here are the majority rule consensus trees. Taxonomic annotations were labelled based on NCBI taxonomy database. Representative bacterial, archaeal, phage and eukaryotic species are highlighted in lilac, yellow, grey and green, respectively. The orange circles on the branches represent the bootstraps values. (c)–(d) Domain architectures of proteins containing DUF2213 (c) and DUF4148 (d) across species. Numbers on top of the domains in (c) and (d) represent amino acid sequence positions.
Discussion
A first step in the functional characterization of genes is performed through gene deletion or gene silencing and growth phenotype characterization. For genes that encode multidomain proteins performing multiple functions driven by the activity of their individual domains23, the function assigned to a gene could indeed correspond to one of its encoded protein domains and not to the whole protein. That is the case of essential genes identified by Tn-seq1. In standard Tn-seq analysis the condition of essentiality is assigned to genes and not to encoded domains, resulting in incorrect classification of many essential genes as non-essential. Rather, the essentiality assignment pipeline should be revised to analyze the essentiality of encoded individual protein domains24. Indeed, essentiality can be assigned to individual domains of a multidomain protein rather than the entire protein15,16. In this work, we defined as essential-domain-containing (EDC) genes those genes that encode more than one protein domain, with one of the domains coding for an essential function. By analyzing a Tn-seq dataset17 for transposon insertion biases, we show that standard Tn-seq analysis pipelines may miss EDC genes, whose detection often requires either manual curation or additional considerations25.
We validated our approach by identifying genes encoding previously characterized multidomain essential proteins in which the essential function is assigned to one single domain. For instance, our analysis of biases in the Tn-seq dataset showed that the gene region coding for the N-terminal domain of NusA26 is sufficient to mediate the essential function, in agreement with previous work27. Similarly, the B. cenocepacia K56-2 dnaK gene was previously defined as non-essential17; however, we found that the Tn-seq reads mapped onto dnaK were biased toward the C-terminal domain (CTD), suggesting that only the NTD is necessary for its essential function. (Fig. 3b; Supplementary Fig. 2). DnaK is a multidomain protein and a master regulator of the chaperone network28. DnaK comprises an N-terminal ATPase domain (NTD) and a C-terminal substrate-binding domain (CTD)28. Perturbations either within the NTD that leads to the abrogation of the ATPase activity or within the conserved linker peptide that impairs the interdomain mechanistic interaction abrogate the in vivo activity of DnaK29,30.
While 14 EDC genes that demonstrated a growth defect when knocked down code for proteins annotated to have a single domain, none of these domains span the entire gene, and transposon insertions are only mapped to the annotated domain (Supplementary Fig. 2). Thus, it is possible that the remaining regions code for novel domains that perform the essential biological functions independently of the adjacent sequences. Indeed, multidomain proteins that are involved in direct protein–protein interactions are more often detected as essential than proteins with a single domain15, hinting towards the functional contribution of individual domains within a protein complex. However, it should be noted that the presence of multiple domains in an essential protein does not necessarily mean that the protein is composed by essential and non-essential domains. An example is the Bacillus subtillis SMC, a multidomain essential protein involved in chromosomal segregation31,32.
We demonstrated a conditional growth defect in 27 out of 40 CRISPRi mutants of EDC genes. It remains a possibility that the sgRNAs designed for CRISPRi-mediated gene silencing of the remaining 13 genes were not efficient in target binding, thus yielding no growth defect. CRISPRi is more effective in blocking transcription initiation than elongation, and is the most efficient in silencing gene expression when promoter regions are targeted with gRNAs18,33–35. However, as promoter regions for B. cenocepacia genomes remained largely unannotated we targeted translation start sites. It remains to be investigated whether targeting the promoter region to block the transcription initiation rather than elongation might yield conditional a growth phenotype in the remaining 13 genes.
Eighteen of the 27 EDC genes CRISPRi mutants that demonstrated a conditional growth defect are in an operon (Supplementary Fig. 3). It is possible then that due to the polar effect of CRISPRi, the observed growth defect could result from the transcriptional silencing of any other gene(s) in the same operon. However, we consider this possibility unlikely. These genes (other than the candidate gene in the operon) had transposon insertions greater than the defined threshold in the script across the entire genes (data not shown), suggesting that they are dispensable. The only exceptions are BCAL0245 and BCAL0250, where both genes are located in the same operon (Supplementary Fig. 3). Thus, it remains a possibility that observed growth defect could be due to transcriptional silencing of either or both the genes. A large portion of the protein domains that lack functional assignment can be grouped within the DUF category. DUFs are members of ever-increasing uncharacterized protein families; they are the object of experimental and computational efforts towards their functional characterization10,36–38. Determining if a DUF is essential is among the first steps in functional characterization. In this study, we focused on two EDC genes that encode putative essential DUFs: DUF2213 and DUF4148. Both domains have a high degree of conservation across diverse phyla, which highlights their biological relevance. DUF2213, a phage-associated domain (PF09979), is well distributed across bacteria and phages. Interestingly, we found that DUF4148 (PF13663) is putatively essential and associated with β-lactamase (PF00144) (Fig. 4).
In summary, our study identified 27 EDC genes whose knockdown produced a growth defect, suggesting the essential nature of one of their protein domains. By leveraging a Tn-Seq dataset in B. cenocepacia K56-217, we demonstrate that the essential nature of protein-coding genes is a function of the individual protein domains they encode. The utility of our work lies in the identification of gene regions encoding essential and conserved protein domains, which will help de-orphan the many remaining proteins of unknown function. Therefore, we propose that determining essentiality of a domain of unknown function should be the first step in the process to define their function.
Methods
Bacterial strains and growth conditions
The list of bacterial strains and plasmids used in this study is provided in Supplementary Table 3. Bacterial strains were grown in LB-Lennox medium (Difco) at 37 °C. E. coli strain MM290 carrying the helper plasmid pRK2013 was selected in kanamycin 40 µg/mL (Fisher Scientific). Donor strains of E. coli DH5α and B. cenocepacia K56-2 carrying the sgRNA plasmids were selected in trimethoprim 50 µg/mL and 100 µg/mL (Sigma), respectively.
Identification of EDC genes from Tn-Seq dataset
Candidate EDC genes were identified with a custom python script using the Tn-seq dataset17. The script analyzed every gene previously classified as “non-essential” by splitting it into two equal halves and counting the number of reads mapped to each half-gene. The script then used the “min ratio” and “min reads” as filtering criteria to call EDC genes. “Min ratio” was defined as the desired ratio of reads between the halves of the gene. “Min reads” was defined as the minimum number of reads in the non-empty end that is equal to a 14% of that half's length. Min reads was set to 0.14, while min ratio was set as 0. For each gene, 10% from each end of the gene was discarded from the analysis. The parameters can be changed to yield either more stringent or more general results. The script is available at https://github.com/cardonalab/EssentialDomains
Bioinformatic analysis
Orthologous essential genes were identified using BLASTx against DEG 1520. Multidomain information was fetched from the UniProt database based on Pfam8 and InterPro9 domain features. DUF containing genes were characterized using the Pfam tool available on the Pfam website (https://pfam.xfam.org/). Domain sequences were retrieved in FASTA format from the Pfam database8 and aligned by Clustal Ω39. Maximum-likelihood phylogenetic trees were generated with MEGA-X40 using a Jones-Taylor-Thornton (JTT)-based model41 applying 100 bootstrap values. Phylogenetic trees were visualized, edited and taxonomic labels were assigned using Interactive Tree Of Life (i-TOL)42. Bootstrap values are represented on a scale of 0 to 1. Taxonomic annotations were labelled based on the NCBI taxonomy database using UniProt identifiers.
Creating knockdown mutants of the candidate EDC genes with CRISPRi
CRISPRi mutants of the EDC genes were created as previously described19. Briefly, pSCB2-sgRNAv2, a modified plasmid from pSCB2-sgRNA19, was used as the template for inverse PCR to insert 20 bp target-specific sgRNA sequence. Inverse PCR was performed using Q5 high-fidelity polymerase (NEB), forward primers with individual sgRNAs as 5′ tail, and 1092 as the reverse primer. The resultant fragments were ligated to create circular plasmids by incubating 0.5µL of the respective PCR products with quick ligation buffer (NEB), 0.25 μL DpnI, 0.25 μL T4 polynucleotide kinase (NEB), and 0.25 μL T4 ligase (NEB) for 30 min at 37 °C. Resultant plasmids were transformed into E. coli DH5α, recovered for 2 h and selected in LB supplemented with trimethoprim 50 µg/mL (Sigma). The transformants were further confirmed by colony PCR using primers 1409 and 848. E. coli strains carrying the sgRNA plasmids were used as donors, and E. coli MM290/pRK2013 as the helper for triparental mating to introduce the sgRNA plasmids into B. cenocepacia K56-2 containing the chromosomally integrated dCas9 under the control of a rhamnose inducible promoter, as described previously43. Trimethoprim resistant colonies (100 µg/mL) were selected and screened by colony PCR using the primers 1409 and 848. The list of all the primers used in this study is provided in Supplementary Table 4.
Conditional growth phenotype analysis of the CRISPRi mutants
To determine the conditional growth phenotype of the candidate genes, overnight cultures of the CRISPRi mutants were back diluted to OD600nm 0.01. The cultures were grown at 37 °C for 20–24 h with continuous shaking in a 384-well plate containing LB broth supplemented with trimethoprim 100 µg/mL and with/without 1% rhamnose. OD600nm readings were taken at 1-h intervals using BioTek Synergy 2 microplate reader.
Supplementary Information
Acknowledgements
This work was supported by grants from the Canadian Institutes of Health Research (CIHR), Cystic Fibrosis Foundation, Cystic Fibrosis Canada to STC; ASMZR was supported by a University of Manitoba Graduate Fellowship (UMGF). The authors thank Dr. Georg Hausner, Andrew Hogan, Dustin Maydaniuk and rest of the Cardona lab members for critically reading the manuscript.
Author contributions
A.S.M.Z.R.—performed the majority of the experiments and wrote the manuscript; L.T.—created the python script and contributed to manuscript editing; F.G.—created CRISPRi mutants and contributed to manuscript editing; S.T.C.—conceived the idea, supervised the work, provided financial support, and edited the final version of the manuscript.
Funding
Funding is provided by Canadian Institutes of Health Research (Grant no. 5211, project Grant), Cystic Fibrosis Canada (Grant no. 50501).
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-022-05028-x.
References
- 1.van Opijnen T, Bodi KL, Camilli A. Tn-seq: High-throughput parallel sequencing for fitness and genetic interaction studies in microorganisms. Nat. Methods. 2009;6:767–772. doi: 10.1038/nmeth.1377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Akerley BJ, et al. Systematic identification of essential genes by in vitro mariner mutagenesis. Proc. Natl. Acad. Sci. USA. 1998;95:8927–8932. doi: 10.1073/pnas.95.15.8927. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Chao MC, Abel S, Davis BM, Waldor MK. The design and analysis of transposon insertion sequencing experiments. Nat. Rev. Microbiol. 2016;14:119–128. doi: 10.1038/nrmicro.2015.7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Langridge GC, et al. Simultaneous assay of every Salmonella Typhi gene using one million transposon mutants. Genome Res. 2009;19:2308–2316. doi: 10.1101/gr.097097.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Shields RC, Zeng L, Culp DJ, Burne RA. Genomewide identification of essential genes and fitness determinants of Streptococcus mutans UA159. mSphere. 2018;3:e00031-18. doi: 10.1128/mSphere.00031-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Forslund, S. K., Kaduk, M. & Sonnhammer, E. L. L. Evolution of protein domain architectures. in Evolutionary Genomics (ed. Anisimova, M.) vol. 1910 469–504 (Springer, 2019). [DOI] [PubMed]
- 7.Schnoes AM, Brown SD, Dodevski I, Babbitt PC. Annotation error in public databases: Misannotation of molecular function in enzyme superfamilies. PLoS Comput. Biol. 2009;5:e1000605. doi: 10.1371/journal.pcbi.1000605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.El-Gebali S, et al. The Pfam protein families database in 2019. Nucleic Acids Res. 2019;47:D427–D432. doi: 10.1093/nar/gky995. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Mitchell AL, et al. InterPro in 2019: Improving coverage, classification and access to protein sequence annotations. Nucleic Acids Res. 2019;47:D351–D360. doi: 10.1093/nar/gky1100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Bateman A, Coggill P, Finn RD. DUFs: Families in search of function. Acta Crystallograph. Sect. F Struct. Biol. Cryst. Commun. 2010;66:1148–1152. doi: 10.1107/S1744309110001685. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Goodacre NF, Gerloff DL, Uetz P. Protein domains of unknown function are essential in bacteria. mBio. 2014;5:e00744-13. doi: 10.1128/mBio.00744-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.DeJesus MA, et al. Bayesian analysis of gene essentiality based on sequencing of transposon insertion libraries. Bioinformatics. 2013;29:695–703. doi: 10.1093/bioinformatics/btt043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Zhang YJ, et al. Global assessment of genomic regions required for growth in Mycobacterium tuberculosis. PLoS Pathog. 2012;8:e1002946. doi: 10.1371/journal.ppat.1002946. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Michel AH, et al. Functional mapping of yeast genomes by saturated transposition. eLife. 2017;6:e23570. doi: 10.7554/eLife.23570. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Lluch-Senar M, et al. Defining a minimal cell: Essentiality of small ORFs and ncRNAs in a genome-reduced bacterium. Mol. Syst. Biol. 2015;11:780. doi: 10.15252/msb.20145558. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Lu Y, et al. A novel essential domain perspective for exploring gene essentiality. Bioinformatics. 2015;31:2921–2929. doi: 10.1093/bioinformatics/btv312. [DOI] [PubMed] [Google Scholar]
- 17.Gislason AS, Turner K, Domaratzki M, Cardona ST. Comparative analysis of the Burkholderia cenocepacia K56-2 essential genome reveals cell envelope functions that are uniquely required for survival in species of the genus Burkholderia. Microb. Genomics. 2017;3:e000140. doi: 10.1099/mgen.0.000140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Qi LS, et al. Repurposing CRISPR as an RNA-guided platform for sequence-specific control of gene expression. Cell. 2013;152:1173–1183. doi: 10.1016/j.cell.2013.02.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Hogan AM, Rahman ASMZ, Lightly TJ, Cardona ST. A broad-host-range CRISPRi Toolkit for silencing gene expression in Burkholderia. ACS Synth. Biol. 2019;8:2372–2384. doi: 10.1021/acssynbio.9b00232. [DOI] [PubMed] [Google Scholar]
- 20.Luo H, et al. DEG 15, an update of the database of essential genes that includes built-in analysis tools. Nucleic Acids Res. 2021;49:D677–D686. doi: 10.1093/nar/gkaa917. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.The UniProt Consortium UniProt: A worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47:D506–D515. doi: 10.1093/nar/gky1049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Gao M, Glenn AE, Blacutt AA, Gold SE. Fungal Lactamases: Their occurrence and function. Front. Microbiol. 2017;8:1775. doi: 10.3389/fmicb.2017.01775. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Kanaan, S. P., Huang, C., Wuchty, S., Chen, D. Z. & Izaguirre, J. A. Inferring protein–protein interactions from multiple protein domain combinations. In Computational Systems Biology (eds. Ireton, R., Montgomery, K., Bumgarner, R., Samudrala, R. & McDermott, J.) vol. 541 43–59 (Humana Press, 2009). [DOI] [PubMed]
- 24.Miravet-Verde S, Burgos R, Delgado J, Lluch-Senar M, Serrano L. FASTQINS and ANUBIS: Two bioinformatic tools to explore facts and artifacts in transposon sequencing and essentiality studies. Nucleic Acids Res. 2020;48:e102. doi: 10.1093/nar/gkaa679. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Goodall ECA, et al. The essential genome of Escherichia coli K-12. mBio. 2018;9:e02096-17. doi: 10.1128/mBio.02096-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Qayyum MZ, Dey D, Sen R. Transcription elongation factor NusA is a general antagonist of rho-dependent termination in Escherichia coli. J. Biol. Chem. 2016;291:8090–8108. doi: 10.1074/jbc.M115.701268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Ha KS, Toulokhonov I, Vassylyev DG, Landick R. The NusA N-terminal domain is necessary and sufficient for enhancement of transcriptional pausing via interaction with the RNA exit channel of RNA polymerase. J. Mol. Biol. 2010;401:708–725. doi: 10.1016/j.jmb.2010.06.036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Wu C-C, Naveen V, Chien C-H, Chang Y-W, Hsiao C-D. Crystal structure of DnaK protein complexed with nucleotide exchange factor GrpE in DnaK chaperone system: Insight into intermolecular communication. J. Biol. Chem. 2012;287:21461–21470. doi: 10.1074/jbc.M112.344358. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Barthel TK, Zhang J, Walker GC. ATPase-defective derivatives of Escherichia coli DnaK that behave differently with respect to ATP-induced conformational change and peptide release. J. Bacteriol. 2001;183:5482–5490. doi: 10.1128/JB.183.19.5482-5490.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Vogel M, Mayer MP, Bukau B. Allosteric regulation of Hsp70 chaperones involves a conserved interdomain linker. J. Biol. Chem. 2006;281:38705–38711. doi: 10.1074/jbc.M609020200. [DOI] [PubMed] [Google Scholar]
- 31.Britton RA, Lin DC-H, Grossman AD. Characterization of a prokaryotic SMC protein involved in chromosome partitioning. Genes Dev. 1998;12:1254–1259. doi: 10.1101/gad.12.9.1254. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Minnen A, et al. Control of Smc coiled coil architecture by the ATPase heads facilitates targeting to chromosomal ParB/parS and release onto flanking DNA. Cell Rep. 2016;14:2003–2016. doi: 10.1016/j.celrep.2016.01.066. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Bikard D, et al. Programmable repression and activation of bacterial gene expression using an engineered CRISPR-Cas system. Nucleic Acids Res. 2013;41:7429–7437. doi: 10.1093/nar/gkt520. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Hawkins JS, Wong S, Peters JM, Almeida R, Qi LS. Targeted transcriptional repression in bacteria using CRISPR interference (CRISPRi) Methods Mol. Biol. 2015;1311:349–362. doi: 10.1007/978-1-4939-2687-9_23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Vigouroux A, Oldewurtel E, Cui L, Bikard D, van Teeffelen S. Tuning dCas9’s ability to block transcription enables robust, noiseless knockdown of bacterial genes. Mol. Syst. Biol. 2018;14:e7899. doi: 10.15252/msb.20177899. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Bastard K, et al. Revealing the hidden functional diversity of an enzyme family. Nat. Chem. Biol. 2014;10:42–49. doi: 10.1038/nchembio.1387. [DOI] [PubMed] [Google Scholar]
- 37.Dessailly BH, et al. PSI-2: Structural genomics to cover protein domain family space. Structure. 2009;17:869–881. doi: 10.1016/j.str.2009.03.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Zhang X, et al. Assignment of function to a domain of unknown function: DUF1537 is a new kinase family in catabolic pathways for acid sugars. Proc. Natl. Acad. Sci. 2016;113:E4161–E4169. doi: 10.1073/pnas.1605546113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Sievers F, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 2011;7:539. doi: 10.1038/msb.2011.75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Kumar S, Stecher G, Li M, Knyaz C, Tamura K. MEGA X: Molecular evolutionary genetics analysis across computing platforms. Mol. Biol. Evol. 2018;35:1547–1549. doi: 10.1093/molbev/msy096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data matrices from protein sequences. Bioinformatics. 1992;8:275–282. doi: 10.1093/bioinformatics/8.3.275. [DOI] [PubMed] [Google Scholar]
- 42.Letunic I, Bork P. Interactive Tree Of Life (iTOL) v4: Recent updates and new developments. Nucleic Acids Res. 2019;47:W256–W259. doi: 10.1093/nar/gkz239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Hogan AM, et al. Competitive fitness of essential gene knockdowns reveals a broad-spectrum antibacterial inhibitor of the cell division protein FtsZ. Antimicrob. Agents Chemother. 2018;62:e01231-18. doi: 10.1128/AAC.01231-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.