Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2024 Dec 9;53(2):gkae1175. doi: 10.1093/nar/gkae1175

Domainator, a flexible software suite for domain-based annotation and neighborhood analysis, identifies proteins involved in antiviral systems

Sean R Johnson 1, Peter R Weigele 2,, Alexey Fomenkov 3, Andrew Ge 4, Anna Vincze 5, James B Eaglesham 6, Richard J Roberts 7, Zhiyi Sun 8,
PMCID: PMC11754643  PMID: 39657740

Abstract

The availability of large databases of biological sequences presents an opportunity for in-depth exploration of gene diversity and function. Bacterial defense systems are a rich source of diverse but difficult to annotate genes with biotechnological applications. In this work, we present Domainator, a flexible and modular software suite for domain-based gene neighborhood and protein search, extraction and clustering. We demonstrate the utility of Domainator through three examples related to bacterial defense systems. First, we cluster CRISPR-associated Rossman fold (CARF) containing proteins with difficult to annotate effector domains, classifying most of them as likely transcriptional regulators and a subset as likely RNases. Second, we extract and cluster P4-like phage satellite defense hotspots, identify an abundant variant of Lamassu defense systems and demonstrate its in vivo activity against several T-even phages. Third, we integrate a protein language model into Domainator and use it to identify restriction endonucleases with low similarity to known reference sequences, validating the activity of one example in vitro. Domainator is made available as an open-source package with detailed documentation and usage examples.

Graphical Abstract

Graphical Abstract.

Graphical Abstract


The rapid growth of publicly available biological sequence databases provides scientists with unprecedented opportunities to explore biological sequence diversity and co-association across virtually any range of phylogenetic and evolutionary distance. For example, multiple sequence alignments including huge numbers of homologous sequences produced by metagenome sequencing efforts around the world have been critical to the training of large-scale models for protein structure prediction (1,2). For the individual researcher, however, new tools can increase access to the biological patterns and meanings that can be found in biological sequence space (3). Specifically, improvement of two functionalities can enhance the ability of the experimenter to extract, from biological sequences, those human-readable patterns necessary for the formation of testable hypotheses. First, is easy generation and curation of accurate and uniform genome-scale annotation of genes at the level of functional domains. Second, is a rich ecosystem of downstream tools for automated retrieval and summarization of those annotated gene sequences within their genomic context. Gene annotation traditionally has been limited by approaches that use a one-to-one mapping of a known annotation onto a query sequence, for example, by transferring the annotation of the best scoring hit in a BLAST (4) search to the query, an approach that is highly vulnerable to mis-annotation and which unfortunately can be propagated throughout databases well beyond their original scope (5). Sequence profile-based approaches, such as HMMER (6), together with community curated models of protein functional domains such as Pfam (7), KOFam (8), CAZy (9), NCBI PGAP (10) and others greatly mitigate this problem. Accurate domain level annotations cannot only help provide consistent description of gene function, putative or otherwise, but it can also aid in the prediction of unknown functions through the co-association of functional domains within a single gene product. A clear example can be found in the polymorphic toxin systems (11) which utilize a single polypeptide composed of a conserved carrier domain followed by variable payload effector domain for delivery via extracellular secretion machinery: new toxin functions can be identified through association with the more readily identifiable carrier domain. In another example, Lutz et al. (12) discovered novel modification dependent DNA endonucleases by searching for PUA RNA epitranscriptomic ‘reader’ domains fused to HNH (histidine asparagine histidine) or PD-(D/E)XK catalytic deoxyendonuclease domains. These two examples show how accurate domain annotation coupled with observation of intra-gene domain associations can be utilized to help identify gene function.

Similarly, genome context can be used to infer potential gene function by observing what other genes are in the surrounding ‘neighborhood’ of the gene in question. Genome contextual analysis takes advantage of a general phenomenon where genes whose products function together in a biosynthetic pathway, or in a macromolecular complex, tend to cluster together at a genomic locus. Functional gene clustering is especially pronounced in bacteria and archaea and may be an adaptation for co-regulation of genes or driven by lateral gene transfer since genes that function as a ‘team’ are more likely to provide a fitness benefit when they are transferred together into a naive genome. Regardless of the origins of gene clustering, the phenomenon can be taken advantage of when attempting to find new gene functions. If, for example, a gene encodes an enzyme in the biosynthetic pathway producing an antibiotic or other cellular metabolite, then the neighboring genes may encode enzymes either producing a substrate for that enzyme or consuming the product of the enzyme in question. The co-association of genes across multiple genomes implies that the association is under selection and therefore can be taken as further evidence of a functional interdependence.

Publicly available tools enabling the retrieval of sets of genes based on their proximity include but are not limited to cblaster (13) the EFI-Genome Neighborhood Tool (14), antiSMASH (15), HaMBURGER (16), STRING (17,18), MicrobesOnline (19), KEGG (20) and BioCyc (21). These tools have been highly cited and proven to be valuable resources for many research groups. AntiSMASH (15) and HaMBURGER (16) are specialized for biosynthetic gene clusters and Type VI secretion systems, respectively. Cblaster (13) allows searches, in local or remote databases, for homologs of the query protein, then finds the genomic coordinates and neighboring genes to generate cross-species locus visualizations. EFI-Genome Neighborhood Tool (14) combines sequence similarity networks with genome neighborhood networks to visualize the relationship between protein similarity and neighborhood context similarity. The STRING database (18) combines multiple lines of evidence, including gene neighborhoods, gene fusions, gene co-occurrence, co-expression, protein homology and text mining, to provide multilevel protein-protein interaction information for 12 535 distinct organisms. The STRING database may be especially useful for information about protein associations in model organisms, such as Homo sapiens, which have rich source data. KEGG (20), BioCyc (21) and MicrobesOnline (19) feature attractive and user-friendly web interfaces and pathway diagrams that can be customized for individual species or datasets.

Positional/locus-based approaches have been used to find entirely new sets of gene functions without prior knowledge of any of the functions of the individual genes recovered. This approach takes advantage of the phenomenon where a genetic locus itself is an ‘island’ or ‘hotspot’ for diverse collections of genes that can be loosely grouped together under a broader functional definition. For example, genome defense genes, that is genes involved in the protection of a cell from the incursion of non-self genes, such as from viruses or mobile genetic elements, can be found at conserved locations within genomes across a range of species (22). The composition of genes at these loci can vary widely across even closely related species, indicative of a high rate of gene loss/acquisition/exchange. The Immigration Control Region of the Enterobacteriaceae (23) is a compelling example of such a dynamic locus. This region encodes many restriction endonuclease (RE) genes believed to protect the host genome against phages. Similarly, prophages themselves have been shown to encode hotspots of genome defense functions, presumably to confer some fitness advantage to their hosts during lysogeny by preventing infection from rival phages (24). In both cases, the dynamic locus is bounded by conserved genes. Homology searches for genomic sub-sequences bounded by conserved ‘framework’ genes can be used to recover sequences of diverse composition enriched for a common function. Investigation of defense systems would benefit from additional flexible and composable automated tools to annotate, search, extract and organize intra-gene, gene neighborhood and island/hotspot configurations.

In addition to profile- and neighborhood-based methods, profile-profile searches (25) and structure-based searches show high sensitivity in finding distant homologs based on shared protein fold between query and target (26,27). However, for many use cases, these approaches may be limited by the computational resources needed to obtain multiple sequence alignments and/or structural models for input query sequence as well as the sequences in the target database. The availability of a variety of structural databases, either experimentally validated such as the PDB, or precomputed predictions of large datasets such as AlphaFold Protein Structure Database (28) has alleviated the need for individual users to do structure prediction at scale, although there remain limitations, such as poor representation of predicted structures of phage proteins. These databases, coupled with structure-based search tools such as Foldseek (27), Rupee (29) and Reseek (30), have resulted in a quantum leap in the ability to search for structural homologs of a protein of interest. But on the user side, the computational demands of de novo structure prediction for query sequences can still limit the number of searches possible. Methods for converting amino acid sequences directly to Foldseek-compatible encodings, without full atomic coordinate prediction, can greatly expand the search space and scale accessible to the users (31,32).

We present here Domainator, an extended suite of command line software for sequence database search, genome annotation and extraction and comparison of sub-genomic regions, enabling diverse workflows for genomic contextual analyses and comparative genomics. We demonstrate the utility of Domainator through application to three examples. The examples presented follow a common pattern of search, annotate, report and cluster, analyze and apply, highlighting the range of functionalities of Domainator along the way (Table 1, Figures 1A and 2A). In the first example, we show how Domainator can be used to cluster individual proteins and examine their domain composition, focusing on CRISPR-associated Rossman fold (CARF) domain containing proteins (33,34). In the second example, we show how Domainator can be used to extract and analyze gene neighborhoods, focusing on P4-like phage satellite defense hotspots in Escherichia coli (24). In the third example, we augment Domainator with a protein language model (pLM) (2,32), enabling sensitive search with structural information to identify restriction modification (RM) systems, where the RE has low sequence identity to any previously characterized REs. We validate the endonuclease activity of one of the newly identified RE candidates using experimental assays.

Table 1.

Overview of examples of practical applications of Domainator presented in this work

Example Search Annotate Report and cluster Analyze and apply
CARF domain containing proteins Search results reported in the literature of CARF domain containing proteins with no other reported domains Hmmsearch with CARF domains, CARF effector domains and Pfam domains Cluster proteins based on sequence similarity of the largest protein regions not containing the CARF domain Examine Pfam and effector domain annotations and run AlphaFold2 on cluster representatives, followed by visual inspection of the predicted structure to assign putative functions to the largest clusters
P4-like phage satellite defense hotspots Hmmsearch of E. coli genomes for regions flanked by genes encoding Psu and Phage_integrase domains Curate protein orthogroups from the extracted regions and annotate the regions with orthogroup assignments using Phmmer, as well as hmmsearch of Pfam domains Cluster regions by Jaccard index of orthogroup contents, which calculates pairwise scores for contigs based on the proportion of domain annotations in common between them Examine Pfam and PADLOC annotations of the largest clusters and compare to defense hotspots reported in the literature
Cryptic RM systems Phmmer search of prokaryote genomes for MTs, then extract the MTs plus four ORFs on either side Phmmer of extracted neighborhoods against all REBASE gold standard proteins. Extract neighborhoods with no annotated RE and re-annotate against REBASE gold standard Type II REs using a pLM + Foldseek Cluster putative divergent Type II REs by sequence similarity Examine gene arrangement in the clusters and look up representative cluster members in REBASE to select putative RM systems for experimental validation

Figure 1.

Figure 1.

Domainator is designed around a modular architecture. (A) Examples of the different kinds of programs in Domainator (not all programs are shown). (B) Editors are a special kind of program where the input files and output files have the same format, editor programs in Domainator can apply sequential transformations to GenBank files, similar to how image editors can apply sequential transformations to image files.

Figure 2.

Figure 2.

General strategy for using Domainator to annotate and compare (A) proteins and gene neighborhoods and (B) HMM profiles.

Materials and methods

Domainator implementation

Domainator is written in Python and built on top of some critical dependencies, including BioPython (35), HMMER3 (6) via PyHMMER (36), Numpy (37), SciPy (38), Pandas (39), CD-HIT (40), DIAMOND (41), Prodigal (42) via Pyrodigal (43), Foldseek (27), ESM2 (2), seaborn (44), umap-learn (45), h5py, jsonargparse, psutil, tqdm, bashplotlib and requests. Sequence similarity network generation draws inspiration from and uses the sequence similarity score proposed by EFI (46). Compare_contigs uses the Jaccard (47) and adjacency (48) indexes to compare proteins or gene neighborhoods based on their domain content, whereas seq_dist uses local alignment scores, for example via phmmer (6), Diamond (41), hmmsearch (6) or the Viterbi profile-comparison algorithm (6,49,50).

Key formulas

EFI alignment score for calculating sequence similarity (46):

graphic file with name M0001.gif

Where the bit score can come from any local alignment algorithm, in our case, Diamond (41). The EFI score is roughly proportional to the negative logarithm of the E-value.

Jaccard index for calculating similarity of domain content (47):

Jaccard index = number of unique domain annotations shared between the two contigs / total number of unique domain annotations present in either contig

Examples

Computational workflows for the three examples are outlined in the results section, including, as detailed flow charts in Supplementary Figures S1, S3 and S6. Code for replicating all three examples is available on GitHub (https://github.com/nebiolabs/domainator_examples). Sequence similarity networks were visualized with Cytoscape (51), structure predictions were performed with AlphaFold2 (1) via local Colabfold (52) and rendered in Pymol (53).

Bacteria, bacteriophage strains and plasmid constructs

E. coli MG1655 was a gift from Mehmet Berkmen. ER2985 (MG1655 ICR::Kan) is a clean deletion of the E. coli immigration control region (ΔyjiT-mrr) and was a gift from Lise Raleigh (54). Phage T2, T4, T6 and Lambda CI were also gifts from Lise Raleigh. Phage 9g was a gift from Shuang-yong Xu. RB69 was a gift from Jim D. Karam. Phage P1 was purchased from ATCC (cat# 25404-B1). The bacteriophage T7 strain used here is a synthetic phage from which BsmBI restriction sites were removed to facilitate genome assembly, but which is otherwise WT. This phage strain was a gift from Gregory Lohman (55). High-titer stocks were created according to previously published protocols via confluent lysis, soak-out of phage particles into SM buffer (Teknova) and 0.2 μM filtration. Phage stocks were stored long-term at 4°C (56). The Lamassu-var system from CP038295 (nucleotide location 4 116 517–4 120 474) was cloned by GenScript into pACYCDuet-1 between the XbaI and Bsu36I restriction sites, generating a plasmid containing just the Lamassu system, p15A origin and chloramphenicol resistance cassette and without any promoters or other elements, similar to plasmids used for validation of defense systems in other studies (57). This construct, or an empty vector backbone, was transformed into MG1655 or ER2895 and selected on chloramphenicol to create the strains used in the spot titer assay. These E. coli strains were maintained in LB media (1% Difco soy peptone, 0.5% Thermo Scientific yeast extract, 0.5% Millipore Sigma NaCl) supplemented where appropriate 25 μg/ml chloramphenicol (Sigma) when necessary to maintain plasmids.

Assessment of in vivo Lamassu-var antiphage activity

To determine the efficiency of plating and plaque phenotype, spot titrations were performed either on MG1655 or ER2895 harboring either empty vector or the Lamassu-var construct. Around 250 μl of each strain was overlayed in top agar (1% soy peptone, 0.5% yeast extract, 0.05% Millipore Sigma MgCl2-6H2O, 0.75% agar, 25 μg/ml chloramphenicol) in 10 cm LB + 25 μg/ml chloramphenicol petri dishes. High titer phage stocks were serial diluted by factors of 10 from 10−1 to 10−8 in SM buffer (Teknova), and 2 μl of each dilution for each phage were spotted on overlays with a multichannel pipette. After adsorption of the drops, plates were inverted and incubated at 37°C for 4 h (phage T7) or overnight (all other phages) before imaging with a digital camera (Nikon).

DNA assembly, PCR amplification and PURExpress analysis of putative restriction endonucleases

Two candidate open reading frames (ORFs) with codon optimization for E. coli expression sequences for putative RE genes, locus tag H1R19_03490 (GenBank protein QMT02248.1) from Gordoniaceae jinghuaiqii str. zg-686 (GenBank nucleotide: CP059491) and locus tag PUW25_24865 (GenBank protein: WDI02380.1) from Paenibacillus urinalis (GenBank nucleotide: CP118108) were assembled into the pJB321 (Julie Beaulieu, unpublished NEB, MA) vector under control of T7 promoter of with pJB321_Nde_68R and pJB321_BamHI_68F primers (Supplementary Table S8) using 1 or 2 synthetic gBlocks (IDT, IA). The sizes of the gBlocks were validated on 1% DNA agarose gel (Supplementary Figure S7). The synthetic gBlocks were assembled into pJB321 by NEBuilder HiFi DNA assembly Master mix (E2621, NEB, MA) and the resulting T7 polymerase chain reaction (PCR) DNA templates were amplified with Q5 ‘Hot Start’ DNA polymerase (M0543, NEB, MA) using the T7 forward S1248 and T7 reverse S1271 primers (Supplementary Table S8) for in vitro protein synthesis. PCR products were purified with Monarch PCR and Cleanup Kit (5 mg) (T1030, NEB, MA) and validated on 1% agarose gel electrophoresis (Supplementary Figure S7). All synthetic DNA gBlocks and PCR fragments were quantified on a Qubit fluorimeter (Invitrogen, OR). The correct sequences of PCR products were validated by Sanger sequencing using an ABI 373 instrument. Enzymes were produced in the PURExpress system (E6800, NEB, MA) following the manufacturer's recommended protocol. For RE reactions, varying amounts of PURExpress enzyme product (1, 3 or 9 μl) were incubated for 1 h in 1X CutSmart buffer with 1 μg of DNA substrate. Two types of DNA substrates were used: phage lambda (λ) genomic DNA with Dam methylation (methylation at the N6 position of adenine within the GATC sequence) (N3011, NEB, MA) and λ genomic DNA free of Dam methylation (N3013, NEB, MA). All RE enzymes, DNA substrates, DNA and protein markers were from New England Biolabs (Ipswich, MA). Sequences of PCR primers and synthetic DNA ‘gBlocks’ encoding putative RE genes are listed in Supplementary Table S8.

Results

The Domainator toolkit

Domainator provides more than two dozen discrete, feature-rich programs that can be composed into a broad range of genome mining and comparative genomics workflows via command line or python scripting (Figure 1A). The high degree of modularity built into Domainator stands in contrast to other genome mining tools, such as EFI (58), antiSMASH (15), BiG-SCAPE/CORASON (48) and MacSyFinder (59), which supply dedicated end-to-end workflows for specific tasks. Another unique feature is that Domainator supports a rich set of operations on HMM-profiles (hidden Markov model), for example, subsetting of .hmm files and comparison of HMM-profiles, including the construction of profile versus profile similarity networks and trees (Figure 2B). The ‘Domain’ part of the name ‘Domainator’ derives from the key role that local (i.e. subsequence) alignments play in Domainator workflows.

Domainator uses the GenBank file format (60) (https://www.ncbi.nlm.nih.gov/genbank/samplerecord/) as a carrier of both sequence and annotation data. Independence from a fixed set of sequence sources and the co-location of sequences and all their annotation data in a single file increases data portability and decreases complexity for end-users. Domainator can add functional annotations to sequences by local alignments against databases of HMM-profiles, protein sequences, or both at the same time. For example, in a single call to the domainate program, a set of genome or metagenome contigs can be annotated with hits to Pfam HMM-profiles (7,61) and hits to REBASE Gold Standard protein sequences (62) at the same time. De novo annotations derived by Domainator can be added atop pre-existing metadata in Genbank format files, but the software also provides options to filter or even replace earlier annotations.

The individual programs that make up Domainator can be roughly grouped into six categories corresponding to their general roles in genome mining and comparative genomics workflows as diagrammed in Figure 1A. The first steps in most workflows often involve passing sequence data through one or more editors.

  • Editors are programs whose output format is the same as their input format. Each individual editor performs a simple task, such as adding putative functional annotations to a GenBank file or extracting a subset of contigs, but they can also be combined in arbitrarily long chains to accomplish complex transformations (Figure 1B). Examples are domain_search, which outputs a subset of the input sequences, based on the presence of a hit to a reference sequence or profile; domainate, which outputs all the input sequence but adds domain annotations based on hits to user-specified reference sequences; deduplicate_genbank, which performs similarity clustering using CD-HIT (40) or USEARCH (63) and outputs only the cluster representatives of the input sequences; and select_by_cds, which extracts genome neighborhoods (including sequence, features and annotations) surrounding domains of interest.

  • Summary report programs summarize data into graphs and statistics, for example, the number of sequences in a file, the count of each kind of domain and the distribution of taxonomic origins of the sequences. Reports are provided in human readable format either as text displayed in the console or as HTML files.

  • Record-wise report programs produce tab-separated files, for example where each row corresponds to a genome contig, a protein, or a domain, and columns are data such as length, taxonomy ID, domain content, etc. Record-wise reports are useful for exporting data to programs, such as Excel, which can’t read GenBank or hmm files, and they also find use as intermediary files between some programs in Domainator.

  • Comparison programs generate pairwise score or distance matrices between proteins, contigs, or HMM-profiles. Compare_contigs uses the Jaccard (47) and adjacency (48) indexes to compare proteins or gene neighborhoods based on their domain content, whereas seq_dist uses local alignment scores, for example via phmmer (6), Diamond (41), hmmsearch (6) or the Viterbi profile-comparison algorithm (6,49,50). Comparison programs output pairwise distance or score matrices that can be presented in tables or used to plot trees or similarity network diagrams.

  • Plotting programs convert data into formats appropriate for graphical visualization, for example converting distance or score matrices and tabular metadata into trees or similarity networks which can be viewed in Cytoscape (51) or other external visualization tools, depending on the data type. Plotting programs that take matrices as input can be also plot data from programs outside the Domainator suite. For example, build_ssn could be used to generate a structure similarity network by supplying it with a text file containing a table of pairwise structural similarity scores between a set of .pdb structure files, as calculated by TM-align (64) or some other program.

  • Finally, there are a few other programs that defy categorization. These programs perform functions such as downloading data from NCBI or UniProt, converting files between formats, or generating profile-profile alignments.

The modularity of Domainator makes it straightforward to integrate the next generation of protein annotation methods into Domainator workflows, including methods such as pLM embeddings (2,65). We recently reported a method (32) to greatly improve the sensitivity of remote homology detection between proteins using the ESM-2 3B pLM (2) fine-tuned to convert amino acid sequences directly into the Foldseek 3Di structure alphabet (27). The fine-tuned model, ESM-2 3B 3Di, generates 3Di sequences roughly 1000 times faster than AlphaFold2. The 3Di sequences predicted by the model perform well when used as queries and targets in Foldseek searches, outperforming both phmmer and hmmscan for sequences less than 20% identical to each other (32,66). We integrated ESM-2 3B 3Di into the domainate program, allowing annotation of protein sequences by on-the-fly conversion to 3Di sequences followed by searches against Foldseek databases.

Comparing CARF domain-containing proteins

CARF containing proteins are ancillary proteins, commonly associated with type III CRISPR-Cas systems (34,67–69) (Figure 3A). In some type III systems, the Cas10 subunit has a cyclic oligoadenylate (cOA) synthase activity that becomes active when the type III complex binds to target RNA. CARF containing proteins in these systems typically have two domains, the CARF domain and an effector domain. Binding of the cOA signaling molecule by the CARF domain induces homo-dimerization and activation of the effector domain. The effector domain is highly variable, but often functions as a non-specific nuclease. Activation of the effector domain provides an additional layer of immunity to the targeted nuclease activity of the CRISPR-Cas system, in some cases slowing down host growth or killing the host in an abortive infection process. Characterization of the variable domains fused to the conserved CARF domains can lead to discovery of new defense-related enzymatic activities.

Figure 3.

Figure 3.

Annotating variable effector domains of CARF domain containing proteins. (A) Schematic of effector activation. (B) Overview of bioinformatics strategy. (C) Sequence similarity network of effector domains colored based on their associated CARF domain.

To extend the analysis of CARF domain containing proteins recently catalogued by Makarova, Koonin and colleagues (34), we used Domainator in a workflow assigning putative annotations to fused domains left unannotated in the earlier work (Figure 3B,C and Supplementary Figure S1, Supplementary Table S1). Searching through a database of 13 116 completely assembled archaeal and bacterial genomes, the earlier work compiled a list of 6665 CARF domain containing proteins, sorted them into 25 clades and annotated the fused domains (34). In 1844 of the CARF domain containing proteins, they identified a single CARF domain and no additional domain annotations.

For each of these CARF-only proteins, we identified the footprint of the putative CARF domains and extracted the largest region not covered by those domains, provided it was 50 amino acids or longer, producing a list of 1524 protein fragments. We annotated the fragments using hmmscan against Pfam (v 36) (7) and profiles built from CARF effector domain alignments reported in the earlier work (34). We built a sequence similarity network from the fragments (Figure 3C), grouping them into 94 homology clusters. Within each cluster, almost all the members are fusions to CARF domains of the same clade. This co-occurrence is notable because we clustered based on just the fusion domain rather than the CARF domain or entire protein, suggesting coevolution. The top 12 homology clusters contained 1391 of the 1524 protein fragments, with 916 fragments belonging to the largest cluster (Supplementary Tables S1 and S2). We attempted to assign a putative functional annotation to each cluster (Supplementary Table S1) using Pfam and effector domain annotations, CARF clade and examination of AlphaFold2 predicted structures (1,52) for two randomly selected examples from each cluster (Supplementary Figure S2, Supplementary Table S3).

AlphaFold2 structures of 9 out the 12 clusters (Supplementary Figure S2) contain apparent Helix-Turn-Helix (HTH) or winged HTH motifs, indicating that they likely bind DNA and may serve as transcriptional regulators. In some cases, the DNA binding motif comprises the entire non-CARF region. In the case of clusters 2, 3 and 9, additional protein sequence is present which may confer catalytic activity or some other function. Clusters 6 and 9 appear to be divergent members of the previously catalogued CARF7_DUF2103 domain family (34). Cluster 11 appears to contain HEPN nucleases (70), non-specific RNases that are one of the most common effector domains found fused to CARF domains. Based on similarity of tertiary structure, cluster 10 most likely represents a variant of the HEPN domain. However, it lacks the critical R-X4-6-H motif, which is characteristic of the HEPN domain (70,71). In addition, there are no conserved arginine or histidine residues within the proteins of cluster 10. Without experimental data or some additional line of computational evidence, we are hesitant to assign a putative nuclease annotation to this cluster.

The example of CARF effector domain analysis demonstrates some of the capabilities of Domainator to extract, cluster and annotate individual proteins and protein fragments. In our next example, we show how Domainator enables the discovery of multigene defense hotspots through the same paradigm of search, annotate, cluster and analyze.

Comparing P4-like phage satellite defense hotspots in E. coli

Diverse defense systems in bacteria are often found in bacterial genomes as variable defense islands, hotspots or cassettes, flanked by conserved ‘framework’ genes. We use the example of P4-like satellites (Figure 4A), previously catalogued by Rousset et. al (2022) (24), to demonstrate the capability of the Domainator suite for identifying, extracting and analyzing variable defense regions. Using a series of Domainator commands (Supplementary Figure S3), we started from a set of 2718 E. coli genomes (totaling about 30 gigabytes in size) and extracted 910 regions flanked by Psu and Phage_integrase Pfam (version 36) domains (7), between 1 and 9 coding sequences (CDSs) in size. We then produced a tabular summary of each instance of the hotspot (Supplementary Table S4), and a graphical neighborhood similarity network (Figure 4B) based on the pairwise Jaccard indexes (48) of orthogroup contents (Supplementary Table S5), grouping the hotspots into clusters with shared orthogroup contents. The Jaccard index is calculated by dividing the number of unique orthogroups in common between two neighborhoods by the total number of unique orthogroups in the two contigs. A limitation of similarity networks is that they do not show the relationship between clusters. To get a more granular view of cluster relationships, Domainator can also produce Jaccard index trees, by Jaccard distance as 1 - Jaccard index, followed by calculating an unweighted pair group method with arithmetic mean (UPGMA) tree (Supplementary Figure S4), from the pairwise Jaccard distances.

Figure 4.

Figure 4.

Identifying and categorizing P4-like phage satellite defense hotspots in E. coli. (A) P4-like phage satellite defense hotspots are highly variable regions of 1–9 genes found between Psu and Phage integrase genes. Pfam and PADLOC annotations of representatives from the top 12 clusters are shown. (B) A similarity network of P4-like phage satellite defense hotspots, edges are drawn between clusters with orthogroup content Jaccard index of greater than 0.7, colors represent different Pfam domain arrangements within the hotspot.

The results of our search were broadly consistent with the previously reported results (24) (Table 2), despite our use of a different set of starting genomes and different analysis methods. Of the top 12 most frequent defense systems found through our analyses, 9 were among the top 12 most abundant systems reported by Rousset et al. and 11 here were among their top 20. These systems include REs (72), the Kiwa (73) and Gabija (73,74) systems, among others. Surprisingly, the most frequent system in our analysis was not previously described but has homology to Type II Lamassu systems (75,76), based on annotation with the PADLOC database (77). Upon further investigation, these hotspots were not from P4-like phages, but apparently from satellites derived from other Caudoviricetes phages, as determined by Blast search of the satellites and their flanking regions against viral genomes in NCBI nr. Furthermore, their flanking Psu sequences had much lower identities to the reference Psu sequence (GenBank: WP_000446153) than did Psu sequences flanking other common clusters. Psu sequences from cluster 1 had an average of 31% amino-acid sequence identity to the reference P4 Psu, whereas Psu sequences from other clusters averaged 98% or 99% identity to the reference (Table 2 and Supplementary Table S6).

Table 2.

P4-like phage satellite defense hotspot annotations. Psu % identity is relative to the P4 phage reference sequence (GenBank: WP_000446153)

Cluster Count Number of CDSs System # in Rousset et. al. 2022 Average Psu %ID Annotation
1 259 3 to 9 None 31 Related to Lamassu systems
2 93 2 to 3 1 99 Type III RE
3 31 3 to 5 3 99 EcoO109I (Type II RM)
4 24 2 4 99 PDC-M18 (PADLOC)
5 20 1 5 99 Unknown
6 19 1 11 99 NRL-like protein
7 19 2 6 98 Gabija
8 16 2 2 99 Reverse transcriptase
9 15 3 to 4 7 99 gop-beta-cII
10 14 1 20 98 Nuclease
11 12 3 14 99 Kiwa
12 12 1 9 98 Sir2-like (Likely a NADase)
All others 376

To determine if the Lamassu system variant (hereafter Lamassu-var) detected using our Domainator analysis functions in antiphage defense, we synthesized a representative hotspot encoding this system and tested whether it conferred resistance to a panel of diverse phages (Supplementary Figure S5). Among the phages tested, all T-even phages (T2, T4, T6 and RB69) showed a reduction in plaque size compared to cell lines harboring an empty vector, demonstrating an in vivo function for this common system (Supplementary Figure S5B). Plaque size reduction compared to the control was apparent only at the most dilute phage titers, suggesting that we may not have tested against the phage species most susceptible to restriction by this defense system. Our observations are consistent with a previous report of a weak defense phenotype of a different Lamassu system variant against T4 phage (75).

The example of extracting and comparing P4-like phage satellite defense hotspots demonstrates the power of Domainator in the analysis of gene neighborhoods, enabling the discovery of variable defense regions, and uncovering potential new defense systems from Caudoviricetes, distantly homologous to the previously described defense hotspots in P4-like satellites. In our next example, we augment Domainator with a pLM to assign useful functional annotations to proteins even deeper into the twilight zone of similarity to well-characterized reference sequences.

Identification of cryptic RM systems using a protein language model

REs are enzymes that cut foreign DNA upon entry into a bacterial cell and thus play an important role in bacterial defense against bacteriophages (78). As a whole, REs are highly diverse but can be categorized according to Types I-IV. Type I, II and III enzymes utilize DNA methylation patterns established by methyltransferases (MTs) in the host cell to distinguish self from non-self DNA. These two enzymatic activities are dependent on each other for a functioning RM system. Methylation and cleavage are specific to a shared DNA sequence motif, and methylation of DNA blocks cleavage of DNA by the cognate RE. In this way, the MT protects the bacterial genome from cleavage by its own RE.

Foreign DNA, such as bacteriophage DNA originating from an environment lacking an appropriate MT, will be susceptible to being cut by an RE. The MT and corresponding RE are typically encoded by genes within four ORFs of each other (62,79) (Figure 5A). MTs tend to exhibit slower evolution and sequence divergence than REs (62), perhaps because of their more complex chemistry and the weaker selection pressure in the absence of phage threats. The implication is that MTs are more easily detectable than REs by primary sequence search such as BLASTP (80) or phmmer (6). Therefore, it can be expected that bioinformatic discovery of RM systems using sequence similarity search methods may fail to identify RM systems where the MT is detectable by sequence search, but a neighboring RE gene has diverged too far from the reference sequences to be detectable.

Figure 5.

Figure 5.

Identifying cryptic RE using Domainator and a pLM. (A) Schematic of how Type II RM systems protect bacteria from phages. (B) Overview of bioinformatics strategy. (C) Sequence similarity network based on similarity of predicted REs and colored based on the best reference MT hit to the neighbor MT.

We leveraged the REBASE Gold Standards set of reference MTs, REs and associated proteins (62) together with the Domainator interface to phmmer, Foldseek (27) and neighborhood extraction functionalities to identify RM systems with divergent REs (Figure 5B and Supplementary Figure S6). We started by downloading a set of high-quality genomes from GenBank (60) (a database totaling about 366 gigabytes in size) and extracting 205 039 neighborhoods containing MT phmmer hits plus four CDSs on either side of the query match. Filtering out those neighborhoods containing REs detectable by phmmer, resulted in 124 853 putative orphan MTs and their surrounding features. We then used the ESM-2 3B 3Di (32) pLM to recode the sequences into a tertiary structure alphabet and used Foldseek (27) to find structural matches to Type II REs within these neighborhoods, yielding 2628 contigs containing putative cryptic RM systems (Supplementary Table S7). We then generated a sequence similarity network of the REs encoded in these systems (Figure 5C), and manually inspected representative systems selected from each cluster. We selected two candidate RM systems for experimental validation, prioritizing candidates where the putative RE is immediately adjacent to the putative MT and where the RE is absent from the REBASE (62) entry for the corresponding genome.

The candidate REs selected for activity screening were: locus_tag H1R19_03490 (GenBank protein QMT02248.1) from Gordoniaceae jinghuaiqii str. zg-686 (GenBank nucleotide: CP059491), and locus tag PUW25_24865 (GenBank protein: WDI02380.1) from Paenibacillus urinalis (GenBank nucleotide: CP118108). A schematic outlining the construction of templates for in vitro protein expression, beginning with synthetic gene blocks, is shown Figure 6A. Following expression using the PURExpress in vitro transcription/translation system, one-, three- and nine-μl of the PURExpress enzyme product was incubated for 1 h in 1X CutSmart buffer with 1 μg of DNA substrates: 42 kb long phage λ genomic DNA with Dam methylation (methylation at the N6 position of the adenine in the sequence GATC) or free of Dam methylation. DNA from the cleavage assay reactions was resolved and visualized by agarose gel electrophoresis and ethidium bromide staining. As seen in Figure 6B, reactions expressing H1R19_03490 from Gordonia showed specific cleavage activity towards unmethylated λ DNA (Figure 6B). As a result of the confirmed activity, H1R19_03490 has been renamed Gba686I, according to accepted nomenclature (81). Reactions expressing PUW25_24865 from P. urinalis did not contain detectable activity in our assay (Figure 6B). The digestion pattern produced by Gba686I reactions was identical to that produced by BamHI (Genbank: QDP93514.1; R0136, NEB, MA), indicating that Gba686I recognized the sequence GGATCC. However, unlike BamHI, Gba686I is blocked by Dam methylation (Figure 6C). We were unable to detect any amino-acid sequence similarity between Gba686I and BamHI.

Figure 6.

Figure 6.

Putative RE candidate in vitro assay. (A) The workflow of expression experiments, including gene assembly, preparation of expression templates by PCR and in vitro expression in PURExpress system. (B) Endonuclease activity assay on λ dam- DNA substrate, where Gba686I shows specific cleavage, but PUW25_24865 is inactive. (C) Gba686I produces the same cleavage pattern as BamHI, which recognizes a GGATCC motif but, in contrast to BamHI, this isoschizomer is sensitive to Dam methylation at the GATC sequence nested within the RE cleavage site. The negative control was performed on the λ dam- DNA substrate without adding the PURExpress protein product. Note the presence of a DNA band at 1 kb corresponding to the expression template and not derived from the cleavage of the input substrate DNA.

Discussion

In this work, we introduce the Domainator software suite and show how it can be used to find, extract and cluster proteins and gene neighborhoods. We highlighted three examples of how Domainator can be used and extended to answer different kinds of questions in the context of bacterial defense systems. The three examples all follow the general pattern of search, annotate, report and cluster, analyze and apply (Figure 2A), but instantiate that pattern using different combinations of Domainator programs to study various aspects of diverse biological systems (Table 1).

In the first example, we clustered CARF containing proteins and added putative annotations to examples that were previously difficult to annotate, identifying a subset that appear to contain HEPN RNase domains. In the second example, we extracted and clustered P4-like phage satellite defense hotspots from E. coli, identified a distantly related defense hotspot with homology to Type II Lamassu systems and demonstrated anti-phage activity of the system in vivo. In the final example, we integrated a neural network pLM into Domainator and used the new functionality as part of a workflow to identify divergent REs that were undetectable using primary sequence-based searches. We verified the sequence-specific endonuclease activity of one of these RMs in vitro.

The power and flexibility of Domainator was enabled by several key design choices. One key was to use the GenBank file format as the primary carrier of sequence information and annotations. The main advantage of using GenBank files as primary carriers of information is that it makes data at each intermediate step of the analysis self-contained and portable. The biggest tradeoffs of using GenBank files are that search speed can suffer due to a lack of pre-indexing of the GenBank files and the slowness of parsing GenBank files in python. Another tradeoff is that the GenBank format is not rigorously standardized, with different software parsing and writing different variants of the format. To mitigate this problem as much as possible, the code in Domainator for reading and writing GenBank files is based on BioPython (35), which has been developed and improved over many years to handle the various GenBank dialects.

Another key choice was the separation of editor, comparison, reporting and plotting tools into distinct scripts with modular, interchangeable and, where possible, portable input and output formats. While this separation of functions increases the complexity of using Domainator compared to other software packages for gene neighborhood and protein annotation, it gives the user much greater flexibility in composition of analysis workflows. One way we take advantage of this flexibility is by using the build_ssn program to build not only sequence similarity networks but also neighborhood domain composition similarity networks, as well as to incorporate into those networks’ annotations produced by enum_report and by external sources of tab-separated data. Another advantage of the modularity is that it makes it straightforward to integrate additional functionality into Domainator, for example, the pLM.

To get started with Domainator, all a user needs are the nucleotide or protein sequences they want to filter or annotate in fasta or GenBank format, and protein sequences or HMM profiles to use as references for the annotation in fasta or hmm format. The included script, domainator_db_download, facilitates downloading sequences from NCBI or UniProt, with options for filtering by taxonomic origin of the sequences. Curated HMM profiles can be obtained from InterPro (7,61) (https://www.ebi.ac.uk/interpro/), NCBI PGAP (10) (https://ftp.ncbi.nih.gov/hmm/current/), KEGG KOfam (8) (https://www.genome.jp/tools/kofamkoala/) or various more specialized sources, such as CAZY/dbCAN (9,82) (https://bcb.unl.edu/dbCAN2/) for carbohydrate-active enzyme, or PADLOC (77) (https://github.com/padlocbio/padloc-db) or DefenseFinder (83) (https://github.com/mdmparis/defense-finder-models/) for proteins related to defense and conflict systems. Documentation, tutorials and example workflows are available on the Domainator GitHub page (https://github.com/nebiolabs/domainator).

The three examples presented in this work just scratch the surface of what is possible using Domainator. Other potential applications include, but are not limited to, identification, extraction and analysis of novel domain fusions and multi-domain proteins, prophages, natural product biosynthetic gene clusters, serotype clusters, pathogenicity islands, phage inducible chromosomal islands and other mobile genetic elements, such as integrative conjugative elements (ICEs). Furthermore, Domainator seamlessly annotates and extracts intron-containing genes, so it can be used to study eukaryotic genomes as well as eubacterial and archaeal genomes. We intend to continue adding new functionality in the coming years as we use it in our own research projects. By making Domainator available free and open source, we hope that it will serve as a useful tool for the life sciences community, and we invite suggestions and contributions for its continued improvement.

Supplementary Material

gkae1175_Supplemental_Files

Acknowledgements

We thank Yu-Cheng Lin for writing a script that became the precursor to Domainator, the many users and beta testers from the research department at New England Biolabs, and Gary Smith and the NEB IT team for help with computing infrastructure. We are grateful for support from New England Biolabs (NEB), without which this work would not have been possible. The authors are employees of NEB, a manufacturer and vendor of molecular biology reagents. This affiliation does not affect the authors’ impartiality, adherence to journal standards and policies, or availability of data and software.

Contributor Information

Sean R Johnson, New England Biolabs Inc., Ipswich, MA 01938, USA.

Peter R Weigele, New England Biolabs Inc., Ipswich, MA 01938, USA.

Alexey Fomenkov, New England Biolabs Inc., Ipswich, MA 01938, USA.

Andrew Ge, New England Biolabs Inc., Ipswich, MA 01938, USA.

Anna Vincze, New England Biolabs Inc., Ipswich, MA 01938, USA.

James B Eaglesham, New England Biolabs Inc., Ipswich, MA 01938, USA.

Richard J Roberts, New England Biolabs Inc., Ipswich, MA 01938, USA.

Zhiyi Sun, New England Biolabs Inc., Ipswich, MA 01938, USA.

Data availability

All data related to the analyses in this work are available from github (https://github.com/nebiolabs/domainator_examples) and Zenodo (https://doi.org/10.5281/zenodo.10989173).

Code availability

The source code for Domainator is available from https://github.com/nebiolabs/domainator. Scripts for reproducing the analyses presented in this work are available from https://github.com/nebiolabs/domainator_examples and Zenodo at https://zenodo.org/records/14056380. We’ve also made a Google Colab notebook that automatically installs Domainator into an interactive sandbox environment and shows how to run some common data analysis tasks: https://colab.research.google.com/github/nebiolabs/domainator_examples/blob/main/colab_notebooks/Domainator.ipynb.

Supplementary data

Supplementary Data are available at NAR Online.

Funding

New England Biolabs (NEB). Funding for open access charge: New England Biolabs, Inc.

Conflict of interest statement. The authors are employees of New England Biolabs, a manufacturer and vendor of molecular biology reagents. This affiliation does not affect the authors’ impartiality, adherence to journal standards and policies, or availability of data.

References

  • 1. Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Žídek A., Potapenko A.et al.. Highly accurate protein structure prediction with AlphaFold. Nature. 2021; 596:583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Lin Z., Akin H., Rao R., Hie B., Zhu Z., Lu W., Smetanin N., Verkuil R., Kabeli O., Shmueli Y.et al.. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023; 379:1123–1130. [DOI] [PubMed] [Google Scholar]
  • 3. De Crécy-Lagard V., Amorin De Hegedus R., Arighi C., Babor J., Bateman A., Blaby I., Blaby-Haas C., Bridge A.J., Burley S.K., Cleveland S.et al.. A roadmap for the functional annotation of protein families: a community perspective. Database. 2022; 2022:baac062. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J.. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997; 25:3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Devos D., Valencia A.. Intrinsic errors in genome annotation. Trends Genet. 2001; 17:429–431. [DOI] [PubMed] [Google Scholar]
  • 6. Eddy S.R. Accelerated profile HMM searches. PLOS Comput. Biol. 2011; 7:e1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Mistry J., Chuguransky S., Williams L., Qureshi M., Salazar G.A., Sonnhammer E.L.L., Tosatto S.C.E., Paladin L., Raj S., Richardson L.J.et al.. Pfam: the protein families database in 2021. Nucleic Acids Res. 2021; 49:D412–D419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Aramaki T., Blanc-Mathieu R., Endo H., Ohkubo K., Kanehisa M., Goto S., Ogata H.. KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics. 2020; 36:2251–2252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Cantarel B.L., Coutinho P.M., Rancurel C., Bernard T., Lombard V., Henrissat B.. The Carbohydrate-Active EnZymes database (CAZy): an expert resource for glycogenomics. Nucleic Acids Res. 2009; 37:D233–D238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Tatusova T., DiCuccio M., Badretdin A., Chetvernin V., Nawrocki E.P., Zaslavsky L., Lomsadze A., Pruitt K.D., Borodovsky M., Ostell J.. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res. 2016; 44:6614–6624. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Ruhe Z.C., Low D.A., Hayes C.S.. Polymorphic toxins and their immunity proteins: diversity, evolution, and mechanisms of delivery. Annu. Rev. Microbiol. 2020; 74:497–520. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Lutz T., Flodman K., Copelas A., Czapinska H., Mabuchi M., Fomenkov A., He X., Bochtler M., Xu S.. A protein architecture guided screen for modification dependent restriction endonucleases. Nucleic Acids Res. 2019; 47:9761–9776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Gilchrist C.L.M., Booth T.J., van Wersch B., van Grieken L., Medema M.H., Chooi Y.-H.. cblaster: a remote search tool for rapid identification and visualization of homologous gene clusters. Bioinforma. Adv. 2021; 1:vbab016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Oberg N., Zallot R., Gerlt J.A.. EFI-EST, EFI-GNT, and EFI-CGFP: enzyme Function Initiative (EFI) web resource for Genomic enzymology tools. J. Mol. Biol. 2023; 435:168018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Blin K., Shaw S., Kloosterman A.M., Charlop-Powers Z., van Wezel G.P., Medema M.H., Weber T.. antiSMASH 6.0: improving cluster detection and comparison capabilities. Nucleic Acids Res. 2021; 49:W29–W35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Mariano G., Trunk K., Williams D.J., Monlezun L., Strahl H., Pitt S.J., Coulthurst S.J.. A family of type VI secretion system effector proteins that form ion-selective pores. Nat. Commun. 2019; 10:5484. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Snel B., Lehmann G., Bork P., Huynen M.A.. STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic Acids Res. 2000; 28:3442–3444. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Szklarczyk D., Kirsch R., Koutrouli M., Nastou K., Mehryary F., Hachilif R., Gable A.L., Fang T., Doncheva N.T., Pyysalo S.et al.. The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2023; 51:D638–D646. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Dehal P.S., Joachimiak M.P., Price M.N., Bates J.T., Baumohl J.K., Chivian D., Friedland G.D., Huang K.H., Keller K., Novichkov P.S.et al.. MicrobesOnline: an integrated portal for comparative and functional genomics. Nucleic Acids Res. 2010; 38:D396–D400. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Kanehisa M., Goto S.. KEGG: kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000; 28:27–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Karp P.D., Billington R., Caspi R., Fulcher C.A., Latendresse M., Kothari A., Keseler I.M., Krummenacker M., Midford P.E., Ong Q.et al.. The BioCyc collection of microbial genomes and metabolic pathways. Brief. Bioinform. 2019; 20:1085–1093. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Makarova K.S., Wolf Y.I., Snir S., Koonin E.V.. Defense islands in bacterial and archaeal genomes and prediction of novel defense systems. J. Bacteriol. 2011; 193:6039–6056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Sibley M.H., Raleigh E.A.. Cassette-like variation of restriction enzyme genes in Escherichia coli C and relatives. Nucleic Acids Res. 2004; 32:522–534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Rousset F., Depardieu F., Miele S., Dowding J., Laval A.-L., Lieberman E., Garry D., Rocha E.P.C., Bernheim A., Bikard D.. Phages and their satellites encode hotspots of antiviral systems. Cell Host Microbe. 2022; 30:740–753. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Remmert M., Biegert A., Hauser A., Söding J.. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods. 2012; 9:173–175. [DOI] [PubMed] [Google Scholar]
  • 26. Holm L. Dali server: structural unification of protein families. Nucleic Acids Res. 2022; 50:W210–W215. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. van Kempen M., Kim S.S., Tumescheit C., Mirdita M., Lee J., Gilchrist C.L.M., Söding J., Steinegger M.. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 2023; 42:243–246. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Varadi M., Anyango S., Deshpande M., Nair S., Natassia C., Yordanova G., Yuan D., Stroe O., Wood G., Laydon A.et al.. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022; 50:D439–D444. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Ayoub R., Lee Y.. RUPEE: a fast and accurate purely geometric protein structure search. PLoS One. 2019; 14:e0213712. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Edgar R.C. Protein structure alignment by reseek improves sensitivity to remote homologs. Bioinformatics. 2024; 40:btae687. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Heinzinger M., Weissenow K., Sanchez J.G., Henkel A., Mirdita M., Steinegger M., Rost B.. Bilingual language model for protein sequence and structure. NAR genom. bioinform. 2024; 6:lqae150. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Johnson S.R., Peshwa M., Sun Z.. Sensitive remote homology search by local alignment of small positional embeddings from protein language models. eLife. 2024; 12:RP91415. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Makarova K.S., Anantharaman V., Grishin N.V., Koonin E.V., Aravind L.. CARF and WYL domains: ligand-binding regulators of prokaryotic defense systems. Front. Genet. 2014; 5:102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Makarova K.S., Timinskas A., Wolf Y.I., Gussow A.B., Siksnys V., Venclovas Č., Koonin E.V.. Evolutionary and functional classification of the CARF domain superfamily, key sensors in prokaryotic antivirus defense. Nucleic Acids Res. 2020; 48:8828–8847. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Cock P.J.A., Antao T., Chang J.T., Chapman B.A., Cox C.J., Dalke A., Friedberg I., Hamelryck T., Kauff F., Wilczynski B.et al.. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009; 25:1422–1423. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Larralde M., Zeller G.. PyHMMER: a Python library binding to HMMER for efficient sequence analysis. Bioinformatics. 2023; 39:btad214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Harris C.R., Millman K.J., van der Walt S.J., Gommers R., Virtanen P., Cournapeau D., Wieser E., Taylor J., Berg S., Smith N.J.et al.. Array programming with NumPy. Nature. 2020; 585:357–362. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Virtanen P., Gommers R., Oliphant T.E., Haberland M., Reddy T., Cournapeau D., Burovski E., Peterson P., Weckesser W., Bright J.et al.. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods. 2020; 17:261–272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. McKinney W. Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference. 2010; 56–61. [Google Scholar]
  • 40. Fu L., Niu B., Zhu Z., Wu S., Li W.. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012; 28:3150–3152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Buchfink B., Reuter K., Drost H.-G.. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods. 2021; 18:366–368. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Hyatt D., Chen G.-L., LoCascio P.F., Land M.L., Larimer F.W., Hauser L.J.. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinf. 2010; 11:119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Larralde M. Pyrodigal: python bindings and interface to Prodigal, an efficient method for gene prediction in prokaryotes. J. Open Source Softw. 2022; 7:4296. [Google Scholar]
  • 44. Waskom M.L. seaborn: statistical data visualization. J. Open Source Softw. 2021; 6:3021. [Google Scholar]
  • 45. McInnes L., Healy J., Saul N., Großberger L.. UMAP: uniform Manifold approximation and projection. J. Open Source Softw. 2018; 3:861. [Google Scholar]
  • 46. Gerlt J.A., Bouvier J.T., Davidson D.B., Imker H.J., Sadkhin B., Slater D.R., Whalen K.L.. Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST): a web tool for generating protein sequence similarity networks. Biochim. Biophys. Acta BBA - Proteins Proteomics. 2015; 1854:1019–1037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Lin K., Zhu L., Zhang D.-Y.. An initial strategy for comparing proteins at the domain architecture level. Bioinformatics. 2006; 22:2081–2086. [DOI] [PubMed] [Google Scholar]
  • 48. Navarro-Muñoz J.C., Selem-Mojica N., Mullowney M.W., Kautsar S.A., Tryon J.H., Parkinson E.I., De Los Santos E.L.C., Yeong M., Cruz-Morales P., Abubucker S.et al.. A computational framework to explore large-scale biosynthetic diversity from large-scale genomic data. Nat. Chem. Biol. 2020; 16:60–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Söding J. Protein homology detection by HMM–HMM comparison. Bioinformatics. 2005; 21:951–960. [DOI] [PubMed] [Google Scholar]
  • 50. Steinegger M., Meier M., Mirdita M., Vöhringer H., Haunsberger S.J., Söding J.. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinf. 2019; 20:473. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Shannon P., Markiel A., Ozier O., Baliga N.S., Wang J.T., Ramage D., Amin N., Schwikowski B., Ideker T.. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003; 13:2498–2504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Mirdita M., Schütze K., Moriwaki Y., Heo L., Ovchinnikov S., Steinegger M.. ColabFold: making protein folding accessible to all. Nat. Methods. 2022; 19:679–682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Schrödinger, LLC The PyMOL Molecular Graphics System, Version 2.5. 2015; https://pymol.org/support.html.
  • 54. Kingston A.W., Roussel-Rossin C., Dupont C., Raleigh E.A.. Novel recA-independent horizontal gene transfer in Escherichia coli K-12. PLoS One. 2015; 10:e0130813. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Pryor J.M., Potapov V., Bilotti K., Pokhrel N., Lohman G.J.S.. Rapid 40 kb genome construction from 52 parts through data-optimized assembly design. ACS Synth. Biol. 2022; 11:2036–2042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Bonilla N., Barr J.J.. Phage on tap: a quick and efficient protocol for the preparation of bacteriophage laboratory stocks. Methods Mol. Biol. Clifton NJ. 2018; 1838:37–46. [DOI] [PubMed] [Google Scholar]
  • 57. Gao L., Altae-Tran H., Böhning F., Makarova K.S., Segel M., Schmid-Burgk J.L., Koob J., Wolf Y.I., Koonin E.V., Zhang F.. Diverse enzymatic activities mediate antiviral immunity in prokaryotes. Science. 2020; 369:1077–1084. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Zallot R., Oberg N., Gerlt J.A.. The EFI web resource for genomic enzymology tools: leveraging protein, genome, and metagenome databases to discover novel enzymes and metabolic pathways. Biochemistry. 2019; 58:4169–4182. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Néron B., Denise R., Coluzzi C., Touchon M., Rocha E.P.C., Abby S.S.. MacSyFinder v2: improved modelling and search engine to identify molecular systems in genomes. Peer Community J. 2023; 3:e28. [Google Scholar]
  • 60. Sayers E.W., Cavanaugh M., Clark K., Ostell J., Pruitt K.D., Karsch-Mizrachi I.. GenBank. Nucleic Acids Res. 2020; 48:D84–D86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Paysan-Lafosse T., Blum M., Chuguransky S., Grego T., Pinto B.L., Salazar G.A., Bileschi M.L., Bork P., Bridge A., Colwell L.et al.. InterPro in 2022. Nucleic Acids Res. 2023; 51:D418–D427. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Roberts R.J., Vincze T., Posfai J., Macelis D.. REBASE: a database for DNA restriction and modification: enzymes, genes and genomes. Nucleic Acids Res. 2023; 51:D629–D630. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Edgar R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010; 26:2460–2461. [DOI] [PubMed] [Google Scholar]
  • 64. Zhang Y., Skolnick J.. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005; 33:2302–2309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Elnaggar A., Heinzinger M., Dallago C., Rehawi G., Yu W., Jones L., Gibbs T., Feher T., Angerer C., Steinegger M.et al.. ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Trans. Pattern Anal. Mach. Intell. 2021; 44:7112–7127. [DOI] [PubMed] [Google Scholar]
  • 66. Bileschi M.L., Belanger D., Bryant D.H., Sanderson T., Carter B., Sculley D., Bateman A., DePristo M.A., Colwell L.J.. Using deep learning to annotate the protein universe. Nat. Biotechnol. 2022; 40:932–937. [DOI] [PubMed] [Google Scholar]
  • 67. Makarova K.S., Wolf Y.I., Iranzo J., Shmakov S.A., Alkhnbashi O.S., Brouns S.J.J., Charpentier E., Cheng D., Haft D.H., Horvath P.et al.. Evolutionary classification of CRISPR–Cas systems: a burst of class 2 and derived variants. Nat. Rev. Microbiol. 2020; 18:67–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Steens J.A., Salazar C.R.P., Staals R.H.J.. The diverse arsenal of type III CRISPR–Cas-associated CARF and SAVED effectors. Biochem. Soc. Trans. 2022; 50:1353–1364. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Stella G., Marraffini L.. Type III CRISPR-Cas: beyond the Cas10 effector complex. Trends Biochem. Sci. 2024; 49:28–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Pillon M.C., Gordon J., Frazier M.N., Stanley R.E.. HEPN RNases – An emerging class of functionally distinct RNA processing and degradation enzymes. Crit. Rev. Biochem. Mol. Biol. 2021; 56:88–108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Niewoehner O., Jinek M.. Structural basis for the endoribonuclease activity of the type III-A CRISPR-associated protein Csm6. RNA. 2016; 22:318. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Kita K., Tsuda J., Kato T., Okamoto K., Yanase H., Tanaka M.. Evidence of horizontal transfer of theEcoO109I restriction-modification gene to Escherichia coli chromosomal DNA. J. Bacteriol. 1999; 181:6822–6827. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Doron S., Melamed S., Ofir G., Leavitt A., Lopatina A., Keren M., Amitai G., Sorek R.. Systematic discovery of antiphage defense systems in the microbial pangenome. Science. 2018; 359:eaar4120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Cheng R., Huang F., Wu H., Lu X., Yan Y., Yu B., Wang X., Zhu B.. A nucleotide-sensing endonuclease from the Gabija bacterial defense system. Nucleic Acids Res. 2021; 49:5216–5229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75. Millman A., Melamed S., Leavitt A., Doron S., Bernheim A., Hör J., Garb J., Bechon N., Brandis A., Lopatina A.et al.. An expanded arsenal of immune systems that protect bacteria from phages. Cell Host Microbe. 2022; 30:1556–1569. [DOI] [PubMed] [Google Scholar]
  • 76. Jaskólska M., Adams D.W., Blokesch M.. Two defence systems eliminate plasmids from seventh pandemic Vibrio cholerae. Nature. 2022; 604:323–329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77. Payne L.J., Todeschini T.C., Wu Y., Perry B.J., Ronson C.W., Fineran P.C., Nobrega F.L., Jackson S.A.. Identification and classification of antiviral defence systems in bacteria and archaea with PADLOC reveals new system types. Nucleic Acids Res. 2021; 49:10868–10878. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78. Loenen W.A.M., Dryden D.T.F., Raleigh E.A., Wilson G.G., Murray N.E.. Highlights of the DNA cutters: a short history of the restriction enzymes. Nucleic Acids Res. 2014; 42:3–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79. Card C.O., Wilson G.G., Weule K., Hasapes J., Kiss A., Roberts R.J.. Cloning and characterization of the HpaII methylase gene. Nucleic Acids Res. 1990; 18:1377–1383. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80. Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J., Bealer K., Madden T.L.. BLAST+: architecture and applications. BMC Bioinf. 2009; 10:421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81. Roberts R.J., Belfort M., Bestor T., Bhagwat A.S., Bickle T.A., Bitinaite J., Blumenthal R.M., Degtyarev S.Kh., Dryden D.T.F., Dybvig K.et al.. A nomenclature for restriction enzymes, DNA methyltransferases, homing endonucleases and their genes. Nucleic Acids Res. 2003; 31:1805–1812. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82. Zheng J., Ge Q., Yan Y., Zhang X., Huang L., Yin Y.. dbCAN3: automated carbohydrate-active enzyme and substrate annotation. Nucleic Acids Res. 2023; 51:W115–W121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83. Tesson F., Hervé A., Mordret E., Touchon M., d’Humières C., Cury J., Bernheim A.. Systematic and quantitative view of the antiviral arsenal of prokaryotes. Nat. Commun. 2022; 13:2561. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkae1175_Supplemental_Files

Data Availability Statement

All data related to the analyses in this work are available from github (https://github.com/nebiolabs/domainator_examples) and Zenodo (https://doi.org/10.5281/zenodo.10989173).


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES