Systematic prediction of functionally linked genes in bacterial and archaeal genomes

Sergey A Shmakov; Guilhem Faure; Kira S Makarova; Yuri I Wolf; Konstantin V Severinov; Eugene V Koonin

doi:10.1038/s41596-019-0211-1

. Author manuscript; available in PMC: 2020 Oct 1.

Published in final edited form as: Nat Protoc. 2019 Sep 13;14(10):3013–3031. doi: 10.1038/s41596-019-0211-1

Systematic prediction of functionally linked genes in bacterial and archaeal genomes

Sergey A Shmakov ^1,², Guilhem Faure ^1,³, Kira S Makarova ¹, Yuri I Wolf ¹, Konstantin V Severinov ^2,^4,⁵, Eugene V Koonin ^1,^*

PMCID: PMC6938587 NIHMSID: NIHMS1063598 PMID: 31520072

Abstract

Functionally linked genes in bacterial and archaeal genomes are often organized into operons. However, the composition and architecture of operons are highly variable and frequently differ even among closely related genomes. Therefore, to efficiently extract reliable functional predictions for uncharacterized genes from comparative analyses of the rapidly growing genomic databases, dedicated computational approaches are required. We developed a protocol to systematically and automatically identify genes that are likely to be functionally associated with a ‘bait’ gene or locus by using relevance metrics. Given a set of bait loci and a genomic database defined by the user, this protocol compares the genomic neighborhoods of the baits to identify genes that are likely to be functionally linked to the baits by calculating the abundance of a given gene within and outside the bait neighborhoods and the distance to the bait. We exemplify the performance of the protocol with three test cases, namely, genes linked to CRISPR–Cas systems using the ‘CRISPRicity’ metric, genes associated with archaeal proviruses and genes linked to Argonaute genes in halobacteria. The protocol can be run by users with basic computational skills. The computational cost depends on the sizes of the genomic dataset and the list of reference loci and can vary from one CPU-hour to hundreds of hours on a supercomputer.

Introduction

Functionally linked genes in bacteria and archaea often form operons, arrays of co-expressed and coregulated genes. The composition and organization of homologous operons are notably variable even among closely related prokaryotes, with many genes occurring in operons only sporadically^1,2. The rapid growth of the genomic sequence databases makes ‘guilt by association’ (GBA) an attractive and potentially highly productive principle for detecting functional connections between genes^3,4. Capitalizing on the fact that operons consist of functionally linked genes, the GBA approach involves inferring likely functions of uncharacterized genes that are recurrently found in the vicinity of genes with defined functions in multiple bacterial or archaeal genomes^1,5–8. This methodology has been widely used to predict functions of diverse genes^9–11 and to search for genes that are associated with particular functional systems. The prediction of the archaeal exosome—the elaborate RNA degradation machinery¹², which was subsequently experimentally validated¹³—is a characteristic example of the successful use of GBA. Recently, multiple novel CRISPR–Cas systems^14–18, CRISPR–Cas ancillary genes¹⁹ and anti-CRISPR genes^20,21 have been discovered in this manner and, in many cases, validated experimentally. However, given the variability of operon organization among bacteria and archaea, productive use of GBA requires a dedicated computational procedure that involves evaluation of the relevance of associated genes. Recently, we developed such a computational strategy to systematically predict genes with functional links to the CRISPR–Cas systems, by using a relevance metric we denoted ‘CRISPRicity’¹⁹. Here, we present the Icity protocol, which implements this strategy in a general context.

Random genome rearrangements often reshuffle genes, constantly breaking existing adjacent gene pairs and creating novel pairings never seen before, often bringing together functionally unrelated genes. Therefore, functionally relevant gene association can be reliably detected only when the conservation of gene pairs or longer gene arrays stands above the background and the respective loci are sufficiently represented in the genomic databases. At least three factors complicate the assessment of functional relevance. First, those combinations that reflect biological functionality but are rare in nature are correspondingly rare in the databases. Second, the sampling of sequenced genomes is extremely biased toward species and strains that are important for medicine or biotechnology²², complicating the statistical assessment of observations. For both of these reasons, the GBA approach can either underestimate or overestimate the relevance of the observed association and produce spurious results. Finally, an important source of problems is the extreme diversification of many functional systems, especially those that are involved in the arms races between prokaryotes and their parasites (viruses and other mobile genetic elements), as well as between coexisting bacteria and archaea^23,24. Fast evolution of the components and architectures of the functional systems that are involved in such conflicts creates an enormous diversity of variants that appear nearly unique and can be extremely difficult to recognize as representatives of homologous and functionally analogous systems. This lack of statistical resolution often complicates the selection of strong candidate genes using GBA.

The Icity protocol was designed to search for protein-coding genes that are associated with a set of baits (protein-coding genes or otherwise-defined loci) in large genomic datasets with variable taxonomical representation. For the selected set of baits, the genomic context is analyzed to rank the candidates according to the strength of their linkage to the baits. The functional relevance of the candidates is assessed using the following variables: (i) the number of occurrences of members of a cluster of genes coding for homologous proteins in the vicinity of the baits, (ii) the number of occurrences of the members of the same cluster in the entire genomic dataset, (iii) the distance between members of the cluster and the (nearest) bait. The combination of these variables is used to evaluate and rank the candidate genes. The design of this approach is based on our previous work on successful prediction of novel CRISPR–Cas systems^14,15 and ancillary CRISPR–Cas genes¹⁹. A conceptually similar approach has been shown to be productive for identifying genes that are over-represented in defense gene islands and thus that are likely to represent novel defense systems^25,26.A variety of other computational methods for identifying partially conserved gene neighborhoods in bacterial and archaeal genomes and predicting functional associations between genes and proteins have also been developed^{2,5–8,27–29}. Most of these procedures aim at operon prediction, which is then used for functional inference, with others identifying neighborhoods with evolutionarily conserved co-localization of genes²⁷ or those enriched in genes with a particular functional annotation²⁵. Notably, one of these methods²⁷ has been successfully applied to describe, for the first time, the cas gene neighborhoods as a distinct functional network¹⁸.

Without addressing the details of these approaches here, we note that the distinctive feature of the Icity strategy is the focus on a specific set of baits that have been functionally characterized or are simply highly conserved in the evolution of bacteria and archaea. Thus, the Icity protocol is best suited for in-depth investigation of functional networks for which some prior information is already available, rather than de novo search for functional associations. This focused approach makes the procedure highly scalable and allows one to quantitatively measure the specificity of the connections between the baits and other genes and thus to predict, with reasonable confidence, even comparatively rare functional associations.

The Icity protocol

Rationale and feasibility

The ready availability of expansive databases of diverse microbial genomes provides ample opportunity for expanding our knowledge of gene functions and genome evolution in bacteria and archaea. However, this diversity makes brute-force computational analyses prohibitively costly because of the enormous amounts of genomic sequence data. To carry out exhaustive searches in comprehensive datasets, the protocol relies on permissive clustering parameters to define protein families and thus minimize the search space. This method is suitable for massive and effective parallelization, making it readily scalable and able to keep up with the growth of prokaryotic sequence databases.

The GBA is a general approach for finding functional associations for genes that are otherwise difficult to classify and, as such, has been successfully used for the past two decades^10,11,30. Here, we combine several features that have been previously used in GBA analyses, namely, the frequency of occurrence of a gene in the vicinity of the bait, the relative frequency of such occurrence compared to the overall abundance of the gene, and the distance from the bait.

Applications and limitations

The procedure provided can be applied for any genomic multi-component system for which GBA is relevant, i.e., to any gene sets that form operons in at least some microbial genomes. A clear limitation of the Icity protocol is its reliance on the physical proximity of the functionally associated genes in microbial genomes. Consequently, groups of genes that interact functionally but are never encoded in the same operon cannot be predicted. Nevertheless, owing to the evolutionary fluidity of operons, partially overlapping gene neighborhoods from different genomes form extensive connected networks of genes^2,27,31, which makes the Icity approach applicable for a variety of microbial functional systems, especially considering the rapid growth of the genomic databases. Arguably, some of the most promising areas of Icity application are the analyses of defense, signal transduction, antibiotic biosynthesis and resistance and that of secondary metabolism networks that are characterized by extensive shuffling of large sets of genes among partially conserved operons^32–34. Below we illustrate this broad applicability of Icity to three independent examples. This methodology should be of interest and utility to a broad range of researchers in microbiology and molecular biology who are engaged in microbial genome mining for novel functions and activities, including those with potential applications in biotechnology.

Experimental design

The entire procedure can be presented in seven stages: (1) defining the set of baits in the genomic database (Steps 1–4); (2) reconstruction of the gene neighborhoods around the baits (Step 5); (3) clustering of proteins encoded in the bait neighborhoods by sequence similarity, with a permissive cutoff (Steps 6–8); (4) analysis of the distributions of the protein families from the bait neighborhoods in the genomic database (Steps 9–11); (5) definition of the relevance metric and the method for its calculation (Step 12); (6) scoring of the candidate protein families according to their abundance, enrichment in the bait neighborhoods and distance to the bait (Step 13); and (7) manual curation of the selected candidates (Step 14). This pipeline takes the genomic database (with information on the chromosome or contig structure) and the list of baits as the inputs (both defined by the user) and produces the list of candidate protein families ranked by their relevance scores. The protocol description given in this section covers stages 2–5 in detail, whereas for stages 1, 6 and 7, best practices are suggested. The overall pipeline and a detailed step-by-step schematic of the protocol are shown in Figs. 1 and 2.

Fig. 1 | — Seven stages of the pipeline are shown as boxes; each box contains information on the main action and output for the stage.

Fig. 2 | — Each stage of the protocol is represented by a gray box. Stage 1: contigs are shown as gray lines and the baits as red stripes within the contigs. Stage 2: ORFs in the contigs are shown as gray polygons. Stage 3: clustering procedure; the color of each ORF reflects the cluster assignment. Stage 4: profile construction from a set of proteins; PSIBLAST hits are shown as red rectangles within ORFs; sorting and filtering of proteins in clusters is performed. Stage 5: strict clustering procedure, Icity calculation and 3D metrics space: Icity, abundance in the genomic database and distance to the baits (red crosses denote clusters that contain Cas proteins, green dots denote clusters containing predicted ancillary CRISPR-linked proteins and blue circles denote clusters that do not include any CRISPR-related proteins). Stage 6: approaches to classify metrics space. Stage 7: methods of manual curation.

Stage 1 (Steps 1–4): the genome database and selection of baits

It is critical that the results remain consistent between the steps of the procedure. The typical rate of update of public genomic databases is commensurate with the amount of time required to run the pipeline, so users should consider performing all steps on a ‘frozen’ copy of the genomic database and the accompanying metadata, especially if the database is maintained by third parties. The database should be customized so as to be representative for the particular case under analysis. For example, if the baits are present mostly in one phylum, then, at least, all available genomes from that phylum should be included in the database. The baits for the Icity protocol are defined by their coordinates in genome partitions (chromosomes), scaffolds or contigs in the genomic database. The baits can be coding or non-coding sequences, including individual protein-coding or RNA genes, entire operons or non-coding sequence features such as CRISPR arrays. The set of baits should be explicitly defined and as complete as possible, to ensure maximum resolution power in the subsequent stages. The protocol works with protein sequences. Therefore, it is essential that the screened genomic database include accurately annotated protein-coding genes. When metagenomic databases are searched, gene prediction software such as GeneMarkS³⁵ should be used to predict the coding sequence (CDS) regions. The existing annotation should be checked for coding density (we used >0.6 coding sequences per kilobase as the threshold), and sequences with low coding density should be reannotated.

Stage 2 (Step 5): annotation of protein-coding genes in the bait neighborhoods

At this stage, the set of protein-coding genes from the vicinity of the baits is assembled. The extent of the region of interest surrounding the baits depends on the expected operon structure of the functional systems that include the baits. For example, in cases in which the system is expected to consist of two genes, such as toxin–antitoxin modules, the bait-flanking region should be limited to one to two protein-coding genes or 1–2 kilobase pairs (kbp) of the genome sequence. For cases such as CRISPR–Cas systems, with the characteristic large operons and extended non-coding functional elements (CRISPR arrays), the distance threshold should be relaxed to 10–15 genes or 10–20 kbp. The extent of the flanking regions determines the trade-off between the capture of all possible candidates and the costs of the computational analysis and manual curation. Prior knowledge of the biology of the system of interest can be used as an additional pre-filtering step. Annotation of protein-coding genes in the bait neighborhoods using databases of protein family profiles^36,37 allows one to exclude obvious non-candidates, such as various housekeeping genes, that are irrelevant to the (predicted) functions of the analyzed system.

Stage 3 (Steps 6–8): permissive protein clustering

The candidate protein set constructed at stage 2 is clustered by sequence similarity with permissive parameters. This stage transforms a set of individual genes with their unique genomic locations into families of homologous proteins for which collective trends can be detected and analyzed. Protein components of many cellular systems, in particular, those involved in inter-genomic conflicts, are highly variable, so the clustering threshold should be pushed down to the lowest safe setting, that is, to the point at which clustering of non-homologous sequences begins to become a problem. The choice of the specific clustering method is flexible and should be made by the user. For our clustering tool of choice, MMseqs2 (ref.³⁸), which appears to provide the optimal sensitivity–speed trade-off³⁸, we recommend using the sequence identity threshold of 0.3 and coverage to 0.1 to achieve the maximum sensitivity. However, with this liberal threshold, a cluster quality check is essential. For such checking, a PSIBLAST search³⁹ against all members of the cluster can be run using the profile derived from the cluster alignment as the query, and those proteins that are not detected in this search with an appropriate e-value (1 × 10⁻⁴ was optimal for our searches) threshold can be discarded. We recommend re-clustering the singleton sequences and repeating the cluster quality check iteratively three to five times, and then clustering the leftover singletons with a safer clustering threshold (sequence identity 0.5 and coverage 0.3 for MMseqs2).

Stage 4 (Steps 9–11): search for cluster members in the genomic database

In this stage, profiles built from the alignments of protein families assembled in stage 3 are used to scan the entire genomic database to identify all the members of the clusters and their homologs. This search can be performed using PSIBLAST or other profile search tools. The choice of tools should be informed by the size of the genomic database and the required sensitivity and/or specificity. Tuning of the search parameters may be needed, depending on the task, to achieve higher sensitivity or, conversely, to restrict the search space. Certain issues, however, must be addressed during the post-processing of the search results. First, all hits that have an alignment length lower than a predefined threshold (we used 25% of the cluster profile length) are removed. Second, multiple hits into the same sequence from different profiles are decomposed into a set of non-overlapping segments with the highest scores. Hits from the candidate profiles to very distant homologs (e.g., generic NTPase or beta-propeller domains) are eliminated using the following procedure: the new hits are clustered together with the original candidate sequences from the corresponding family using the same permissive clustering parameters, and clusters that do not contain any of the original sequences are rejected. This procedure ensures that the new hits are bona fide members of the original clusters in terms of sequence similarity. These post-processing steps ensure that the representation of the candidate families in the genomic database is optimized in terms of both sensitivity and specificity.

Stage 5 (Step 12): relevance metrics

To assess the potential functional relevance of the candidate protein clusters to the baits, we introduce three metrics to score the candidates.

The fraction of the candidate gene occurrence in the vicinity of the baits relative to the overall abundance of the respective family, or the ‘icity’ value that is calculated as the ratio of the corresponding weighted counts (ref.¹⁹ and see below), was employed as the most important relevance metric. The ‘icity’ values range from 0 to 1, for which numbers close to 0 indicate that proteins of this cluster rarely can be found around the baits (weak linkage, low relevance), and numbers close to 1 indicate that members of the cluster occur mostly in the vicinity of the baits (strong linkage, high relevance). The preceding stages provide the counts of occurrences for all candidate families. Owing to both the sequencing bias in the current genomic databases and the highly non-uniform distribution of microbial taxa in nature, these raw numbers are strongly biased (e.g., the numbers of observed occurrences of a gene found in Escherichia coli in a particular context and in Buttiauxella gaviniae in a different context are likely to be biased >10,000:1, whereas the ratio of independent occurrences would be close to 1:1). To mitigate this bias, we use the effective number of sequences to calculate the fraction. To this end, the members of a permissive cluster are clustered for the second time using a strict clustering threshold (sequence identity 0.9, coverage by default for MMseqs2), and the number of these strict clusters is used as the effective number of sequences. Genes in the over-represented strains typically cluster together in one strict cluster, reducing the effective number of sequences and bringing the bias under control.
Another useful relevance metric is the median proximity of the genes in a cluster to the baits. The closer the gene is to the bait, the more likely it is that their association is functionally relevant rather than random. To avoid bias caused by the same factors as for the ‘icity’ metric, median distance for representatives of strict clusters should be used.
Finally, taking into account the actual abundance of a gene family in the bait neighborhoods and in the entire genomic dataset (again, counted as the effective number of sequences) helps to differentiate functional associations from random occurrences that could yield high scores for the first two metrics.

Stage 6 (Step 13): candidate selection and ranking

Multiple approaches can be applied to select or rank candidate families using the metrics described in stage 5. In cases in which a training dataset is available (i.e., other known components of the system under study that can be used to annotate the permissive cluster set), classification can be guided by this information in two ways. Under the first approach, the segment of interest is selected in the metrics space (‘icity’ > the threshold value, bait distance < the threshold value and abundance > the threshold value), the known true and false positives in this segment are counted, and the thresholds are adjusted, optimizing recall and precision (using, e.g., the F-score). The uncharacterized genes that fall within this sector form the list of viable candidates. Alternatively, the metrics space can be broken into segments of equal size (voxels), with each voxel characterized by its true positive to true negative ratio. Voxels with these ratios exceeding the threshold, set by the user, contain the candidates of interest.

In the absence of a training set, cutoffs for the metrics can be defined from external considerations or, for well-characterized systems, taken from earlier studies. For example, cutoffs derived from our previous work¹⁹ for CRISPR-like systems are ‘icity’ > 0.7, effective distance < 2 (distance < 5 is optimal as well), effective abundance in the entire genomic dataset ≥ 1.

Stage 7 (Step 14): manual curation

Ultimately, the list of the candidate families that is obtained in stage 6 should pass through the expert curation stage, in which the candidates are informally assessed with respect to their potential relevance to the system in question. Almost by definition, such a procedure cannot be specified in general and should be tailored for each case separately. Nevertheless, we can offer some guidance on the features that are useful at this stage. One of the most productive approaches is domain identification combined with analysis of the domain architectures of the candidate proteins. To do this, a representative protein from a candidate family could be analyzed using a Conserved Domain Database search³⁶ and a HHpred search⁴⁰. To achieve higher sensitivity, a manually curated alignment of a candidate family can be used as a query for the HHpred search. Domain composition and architecture of the proteins can point to the function of the candidate family and to a functional linkage to the set of baits. Conservation of specific genomic contexts over larger taxonomic distances (beyond the genus level), especially in the case of predicted operons, can also serve as a strong argument in favor of the functional relevance of the corresponding candidate families as opposed to coincidental gene adjacency due to synteny conservation in closely related genomes.

Materials

Equipment

Data

An up-to-date genomic database is available at the NCBI and can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/refseq/. Any other genomic database can be used, as long as the genome partitions and contigs are sufficiently well annotated for protein-coding genes so that the gene locations (coding sequence coordinates and strands) are available ▲ CRITICAL Filtering out taxonomically irrelevant data (e.g., eukaryotic and virus sequences for a project that involves prokaryotic baits) is essential for both accurate statistics and efficient performance. For large datasets that include shotgun genomic sequences and metagenomic assembly, pre-filtering for contig length (excluding contigs that are shorter than the characteristic size of the system) can markedly improve the performance.

Example datasets

▲CRITICAL We selected three example datasets. The three example datasets (files with ‘Set1’, ‘Set2’ and ‘Set3’ prefixes) and the files generated by the protocol for these datasets are available from ftp://ftp.ncbi.nih.gov/pub/wolf/_suppl/icityNatProt/.

Database.tar.gz contains data ProteinsDB and CDS.pty files (use Set2.Database.tar.gz for the third dataset)▲CRITICAL In the first example dataset, the search focused on identifying genes associated with cas10, the signature gene of type III CRISPR–Cas systems⁴¹, in the genomes of the Thermotogae bacteria. This dataset was chosen because of the limited size and wide spread of type III CRISPR–Cas in this phylum⁴¹, which makes it an illustrative example. Example database ProteinsDB (see Database folder on the FTP site) was generated with the makeblastdb command (see NCBI BLAST suite) for all protein sequences annotated in the genomes of the bacteria in the phylum Thermotogae that were present in the NCBI databases in March 2016. The Icity protocol running time for this dataset (57 genomes, 107,957 protein sequences and 1,358 sequences of bait neighborhoods) is <1.5 h. In the CDS.pty file that was generated using the data available at the NCBI FTP site¹⁹, the last column (Generated ID) contains custom protein IDs because, owing to the poor annotation available in public databases, some protein-coding genes were predicted de novo using open reading frame (ORF) prediction software and therefore do not have global accession numbers. The Seeds.tsv file contains coordinates of cas10 genes that are present in ProteinsDB and were identified by running a PSIBLAST search as previously described¹⁹ against the ProteinsDB and then filtering CDS.pty with the protein IDs detected in this search to retrieve the coordinates.
Seeds.tar.gz contains coordinates of Cas10 proteins in the Thermotogae dataset (Set1 prefix), coordinates for major capsid protein (MCP) of His2 for second dataset (Set2) and coordinates for Argonaute proteins for the third dataset (Set3) ▲CRITICAL The second dataset includes 133 genomes of Halobacteria (438,885 proteins) that were available in the NCBI RefSeq database as of March 2018. The search of this dataset focused on the identification of genes linked to the MCP of the His2 family of spindle-shaped haloviruses⁴². In the ±10-kb vicinity of 124 seeds, 2,350 proteins were identified (including the seeds) that formed 885 clusters. Running the Icity protocol on this dataset took 5 h.
ProtocolFiles.tar.gz contains all files generated by the pipeline ▲CRITICAL For the third example, the same dataset was used as for the second one (133 Halobacteria genomes), but the search focused on identifying genes linked to genes of the Argonaute protein family⁴³. The search in the ±10-kb vicinity of 36 seeds recovered 569 proteins that formed 390 clusters. Running the Icity protocol for this example took 2.5 h.

Scripts

The scripts used in this protocol are accessible via the NCBI GitHub page https://github.com/ncbi/ ICITY and are designed to be executed in a Unix environment. The source code for the scripts is available under NCBI license; see LICENSE.txt. All scripts are provided as Python code or BASH scripts that should be run with Python or BASH commands accordingly. The protocol is designed to use the file formats specified below. All indicated times for the example datasets are for a single CPU (using only one process/thread).

Hardware

The type and requirements for data storage depend on the size of the analyzed genomic database. As of December 2018, the up-to-date NCBI prokaryotic database takes ~340 GB of storage space; temporary files might require approximately half of that space.
Access to computational nodes (computer farm or multiple CPUs). Given that most of the computationally costly steps are highly parallelizable, access to computational nodes that have sufficient memory (see BLAST memory requirements) to execute searches in the genomic database is essential.

Software ▲CRITICAL As presented, the protocol is designed to run using the specified suite of tools in a Unix environment (protocol was tested on CentOS Linux 7 (Core), Ubuntu 18.10 and macOS High Sierra v.10.13.6) and requires the following dependencies (the tools and commands are described for Unix).

Python 3.4+: https://www.python.org/downloads/ (scripts provided for the protocol were implemented with Python 3.4)
NCBI BLAST suite: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ (v.2.7.1 was used for the example dataset and scripts provided)
Clustering tools: MMseqs2 (preferred) can be downloaded from https://github.com/soedinglab/ MMseqs2 (MMseqs2 version used in protocol was downloaded in July 2018 (v.8306271e1e0d40f05963-d6a7d10533767914cdca))
Sequence alignment: Muscle⁴⁴, http://www.drive5.com/muscle/ (Muscle v.3.7 was used in the protocol)

Equipment setup

Environment and tools

The scripts used in the protocol require a Unix environment and the standard Python package (v.3.4+). Installation instructions for the Python package can be found at https://docs.python.org/3/using/unix.html. For the BASH scripts provided, Python 3.4+ should be available by calling python in the command line (if Python is available by another name, change the BASH scripts accordingly). Owing to interdependencies in all scripts provided, they should be locally available (all scripts must be located in one folder, where they are executed).

Other dependencies

The following dependencies, Python, Muscle, MMseqs2, blastdbcmd and PSIBLAST, should be available in the PATH variable of Unix. Tools should be globally available in the environment to call them by name without specifying the direct path. For example, the command blastdbcmd should be executable from any folder. This can be done with the export command in BASH in the Unix environment (run export PATH=$PATH:/path to the program in BASH to add programs to the path variable). Owing to dependencies in the scripts provided for the example dataset, they should be placed in the folder in which the pipeline will be executed or should reside in a folder accessible via the $PATH Unix environment variable.

Procedure

Stage 1: database/metadata preparation and selection of the loci of interest ● Timing Variable

Preparation of BLAST database and file containing CDS coordinates

To create a new dataset, proceed with option A; to run the procedure with the example dataset provided (CRISPR dataset; see ‘Equipment’ section), follow instructions in option B.

Generation of a new dataset ● Timing Variable, depending on the data size

Create new BLAST database with the makeblastdb command from the BLAST tools package:
```
makeblastdb -in ProteinSequencesFasta.faa -out ProteinsDB -dbtype prot -parse_seqids
```
The ProteinSequencesFasta.faa file provided by the user contains sequence information for CDSs extracted from genomes of interest. ProteinsDB = output name of the generated database.

To prepare a file containing CDS coordinates, create a CDS.pty file (text file, tab-separated values) containing the descriptions of ORFs in the ProteinsDB, using following format:

LocusID	ORFStart.. ORFStop	Strand	OrganismID	ContigID	Accession no.	Generated ID
CTN_0007	5030..5731	+	Thermotoga_neapolitana_DSM_4359_GCA_000018945.1	CP000916.1	ACM22183.1	1001945397
CTN_0008	5733..5990	−	Thermotoga_neapolitana_DSM_4359_GCA_000018945.1	CP000916.1	ACM22184.1	1001945398

Open in a new tab

Generated ID here defines a unique decimal number assigned to a protein sequence.

For a custom dataset, this file could be generated using the information produced by ORF prediction software, e.g., GeneMarkS³⁵.
▲CRITICAL STEP ORF lists must be sorted by contig and by start position.
▲CRITICAL STEP File names (e.g., ProteinsDB and CDS.pty) are optional; if custom names are used, all commands should be modified accordingly.

Downloading of example BLAST database ● Timing ~1 min
1. Download and unzip the example database (for first dataset), using the following commands:
```
wget ftp://ftp.ncbi.nih.gov/pub/wolf/_suppl/icityNatProt/Set1.
Database.tar.gz
tar -zxvf Set1.Database.tar.gz
```
  After running these commands, the database can be accessed with Database/ProteinsDB name. The file containing CDS information will be stored as Database/CDS.pty.
  
  ▲CRITICAL STEP All the commands should be executed in same work folder. Otherwise custom paths should be specified for the scripts in the following part of the procedure.

2
Generation of file containing coordinates of the seeds (baits). To create a seeds file for a new dataset, proceed with option A; to use the seeds file from the example dataset, proceed with option B.
1. Creation of a new seeds file ● Timing Variable
  1. Create the file containing seed coordinates (text file, tab-separated values). Arbitrary loci, e.g., protein- or RNA-coding genes, operons or repeat regions, can serve as seeds. During the subsequent steps, the neighborhoods around the seeds, specified in this file, will be analyzed. The file should have the following format:
    
    LociID ContigID Start Stop
    
    1001710480 CP000771.1 1497646 1500138
    
    Open in a new tab
2. Download a seeds file for the example dataset ● Timing ~1 min
  1. Run the following command to download the seeds file.
    wget ftp://ftp.ncbi.nih.gov/pub/wolf/_suppl/icityNatProt/Set1. Seeds.tar.gz tar -zxvf Set1.Seeds.tar.gz

Downloading of pipeline scripts to the working folder ● Timing ~1 min

3
Download an archive with all files from https://github.com/ncbi/ICITY; extract all files from the archive to the working folder.

■PAUSE POINT Steps 1–3 generate the input data for the computational part of the pipeline. The subsequent computational part of the pipeline (Steps 5–12, Stages 2–5) can be run with different parameters using the same input from this step.

Running of Icity pipeline in automatic mode ● Timing ~1 h for the example dataset

Run the Icity pipeline batch file to run the computational part (Steps 5–12, stages 2–5) of the procedure in automatic mode with the following command:

python icity.py

Pass input parameters to the pipeline in the config.py configuration file, which has following values:

Parameter name	Value or the example dataset	Description
PTYFile	Database/CDS.pty	Path to the file generated in Step 1
SeedsFile	Seeds.tsv	Path to the file generated in Step 2
NeighborhoodVicinitySize	10000	Width of the region around the seed (base pairs)
PathToDatabase	Database/ProteinsDB	Path to the database generated in Step 1
PermissiveClusteringThreshold	0.3	Sequence similarity clustering threshold
SortingOverlapThreshold	0.4	Overlap threshold to sort BLAST hits
SortingCoverageThresold	0.25	Coverage threshold to sort BLAST hits

Open in a new tab

And specify the file names for output information:

Parameter name	Example value	Description
ICITYFileName	Relevance_09.tsv	File with relevance values for protein clusters
VicinityClustersFileName	VicinityPermissiveClustsLinear.tsv	File with protein cluster information

Open in a new tab

If you have run the pipeline in automatic mode using this step, proceed directly to Step 13.

▲CRITICAL STEP Running the pipeline in batch mode without modifications is advised only for small datasets (<1,000 genomes); for larger datasets, parallelization is required. To control and vary parameters without running the whole pipeline again, we advise running the pipeline step by step, following Steps 5–12.

? TROUBLESHOOTING

Stage 2: annotation of protein-coding genes in the bait neighborhoods

Identification of protein-coding genes around the baits

With a custom dataset, proceed with option A; with the example dataset, proceed with option B.

Identify protein-coding genes around the baits in the new dataset ● Timing Variable

Select the width of the neighborhood (in base pairs) that will determine the number of genes in the upstream and downstream regions to be analyzed. Run the SelectNeighborhood.py script with the following to find the protein-coding genes in these regions:

-h, --help	Shows help message
-p P	PTYDataFileName, complete .pty file for contigs
-s S	SeedsFileName, seeds .tsv file
-o O	ResultFileName, output .pty file
-d D	Offset around seed (base pairs)

Open in a new tab

The script will save results into a file with the following format (tab-separated values file):

GI	ORF coordinates	Strand	Genome	Contig
===
1001946080	691014..692177	+	Thermotoga_neapolitana_DSM_4359_GCA_000018945.1	CP000916.1
1001946081	692179..692862	−	Thermotoga_neapolitana_DSM_4359_GCA_000018945.1	CP000916.1

Open in a new tab

The selected loci contained in the file are separated by lines containing ‘===’; each line contains information on an ORF in the selected area around the baits.

Identify protein-coding genes around the baits in the example dataset ● Timing ~1 min
1. Annotate protein-coding genes in the ±10-kb vicinity of the baits in the example dataset and save the annotation in the Vicinity.tsv file with the following command:
```
python SelectNeighborhood.py -p Database/CDS.pty -s Seeds.tsv -o Vicinity.tsv -d 10000
```

Stage 3: permissive clustering of the protein set

Collection of protein IDs from bait neighborhoods ● Timing ~1 min

6
Run the following command in the Unix environment to get IDs from the file generated in Step 5:
```
grep -v “===“ Vicinity.tsv | cut -f1 | sort -u > VicinityIDs.lst
```

Getting protein sequences from the database ● Timing ~1 min

7
Run the following command, using the file generated in Step 6 to get protein sequences from the database created in Step 1:
```
blastdbcmd -db Database/ProteinsDB -entry_batch VicinityIDs.lst -long_seqids > Vicinity.faa
```

Run clustering with permissive parameters ● Timing ~1 min

8
Run the following command to cluster protein sequences contained in the file Vicinity.faa using a sequence similarity cutoff value of 0.3 and save results in the VicinityPermissiveClustsLinear.tsv file:
```
bash RunClust.sh Vicinity.faa 0.3 VicinityPermissiveClustsLinear.tsv
```
Use the following parameters:
```
RunClust.sh
Argument 1: FASTA file name
Argument 2: sequence similarity clustering threshold
Argument 3: result clusters FileName
```
Clustering is performed using the MMseqs2 library. The entire clustering procedure is implemented in one shell script. See Supplementary Data for clustering procedure description.

▲CRITICAL STEP Shell scripts should be called using a BASH shell. Running them with other shells may produce errors in some Unix environments.

▲CRITICAL STEP The results of clustering might differ from run to run because of the stochastic approach implemented in MMseqs2.

▲CRITICAL STEP Clustering sequences by similarity with permissive parameters is always fraught with the risk of overclustering, and some protein families are more prone to it than others. If overclustering is evident or an additional safeguard is desired, one can use an iterative cluster refinement procedure (Supplementary Data).

▲CRITICAL STEP For parallelization of this step, see Help for multiprocessing parameters for MMseqs2 library (run mmseqs cluster -h).

Stage 4: searching for cluster representatives in a genomic database

Creation of protein profiles from representatives of a permissive cluster ● Timing ~5 min

9
To make profiles for the clusters, run the following script:
```
python MakeProfiles.py -f VicinityPermissiveClustsLinear.tsv -c CLUSTERS/ -d Database/ProteinsDB
```
Use the following parameters:

-h, --help Shows help message

-f F Clusters file name

-c C Folder name where profiles will be saved

-d D Path to the protein database

Open in a new tab

This script will create a protein profile for each permissive cluster for proteins from the genomic example database using the Muscle program and will save the profiles to the CLUSTERS folder with an ‘.ali’ extension and CLUSTER_ prefix with line number after the prefix as cluster ID (if the directory does not exist, the script will create it).

▲CRITICAL STEP In the provided scripts, protein profiles are constructed using a simplified approach; for a more advanced approach, see ref.¹⁹.

▲CRITICAL STEP This step could be time consuming with large datasets. Parallelization is strongly recommended.

Running BLAST to search for generated protein profiles ● Timing ~5 min

10
Run the following script, which will execute a PSIBLAST search of the genomic database with the profiles created at Step 9 used as queries and save results for each cluster with a ‘.hits’ extension in the CLUSTERS folder:
```
python RunPSIBLAST.py − c CLUSTERS/ − d Database/ProteinsDB
```
Use the following parameters:

-h, --help Shows help message

-c C Folder name where profiles will be saved

-d D Path to the protein database

Open in a new tab
This script runs a PSIBLAST search for the created profiles with the following parameters:
```
psiblast −db <Database> −outfmt “7 qseqid sseqid slen sstart send evalue qseq sseq qstart qend score” −seg no −evalue <Evalue> −dbsize 20000000 −max_target_seqs 10000 −comp_based_stats no <Query> > <BLASTHitsFileName>
```

Sorting of blast hits between clusters ● Timing ~2 min

Run the following command, which will read the BLAST hits from the CLUSTERS folder and save the sorted results for each cluster in the CLUSTERS/Sorted/ folder with a ‘.hits_sorted’ extension:

python SortBLASTHitsInMemory.py −c CLUSTERS/ −o CLUSTERS/Sorted/ −p Data- base/CDS.pty −i VicinityIDs.lst −s Seeds.tsv −v Vicinity.tsv −z 0.4 −x 0.25

Use the following parameters:

-h, --help	Shows help message
-c C	Folder name where hits are stored
-o O	Folder name where sorted result hits are stored
-p P	PTY file containing coordinates of ORFs in the genomic database
-i I	List of protein IDs in the vicinity of baits
-s S	Bait coordinates
-v V	Vicinity of the baits, needed to calculate distances to the baits
-z Z	Overlap threshold; hits are subject to sorting between two profiles if they overlap by more than the threshold value
-x X	Coverage threshold; hits are stored if they cover original profile by more than the threshold value

Open in a new tab

Output files have the following format:

ProteinID	BLAST score	Alignment start	Alignment stop	Alignment sequence	CLUSTERID	Contig	Is in vicinity islands	ORF start	ORF stop	Distance to the bait
1002058134	1492	1	283	MKV…	CLUSTER_1	CP000969.1	1	289168	290019	7

Open in a new tab

The script can be described in pseudo-code, as shown below:

Load cluster PSIBLAST hits:
For all cluster profile hits, for each hit found
Filter out hits that cover less than 25% of original alignment
For each protein ID that was hit by cluster profile, store best scoring hits (with reference by which profile this hit was generated); if two hits overlap by more than 40%, choose the one with the higher score to store For each cluster protein profile, collect and store hits after sorting procedure

▲CRITICAL STEP With large datasets, this script will require sufficient RAM to load the information on all BLAST hits into memory.

Stage 5: relevance metrics

Calculation of relevance metrics for all protein clusters ● Timing ~30 min

12
Run the following script, which will calculate effective cluster sizes and the relevance metrics for all sorted hits in CLUSTERS/Sorted/, created in the previous step, and save results into the file specified in arguments:
```
bash CalculateICITY.sh CLUSTERS/Sorted/ Database/ProteinsDB VicinityPermissiveClustsLinear.tsv Relevance.tsv
```
Use the following parameters:
```
CalculateICITY.sh
```
Argument 1: clusters folder path

Argument 2: path to protein database

Argument 3: path to the file with clusters information Argument 4: result file name with cluster relevance information

Effective cluster sizes and the relevance metrics are calculated in this script for all clusters in the specified folder. Output file names have following format:

ClusterlD Effective size in vicinity of baits Effective size in entire database Median distance to bait (in ORFs) Icity

CLUSTER_10 2 6 3 0.3333

Open in a new tab
To perform this procedure for a single selected cluster, use the following script:
```
GetIcityForBLASTHits.py
```
-h, --help Shows help message

-f F Sorted PSIBLAST hits file name

-o O Result .tsv file

-d D Genomic database

-c C Permissive clusters file name

Open in a new tab
where Sorted PSIBLAST hits file name is one of the files generated by SortBLASTHitsInMemory.py. A result file with effective sizes (using a sequence similarity threshold of 0.9 with MMseqs2) and distance will be saved into the specified result file.
▲CRITICAL STEP This process is computationally expensive. For large datasets, we suggest parallelizing this procedure. To do this, modify CalculateICITY.sh to run the GetIcityForBLASTHits.py script (now called by the shell script in a cycle) in parallel.

? TROUBLESHOOTING

■PAUSE POINT Steps 4–12 generate all the data required for manual analysis.

Stage 6: candidate selection

Classification of clusters and selection of linked candidates ● Timing ~5 min

13
Score each cluster and select top candidates for manual curation, using the values calculated in stage 5. In our study with the CRISPR–Cas dataset, we found that the ‘Icity’ value should be >0.6 and the distance should be <5; these values can be used for systems resembling CRISPR–Cas in terms of the characteristic size (for an alternative approach to selecting candidates, see Box 1). Clusters with an effective abundance in the genomic database equal to 1 should be discarded or manually reviewed.

Box 1 |. Alternative approach to selection of candidates.

Derive relevance cutoff using the information on known associated genes:

When other genes associated with the selected set of baits are known, the relevance cutoff can be estimated from these examples using various classification methods. The Relevance_types.tsv file (see Supplementary_data file on the FTP site (ftp://ftp.ncbi.nih.gov/pub/wolf/_suppl/icityNatProt/Supplementary_data.tar.gz)) contains annotation for clusters from ref.¹⁹ using the same format as Relevance.tsv, with an additional column that contains information on the association with CRISPR. Divide the metrics space as follows: split Icity by 0.1, log abundance by 0.5, distance by 1. For each sector in this space, calculate the F_0.5 score⁴⁵ using CRISPR–Cas and associated genes as true positives and non-Cas genes as false positives. Select candidates from the sector with the highest F_0.5 score. The division of the metrics space into sectors is illustrated in Fig. 4 for the F_0.5 classification results.

Stage 7: manual curation

Analysis of protein clusters with domain detection tools ● Timing Variable

14
Identify domains in the candidate clusters by running an RPSBLAST search against the Conserved Domain Database (CDD) profiles (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) and/or an HHpred search (https://toolkit.tuebingen.mpg.de/#/tools/hhpred). We recommend adding the CDD to the PDB database for the HHpred search; other databases (PFAM (https://pfam.xfam.org/) and COG (https://www.ncbi.nlm.nih.gov/COG/)/KOG (https://mycocosm.jgi.doe.gov/Tutorial/tutorial/kog.html)) also might be useful in specific contexts.

Troubleshooting

Troubleshooting advice can be found in Table 1.

Table 1 |.

Troubleshooting table

Step	Problem	Possible reason	Solution
4	icity.py returns message that some package is missing Example: ‘mmseqs not available required software missing’	Required software package not installed or not available	Required package should be installed and available in the PATH variable of Unix; see ‘Software’ section in ‘Materials’
	icity.py returns message that step has failed. Example: ‘Step 5: Selecting neighborhoods failed’	Error occurred during execution of script responsible for a certain step	Run the step script manually to see detailed error message
12	0 in ‘Effective size’ column	Indicates that there was problem with the clustering procedure. The current version of MMseqs2 produces an error when clustering two identical proteins, so that the cluster file contains no information	Clusters should be manually corrected or discarded
	−1 in ‘Effective size’ column	Indicates that there are no sequences left after filtering of the BLAST hits result file	Clusters should be manually corrected or discarded

Open in a new tab

Timing

The most important variables that affect performance are the size of the genomic database, the number of baits and the size of the analyzed bait neighborhoods. Because most of the timeconsuming stages (cluster consistency check in stages 3, 4 and 5) are easy to parallelize, access to multiple computational nodes (computer farm or cluster) considerably improves the overall performance. Our previous analysis of CRISPR–Cas systems¹⁹ was performed with a database containing ~50,000 prokaryotic genomes and a set of 40,000 baits (all identified CRISPR–Cas systems). Running the parallelized pipeline for this dataset, with 20-kbp neighborhoods and access to 100 computational nodes takes ~100 h. Running the same pipeline for the provided example database (57 Thermotogae genomes, 80 baits, 20-kbp neighborhoods) takes less than an hour on a computer with one CPU. The manual curation stage requires a skilled bioinformatician, and analysis of 1,000 candidates can take up to a month.

For the first example dataset, the timing was as follows:

Steps 1–3, downloading of data and scripts: ~10 min
Step 4, running the pipeline in automatic mode: ~1 h
Steps 5–14, running the pipeline in manual mode: ~1 h
Step 5, identification of protein-coding genes around the baits: ~1 min
Step 6, collection of protein IDs from bait neighborhoods: ~1 min
Step 7, obtaining protein sequences from the database: ~1 min
Step 8, running clustering with permissive parameters: ~1 min
Step 9, creation of protein profiles from representatives of a permissive cluster: ~5 min
Step 10, running a BLAST search for generated protein profiles: ~5 min
Step 11, sorting of blast hits between clusters: ~2 min
Step 12, calculation of relevance metrics for all protein clusters: ~30 min
Step 13, classification of clusters and selection of linked candidates (using predefined cutoffs): ~5 min
Step 14, analysis of protein clusters with domain detection tools: variable

Anticipated results

This protocol will yield an annotated list of protein families from the provided genomic database that are associated with and are likely to be functionally linked to the provided set of baits. The main output includes the list of protein clusters assembled from the neighborhoods of the baits; relevance metrics for these clusters, which show the number of proteins that belong to the cluster in the specified neighborhoods and in the entire database; and the median distance between the members of the cluster and the baits. For the example datasets provided, information on clusters is contained in the VicinityPermissiveClustsLinear.tsv file and the relevance metric is in the Relevance.tsv file (see Fig. 3 and the Set1.ProtocolsFiles.tar.gz archive on the FTP site (ftp://ftp.ncbi.nih.gov/pub/wolf/_suppl/icityNatProt/Set1.ProtocolFiles.tar.gz) for the Thermotogae dataset, Supplementary_data.tar.gz, and the Set2.ProtocolsFiles.tar.gz and Set3.ProtocolsFiles.tar.gz archives on the FTP site (ftp://ftp.ncbi.nih.gov/pub/wolf/_suppl/icityNatProt/Set2.ProtocolFiles.tar.gz, ftp://ftp.ncbi.nih.gov/pub/wolf/_suppl/icityNatProt/Set3.ProtocolFiles.tar.gz) for the second and third examples, respectively). Clusters are represented as lines in the VicinityPermissiveClustsLinear.tsv file, where the cluster number is indicated by the line number. The relevance metric file contains information on cluster abundance in the neighborhood of the baits, the abundance in the entire genomic set and the distance to the baits. Application of the proposed classification approach (see Fig. 4 for the Thermotogae example and the Supplementary_data.tar.gz archive on the FTP site (ftp://ftp.ncbi.nih.gov/pub/wolf/_suppl/icityNatProt/Supplementary_data.tar.gz) for the two Halobacteria examples) outputs the list of candidate protein clusters that are linked to the baits. In general, the higher the ‘Icity’ (number of the proteins from a given cluster encoded in the bait neighborhoods normalized by the number of proteins from the same cluster in the entire database) and the shorter the distance to the baits, the more likely it is that that the association of the given cluster with the baits is functionally relevant. Low-abundance clusters should be treated with caution because of the absence of sufficient statistical support.

Fig. 3 | — Protein clusters characterized by their Icity, effective abundance and effective distance to the baits are shown. Annotation for each cluster was performed by using PSIBLAST to classify the clusters into categories: ‘Cas’, a known Cas protein; ‘Associated’, predicted ancillary Cas proteins; and ‘Non-Cas’, no CRISPR-related proteins.

Fig. 4 | — The yellow area shows the sector with the maximum F score (optimized recall/precision). Annotation for each cluster was performed by using PSIBLAST to classify the clusters into categories: ‘Cas’, a known Cas protein; ‘Associated’, predicted ancillary Cas proteins; and ‘Non-Cas’, no CRISPR-related proteins.

The datasets used as the examples here are too small for meaningful statistical analysis. With our previously published data¹⁹, the sector of the metrics space with Icity > 0.7, abundance ≥ 1 and target distance < 2 provides the maximum F_0.5 precision-recall measure⁴⁵ of 0.98, with precision of 0.98 (48 known non-Cas families among the 2,624 protein families in the sector) and recall of 0.96 (2,576 known Cas families in the sector out of the total of 2,678).

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data and code availability

The source code of the Icity pipeline is freely available under open-source NCBI license (https://github.com/ncbi/ICITY/blob/master/LICENSE.txt) at the NCBI GitHub page (https://github.com/ncbi/ICITY). Questions and comments can be addressed to authors through the GitHub portal or by email. All example datasets and the results of their analysis presented in the paper are available at the NCBI FTP site (ftp://ftp.ncbi.nih.gov/pub/wolf/_suppl/icityNatProt/).

Supplementary Material

Supplement

NIHMS1063598-supplement-Supplement.pdf^{(517.7KB, pdf)}

Acknowledgements

This research was funded through the Intramural Research Program of the National Institutes of Health of the USA, the RFBR (for research project 18-34-00012, S.A.S.), a systems biology fellowship funded by Philip Morris Sales and Marketing (to S.A.S.), the Ministry of Education and Science of the Russian Federation (subsidy agreement 14.606.21.0006; project identifier RFMEFI60617X0006; to S.A.S. and K.V.S.) and an NIH grant (R01 GM10407 to K.V.S.).

Footnotes

Competing interests

The authors declare no competing interests.

Supplementary information is available for this paper at https://doi.org/10.1038/s41596-019-0211-1.

Peer review information Nature Protocols thanks Christine Pourcel and other anonymous reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Wolf YI, Rogozin IB, Kondrashov AS & Koonin EV Genome alignment, evolution of prokaryotic genome organization and prediction of gene function using genomic context. Genome Res. 11, 356–372 (2001). [DOI] [PubMed] [Google Scholar]
2.Rogozin IB, Makarova KS, Wolf YI & Koonin EV Computational approaches for the analysis of gene neighbourhoods in prokaryotic genomes. Brief Bioinform. 5, 131–149 (2004). [DOI] [PubMed] [Google Scholar]
3.Aravind L Guilt by association: contextual information in genome analysis. Genome Res. 10, 1074–1077 (2000). [DOI] [PubMed] [Google Scholar]
4.Galperin MY & Koonin EV Who’s your neighbor? New computational approaches for functional genomics. Nat. Biotechnol 18, 609–613 (2000). [DOI] [PubMed] [Google Scholar]
5.Janga SC, Collado-Vides J & Moreno-Hagelsieb G Nebulon: a system for the inference of functional relationships of gene products from the rearrangement of predicted operons. Nucleic Acids Res. 33, 2521–2530 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Moreno-Hagelsieb G The power of operon rearrangements for predicting functional associations. Comput. Struct. Biotechnol. J 13, 402–406 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Moreno-Hagelsieb G & Santoyo G Predicting functional interactions among genes in prokaryotes by genomic context. Adv. Exp. Med. Biol 883, 97–106 (2015). [DOI] [PubMed] [Google Scholar]
8.Price MN, Huang KH, Alm EJ & Arkin AP A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res. 33, 880–892 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.de Crecy-Lagard V & Hanson AD Finding novel metabolic genes through plant-prokaryote phylogenomics. Trends Microbiol. 15, 563–570 (2007). [DOI] [PubMed] [Google Scholar]
10.Zhao S et al. Discovery of new enzymes and metabolic pathways by using structure and genome context. Nature 502, 698–702 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Calhoun S et al. Prediction of enzymatic pathways by integrative pathway mapping. Elife 7, e31097 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Koonin EV, Wolf YI & Aravind L Prediction of the archaeal exosome and its connections with the proteasome and the translation and transcription machineries by a comparative-genomic approach. Genome Res. 11, 240–252 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Evguenieva-Hackenberg E, Hou L, Glaeser S & Klug G Structure and function of the archaeal exosome. Wiley Interdiscip. Rev. RNA 5, 623–635 (2014). [DOI] [PubMed] [Google Scholar]
14.Shmakov S et al. Discovery and functional characterization of diverse class 2 CRISPR–Cas systems. Mol. Cell 60, 385–397 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Shmakov S et al. Diversity and evolution of class 2 CRISPR–Cas systems. Nat. Rev. Microbiol 15, 169–182 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Burstein D et al. Major bacterial lineages are essentially devoid of CRISPR–Cas viral defence systems. Nat. Commun 7, 10613 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Yan WX et al. Cas13d is a compact RNA-targeting type VI CRISPR effector positively modulated by a WYL-domain-containing accessory protein. Mol. Cell 70, 327–339.e5 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Makarova KS, Aravind L, Grishin NV, Rogozin IB & Koonin EV A DNA repair system specific for thermophilic archaea and bacteria predicted by genomic context analysis. Nucleic Acids Res. 30, 482–496 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Shmakov SA, Makarova KS, Wolf YI, Severinov KV & Koonin EV Systematic prediction of genes functionally linked to CRISPR–Cas systems by gene neighborhood analysis. Proc. Natl Acad. Sci. USA 115, E5307–E5316 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Pawluk A et al. Naturally occurring off-switches for CRISPR–Cas9. Cell 167, 1829–1838e1829 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Pawluk A, Davidson AR & Maxwell KL Anti-CRISPR: discovery, mechanism and function. Nat. Rev. Microbiol 16, 12–17 (2018). [DOI] [PubMed] [Google Scholar]
22.Lasken RS & McLean JS Recent advances in genomic DNA sequencing of microbial species from single cells. Nat. Rev. Genet 15, 577–584 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Stern A & Sorek R The phage-host arms race: shaping the evolution of microbes. Bioessays 33, 43–51 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Koonin EV, Makarova KS & Wolf YI Evolutionary genomics of defense systems in archaea and bacteria. Annu. Rev. Microbiol 71, 233–261 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Makarova KS, Wolf YI, Snir S & Koonin EV Defense islands in bacterial and archaeal genomes and prediction of novel defense systems. J. Bacteriol 193, 6039–6056 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Doron S et al. Systematic discovery of antiphage defense systems in the microbial pangenome. Science 359, eaar4120 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Rogozin IB et al. Connected gene neighborhoods in prokaryotic genomes. Nucleic Acids Res. 30, 2212–2223 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Zheng Y, Szustakowski JD, Fortnow L, Roberts RJ & Kasif S Computational identification of operons in microbial genomes. Genome Res. 12, 1221–1230 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Yan Y & Moult J Detection of operons. Proteins 64, 615–628 (2006). [DOI] [PubMed] [Google Scholar]
30.Mitra K, Carvunis AR, Ramesh SK & Ideker T Integrative approaches for finding modular structure in biological networks. Nat. Rev. Genet 14, 719–732 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Burroughs AM, Zhang D, Schaffer DE, Iyer LM & Aravind L Comparative genomic analyses reveal a vast, novel network of nucleotide-centric systems in biological conflicts, immunity and signaling. Nucleic Acids Res. 43, 10633–10654 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Makarova KS, Wolf YI & Koonin EV Comparative genomics of defense systems in archaea and bacteria. Nucleic Acids Res. 41, 4360–4377 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Galperin MY Bacterial signal transduction network in a genomic perspective. Environ. Microbiol 6, 552–567 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Mishra V, Lal R & Srinivasan Enzymes and operons mediating xenobiotic degradation in bacteria. Crit. Rev. Microbiol 27, 133–166 (2001). [DOI] [PubMed] [Google Scholar]
35.Besemer J, Lomsadze A & Borodovsky M GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 29, 2607–2618 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Marchler-Bauer A et al. Troubleshooting advice can be: NCBI’s conserved domain database. Nucleic Acids Res. 43, D222–226 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Finn RD et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res.44, D279–285 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Steinegger M & Soding J MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol 35, 1026–1028 (2017). [DOI] [PubMed] [Google Scholar]
39.Altschul SF et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997). [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Soding J Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951–960 (2005). [DOI] [PubMed] [Google Scholar]
41.Makarova KS et al. An updated evolutionary classification of CRISPR–Cas systems. Nat. Rev. Microbiol 13, 722–736 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Bath C, Cukalac T, Porter K & Dyall-Smith ML His1 and His2 are distantly related, spindle-shaped haloviruses belonging to the novel virus group, Salterprovirus. Virology 350, 228–239 (2006). [DOI] [PubMed] [Google Scholar]
43.Swarts DC et al. The evolutionary journey of argonaute proteins. Nat. Struct. Mol. Biol 21, 743–753 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Edgar RC MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Sasaki Y The truth of the F-measure. Teach Tutor Mater. 1, 1–5 (2007). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

NIHMS1063598-supplement-Supplement.pdf^{(517.7KB, pdf)}

Data Availability Statement

[R1] 1.Wolf YI, Rogozin IB, Kondrashov AS & Koonin EV Genome alignment, evolution of prokaryotic genome organization and prediction of gene function using genomic context. Genome Res. 11, 356–372 (2001). [DOI] [PubMed] [Google Scholar]

[R2] 2.Rogozin IB, Makarova KS, Wolf YI & Koonin EV Computational approaches for the analysis of gene neighbourhoods in prokaryotic genomes. Brief Bioinform. 5, 131–149 (2004). [DOI] [PubMed] [Google Scholar]

[R3] 3.Aravind L Guilt by association: contextual information in genome analysis. Genome Res. 10, 1074–1077 (2000). [DOI] [PubMed] [Google Scholar]

[R4] 4.Galperin MY & Koonin EV Who’s your neighbor? New computational approaches for functional genomics. Nat. Biotechnol 18, 609–613 (2000). [DOI] [PubMed] [Google Scholar]

[R5] 5.Janga SC, Collado-Vides J & Moreno-Hagelsieb G Nebulon: a system for the inference of functional relationships of gene products from the rearrangement of predicted operons. Nucleic Acids Res. 33, 2521–2530 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Moreno-Hagelsieb G The power of operon rearrangements for predicting functional associations. Comput. Struct. Biotechnol. J 13, 402–406 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Moreno-Hagelsieb G & Santoyo G Predicting functional interactions among genes in prokaryotes by genomic context. Adv. Exp. Med. Biol 883, 97–106 (2015). [DOI] [PubMed] [Google Scholar]

[R8] 8.Price MN, Huang KH, Alm EJ & Arkin AP A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res. 33, 880–892 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.de Crecy-Lagard V & Hanson AD Finding novel metabolic genes through plant-prokaryote phylogenomics. Trends Microbiol. 15, 563–570 (2007). [DOI] [PubMed] [Google Scholar]

[R10] 10.Zhao S et al. Discovery of new enzymes and metabolic pathways by using structure and genome context. Nature 502, 698–702 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Calhoun S et al. Prediction of enzymatic pathways by integrative pathway mapping. Elife 7, e31097 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Koonin EV, Wolf YI & Aravind L Prediction of the archaeal exosome and its connections with the proteasome and the translation and transcription machineries by a comparative-genomic approach. Genome Res. 11, 240–252 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Evguenieva-Hackenberg E, Hou L, Glaeser S & Klug G Structure and function of the archaeal exosome. Wiley Interdiscip. Rev. RNA 5, 623–635 (2014). [DOI] [PubMed] [Google Scholar]

[R14] 14.Shmakov S et al. Discovery and functional characterization of diverse class 2 CRISPR–Cas systems. Mol. Cell 60, 385–397 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Shmakov S et al. Diversity and evolution of class 2 CRISPR–Cas systems. Nat. Rev. Microbiol 15, 169–182 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Burstein D et al. Major bacterial lineages are essentially devoid of CRISPR–Cas viral defence systems. Nat. Commun 7, 10613 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Yan WX et al. Cas13d is a compact RNA-targeting type VI CRISPR effector positively modulated by a WYL-domain-containing accessory protein. Mol. Cell 70, 327–339.e5 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Makarova KS, Aravind L, Grishin NV, Rogozin IB & Koonin EV A DNA repair system specific for thermophilic archaea and bacteria predicted by genomic context analysis. Nucleic Acids Res. 30, 482–496 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Shmakov SA, Makarova KS, Wolf YI, Severinov KV & Koonin EV Systematic prediction of genes functionally linked to CRISPR–Cas systems by gene neighborhood analysis. Proc. Natl Acad. Sci. USA 115, E5307–E5316 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Pawluk A et al. Naturally occurring off-switches for CRISPR–Cas9. Cell 167, 1829–1838e1829 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Pawluk A, Davidson AR & Maxwell KL Anti-CRISPR: discovery, mechanism and function. Nat. Rev. Microbiol 16, 12–17 (2018). [DOI] [PubMed] [Google Scholar]

[R22] 22.Lasken RS & McLean JS Recent advances in genomic DNA sequencing of microbial species from single cells. Nat. Rev. Genet 15, 577–584 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Stern A & Sorek R The phage-host arms race: shaping the evolution of microbes. Bioessays 33, 43–51 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Koonin EV, Makarova KS & Wolf YI Evolutionary genomics of defense systems in archaea and bacteria. Annu. Rev. Microbiol 71, 233–261 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Makarova KS, Wolf YI, Snir S & Koonin EV Defense islands in bacterial and archaeal genomes and prediction of novel defense systems. J. Bacteriol 193, 6039–6056 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Doron S et al. Systematic discovery of antiphage defense systems in the microbial pangenome. Science 359, eaar4120 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Rogozin IB et al. Connected gene neighborhoods in prokaryotic genomes. Nucleic Acids Res. 30, 2212–2223 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Zheng Y, Szustakowski JD, Fortnow L, Roberts RJ & Kasif S Computational identification of operons in microbial genomes. Genome Res. 12, 1221–1230 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Yan Y & Moult J Detection of operons. Proteins 64, 615–628 (2006). [DOI] [PubMed] [Google Scholar]

[R30] 30.Mitra K, Carvunis AR, Ramesh SK & Ideker T Integrative approaches for finding modular structure in biological networks. Nat. Rev. Genet 14, 719–732 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Burroughs AM, Zhang D, Schaffer DE, Iyer LM & Aravind L Comparative genomic analyses reveal a vast, novel network of nucleotide-centric systems in biological conflicts, immunity and signaling. Nucleic Acids Res. 43, 10633–10654 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Makarova KS, Wolf YI & Koonin EV Comparative genomics of defense systems in archaea and bacteria. Nucleic Acids Res. 41, 4360–4377 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Galperin MY Bacterial signal transduction network in a genomic perspective. Environ. Microbiol 6, 552–567 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Mishra V, Lal R & Srinivasan Enzymes and operons mediating xenobiotic degradation in bacteria. Crit. Rev. Microbiol 27, 133–166 (2001). [DOI] [PubMed] [Google Scholar]

[R35] 35.Besemer J, Lomsadze A & Borodovsky M GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 29, 2607–2618 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Marchler-Bauer A et al. Troubleshooting advice can be: NCBI’s conserved domain database. Nucleic Acids Res. 43, D222–226 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Finn RD et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res.44, D279–285 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Steinegger M & Soding J MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol 35, 1026–1028 (2017). [DOI] [PubMed] [Google Scholar]

[R39] 39.Altschul SF et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Soding J Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951–960 (2005). [DOI] [PubMed] [Google Scholar]

[R41] 41.Makarova KS et al. An updated evolutionary classification of CRISPR–Cas systems. Nat. Rev. Microbiol 13, 722–736 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Bath C, Cukalac T, Porter K & Dyall-Smith ML His1 and His2 are distantly related, spindle-shaped haloviruses belonging to the novel virus group, Salterprovirus. Virology 350, 228–239 (2006). [DOI] [PubMed] [Google Scholar]

[R43] 43.Swarts DC et al. The evolutionary journey of argonaute proteins. Nat. Struct. Mol. Biol 21, 743–753 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] 44.Edgar RC MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] 45.Sasaki Y The truth of the F-measure. Teach Tutor Mater. 1, 1–5 (2007). [Google Scholar]

-h, --help	Shows help message
-f F	Clusters file name
-c C	Folder name where profiles will be saved
-d D	Path to the protein database

PERMALINK

Systematic prediction of functionally linked genes in bacterial and archaeal genomes

Sergey A Shmakov

Guilhem Faure

Kira S Makarova

Yuri I Wolf

Konstantin V Severinov

Eugene V Koonin

Abstract

Introduction

The Icity protocol

Rationale and feasibility

Applications and limitations

Experimental design

Fig. 1 |. The pipeline for the identification of gene families associated with a set of baits.

Fig. 2 |. A detailed, step by step schematic of the protocol.

Stage 1 (Steps 1–4): the genome database and selection of baits

Stage 2 (Step 5): annotation of protein-coding genes in the bait neighborhoods

Stage 3 (Steps 6–8): permissive protein clustering

Stage 4 (Steps 9–11): search for cluster members in the genomic database

Stage 5 (Step 12): relevance metrics

Stage 6 (Step 13): candidate selection and ranking

Stage 7 (Step 14): manual curation

Materials

Equipment

Data

Example datasets

Scripts

Hardware

Equipment setup

Environment and tools

Other dependencies

Procedure

Stage 1: database/metadata preparation and selection of the loci of interest ● Timing Variable

Preparation of BLAST database and file containing CDS coordinates

Downloading of pipeline scripts to the working folder ● Timing ~1 min

Running of Icity pipeline in automatic mode ● Timing ~1 h for the example dataset

Stage 2: annotation of protein-coding genes in the bait neighborhoods

Identification of protein-coding genes around the baits

Stage 3: permissive clustering of the protein set

Collection of protein IDs from bait neighborhoods ● Timing ~1 min

Getting protein sequences from the database ● Timing ~1 min

Run clustering with permissive parameters ● Timing ~1 min

Stage 4: searching for cluster representatives in a genomic database

Creation of protein profiles from representatives of a permissive cluster ● Timing ~5 min

Running BLAST to search for generated protein profiles ● Timing ~5 min

Sorting of blast hits between clusters ● Timing ~2 min

Stage 5: relevance metrics

Calculation of relevance metrics for all protein clusters ● Timing ~30 min

Stage 6: candidate selection

Classification of clusters and selection of linked candidates ● Timing ~5 min

Box 1 |. Alternative approach to selection of candidates.

Stage 7: manual curation

Analysis of protein clusters with domain detection tools ● Timing Variable

Troubleshooting

Table 1 |.

Timing

Anticipated results

Fig. 3 |. The space of relevance metrics.

Fig. 4 |. Dissection of the space of relevance metrics.

Reporting Summary

Data and code availability

Supplementary Material

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases