Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2016 Apr 29;44(10):4539–4550. doi: 10.1093/nar/gkw319

Identification and analysis of integrons and cassette arrays in bacterial genomes

Jean Cury 1,2,*, Thomas Jové 3, Marie Touchon 1,2, Bertrand Néron 4, Eduardo PC Rocha 1,2
PMCID: PMC4889954  PMID: 27130947

Abstract

Integrons recombine gene arrays and favor the spread of antibiotic resistance. Their broader roles in bacterial adaptation remain mysterious, partly due to lack of computational tools. We made a program – IntegronFinder – to identify integrons with high accuracy and sensitivity. IntegronFinder is available as a standalone program and as a web application. It searches for attC sites using covariance models, for integron-integrases using HMM profiles, and for other features (promoters, attI site) using pattern matching. We searched for integrons, integron-integrases lacking attC sites, and clusters of attC sites lacking a neighboring integron-integrase in bacterial genomes. All these elements are especially frequent in genomes of intermediate size. They are missing in some key phyla, such as α-Proteobacteria, which might reflect selection against cell lineages that acquire integrons. The similarity between attC sites is proportional to the number of cassettes in the integron, and is particularly low in clusters of attC sites lacking integron-integrases. The latter are unexpectedly abundant in genomes lacking integron-integrases or their remains, and have a large novel pool of cassettes lacking homologs in the databases. They might represent an evolutionary step between the acquisition of genes within integrons and their stabilization in the new genome.

INTRODUCTION

Integrons are gene-capturing platforms playing a major role in the spread of antibiotic resistance genes (reviewed in (1–4)). They have two main components (Figure 1). The first is made of the integron-integrase gene (intI) and its promoter (PintI), an integration site named attI (attachment site of the integron), and a constitutive promoter (Pc) for the gene cassettes integrated at the attI site (5). The second component is a cluster with up to 200 gene cassettes (6), most frequently transcribed in the opposite direction relative to the integron-integrase (7). Typical gene cassettes have an open reading frame (ORF) surrounded by attC recombination sites (attachment site of the cassette), but the presence of the ORF is not mandatory. Cassettes carrying their own promoters are expressed independently of PC (8).

Figure 1.

Figure 1.

Schema of an integron and the three types of elements detected by IntegronFinder. (A) The integron is composed of a specific integron integrase gene (intI, orange), an attI recombination site (red), and an array of gene cassettes (blue, yellow and green). A cassette is typically composed of an ORF flanked by two attC recombination sites. The integron integrase has its own promoter (PintI). There is one constitutive promoter (Pc) for the cluster of cassettes. Cassettes rarely contain promoters. The integrase can excise a cassette Inline graphic and/or integrate it at the attI site Inline graphic. (B) Complete integrons include an integrase and at least one attC site. (C) The In0 elements are composed of an integron integrase and no attC sites. (D) The clusters of attC sites lacking integron-integrases (CALIN) are composed of at least two attC sites.

Integrase-mediated recombination between two adjacent attC sites leads to the excision of a circular DNA fragment composed of an ORF and an attC site. The recombination of the attC site of this circular DNA fragment with an attI site leads to integration of the fragment at the location of the latter (9,10). Integrons can use this mechanism to capture cassettes from other integrons or to rearrange the order of their cassettes (11). The mechanism responsible for the creation of new cassettes is unknown.

The most distinctive features of the integron are thus the integron-integrase gene (intI) and a cluster of attC sites (Figure 1). The integron-integrase is a site-specific tyrosine recombinase closely related to Xer proteins (12). Contrary to most other tyrosine recombinases, IntI recombines nucleotide sequences of low similarity (13,14), by recognizing specific structural features of the attC site (15,16) (Figure 2A). This is partly caused by the presence of a ∼35 residues domain near the patch III region of the integron-integrase that is lacking in the other tyrosine recombinases (17). The integration of the attC site at attI produces chimeric attI/attC sites on one side and chimeric attC/attC sites on the other side of the cassette. This results in a cluster of chimerical attC sites with similar palindromic structures.

Figure 2.

Figure 2.

Characteristics of the attC sites. (A) Scheme of the secondary structure of a folded attC site. EHB stands for Extra Helical Bases. (B) Analysis of the attC sites used to build the model, including the WebLogo (73) of the R and L box and unpaired central spacers (UCS) and the histogram (and kernel density estimation) of the size of the variable terminal structure (VTS). The Weblogo represents the information contained in a column of a multiple sequence alignment (using the log2 transformation). The taller the letter is, the more conserved is the character at that position. The width of each column of symbols takes into account the presence of gaps. Thin columns are mostly composed of gaps. (C) Same as (B) but with the set of attC sites identified in complete integrons found in complete bacterial genomes. (D) Secondary structure used in the model in WUSS format, colors match those of (A).

Previous literature focused on integrons carrying antibiotic resistance genes. These integrons are often mobile, due to their association with transposons, and carry few cassettes (18,19,20). Most of the so-called mobile integrons can be classed in five classes, numbered 1 to 5. The IntI within each class show little genetic diversity, indicating their recent emergence from a much larger and diverse pool of integrons (7,20–22), possibly chromosomally encoded (23). For example, prototypical class 1 integrons were found on many chromosomes of non-pathogenic soil and freshwater β-Proteobacteria (24). By contrast, so-called chromosomal integrons are found in most strains of a species and carry cassettes encoding a wide range of functions. For example, the Vibrio spp. chromosomal integrons (initially called super-integrons due to the large number of cassettes they carry (25)) encode virulence factors, secreted proteins, and toxin-antitoxin modules (20). The high similarity of attC sites within chromosomal, but not mobile, integrons of Vibrio spp has prompted the hypothesis that cassettes are created by chromosomal and spread by mobile integrons (26). However, the dichotomy between chromosomal and mobile integrons has been criticized (7) because integrons encoded in the chromosome may be in mobile elements (27,28) and/or have small arrays of cassettes (29).

The analysis of metagenomics data has unraveled a vast pool of novel cassettes in microbial communities (30). Although antibiotic-resistance integrons were found to be abundant in human-associated environments such as sewage (31–33), most cassettes in environmental datasets encode different functions (or genes of unknown function) (31–33). The study of these functions, and of the adaptive impact of integrons, has been hindered by the difficulty in identifying integron cassettes. The bottleneck in these analyses is the recognition of attC sites, for which few tools were made available. The program XXR identifies attC sites in the large Vibrio integrons using pattern-matching techniques (20). The programs ACID (6) (no longer available) and ATTACCA (34) (now a part of RAC, available under private login) were designed to search for class 1 to class 3 mobile integrons. Such classical motif detection tools based on sequence conservation identify attC sites only within restricted classes of integrons. They are inadequate to identify or align distantly related attC sites because their sequences are too dissimilar. Yet, these sites have conserved structural constraints that can be used to identify highly divergent sequences.

We built a program named IntegronFinder (Figure 3, https://github.com/gem-pasteur/Integron_Finder) to detect integrons and their most distinctive components: the integron-integrase with the use of HMM profiles and the attC sites with the use of a covariance model (Figure 2). Covariance models use stochastic context-free grammars to model the constraints imposed by sequence pairing to form secondary structures. Such models have been previously used to detect structured motifs, such as tRNAs (35). They provide a good balance between sensitivity, the ability to identify true elements even if very diverse in sequence, and specificity, the ability to exclude false elements (36). They are ideally suited to model elements with high conservation of structure and poor conservation of sequence, such as attC sites. IntegronFinder also annotates known attI sites, PintI and PC, and any pre-defined type of protein coding genes in the cassettes (e.g., antibiotic resistance genes). IntegronFinder was built to accurately identify integron-integrases and attC sites of any generic integron. Importantly, we provide the program on a webserver that is free, requires no login, and has a long track record of stability (37) (http://mobyle.pasteur.fr/cgi-bin/portal.py#forms::integron_finder). We also provide a standalone application for large-scale genomics and metagenomics projects. We used IntegronFinder to identify integrons in bacterial genomes and to characterize their distribution and diversity.

Figure 3.

Figure 3.

Diagram describing the different steps used by IntegronFinder to identify and annotate integrons. Solid lines represent the default mode, dotted lines optional modes. Blue boxes indicate the main dependency used for a given step. Green boxes indicate the format of the file needed for a given step.

MATERIALS AND METHODS

Data

The sequences and annotations of complete genomes were downloaded from NCBI RefSeq (last accessed in November 2013, http://ftp.ncbi.nih.gov/genomes/refseq/bacteria/). Our analysis included 2484 bacterial genomes (see Supplementary Table S1). We used the classification of replicons in plasmids and chromosomes as provided in the GenBank files. Our dataset included 2626 replicons labeled as chromosomes and 2006 as plasmids. The attC sites used to build the covariance model and the accession numbers of the replicons manually curated for the presence or absence of attC sites were retrieved from INTEGRALL, the reference database of integron sequences (http://integrall.bio.ua.pt/) (38). We used a set of 291 attC sites (Supplementary File 1) to build and test the model, and a set of 346 sequences with expert annotation of 596 attC sites to analyze the quality of the program predictions (Supplementary Tables S2a and S2b).

Protein profiles

We built a protein profile for the region specific to the integron tyrosine recombinase. For this, we retrieved the 402 IntI homologs from the Supplementary file 11 of Cambray et al. (39). These proteins were clustered using uclust 3.0.617 (40) with a threshold of 90% identity to remove very closely related proteins (the largest homologs were kept in each case). The resulting 79 proteins were used to make a multiple alignment using MAFFT (41) (–globalpair –maxiterate 1000). The position of the specific region of the integron-integrase in V. cholerae was mapped on the multiple alignments using the coordinates of the specific region taken from (17). We recovered this section of the multiple alignment to produce a protein profile with hmmbuild from the HMMer suite version 3.1b1 (42). This profile was named intI_Cterm (Supplementary File 2).

We used 119 protein profiles of the Resfams database (core version, last accessed on January 20, 2015 v1.1), to search for genes conferring resistance to antibiotics (http://www.dantaslab.org/resfams, (43)). We retrieved from PFAM the generic protein profile for the tyrosine recombinases (PF00589, phage_integrase, http://pfam.xfam.org/, (44)). All the protein profiles were searched using hmmsearch from the HMMer suite version 3.1b1. Hits were regarded as significant when their e-value was smaller than 0.001 and their alignment covered at least 50% of the profile.

Construction and analysis of attC models

We built a covariance model for the attC sites (Supplementary File 3). These models score a combination of sequence and secondary structure consensus (35) (with the limitation that these are DNA not RNA structures). To produce the attC models, 96 attC sites (33%) were chosen randomly from 291 known attC (see Data). The alignments were manually curated to keep the known conserved regions of the R and L boxes aligned in blocks. The unpaired central spacers (UCS) and the variable terminal structure (VTS) were not aligned because they were poorly conserved in sequence and length. Gaps were inserted in the middle of the VTS sequence as needed to keep the blocks of R and L boxes aligned. The consensus secondary structure was written in WUSS format beneath the aligned sequences (Supplementary File 4). The model was then built with INFERNAL 1.1 (36) using cmbuild with the option ‘- - hand’. This option allows the user to set the columns of the alignment that are actual matches (consensus). This is crucial for the quality of the model, because most of the columns in the R and L boxes would otherwise be automatically assigned as inserts due to the lack of sequence conservation. The R-UCS-L sections of the alignment were chosen as the consensus region, and the VTS was designed as a gap region. We used cmcalibrate from INFERNAL 1.1 to fit the exponential tail of the covariance model e-values, with default options. The model was used to identify attC sites using INFERNAL with two alternative modes. The default mode uses heuristics to reduce the sequence space of the search. The Inside algorithm is more accurate, but computationally much more expensive (typically 104 times slower) (36). By default, the attC sites were kept for further analysis when their e-value were below 1. The user can set this value (option - -evalue_attc).

Identification of promoters and attI sites

The sequences of the Pc promoters, of the PintI promoters and of the attI site were retrieved from INTEGRALL for the integrons of class 1, 2 and 3 when available (see Supplementary Table S3). We searched for exact matches of these sequences (accepting no indel nor mismatch) using pattern matching as implemented in the search function of the Bio.motifs module of Biopython v1.65 (45).

Overview of IntegronFinder: a program for the identification of attC sites, intI genes, integrons and CALIN elements

The input of IntegronFinder is a sequence of DNA in FASTA format. The sequence is annotated with Prodigal v2.6.2 (46) using the default mode for replicons larger than 200 kb and the metagenomic mode for smaller replicons ('-p meta’ in Prodigal) (Figure 3). In the present work, we omitted the annotation part and used the NCBI RefSeq annotations because they are curated. The annotation step is particularly useful to study newly acquired sequences or poorly annotated ones.

The program searches for the two protein profiles of the integron-integrase using hmmsearch with default parameters from HMMER suite version 3.1b1 and for the attC sites with the default mode of cmsearch from INFERNAL 1.1 (Figure 3). Two attC sites are put in the same cluster if they are less than 4 kb apart on the same strand. The clusters are built by transitivity: an attC site less than 4 kb from any attC site of a cluster is integrated in that cluster. Clusters are merged when localized less than 4 kb apart. The threshold of 4kb was determined empirically as a compromise between sensitivity (large values decrease the probability of missing cassettes) and specificity (small values are less likely to put together two independent integrons). More precisely, the threshold is twice the size of the largest known cassettes (∼2 kb (6)). This guarantees that even in the worst case (largest known cassettes) two attC sites will be clustered if an intervening site was not detected. Importantly, the user can set this threshold (‘- - distance_thresh’ in IntegronFinder).

The results of the searches for the elements of the integron are put together to class the loci in three categories (Figure 1 - B, C, D). (i) The elements with intI and at least one attC site were named complete integrons. The word complete is meant to characterize the presence of both elements; we cannot ascertain the functionality or expression of the integron. (ii) The In0 elements have intI but no recognizable attC sites. We do not strictly follow the original definition of In0, which also includes the presence of an attI (47), because this sequence is not known for most integrons (and thus cannot be searched for). (iii) The cluster of attC site lacking integron-integrase (CALIN) has at least two attC sites and lacks nearby intI.

To obtain a better compromise between accuracy and running time, IntegronFinder can re-run INFERNAL to search for attC sites with more precision using the Inside algorithm (‘- - max’ option in INFERNAL), but only around previously identified elements (‘- - local_max’ option in IntegronFinder). More precisely, if a locus contains an integron-integrase and attC sites (complete integron), the search is constrained to the strand encoding attC sites between the end of the integron-integrase and 4 kb after its most distant attC. If other attC sites are found after this one, the search is extended by 4 kb in that direction until no more new sites are found. If the element contains only attC sites (CALIN), the search is performed on the same strand on both directions. If the integron is In0, the search for attC sites is done on both strands in the 4 kb flanking the integron-integrase on each side. The program then searches for promoters and attI sites near the integron-integrase. Finally, it can annotate the integron genes’ cassettes (defined in the program as the CDS found between intI and 200 bp after the last attC site, or 200 bp before the first and 200 bp after the last attC site if there is no integron-integrase) using a database of protein profiles (option ‘- - func_annot’). For example, in the present study we used the ResFams database to search for antibiotic resistance genes. One can use any hmmer-compatible profile databases with the program.

The program outputs tabular and GenBank files listing all the identified genetic elements associated with an integron. The program also produces a figure in pdf format representing each complete integron. For an interactive view of all the hits, one can use the GenBank file as input in specific programs such as Geneious (48).

The user can change the profiles of the integrases and the covariance model of the attC site. Thus, if novel models of attC sites were to be built in the future, e.g., for novel types of attC sites, they could easily be plugged in IntegronFinder.

Phylogenetic analyses

We have made two phylogenetic analyses. One analysis encompasses the set of all tyrosine recombinases and the other focuses on IntI. The phylogenetic tree of tyrosine recombinases (Supplementary Figure S1) was built using 204 proteins, including: 21 integrases adjacent to attC sites and matching the PF00589 profile but lacking the intI_Cterm domain, seven proteins identified by both profiles and representative of the diversity of IntI, and 176 known tyrosine recombinases from phages and from the literature (12). We aligned the protein sequences with Muscle v3.8.31 with default options (49). We curated the alignment with BMGE using default options (50). The tree was then built with IQ-TREE multicore version 1.2.3 with the model LG+I+G4. This model was the one minimizing the Bayesian Information Criterion (BIC) among all models available (‘-m TEST’ option in IQ-TREE). We made 10 000 ultra fast bootstraps to evaluate node support (Supplementary Figure S1, Tree S1).

The phylogenetic analysis of IntI was done using the sequences from complete integrons or In0 elements (i.e., integrases identified by both HMM profiles) (Supplementary Figure S2). We added to this dataset some of the known integron-integrases of class 1, 2, 3, 4 and 5 retrieved from INTEGRALL. Given the previous phylogenetic analysis we used known XerC and XerD proteins to root the tree. Alignment and phylogenetic reconstruction were done using the same procedure; except that we built ten trees independently, and picked the one with best log-likelihood for the analysis (as recommended by the IQ-TREE authors (51)). The robustness of the branches was assessed using 1000 bootstraps (Supplementary Figure S2, Tree S2, Table S4).

Pan-genomes

Pan-genomes are the full complement of genes in the species. They were built by clustering homologous proteins into families for each of the species (as previously described in (52)). Briefly, we determined the lists of putative homologs between pairs of genomes with BLASTP (53) (default parameters) and used the e-values (<10−4) to cluster them using SILIX (54). SILIX parameters were set such that a protein was homologous to another in a given family if the aligned part had at least 80% identity and if it included more than 80% of the smallest protein. We built pan-genomes for the 12 species having at least four complete genomes available in Genbank RefSeq and encoding at least one IntI. The genomes of these species carried 40% of the complete integrons in our dataset. We did not build a pan-genome for Xanthomonas oryzae because it contained too many rearrangements and repeated elements (55).

For a given species we computed the pattern of presence and absence of each integron-integrase protein family and the frequency of the integron-integrase within the species.

Integron classification

We used two criteria to class integrons: frequency within the species’ genomes (Supplementary Figure S3) and number of cassettes (Supplementary Figure S4). Integrons were classed as sedentary chromosomal integrons (as named by (4)) when their frequency in the pan-genome was 100%, or when they contained more than 19 attC sites. They were classed as mobile integrons when missing in more than 40% of the species' genomes, when present on a plasmid, or when the integron-integrase was from classes 1 to 5. The remaining integrons were classed as ‘other’.

Pseudo-genes detection

We translated the six reading frames of the region containing the CALIN elements (10 kb on each side) to detect intI pseudo-genes. We then ran hmmsearch with default options from HMMER suite v3.1b1 to search for hits matching the profile intI_Cterm and the profile PF00589 among the translated reading frames. We recovered the hits with e-values lower than 10−3 and alignments covering more than 50% of the profiles.

IS detection

We identified insertion sequences (IS) by searching for sequence similarity between the genes present 4 kb around or within each genetic element and a database of IS from ISFinder (56). Details can be found in (57).

Detection of cassettes in INTEGRALL

We searched for sequence similarity between all the CDS of CALIN elements and the INTEGRALL database using BLASTN from BLAST 2.2.30+. Cassettes were considered homologous to those of INTEGRALL when the BLASTN alignment showed more than 40% identity.

RESULTS

Models for attC sites

We selected a manually curated set of 291 attC sites representative of the diversity of sequences available in INTEGRALL (see Methods). We randomly sampled a third of them to build a covariance model of the attC site and set aside the others for subsequent validation. The characteristics of these sequences were studied in detail (Figure 2A), notably concerning the R and L boxes, the UCS and the EHB (15). The positions of the so-called Conserved Triplet (AAC and the complementary GTT) were more conserved than the others (Figure 2B and D). The length and sequence of the VTS were highly variable, between 20 and 100 nts long, as previously observed (1).

We used the covariance model to search for attC sites on 2484 complete bacterial genomes. The genomic attC sites showed stronger consensus sequences and more homogeneous VTS lengths than those used to build the model (Figure 2C). The analyses of sensitivity in the next paragraph show that our model missed very few sites. Hence, the differences between the initial and the genomic attC sites might be due to our explicit option of using diverse sequences to build the model (to maximize diversity). They may also reflect differences between mobile integrons (very abundant in INTEGRALL) and integrons in sequenced bacterial genomes (where a sizeable fraction of cassettes were identified in Vibrio spp.).

We tested the ability of the covariance model to identify known sequences within pseudo-genomes built by randomizing dinucleotides from genomes with varying G+C content (Supplementary Table S5). We integrated in each pseudo-genome five attC sites (among the 195 of the validation set) at 2 kb intervals. We searched for attC sites in these genomes and found very few false positives in both run modes (∼0.03 FP/Mb, Figure 4, see Methods for details). The proportion of true attC sites actually identified (sensitivity), was 61% for the default mode and 88% for the most accurate mode (with option ‘- - local_max’). We identified at least two of the attC sites in 99% of the clusters (with the most accurate mode). Hence, clusters could be identified even when some attC sites were missed. The sensitivity of the model showed very little dependency on genome G+C composition in all cases (Figure 4).

Figure 4.

Figure 4.

Quality assessment of the attC sites covariance model on pseudo-genomes with varying G+C content and depending on the run mode (default and ‘- - local_max’). (Top) Table resuming the results. The mean time is the average running time per pseudo-genome on a Mac Pro, 2 × 2.4 GHz 6-Core Intel Xeon, 16 Gb RAM, with options - - cpu 20 and - - no-proteins. (Middle) Rate of false positives per mega-base (Mb) as function of the G+C content. (Bottom) Sensitivity (or true positive rate) as function of the G+C content. The red line depicts results obtained with the default parameters, and the blue line represents results obtained with the accurate parameters (‘- - local_max’ option). Vertical lines represent standard error of the mean. There is no correlation with G+C content (all spearman ρ ∈ [−0.12; −0.04] and all P-values > 0.06).

We then searched for attC sites in sequences annotated for the presence of integrons in INTEGRALL (Supplementary Table S2a). The search was performed in 346 sequences containing 596 known attC sites. We found 570 attC sites with the most accurate mode (96% of sensitivity, Supplementary Table S2b). We missed 26 of the known attC sites, among which 15 were on the integron edges, and were probably missed because of the absence of R’ box on the 3′ side. All the 57 sequences annotated as In0 in INTEGRALL also lacked attC sites in our analysis. We found 247 attC sites missing in the annotations of INTEGRALL. If attC annotations in INTEGRALL were perfect and all these sites were false, the rate of false positives of our analysis would be 0.72 per Mb. However, about 90% of these non-annotated attC sites were found in clusters of two attC sites or more, which suggests that they are real attC sites. If all isolated attC sites were false positives (and the only ones), then the false positive rate would be 0.07 FP/Mb, i.e., less than one false attC site per genome.

These analyses showed a rate of false positives between 0.03 FP/Mb and 0.72 FP/Mb. The probability of having a cluster of two or more false attC sites by chance (within 4 kb) given this density of false positives is between 4.10−6 and 7.10−9 depending on the false positive rate (assuming a Poisson process). Hence, the clusters of attC sites given by our model are extremely unlikely to be false positives.

Identification of integron-integrases

We identified tyrosine recombinases using the protein profile PF00589 (from PFAM). To distinguish IntI from the other tyrosine recombinases, we built an additional protein profile corresponding to the IntI specific region near the patch III domain (17) (henceforth named intI_Cterm, see Methods). We found 215 proteins matching both profiles in the complete genomes of bacteria. Only six genes matched intI_Cterm but not PF00589. There were 18 808 occurrences of PF00589 not matching intI_Cterm, among which only 50 co-localized with an attC site. Among the latter, 29 were in genomes that encoded IntI elsewhere in the replicon (Supplementary Figure S5). The remaining 21 integrases were scattered in the phylogenetic tree of tyrosine recombinases, and only four of them were placed in an intermediate position between IntI and Xer (Supplementary Figure S1). These four sequences resembled typical phage integrases at the region of the patch III domain characteristic of IntI and they co-localized with very few attC sites (always less than three). This analysis strongly suggests that tyrosine recombinases lacking the intI_Cterm domain identified near attC sites are not IntI.

Most intI genes identified in bacterial genomes co-localized with attC sites (76%, Supplementary Figure S5). It is difficult to assess if the remaining intI genes are true or false, since In0 elements have often been described in the literature (47,58). We were able to identify IntI in the integrons of class 1 to class 5, as well as in well-known chromosomal integrons (e.g., in Vibrio super-integrons). We also identified all In0 elements in the INTEGRALL dataset mentioned above. Overall, these results show that IntI could be identified accurately using the intersection of both protein profiles.

We built a phylogenetic tree of the 215 IntI proteins identified in genomes (Supplementary Figure S2). Together with the analysis of the broader phylogenetic tree of tyrosine recombinases (Supplementary Figure S1), this extends and confirms previous analyses (1,7,22,59): (i) The XerC and XerD sequences are close outgroups. (ii) The IntI are monophyletic. (iii) Within IntI, there are early splits, first for a clade including class 5 integrons, and then for Vibrio super-integrons. On the other hand, a group of integrons displaying an integron-integrase in the same orientation as the attC sites (inverted integron-integrase group) was previously described as a monophyletic group (7), but in our analysis it was clearly paraphyletic (Supplementary Figure S2, column F). Notably, in addition to the previously identified inverted integron-integrase group of certain Treponema spp., a class 1 integron present in the genome of Acinetobacter baumannii 1656-2 had an inverted integron-integrase.

Integrons in bacterial genomes

We built a program—IntegronFinder—to identify integrons in DNA sequences. This program searches for intI genes and attC sites, clusters them in function of their co-localization and then annotates cassettes and other accessory genetic elements (see Figure 3 and Methods). The use of this program led to the identification of 215 IntI and 4597 attC sites in complete bacterial genomes. The combination of this data resulted in a dataset of 164 complete integrons, 51 In0 and 279 CALIN elements (see Figure 1 for their description). The observed abundance of complete integrons is compatible with previous data (7). While most genomes encoded a single integron-integrase, we found 36 genomes encoding more than one, suggesting that multiple integrons are relatively frequent (20% of genomes encoding integrons). Interestingly, while the literature on antibiotic resistance often reports the presence of integrons in plasmids, we only found 24 integrons with integron-integrase (20 complete integrons, 4 In0) among the 2006 plasmids of complete genomes. All but one of these integrons were of class 1 (96%).

The taxonomic distribution of integrons was very heterogeneous (Figure 5 and Supplementary Figure S6). Some clades contained many elements. The foremost clade was the γ-Proteobacteria among which 20% of the genomes encoded at least one complete integron. This is almost four times as much as expected given the average frequency of these elements (∼6%, χ2 test in a contingency table, P < 0.001). The β-Proteobacteria also encoded numerous integrons (∼10% of the genomes). In contrast, all the genomes of Firmicutes, Tenericutes and Actinobacteria lacked complete integrons. Furthermore, all 243 genomes of α-Proteobacteria, the sister-clade of γ and β-Proteobacteria, were devoid of complete integrons, In0 and CALIN elements. Interestingly, much more distantly related bacteria such as Spirochaetes, Chlorobi, Chloroflexi, Verrucomicrobia and Cyanobacteria encoded integrons (Figure 5 and Supplementary Figure S6). The complete lack of integrons in one large phylum of Proteobacteria is thus very intriguing.

Figure 5.

Figure 5.

Taxonomic distribution of integrons in clades with more than 50 complete genomes sequenced. The gray bar represents the number of genomes sequenced for a given clade. The blue bar represents the number of complete integrons, the red bar the number of In0 and the yellow bar the number of CALIN. The colored text boxes refer to the colors in Supplementary Figure S2.

We searched for genes encoding antibiotic resistance in integron cassettes (see Methods). We identified such genes in 105 cassettes, i.e., in 3% of all cassettes from complete integrons (3116 cassettes). Most resistance cassettes were found in class 1 to 5 integrons (90% of them), even if the latter contained only 4.5% of all cassettes. This fits previous observations that integrons lacking antibiotic resistance determinants are very frequent in natural populations (24,30).

The association between genome size and the frequency of integrons has not been studied before. We binned the genomes in terms of their size and analyzed the frequency of complete integrons, In0 and CALIN. This showed a clearly non-monotonic trend (Figure 6). This distribution was not homogeneous in the different size categories (χ2 test in a contingency table, P-values <1.10−4 for complete, CALIN and In0). The same result was observed in a complementary analysis using only integrons from Gamma-Proteobacteria (Supplementary Figure S7). Very small genomes lack complete integrons, intermediate size genomes accumulate most of the integrons and the largest genomes encode few. Importantly, the same trends were observed for In0 and CALIN. Hence, the frequency of integrons is maximal for genomes of intermediate size (4–6 Mb).

Figure 6.

Figure 6.

Frequency of integrons and related elements as a function of the genome size. Vertical bar represents standard error of the mean. The sample size in each bin is: 608 [0-2], 912 [2-4], 712 [4-6] and 247 [>6].

Unexpected abundance of CALIN elements

The number of attC sites lacking nearby integron-integrases was unexpectedly high. We found 431 occurrences of isolated single attC sites among the 1879 attC sites lacking an integrase. If these sites were all false, and were the only false ones, then the observed rate of false positives can be estimated at 0.047 FP/Mb. This is within the range of the rates of false positives observed in the sensitivity analysis (between 0.03 FP/Mb and 0.72 FP/Mb). The probability that CALIN elements are false positives is exceedingly small for these rates of false positives. Therefore, we discarded single attC sites and kept the 279 clusters with two or more sites (CALIN) for the subsequent analyses. The CALIN resemble mobile integrons in terms of the number of cassettes: 83% had fewer than six attC sites and only 6.6% had more than 10 (Supplementary Figure S8). Nevertheless, some few CALIN were very large, with up to 114 attC sites. Furthermore, their cassettes were remarkably different from those of mobile integrons: only 147 out of the 1933 cassettes were homologous to those reported in INTEGRALL and only 31 carried antibiotic resistance genes (to be compared with 70% among class 1 to class 5 integrons and with 0.4% among the other complete integrons). Hence, CALIN are relatively small on average (5 attC sites) but may contain several tens of attC sites, and have many previously unknown gene cassettes.

The CALIN elements might have arisen from the loss of the integrase in a previously complete integron. Therefore, we searched for pseudogenes matching the specific IntI_Cterm domain less than 10 kb away from CALIN. We found such pseudo-genes near 15 out of 279 CALIN elements. It is worth noting that out of the 15 hits, 11 pseudo-genes were also matched by the PF00589 profile, which is consistent with the idea that they previously encoded IntI. Overall, our analysis showed that most CALIN (95%) are not close to recognizable intI pseudogenes.

We enquired on the possibility that some CALIN might actually be part of an integron and that we have missed this association because of the small 4 kb threshold used in the definition of the clusters. To test this hypothesis, we re-run IntegronFinder with a distance threshold of 10 kb. This analysis found 252 of the previously identified 279 CALIN elements, the remaining 27 being merged with a integron or In0 element with the 10 kb threshold. This shows that increasing the distance threshold in the clustering procedure does not significantly change the observed abundance of CALIN.

Chromosomal rearrangements (integrations, translocations or inversions) may split integrons and separate some cassettes from the neighborhood of the integron-integrase, thus producing CALIN elements in genomes encoding IntI. The CALIN elements might also result from integration of cassettes at secondary sites in the chromosome (9,10,60). We found some cases where IntI was actually encoded in another replicon (3.5% of CALIN). Overall, half of the CALIN were found in genomes encoding IntI and half in genomes lacking this gene.

Insertion sequences (IS) may create CALIN by promoting chromosomal rearrangements in a previously complete integron. The frequency of these events depends on the frequency of IS inside integrons. We therefore searched for IS inside or near CALIN, In0 and complete integrons (see Methods). We found that 12% of CALIN and 23% of the complete integrons encoded at least one IS within their cassettes. Upon IS-mediated rearrangements, the CALIN should be close to an IS. Indeed, 38% of the CALIN had a neighboring IS. Such co-localization was more frequent for CALIN in genomes encoding IntI than in the others (P < 0.001, χ2 contingency table). These results are consistent with the hypothesis that IS contribute to disrupt integrons and create CALIN. They may explain the origin of many CALIN elements, especially in the genomes encoding IntI in other locations.

Divergence of attC sites

Since attC sites were too poorly conserved in sequence to align using standard sequence alignment methods, we aligned them using the covariance model. We used these alignments to assess the sequence similarity between the R-UCS-L box and the difference in length of VTS sequences between attC sites. As expected, both measures showed that attC sites were more similar within than between integrons (Supplementary Figure S9). We then quantified the relationship between the number of attC sites in an integron and the average within-integron sequence dissimilarity in attC sites. The sequence similarity increased with the number of attC sites (Figure 7), i.e., the integrons carrying the longest arrays of cassettes had more homogeneous attC sites. Conversely, arrays of heterogeneous attC sites were almost always small.

Figure 7.

Figure 7.

Relationship between the number of attC sites in an integron and the mean sequence distance between attC sites within an integron. The x-axis is in log10 scale. The association is significant: spearman ρ = -0.53, P < 0.001.

Considering that many previous studies opposed mobile to chromosomal integrons, we tested if our results remained valid when following this dichotomy. We split our dataset into sedentary chromosomal integrons, mobile integrons and others (unclassified) (see Materials and Methods). Integrons from all three sets were found in the major clades of the IntI phylogeny (Supplementary Figure S2). Around 67% of the integrons encoded in chromosomes were classed as mobile in the species with computed pan-genomes (see Materials and Methods), showing that the separation between chromosomal and mobile integrons may be misleading. Expectedly, given their longer arrays of cassettes (Figure 7), the sedentary chromosomal integrons showed more similar attC sites than the mobile ones (Supplementary Figure S9). The similarity of attC sites within CALIN elements was between that of sedentary and mobile integrons (Supplementary Figure S9). As proposed before (3), our results suggest that the dichotomy between sedentary chromosomal and mobile integrons may be informative because these two sets are quantitatively different, but may not reflect qualitative biological differences because there seems to be a continuum between large and small integrons.

DISCUSSION

IntegronFinder, limitations and perspectives

IntegronFinder identifies the vast majority of known attC sites and intI genes and is unaffected by genomic G+C content. The high sensitivity with which it identifies individual attC sites leads to a very small probability (0.02%) of missing all elements in a cluster of four attC sites. Nevertheless, it may be necessary to interpret with care the results of IntegronFinder in certain circumstances. For example, a genome rearrangement that splits an integron in two will result in the identification of a CALIN and an integron (eventually an In0 if the rearrangement takes place near the attI site). IntegronFinder accurately identifies these two genetic elements, which are independent from the transcriptional point of view since PC cannot promote expression of the CALIN's cassettes. On the other hand, these elements may remain functionally linked because cassettes from the CALIN may be excised by the integron-integrase and re-inserted in the integron at its attI site. It is unclear if the two elements should be regarded as independent, as it is done by default, or as a single integron. One should note that such cases might be difficult to distinguish from alternative evolutionary scenarii involving the loss of the integron-integrase in one of multiple integrons of a genome.

IntegronFinder detects few false positives among integrons and CALIN. Yet, we have identified 431 single attC sites in bacterial genomes whose relevance is less clear. Some of these sites might be false positives because their frequency in genomes is close to the upper limit of the false positive rates obtained in our validation procedure. Others might result from the genetic degradation of integron cassettes.

Our study was restricted to the analysis of complete bacterial genomes to avoid the complications of dealing with inaccurate genome assemblies. However, IntegronFinder can be used to analyze draft genomes or metagenomes as long as one is aware of the limitations of the procedure in such data. The difficulty in the analysis of draft genomes results from the presence of contig breaks that often coincide with repeated sequences, such as transposable elements. Their high frequency in integrons implicate that these might be scattered in different contigs. Under these circumstances, IntegronFinder will identify several genetic elements (typically an integron and several CALIN) even if the genome actually encodes one single complete integron. Metagenomics data are even more challenging because it includes numerous small contigs where it is difficult to identify complete integrons. Yet, since the models for attC sites and intI are very accurate they can be used to identify cassettes and integron-integrases in assembled metagenomes. This might dramatically improve the detection of novel gene cassettes in environmental data.

Determinants of integron distribution

Our analysis highlighted associations between the frequency of integrons and certain genetic traits. The frequency of CALIN, complete integrons and In0 is often highly correlated in relation to all of these traits, e.g., all three types of elements show roughly similar distributions among bacterial phyla and in terms of genome size. This association between the three types of elements is most likely caused by their common evolutionary history.

Integrons have well-known roles in the spread of antibiotic resistance. Nevertheless, we identified very few known antibiotic resistance genes in complete integrons outside the class 1 to class 5 integrons. Interestingly, we also found few resistance genes in CALIN elements. This supports previous suggestions that integrons carry a much broader set of adaptive traits, than just antibiotic resistance, in natural populations (30).

We found an under-representation of integrons in both small and large bacterial genomes. Since integrons are gene-capturing platforms, one would expect a positive association between the frequency of integrons and that of horizontal transfer. Accordingly, the lack of integrons in bacteria with small genomes might be caused by the sexual isolation endured by these bacteria, which typically also have few or no transposable elements, plasmids or phages (61–63). The causes for the low frequency of integrons in the largest genomes must be different, since they are thought to engage in very frequent horizontal transfer (64,65). We can only offer a speculation to explain this puzzling result. Horizontal transfer is often brought by mobile genetic elements. These elements can be very large and costly, while encoding few adaptive traits (if any) (66). The cost of these elements should scale with the inverse of genome size, if larger genomes have fewer constraints on the amount of incoming genetic material and if they select for more frequent horizontal transfer. Hence, the distribution of integrons might result from the combined effect of the frequency of transfer (increasing with genome size) and selection for compact transfer (decreasing with genome size). Further work will be necessary to test this hypothesis.

Most integrons with taxonomic identification available in INTEGRALL are from γ-proteobacteria (90%) (38). Our dataset is more diverse; we found many integrons in β-Proteobacteria and in other large phyla (such as Spirochaetes, Chloroflexi, Chlorobi or Planctomycetes). This shows that our method identifies integrons in clades distant from γ-proteobacteria. Surprisingly, the genomes from α-Proteobacteria had no integrons, even if they encoded many tyrosine recombinases involved in the integration of a variety of mobile genetic elements. The complete absence of integrons, In0 and CALIN in α-Proteobacteria is extremely puzzling. It cannot solely be ascribed to the frequency of small genomes in certain branches of α-Proteobacteria, since our dataset included 99 genomes larger than 4 Mb in the clade. We also did not find complete integrons in Gram-positive bacteria. It is well known that differences in the translation machinery hinder the expression of transferred genetic information from Proteobacteria to Firmicutes (e.g., due specificities of the protein S1 (67)), but these differences cannot explain the lack of integrons in α-Proteobacteria. Transfer of genetic information between clades of Proteobacteria and between Proteobacteria and Gram-positive bacteria is well documented (68,69). Accordingly, integrons have occasionally been identified in Firmicutes and α-Proteobacteria (38,70), and we found CALIN in Firmicutes and In0 in Actinobacteria. Some unknown mechanism probably hinders the establishment of integrons in these lineages after transfer.

The evolution of integrons

Our study sheds new light on the evolution of integrons. The use of the covariance model confirmed that attC sites are more similar within than between integrons. It also uncovered a positive association between the homogeneity of attC sites and the number of cassettes in integrons. If homogeneous attC sites result from the creation of cassettes by the integron-integrase of the same integron, then the largest integrons might be those creating more cassettes.

We found many CALIN in genomes. Previous works have identified intI pseudo-genes in bacterial genomes (22,29), and showed IntI-mediated creation of CALIN at secondary integration sites (9,10,60). However, the frequency of CALIN, and especially in genomes lacking integron-integrases, is surprising. These elements may have arisen in several ways. (i) By the unknown mechanism creating novel cassettes if this mechanism does not depend on IntI. (ii) By integration of cassettes at secondary integration sites by an integron encoded elsewhere in the genome. (iii) By loss of intI, even if most CALIN lacked recognizable neighboring pseudogenes of intI. (iv) By genome rearrangements separating a group of cassettes from the neighborhood of intI (as observed in (71)). Mechanisms #3 and #4 are consistent with the presence of IS in a fourth of complete integrons. Mechanisms #1, #2 and #4 might explain why half of the CALIN are in replicons encoding IntI.

There are some similarities, but also key differences, between CALIN and mobile integrons. The average number of cassettes is similar in both elements, but the within-element sequence identity of attC sites is different and few CALIN have many cassettes. CALIN have very few cassettes homologous to those of mobile integrons and far fewer antibiotic resistant genes. This shows that most CALIN do not derive from the class 1 to 5 mobile integrons carrying antibiotic resistance genes.

The lack of an integron-integrase in CALIN elements does not imply that these cassettes cannot be mobilized. We found many co-occurrences of complete integrons, In0 and CALIN in genomes. They might facilitate the exchange of cassettes between elements. Integron-integrases have relaxed sequence similarity requirements to mediate recombination between divergent attC sites. It is thus tempting to speculate that integrons transferred into a genome encoding a CALIN might be able to integrate CALIN cassettes in their own cluster of cassettes. Alternatively, CALIN might provide cassettes to naturally transformable bacteria, in case their stable circular forms are able to survive in the environment and be taken up by transformation (72). If integrons often capture cassettes from CALIN elements then many genomes currently lacking integrons but carrying CALIN might be important reservoirs of novel cassettes.

It is unknown whether CALIN genes are expressed or have an adaptive value. Their high abundance in genomes suggests that at least some of them might provide advantageous traits for the host bacterium. CALIN with very degenerate attC sites might thus represent an intermediate step between the acquisition of a gene by an integron and its definitive stabilization in the genome by loss of the IntI-based cassette mobilizing activity.

AVAILABILITY

The program was written in Python 2.7. It is freely available on a webserver (http://mobyle.pasteur.fr/cgi-bin/portal.py#forms::integron_finder). The standalone program is distributed under an open-source GPLv3 license and can be downloaded from Github (https://github.com/gem-pasteur/Integron_Finder/) to be run using the command line. Supplementary materials include tables containing all integrons found at different level (elements, integrons, genomes, in Supplementary Tables S6, S7 and S8). It includes the list of the 596 attC sites with their annotated position (Supplementary Table S2a), and the corresponding file with observed position (Supplementary Table S2b). We provide the intI_Cterm HMM profile (File S2) and the covariance model for the attC site (File S3).

Supplementary Material

Supplementary Data
SUPPLEMENTARY DATA

Acknowledgments

J.C. is a member of the ‘Ecole Doctorale Frontière du Vivant (FdV) – Programme Bettencourt’. We thank Didier Mazel, Jose A. Escudero, Céline Loot, Aleksandra Nivina, Philippe Glaser, Claudine Médigue, Alexandra Moura and Julian E. Davies for fruitful discussions and comments on the manuscript.

Author contributions: J.C. EPCR designed the study. J.C. made the analysis. J.C. and B.N. wrote the software and webserver. M.T. T.J. contributed with data. J.C. EPCR. drafted the manuscript. All authors contributed to the final text of the manuscript.

FUNDING

European Research Council [EVOMOBILOME, 281605 to E.P.C.R.]. Funding for open access charge: ERC EVOMOBILOME.

Conflict of interest statement. None declared.

REFERENCES

  • 1.Mazel D. Integrons: agents of bacterial evolution. Nat. Rev. Microbiol. 2006;4:608–620. doi: 10.1038/nrmicro1462. [DOI] [PubMed] [Google Scholar]
  • 2.Partridge S.R. Analysis of antibiotic resistance regions in Gram-negative bacteria. FEMS Microbiol. Rev. 2011;35:820–855. doi: 10.1111/j.1574-6976.2011.00277.x. [DOI] [PubMed] [Google Scholar]
  • 3.Gillings M.R. Integrons: past, present, and future. Microbiol. Mol. Biol. Rev. 2014;78:257–277. doi: 10.1128/MMBR.00056-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Escudero J.A., Loot C., Nivina A., Mazel D. The Integron: adaptation on demand. Microbiol. Spectr. 2015;3 doi: 10.1128/microbiolspec.MDNA3-0019-2014. MDNA3-0019-2014. [DOI] [PubMed] [Google Scholar]
  • 5.Collis C.M., Hall R.M. Expression of antibiotic resistance genes in the integrated cassettes of integrons. Antimicrob. Agents Chemother. 1995;39:155–162. doi: 10.1128/aac.39.1.155. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Joss M.J., Koenig J.E., Labbate M., Polz M.F., Gillings M.R., Stokes H.W., Doolittle W.F., Boucher Y. ACID: annotation of cassette and integron data. BMC Bioinformatics. 2009;10:118. doi: 10.1186/1471-2105-10-118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Boucher Y., Labbate M., Koenig J.E., Stokes H.W. Integrons: mobilizable platforms that promote genetic diversity in bacteria. Trends Microbiol. 2007;15:301–309. doi: 10.1016/j.tim.2007.05.004. [DOI] [PubMed] [Google Scholar]
  • 8.Michael C.A., Labbate M. Gene cassette transcription in a large integron-associated array. BMC Genet. 2010;11:82. doi: 10.1186/1471-2156-11-82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Recchia G.D., Stokes H.W., Hall R.M. Characterisation of specific and secondary recombination sites recognised by the integron DNA integrase. Nucleic Acids Res. 1994;22:2071–2078. doi: 10.1093/nar/22.11.2071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Recchia G.D., Hall R.M. Plasmid evolution by acquisition of mobile gene cassettes: plasmid pIE723 contains the aadB gene cassette precisely inserted at a secondary site in the incQ plasmid RSF1010. Mol. Microbiol. 1995;15:179–187. doi: 10.1111/j.1365-2958.1995.tb02232.x. [DOI] [PubMed] [Google Scholar]
  • 11.Hall R.M., Brookes D.E., Stokes H.W. Site-specific insertion of genes into integrons: role of the 59-base element and determination of the recombination cross-over point. Mol. Microbiol. 1991;5:1941–1959. doi: 10.1111/j.1365-2958.1991.tb00817.x. [DOI] [PubMed] [Google Scholar]
  • 12.Nunes-Duby S.E., Kwon H.J., Tirumalai R.S., Ellenberger T., Landy A. Similarities and differences among 105 members of the Int family of site-specific recombinases. Nucleic Acids Res. 1998;26:391–406. doi: 10.1093/nar/26.2.391. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Collis C.M., Recchia G.D., Kim M.J., Stokes H.W., Hall R.M. Efficiency of recombination reactions catalyzed by class 1 integron integrase IntI1. J. Bacteriol. 2001;183:2535–2542. doi: 10.1128/JB.183.8.2535-2542.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.MacDonald D., Demarre G., Bouvier M., Mazel D., Gopaul D.N. Structural basis for broad DNA-specificity in integron recombination. Nature. 2006;440:1157–1162. doi: 10.1038/nature04643. [DOI] [PubMed] [Google Scholar]
  • 15.Bouvier M., Ducos-Galand M., Loot C., Bikard D., Mazel D. Structural features of single-stranded integron cassette attC sites and their role in strand selection. PLoS Genet. 2009;5:e1000632. doi: 10.1371/journal.pgen.1000632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Frumerie C., Ducos-Galand M., Gopaul D.N., Mazel D. The relaxed requirements of the integron cleavage site allow predictable changes in integron target specificity. Nucleic Acids Res. 2010;38:559–569. doi: 10.1093/nar/gkp990. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Messier N., Roy P.H. Integron integrases possess a unique additional domain necessary for activity. J. Bacteriol. 2001;183:6699–6706. doi: 10.1128/JB.183.22.6699-6706.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Partridge S.R., Tsafnat G., Coiera E., Iredell J.R. Gene cassettes and cassette arrays in mobile resistance integrons. FEMS Microbiol. Rev. 2009;33:757–784. doi: 10.1111/j.1574-6976.2009.00175.x. [DOI] [PubMed] [Google Scholar]
  • 19.Hall R.M., Collis C.M. Mobile gene cassettes and integrons: capture and spread of genes by site-specific recombination. Mol. Microbiol. 1995;15:593–600. doi: 10.1111/j.1365-2958.1995.tb02368.x. [DOI] [PubMed] [Google Scholar]
  • 20.Rowe-Magnus D.A., Guerout A.M., Biskri L., Bouige P., Mazel D. Comparative analysis of superintegrons: engineering extensive genetic diversity in the Vibrionaceae. Genome Res. 2003;13:428–442. doi: 10.1101/gr.617103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Diaz-Mejia J.J., Amabile-Cuevas C.F., Rosas I., Souza V. An analysis of the evolutionary relationships of integron integrases, with emphasis on the prevalence of class 1 integrons in Escherichia coli isolates from clinical and environmental origins. Microbiology. 2008;154:94–102. doi: 10.1099/mic.0.2007/008649-0. [DOI] [PubMed] [Google Scholar]
  • 22.Nemergut D.R., Robeson M.S., Kysela R.F., Martin A.P., Schmidt S.K., Knight R. Insights and inferences about integron evolution from genomic data. BMC Genomics. 2008;9:261. doi: 10.1186/1471-2164-9-261. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Hall R.M. Integrons and gene cassettes: hotspots of diversity in bacterial genomes. Ann. N Y Acad. Sci. 2012;1267:71–78. doi: 10.1111/j.1749-6632.2012.06588.x. [DOI] [PubMed] [Google Scholar]
  • 24.Gillings M., Boucher Y., Labbate M., Holmes A., Krishnan S., Holley M., Stokes H.W. The evolution of class 1 integrons and the rise of antibiotic resistance. J. Bacteriol. 2008;190:5095–5100. doi: 10.1128/JB.00152-08. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Mazel D., Dychinco B., Webb V.A., Davies J. A distinctive class of integron in the Vibrio cholerae genome. Science. 1998;280:605–608. doi: 10.1126/science.280.5363.605. [DOI] [PubMed] [Google Scholar]
  • 26.Rowe-Magnus D.A., Guerout A.M., Ploncard P., Dychinco B., Davies J., Mazel D. The evolutionary history of chromosomal super-integrons provides an ancestry for multiresistant integrons. Proc. Natl Acad. Sci. U.S.A. 2001;98:652–657. doi: 10.1073/pnas.98.2.652. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Hochhut B., Lotfi Y., Mazel D., Faruque S.M., Woodgate R., Waldor M.K. Molecular analysis of antibiotic resistance gene clusters in vibrio cholerae O139 and O1 SXT constins. Antimicrob. Agents Chemother. 2001;45:2991–3000. doi: 10.1128/AAC.45.11.2991-3000.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Iwanaga M., Toma C., Miyazato T., Insisiengmay S., Nakasone N., Ehara M. Antibiotic resistance conferred by a class I integron and SXT constin in Vibrio cholerae O1 strains isolated in Laos. Antimicrob. Agents Chemother. 2004;48:2364–2369. doi: 10.1128/AAC.48.7.2364-2369.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Gillings M.R., Holley M.P., Stokes H.W., Holmes A.J. Integrons in Xanthomonas: a source of species genome diversity. Proc. Natl Acad. Sci. U.S.A. 2005;102:4419–4424. doi: 10.1073/pnas.0406620102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Holmes A.J., Gillings M.R., Nield B.S., Mabbutt B.C., Nevalainen K.M., Stokes H.W. The gene cassette metagenome is a basic resource for bacterial genome evolution. Environ. Microbiol. 2003;5:383–394. doi: 10.1046/j.1462-2920.2003.00429.x. [DOI] [PubMed] [Google Scholar]
  • 31.Moura A., Henriques I., Ribeiro R., Correia A. Prevalence and characterization of integrons from bacteria isolated from a slaughterhouse wastewater treatment plant. J. Antimicrobial Chemother. 2007;60:1243–1250. doi: 10.1093/jac/dkm340. [DOI] [PubMed] [Google Scholar]
  • 32.Stalder T., Barraud O., Casellas M., Dagot C., Ploy M.C. Integron involvement in environmental spread of antibiotic resistance. Front. Microbiol. 2012;3:119. doi: 10.3389/fmicb.2012.00119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Gillings M.R., Gaze W.H., Pruden A., Smalla K., Tiedje J.M., Zhu Y.G. Using the class 1 integron-integrase gene as a proxy for anthropogenic pollution. ISME J. 2015;9:1269–1279. doi: 10.1038/ismej.2014.226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Tsafnat G., Coiera E., Partridge S.R., Schaeffer J., Iredell J.R. Context-driven discovery of gene cassettes in mobile integrons using a computational grammar. BMC Bioinformatics. 2009;10:281. doi: 10.1186/1471-2105-10-281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Eddy S.R., Durbin R. RNA sequence analysis using covariance models. Nucleic Acids Res. 1994;22:2079–2088. doi: 10.1093/nar/22.11.2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Nawrocki E.P., Eddy S.R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 2013;29:2933–2935. doi: 10.1093/bioinformatics/btt509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Neron B., Menager H., Maufrais C., Joly N., Maupetit J., Letort S., Carrere S., Tuffery P., Letondal C. Mobyle: a new full web bioinformatics framework. Bioinformatics. 2009;25:3005–3011. doi: 10.1093/bioinformatics/btp493. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Moura A., Soares M., Pereira C., Leitao N., Henriques I., Correia A. INTEGRALL: a database and search engine for integrons, integrases and gene cassettes. Bioinformatics. 2009;25:1096–1098. doi: 10.1093/bioinformatics/btp105. [DOI] [PubMed] [Google Scholar]
  • 39.Cambray G., Sanchez-Alberola N., Campoy S., Guerin E., Da Re S., Gonzalez-Zorn B., Ploy M.C., Barbe J., Mazel D., Erill I. Prevalence of SOS-mediated control of integron integrase expression as an adaptive trait of chromosomal and mobile integrons. Mobile DNA. 2011;2:6. doi: 10.1186/1759-8753-2-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Edgar R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26:2460–2461. doi: 10.1093/bioinformatics/btq461. [DOI] [PubMed] [Google Scholar]
  • 41.Katoh K., Standley D.M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 2013;30:772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Eddy S.R. Accelerated profile HMM searches. PLoS Comput. Biol. 2011;7:e1002195. doi: 10.1371/journal.pcbi.1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Gibson M.K., Forsberg K.J., Dantas G. Improved annotation of antibiotic resistance determinants reveals microbial resistomes cluster by ecology. ISME J. 2015;9:207–216. doi: 10.1038/ismej.2014.106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Finn R.D., Tate J., Mistry J., Coggill P.C., Sammut S.J., Hotz H.R., Ceric G., Forslund K., Eddy S.R., Sonnhammer E.L., et al. The Pfam protein families database. Nucleic Acids Res. 2008;36:D281–D288. doi: 10.1093/nar/gkm960. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Cock P.J., Antao T., Chang J.T., Chapman B.A., Cox C.J., Dalke A., Friedberg I., Hamelryck T., Kauff F., Wilczynski B., et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–1423. doi: 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Hyatt D., Chen G.L., Locascio P.F., Land M.L., Larimer F.W., Hauser L.J. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11:119. doi: 10.1186/1471-2105-11-119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Bissonnette L., Roy P.H. Characterization of In0 of Pseudomonas aeruginosa plasmid pVS1, an ancestor of integrons of multiresistance plasmids and transposons of gram-negative bacteria. J. Bacteriol. 1992;174:1248–1257. doi: 10.1128/jb.174.4.1248-1257.1992. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Kearse M., Moir R., Wilson A., Stones-Havas S., Cheung M., Sturrock S., Buxton S., Cooper A., Markowitz S., Duran C., et al. Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics. 2012;28:1647–1649. doi: 10.1093/bioinformatics/bts199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Edgar R.C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. doi: 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Criscuolo A., Gribaldo S. BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evol. Biol. 2010;10:210. doi: 10.1186/1471-2148-10-210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Nguyen L.T., Schmidt H.A., von Haeseler A., Minh B.Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 2015;32:268–274. doi: 10.1093/molbev/msu300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Touchon M., Cury J., Yoon E.-J., Krizova L., Cerqueira G.C., Murphy C., Feldgarden M., Wortman J., Clermont D., Lambert T., et al. The Genomic Diversification of the Whole Acinetobacter Genus: Origins, Mechanisms, and Consequences. Genome Biol. Evol. 2014;6:2866–2882. doi: 10.1093/gbe/evu225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Miele V., Penel S., Duret L. Ultra-fast sequence clustering from similarity networks with SiLiX. BMC Bioinformatics. 2011;12:116. doi: 10.1186/1471-2105-12-116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Lee B.M., Park Y.J., Park D.S., Kang H.W., Kim J.G., Song E.S., Park I.C., Yoon U.H., Hahn J.H., Koo B.S., et al. The genome sequence of Xanthomonas oryzae pathovar oryzae KACC10331, the bacterial blight pathogen of rice. Nucleic Acids Res. 2005;33:577–586. doi: 10.1093/nar/gki206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Siguier P., Perochon J., Lestrade L., Mahillon J., Chandler M. ISfinder: the reference centre for bacterial insertion sequences. Nucleic Acids Res. 2006;34:D32–D36. doi: 10.1093/nar/gkj014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Touchon M., Rocha E.P. Causes of insertion sequences abundance in prokaryotic genomes. Mol. Biol. Evol. 2007;24:969–981. doi: 10.1093/molbev/msm014. [DOI] [PubMed] [Google Scholar]
  • 58.Brown H.J., Stokes H.W., Hall R.M. The integrons In0, In2, and In5 are defective transposon derivatives. J. Bacteriol. 1996;178:4429–4437. doi: 10.1128/jb.178.15.4429-4437.1996. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Cambray G., Guerout A.M., Mazel D. Integrons. Annu. Rev. Genet. 2010;44:141–166. doi: 10.1146/annurev-genet-102209-163504. [DOI] [PubMed] [Google Scholar]
  • 60.Segal H., Victoria Francia M., Garcia Lobo J.M., Elisha G. Reconstruction of an active integron recombination site after integration of a gene cassette at a secondary site. Antimicrob. Agents Chemother. 1999;43:2538–2541. doi: 10.1128/aac.43.10.2538. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Silva F.J., Latorre A., Moya A. Why are the genomes of endosymbiotic bacteria so stable? Trends Genet. 2003;19:176–180. doi: 10.1016/S0168-9525(03)00041-6. [DOI] [PubMed] [Google Scholar]
  • 62.Canback B., Tamas I., Andersson S.G. A phylogenomic study of endosymbiotic bacteria. Mol. Biol. Evol. 2004;21:1110–1122. doi: 10.1093/molbev/msh122. [DOI] [PubMed] [Google Scholar]
  • 63.McCutcheon J.P., Moran N.A. Extreme genome reduction in symbiotic bacteria. Nat. Rev. Microbiol. 2012;10:13–26. doi: 10.1038/nrmicro2670. [DOI] [PubMed] [Google Scholar]
  • 64.Ochman H., Lawrence J.G., Groisman E.A. Lateral gene transfer and the nature of bacterial innovation. Nature. 2000;405:299–304. doi: 10.1038/35012500. [DOI] [PubMed] [Google Scholar]
  • 65.Cordero O.X., Hogeweg P. The impact of long-distance horizontal gene transfer on prokaryotic genome size. Proc. Natl Acad. Sci. U.S.A. 2009;106:21748–21753. doi: 10.1073/pnas.0907584106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Baltrus D.A. Exploring the costs of horizontal gene transfer. Trends Ecol. Evol. 2013;28:489–495. doi: 10.1016/j.tree.2013.04.002. [DOI] [PubMed] [Google Scholar]
  • 67.Salah P., Bisaglia M., Aliprandi P., Uzan M., Sizun C., Bontems F. Probing the relationship between Gram-negative and Gram-positive S1 proteins by sequence analysis. Nucleic Acids Res. 2009;37:5578–5588. doi: 10.1093/nar/gkp547. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Mazodier P., Davies J. Gene transfer between distantly related bacteria. Annu. Rev. Genet. 1991;25:147–171. doi: 10.1146/annurev.ge.25.120191.001051. [DOI] [PubMed] [Google Scholar]
  • 69.Kloesges T., Popa O., Martin W., Dagan T. Networks of gene sharing among 329 proteobacterial genomes reveal differences in lateral gene transfer frequency at different phylogenetic depths. Mol. Biol. Evol. 2011;28:1057–1074. doi: 10.1093/molbev/msq297. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Nandi S., Maurer J.J., Hofacre C., Summers A.O. Gram-positive bacteria are a major reservoir of Class 1 antibiotic resistance integrons in poultry litter. Proc. Natl Acad. Sci. U.S.A. 2004;101:7118–7122. doi: 10.1073/pnas.0306466101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Le Roux F., Zouine M., Chakroun N., Binesse J., Saulnier D., Bouchier C., Zidane N., Ma L., Rusniok C., Lajus A., et al. Genome sequence of Vibrio splendidus: an abundant planctonic marine species with a large genotypic diversity. Environ. Microbiol. 2009;11:1959–1970. doi: 10.1111/j.1462-2920.2009.01918.x. [DOI] [PubMed] [Google Scholar]
  • 72.Gestal A.M., Liew E.F., Coleman N.V. Natural transformation with synthetic gene cassettes: new tools for integron research and biotechnology. Microbiology. 2011;157:3349–3360. doi: 10.1099/mic.0.051623-0. [DOI] [PubMed] [Google Scholar]
  • 73.Crooks G.E., Hon G., Chandonia J.M., Brenner S.E. WebLogo: a sequence logo generator. Genome Res. 2004;14:1188–1190. doi: 10.1101/gr.849004. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data
SUPPLEMENTARY DATA

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES