ABSTRACT
Genomic islands (GIs) are integrative mobile DNAs shared among bacteria or archaea, often bringing cargo genes that affect bacterial phenotypes like virulence and metabolism; many are prophages with utility in therapy. We applied our TIGER and Islander software to ~460,000 prokaryotic genomes, yielding a large database containing ~1.75 million GIs.
KEYWORDS: genomic island, sequence database, comparative genomics, bioinformatics
ANNOUNCEMENT
Genomic Islands (GIs), a diverse class of mobile genetic elements encoding integrases, promote horizontal gene transfer in bacteria and archaea. Identification and precise mapping of GIs has been challenging: their gene content is often poorly understood, and it is difficult to identify and validate their genomic endpoints. However, precise mapping provides insights into mechanisms of transfer and integration, identifies prophages useful for various applications (1), reveals novel cases of regulated gene integrity (2), and enables discovery and characterization of auxiliary metabolic genes, virulence factors, and other genes that alter host phenotype (3). Improved knowledge of GI sequences and genes also facilitates characterization of host–GI defense systems (4) and development of novel tools for synthetic biology (e.g., integrases [5], CRISPR systems [6], polymerases [7, 8]). We aimed to develop the most precise and phylogenetically comprehensive prokaryotic GI database possible with current data. We downloaded 458,681 prokaryotic genomes from GenBank that could be assigned to species in Genome Taxonomy Database (GTDB) release 214, as described in reference (9). We applied two orthogonal GI-finding algorithms notable for their precision in mapping GI termini: Islander 1.0 (10), limited to finding single-contig islands in tRNA/tmRNA genes, and TIGER 2.0 (2, 11, 12), which compares to reference genomes and can detect cross-contig islands. The latter program was run in “circle-junction” mode for presumably complete genomes with five or fewer contigs, or otherwise in “cross-contig” mode. Cross-contig islands have precisely defined termini but may be missing internal sequences. Additional software in TIGER 2.0 resolved overlaps and tandem arrays. GIs <2 kbp and >200 kbp were excluded. This process yielded 1,757,053 island assignments, likely including some false positives. The majority of GIs have simple (internal to a single contig) structure (Fig. 1A). Our software insists that GIs contain at least one integrase gene, most commonly in the tyrosine recombinase family (68%, Fig. 1B). Among GIs containing integrases in the serine recombinase family, the Serine Core group (lacking the Pfam domain “Recombinase” characteristic of integrases) is usually annotated as DNA invertases or resolvases; Serine Core GIs may be false positives or reveal novel Recombinase-lacking integrases or use of exogenous helper integrases. The site-promiscuous IS607 subfamily is known to mobilize a group of transposon-like elements (13). Our precise detection methods specify the locus of integration for each GI. We find that 33.8% of GIs integrate into a tRNA or tmRNA gene, 43.4% are in protein coding sequences, and 22.8% are intergenic (Fig. 1C). Gene content categorized our GIs as 34.9% prophages and 7.0% Integrative and Conjugative Elements (ICEs), and <0.1% as phage-ICE tandems. The remaining 58% were not readily categorizable by current gene content algorithms (Fig. 1D). There are two prominent peaks in the GI size distribution, 10 kbp and 40 kbp; the latter is attributed to our top category of prophages (Phage1) (Fig. 1E). The database provides a large, diverse survey of prokaryotic GIs that is unique for the precision of integration site mapping.
Fig 1.
Statistics of GI assignments. (A). Composition of the GI. TIGER allows detection of split (cross-contig or circle-junction) GIs or simple GIs internal to a single contig. (B). Integrase type used in each genomic island. “Serine core” is typically non-integrase lacking the characteristic recombinase domain and is usually known as invertase or resolvase. “S-Core_IS607” is a clade with site-promiscuity associated with the insertion sequence IS607. (C). Integration site. CDS: protein-coding sequence. (D). GI type based on gene content. (E). GI length distribution. The top panel is for the entire data set and lower panels are for the Phage1, Phage2, and combined ICE1/2 types.
ACKNOWLEDGMENTS
This material is based upon work supported by Laboratory Directed Research and Development program of Sandia National Laboratories and by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research under the Secure Biosystems Design Initiative. Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC (NTESS), a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration (DOE/NNSA) under contract DE-NA0003525.
Contributor Information
Kelly P. Williams, Email: kpwilli@sandia.gov.
Jason E. Stajich, University of California Riverside, Riverside, California, USA
DATA AVAILABILITY
The database is available as a flat file at https://figshare.com/s/fe1d563b00782d187880 (file extension .tsv.gz; size 108.4 Mb, inflating to 635.4 Mb). Also at that website is a table listing the genomes analyzed.
REFERENCES
- 1. Mageeney CM, Sinha A, Mosesso RA, Medlin DL, Lau BY, Rokes AB, Lane TW, Branda SS, Williams KP. 2020. Computational basis for on-demand production of diversified therapeutic phage cocktails. mSystems 5:e00659-20. doi: 10.1128/mSystems.00659-20 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Mageeney CM, Lau BY, Wagner JM, Hudson CM, Schoeniger JS, Krishnakumar R, Williams KP. 2020. New candidates for regulated gene integrity revealed through precise mapping of integrative genetic elements. Nucleic Acids Res 48:4052–4065. doi: 10.1093/nar/gkaa156 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Jones JM, Grinberg I, Eldar A, Grossman AD. 2021. A mobile genetic element increases bacterial host fitness by manipulating development. Elife 10:e65924. doi: 10.7554/eLife.65924 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Botelho J. 2023. Defense systems are pervasive across chromosomally integrated mobile genetic elements and are inversely correlated to virulence and antimicrobial resistance. Nucleic Acids Res 51:4385–4397. doi: 10.1093/nar/gkad282 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Fogg PCM, Colloms S, Rosser S, Stark M, Smith MCM. 2014. New applications for phage integrases. J Mol Biol 426:2703–2716. doi: 10.1016/j.jmb.2014.05.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Al-Shayeb B, Skopintsev P, Soczek KM, Stahl EC, Li Z, Groover E, Smock D, Eggers AR, Pausch P, Cress BF, Huang CJ, Staskawicz B, Savage DF, Jacobsen SE, Banfield JF, Doudna JA. 2022. Diverse virus-encoded CRISPR-Cas systems include streamlined genome editors. Cell 185:4574–4586. doi: 10.1016/j.cell.2022.10.020 [DOI] [PubMed] [Google Scholar]
- 7. Morcinek-Orłowska J, Zdrojewska K, Węgrzyn A. 2022. Bacteriophage-encoded dna polymerases-beyond the traditional view of polymerase activities. Int J Mol Sci 23:635. doi: 10.3390/ijms23020635 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Nair A, Kis Z. 2024. Bacteriophage RNA polymerases: catalysts for mRNA vaccines and therapeutics. Front Mol Biosci 11:1504876. doi: 10.3389/fmolb.2024.1504876 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Nawrocki EP, Petrov AI, Williams KP. 2025. Expansion of the tmRNA sequence database and new tools for search and visualization. NAR Genom Bioinform 7:lqaf019. doi: 10.1093/nargab/lqaf019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Hudson CM, Lau BY, Williams KP. 2015. Islander: a database of precisely mapped genomic islands in tRNA and tmRNA genes. Nucleic Acids Res 43:D48–53. doi: 10.1093/nar/gku1072 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Mageeney CM, Trubl G, Williams KP. 2022. Improved mobilome delineation in fragmented genomes. Front Bioinform 2:866850. doi: 10.3389/fbinf.2022.866850 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Yu SL, Mageeney CM, Shormin F, Ghaffari N, Williams KP. 2024. Speeding genomic island discovery through systematic design of reference database composition. PLoS One 19:e0298641. doi: 10.1371/journal.pone.0298641 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Boocock MR, Rice PA. 2013. A proposed mechanism for IS607-family serine transposases. Mob DNA 4:24. doi: 10.1186/1759-8753-4-24 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The database is available as a flat file at https://figshare.com/s/fe1d563b00782d187880 (file extension .tsv.gz; size 108.4 Mb, inflating to 635.4 Mb). Also at that website is a table listing the genomes analyzed.

