Abstract
The first insect genome assembly (Drosophila melanogaster) was published two decades ago. Today, nuclear genome assemblies are available for a staggering 601 insect species representing 20 orders. In this study, we analyzed the most-contiguous assembly for each species and provide a “state-of-the-field” perspective, emphasizing taxonomic representation, assembly quality, gene completeness, and sequencing technologies. Relative to species richness, genomic efforts have been biased toward four orders (Diptera, Hymenoptera, Collembola, and Phasmatodea), Coleoptera are underrepresented, and 11 orders still lack a publicly available genome assembly. The average insect genome assembly is 439.2 Mb in length with 87.5% of single-copy benchmarking genes intact. Most notable has been the impact of long-read sequencing; assemblies that incorporate long reads are ∼48× more contiguous than those that do not. We offer four recommendations as we collectively continue building insect genome resources: 1) seek better integration between independent research groups and consortia, 2) balance future sampling between filling taxonomic gaps and generating data for targeted questions, 3) take advantage of long-read sequencing technologies, and 4) expand and improve gene annotations.
Keywords: Insecta, Arthropoda, arthropod genomics, long-read sequencing, Pacific Biosciences, Oxford Nanopore
Since the first insect genome was sequenced ∼20 years ago, sequencing technologies and the availability of insect genome assemblies have both advanced dramatically. In this study, we curated, analyzed, and summarized the field of insect genomics in terms of taxonomic representation, assembly quality, gene completeness, and sequencing technology. We show that 601 insect species have genome assemblies available, with some groups heavily overrepresented (e.g., Diptera) relative to others (e.g., Coleoptera). The major takeaway of our study is that genome assemblies produced with long reads are ∼48× more contiguous than short-read assemblies.
Since the publication of the Drosophila melanogaster genome (Adams et al. 2000), sequencing and analytical technologies have developed rapidly, bringing the power of genome sequencing to an ever-expanding pool of researchers. More than 600 insects have now had their nuclear genome sequenced and made publicly available in the GenBank repository (Sayers et al. 2021). Although representing just 0.06% of the ∼1 million described insects (Stork 2018), this breadth of insect genome sequencing still spans ∼480 Myr of evolution (Misof et al. 2014) and roughly two orders of genome size from the tiny 99 Mb genome of Belgica antarctica (Kelley et al. 2014) to the massive genome of Locusta migratoria at 6.5 Gb (Wang et al. 2014).
Accumulating genomic resources have transformed biological research and precipitated major advances in our understanding of the origins of biodiversity (Seehausen et al. 2014; Hug et al. 2016; McKenna et al. 2019; McGee et al. 2020). Considerable progress has been driven by large-scale consortia (e.g., Human Genome Project [Collins et al. 2003]; Vertebrate Genome Project [Rhie et al. 2021]) and for insects, the most prominent consortium has been the i5K initiative to sequence genomes for 5,000 different arthropods (Robinson et al. 2011; i5K Consortium 2013). The rise of long-read sequencing technologies—primarily Oxford Nanopore and Pacific Biosciences (PacBio)—have also changed the landscape of genome sequencing by providing an economical means for high-throughput generation of reads that are commonly 25 kb or longer (Amarasinghe et al. 2020), thereby greatly increasing the average size of sequences used in assemblies. Genome sequencing efforts in insects, however, have not been spread evenly. Aquatic insects, as a group, are underrepresented relative to their terrestrial counterparts (Hotaling et al. 2020). And, some orders (e.g., Diptera) are represented by far more genome assemblies than their species diversity alone would warrant—likely reflecting the model organisms within them—although many orders still have no genomic representation.
Here, we curated and analyzed the best available genome assembly for 601 insects (species or subspecies). We provide a “state-of-the-field” perspective emphasizing taxonomic representation, assembly quality, gene completeness, and sequencing technology. We focused on taxonomic breadth rather than within-group efforts (e.g., The Anopheles gambiae 1000 Genomes Consortium [2017]) to gain a more holistic overview of the field. Following similar studies (e.g., Misof et al. 2014; Petersen et al. 2019; Hotaling et al. 2020), we defined insects to include all groups within the subphylum Hexapoda. We downloaded metadata from GenBank for all nuclear hexapod genome assemblies on an order-by-order basis (Sayers et al. 2021; accessed November 2, 2020). We culled this data set to only include the assembly with the highest contig N50 for each taxon and downloaded these assemblies for analysis. We acknowledge that this filtering approach may introduce biases toward the present day for assemblies that have been improved over the years. Assemblies were classified as “short read,” “long read,” or “not provided” based on whether only short reads (e.g., Illumina) were used, any amount of long-read sequences (e.g., PacBio) were used, or no information was provided. If an assembly used both short and long reads (a “hybrid” assembly), it was classified as a long-read assembly in our analysis.
To test if insect orders were under- or overrepresented in terms of genome assembly availability, we compared the observed number of taxa with assemblies to the expected number given the described diversity for a given order. We obtained totals for the number of insects described overall and for each order from previous studies (Zhang 2011; Bellinger et al. 2020). We assessed significance between observed and expected representation with Fisher’s exact tests. To assess gene completeness, we ran “Benchmarking Universal Single-Copy Orthologs” (BUSCO) v.4.1.4 (Seppey et al. 2019) on each assembly using the 1,367 reference genes in the OrthoDB v.10 Insecta gene set (Kriventseva et al. 2019). It should be noted that Collembola genome assemblies may have received slightly lower BUSCO scores in this analysis because noninsect hexapod genomes were not used to generate the Insecta gene set. We tested for differences in distributions of contig N50 or assembly size between short- and long-read assemblies with Welch’s t-tests. Next, using the BUSCO gene set, we tested whether longer genes were more likely to be missing or fragmented depending on sequencing technology (short or long read) with Spearman’s correlations. We defined BUSCO gene length as the full nucleotide sequence for the protein-coding portions of the consensus “ancestral” genes included in the OrthoDB v.10 Insecta gene set. An extended version of the methods and the scripts used for analysis are provided in the supplementary material, Supplementary Material online and GitHub repository (https://github.com/pbfrandsen/insect_genome_assemblies, last accessed July 15, 2021).
As of November 2020, 601 different insect species representing 20 orders had nuclear genome assemblies available in GenBank. These data were dominated by Diptera (n = 169 assemblies), Hymenoptera (n = 164), and Lepidoptera (n = 118; fig. 1a). Four orders were overrepresented relative to their species diversity: Collembola, Diptera, Hymenoptera, and Phasmatodea (P, Fisher’s < 0.03; fig. 1a). Coleoptera, with 387,100 described species (Zhang 2011), was significantly underrepresented (41 assemblies vs. ∼228 expected; P, Fisher’s < 0.01). Six orders were represented by only one genome assembly and 11 orders had no publicly available assembly. This lack of representation was particularly striking for Neuroptera (5,868 described species, Zhang 2011).
On average, insect genome assemblies were 439.2 Mb in length (SD = 448.4 Mb; fig. 2a) with a mean contig N50 of 1.09 Mb (SD = 4.01 Mb) and 87.5% (SD = 21%) BUSCO completeness (single and duplicated genes, combined). Substantial variation existed in all three metrics, however, with assemblies ranging from the highly incomplete assembly of Piezodorus guildini at just 3.2 Mb (contig N50 = 1.5 kb, BUSCO completeness = 0.2%) to the exceptionally high-quality 140.7 Mb assembly of D. melanogaster (contig N50 = 22.4 Mb, BUSCO completeness = 99.9%; fig. 2 and supplementary table S1, Supplementary Material online). For orders represented by >10 taxa, Hymenoptera assemblies were the most complete (BUSCO completeness = 94%, SD = 14.3%) and Lepidoptera the least (74.6%, SD = 28.2%; fig. 2b). At 15.3%, Lepidoptera had the lowest percentage of long-read assemblies (supplementary fig. S1, Supplementary Material online) and Heliconius assemblies were particularly fragmented (fig. 1d). For families represented by >10 taxa, Drosophilidae assemblies were the most complete (BUSCO completeness = 98.4%, SD = 2%) followed closely by Apidae assemblies (97.9%, SD = 3.7%; figs. 1d and 2b). As expected, assemblies with higher contig N50 lengths were also more complete (fig. 2f) but assembly size had little to no effect on gene completeness (supplementary fig. S3, Supplementary Material online).
The type(s) of sequence data used for genome assembly were obtained for ∼82% of assemblies (long read = 126, short read = 365; supplementary table S1, Supplementary Material online). Long-read assemblies were more contiguous than short-read assemblies (fig. 1b; P, Welch’s t-test < 0.0001), averaging contig N50 values that were ∼4.4 Mb higher despite no difference in assembly size (P, Welch’s t-test = 0.12; supplementary fig. S4, Supplementary Material online). Gene regions were also far more complete in long-read assemblies (mean BUSCO completeness = 96%, SD = 7%) versus those generated from short reads (89.1%, SD = 19%; P, Welch’s t-test < 1e-8; fig. 2c) with 70% fewer fragmented genes (P, Welch’s t-test < 1e-11; fig. 2d). Long-read assemblies, however, had ∼2.6× more duplicated genes (4.4% vs. 1.7%; P, Welch’s t-test = 0.003; fig. 2e). Longer BUSCO genes were also more likely to be fragmented in both short-read (Spearman’s p: 0.24, P < 2.2e-16) and long-read assemblies (Spearman’s p: 0.08, P = 0.002; fig. 2g) but they were less likely to be missing in both when compared with shorter genes (short read: Spearman’s p: −0.08, P = 0.002; long read: Spearman’s p: −0.18, P = 9.7e-12; supplementary fig. S5, Supplementary Material online).
The rate at which new insect genome assemblies are becoming available is clearly accelerating (fig. 1c). Nearly 50% (n = 292) of the best-available insect assemblies were accessioned in 2019–2020 (supplementary tables S1 and S2, Supplementary Material online). The same period also represented a high-water mark of contiguity (mean contig N50, 2019–2020 = 1.77 Mb; supplementary table S2, Supplementary Material online). Much of the increase in contiguity was driven by long-read assemblies which rose in frequency from 0% of all assemblies in 2011–2012 to 36.1% in 2019–2020. The contiguity of long-read assemblies also sharply increased in 2017 (supplementary fig. S6 and table S2, Supplementary Material online).
We have entered a new era of insect genome biology. Since 2019, a new species has had its genome assembly deposited in GenBank every 2.3 days. These new assemblies are, on average, markedly more contiguous than those of just a few years ago. As we continue developing these resources, we offer four recommendations: first, we should recognize the community-driven nature of these data and seek better integration between research groups and consortia in terms of data sharing, best practices, and taxonomic focus. Progress toward these goals is occurring (e.g., a proposed metric system for describing genome assembly quality with associated benchmark standards from the Earth BioGenome Project, Lewin et al. 2018) and will accelerate as more researchers integrate these standards into their own workflows. Second, new sequencing efforts should strive to balance sampling that fills taxonomic gaps and improves existing resources with targeted sampling motivated by specific questions. Both approaches are valuable and not mutually exclusive. The former—filling taxonomic gaps—is critical to broadly understanding the evolution of insects, the most diverse animal group on Earth; whereas the latter—targeted, question-driven sequencing—is critical to understanding specific aspects of genome biology which are often best answered using dense sampling of specific groups. Importantly, success for this recommendation will depend, in part, on our first recommendation. Better integration and communication will limit redundancy of efforts where the same species’ genome is sequenced by multiple groups simultaneously. Third, we echo the findings of the Vertebrate Genome Project (Rhie et al. 2021)—long-read assemblies are vastly more contiguous than short-read approaches—and recommend that these technologies be embraced by insect genome scientists. And, fourth, as of 2019, only 40% of insect genome assemblies had corresponding gene annotations in GenBank (Li et al. 2019). Expanding and refining the availability of gene annotations for insects will drive corresponding increases in the scale of taxonomic comparisons that are possible for many analyses. Overcoming this challenge of annotation quality and availability can be subdivided into two more specific calls: 1) whenever possible, annotations should be made available alongside genome assemblies in GenBank or similar public repositories and 2) researchers should consider using annotations produced by the NCBI Eukaryotic Genome Annotation Pipeline (Thibaud-Nissen et al. 2016) to limit variation introduced by differing annotation approaches and maximize compatibility.
Beyond resource development, we must continue to leverage this data set to conduct new studies of insect genome biology and evolution. These efforts are beginning to emerge and are paying dividends. For instance, 76 arthropod genome assemblies were used to better understand 500 Myr of evolution by characterizing changes in gene and protein content in a temporal and phylogenetic context, including the identification of novel gene families that arose during diversification with links to key adaptations including flight (Thomas et al. 2020). Similarly, a study of 195 insect genomes revealed the high diversity of transposable elements across insects with varying levels of conservation depending upon phylogenetic position (Gilbert et al. 2021). With genome assemblies representing 600+ taxa and ∼480 Myr of evolution available in a public repository, the power and promise of insect genome research has never been greater. Although our focus was on insects, long reads are likely revolutionizing genome science in virtually all taxonomic groups with untapped genomic potential existing in public repositories across the Tree of Life. The rise of long-read assemblies will, in particular, spur new understanding of previously difficult to characterize aspects of the genome (e.g., genome structure, highly repetitive regions). By continuing to build, curate, and make genomic resources publicly available, we will gain tremendous insight into genome biology and evolution at broad phylogenetic scales. We will also create a more inclusive and equitable discipline by expanding access to resources for scientists whose participation has historically been limited by financial or technological barriers.
Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.
Supplementary Material
Acknowledgments
S.H. and J.L.K. were supported by NSF award (OPP-1906015). J.H and S.U.P. were supported by the LOEWE-Centre for Translational Biodiversity Genomics, which was funded by the Hessen State Ministry of Higher Education, Research and the Arts. J.S.S. was supported by an NSF Postdoctoral Research Fellowship in Biology (DBI-1811930) and an NIH General Medical Sciences Award (R35GM119515) to A.M.L.
Data Availability
The data underlying this article are available in the supplementary material, primarily in supplementary table S1, with associated scripts for analysis on GitHub: https://github.com/pbfrandsen/insect_genome_assemblies.
Literature Cited
- Adams MD, et al. 2000. The genome sequence of Drosophila melanogaster. Science 287:2185–2195. [DOI] [PubMed] [Google Scholar]
- Amarasinghe SL, et al. 2020. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21(1):30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bellinger PF, Christiansen KA, Janssens F.. 2020. Checklist of the Collembola of the world. Available from: http://www.collembola.org.
- Collins FS, Morgan M, Patrinos A.. 2003. The Human Genome Project: lessons from large-scale biology. Science 300(5617):286–290. [DOI] [PubMed] [Google Scholar]
- Consortium AgG. 2017. Genetic diversity of the African malaria vector Anopheles gambiae. Nature 552:96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gilbert C, Peccoud J, Cordaux R.. 2021. Transposable elements and the evolution of insects. Annu Rev Entomol. 66(1):355–372. [DOI] [PubMed] [Google Scholar]
- Hotaling S, Kelley JL, Frandsen PB.. 2020. Aquatic insects are dramatically underrepresented in genomic research. Insects 11(9):601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hug LA, et al. 2016. A new view of the tree of life. Nat Microbiol. 1(5):1–6. [DOI] [PubMed] [Google Scholar]
- i5K Consortium. 2013. The i5K Initiative: advancing arthropod genomics for knowledge, human health, agriculture, and the environment. J Hered. 104: 595–600. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kelley JL, et al. 2014. Compact genome of the Antarctic midge is likely an adaptation to an extreme environment. Nat Commun. 5:4611. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kriventseva EV, et al. 2019. OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs. Nucleic Acids Res. 47(D1):D807–D811. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lewin HA, et al. 2018. Earth BioGenome Project: sequencing life for the future of life. Proc Natl Acad Sci U S A. 115(17):4325–4333. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li F, et al. 2019. Insect genomes: progress and challenges. Insect Mol Biol. 28(6):739–758. [DOI] [PubMed] [Google Scholar]
- McGee MD, et al. 2020. The ecological and genomic basis of explosive adaptive radiation. Nature 586(7827):75–79. [DOI] [PubMed] [Google Scholar]
- McKenna DD, et al. 2019. The evolution and genomic basis of beetle diversity. Proc Natl Acad Sci U S A. 116(49):24729–24737. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Misof B, et al. 2014. Phylogenomics resolves the timing and pattern of insect evolution. Science 346(6210):763–767. [DOI] [PubMed] [Google Scholar]
- Petersen M, et al. 2019. Diversity and evolution of the transposable element repertoire in arthropods with particular reference to insects. BMC Evol Biol. 19(1):1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rhie A, et al. 2021. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592(7856):737–746. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robinson GE, et al. 2011. Creating a buzz about insect genomes. Science 331(6023):1386. [DOI] [PubMed] [Google Scholar]
- Sayers EW, et al. 2021. GenBank. Nucleic Acids Res. 48:D84–D86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Seehausen O, et al. 2014. Genomics and the origin of species. Nat Rev Genet. 15(3):176–192. [DOI] [PubMed] [Google Scholar]
- Seppey M, Manni M, Zdobnov EM.. 2019. BUSCO: assessing genome assembly and annotation completeness. In: Gene prediction: methods in molecular biology. New York (NY): Humana. p. 227–245. [DOI] [PubMed] [Google Scholar]
- Stork NE.2018. How many species of insects and other terrestrial arthropods are there on Earth? Annu Rev Entomol. 63:31–45. [DOI] [PubMed] [Google Scholar]
- Thibaud-Nissen F, et al. 2016. The NCBI eukaryotic genome annotation pipeline. J Anim Sci. 94(Suppl 4):184–184. [Google Scholar]
- Thomas GW, et al. 2020. Gene content evolution in the arthropods. Genome Biol. 21(1):1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang X, et al. 2014. The locust genome provides insight into swarm formation and long-distance flight. Nat Commun. 5(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Z-Q.2011. Animal biodiversity: an outline of higher-level classification and survey of taxonomic richness. Waco (TX): Magnolia Press. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data underlying this article are available in the supplementary material, primarily in supplementary table S1, with associated scripts for analysis on GitHub: https://github.com/pbfrandsen/insect_genome_assemblies.