Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2023 Apr 24;120(18):e2221528120. doi: 10.1073/pnas.2221528120

Allelic resolution of insect and spider silk genes reveals hidden genetic diversity

Paul B Frandsen a,b,1, Scott Hotaling c, Ashlyn Powell a, Jacqueline Heckenhauer b,d, Akito Y Kawahara e, Richard H Baker f, Cheryl Y Hayashi f, Blanca Ríos-Touma g, Ralph Holzenthal h, Steffen U Pauls b,d,i, Russell J Stewart j
PMCID: PMC10161007  PMID: 37094147

Abstract

Arthropod silk is vital to the evolutionary success of hundreds of thousands of species. The primary proteins in silks are often encoded by long, repetitive gene sequences. Until recently, sequencing and assembling these complex gene sequences has proven intractable given their repetitive structure. Here, using high-quality long-read sequencing, we show that there is extensive variation—both in terms of length and repeat motif order—between alleles of silk genes within individual arthropods. Further, this variation exists across two deep, independent origins of silk which diverged more than 500 Mya: the insect clade containing caddisflies and butterflies and spiders. This remarkable convergence in previously overlooked patterns of allelic variation across multiple origins of silk suggests common mechanisms for the generation and maintenance of structural protein-coding genes. Future genomic efforts to connect genotypes to phenotypes should account for such allelic variation.

Keywords: genomics, silk, insects, long-read sequencing, alleles


Silk is fundamental to the life histories of hundreds of thousands of arthropods (1). The genes that encode for silk proteins are often long and repetitive, and their variation is directly tied to silk phenotypes. Despite major headway in resolving difficult-to-assemble regions in de novo genome assembly [e.g., the human telomere-to-telomere consortium (2)], assemblies of biodiverse organisms vary widely in quality and contiguity (3, 4) and long repetitive regions pose a particularly difficult challenge (5). Arthropod silk genes are difficult to assemble due to their repetitive internal region, which forms the semicrystalline protein structure underlying the unique properties of silk (1). In the sister orders Lepidoptera (butterflies and moths) and Trichoptera (caddisflies), the gene that encodes the primary protein component of silk is heavy chain fibroin (H-fibroin), which originated in their common ancestor. However, structural silk protein-coding genes have independently evolved multiple times across the arthropod tree of life. For example, in spiders, there are multiple repetitive silk genes that encode a suite of proteins collectively known as spidroins (6).

The first full length H-fibroin and spidroin genes were published in 2000 and 2007, respectively through sequencing of bacterial artificial chromosome and fosmid libraries (7, 8). Subsequent attempts using high throughput sequencing to resolve full-length H-fibroin and spidroin sequences were unsuccessful due to the difficulty in resolving long repetitive regions with short-read sequencing (9). This lack of full-length silk gene sequences has hindered the analysis of the variation present in these large (>20 kbp), repetitive proteins, leaving a significant gap in our understanding of their evolution and structure. Recent advances in long-read sequencing enabled recovery of these regions (1013). However, only single consensus sequences were recovered, leaving allelic variation hidden. New sequencing technologies, e.g., PacBio HiFi, generate accurate long-reads that can be assembled into haploid-resolved genomes, even for regions of the genome that were previously intractable to assemble (2, 5). Here we present the first examination of allelic structure in insect and spider silk genes and show that there is substantial diversity within individuals, demonstrating a wealth of genomic variation that was previously overlooked. Ultimately, characterizing and understanding this allelic variation is essential to uncovering the molecular mechanisms that shaped these highly modular structural proteins that are central to the success of hundreds of thousands of animals.

Results and Discussion

We obtained high-quality reference genomes from single individuals of a butterfly (Vanessa cardui), three caddisfly species across a diversity of silk use (Hesperophylax magnus, a case-making caddisfly; Atopsyche davidsoni, a cocoon-maker; and Arctopsyche grandis, a retreat-maker), and a spider (Argiope argentata). Two of these were newly generated and the primary assemblies of the others were previously published (Table 1).

Table 1.

Comparison of genome and gene sequences from organisms used in this study

Order Species Contig N50 (Mbp) BUSCO % complete Gene No. of amino acids allele 1 No. of amino acids allele 2 % complete repeat indels (CRI) Study
Trichoptera Arctopsyche grandis 9.4 98.9 H-fibroin 6,375 5,696 97.6 Present study
Trichoptera Atopsyche davidsoni 14.1 98.8 H-fibroin 7,878 6,992 94.1 Ríos-Touma et al. 2021
Trichoptera Hesperophylax magnus 11.2 95.6 H-fibroin 8,624 6,728 91.7 Hotaling et al. 2022
Lepidoptera Vanessa cardui 7.0 98.8 H-fibroin 5,675 4,326 95.3 Lohse et al. 2021
Araneae Argiope argentata 32.3 98.4 MaSp2 4,534 4,183 94.7 Present study
Araneae Argiope argentata 32.3 98.4 AgSp2 5,465 5,313 69.1 Present study

BUSCO scores reflect the insecta_odb10 dataset for the Trichoptera and Lepidoptera and the arachnida_odb10 dataset for the spider. Percent of complete repeat indels (%CRI) indicates the percentage of total allelic variation that results from complete repeat indels between alleles.

We recovered full-length, fully resolved sequences for both alleles of H-fibroin across all three caddisflies, unveiling substantial, and previously hidden, heterozygosity within each individual (Fig. 1). The variation between alleles can largely be ascribed to indels resulting in allele sequences with considerable differences in length (Table 1). Because the origin of H-fibroin can be traced back to the common, silk-spinning ancestor of Trichoptera and Lepidoptera more than 290 million years ago, we also investigated the H-fibroin sequence in the butterfly V. cardui to determine whether such patterns of heterozygosity were consistent across the evolutionary history of the gene. As with the caddisflies, we recovered two distinct H-fibroin alleles. Across all samples of Trichoptera and Lepidoptera, the structure of the H-fibroin gene was conspicuously conserved. For example, each H-fibroin sequence included conserved termini with a repetitive internal region consisting of modular repetitive units. Each gene was structured by a short initial exon followed by an intron and a long terminal exon that contained the entire repetitive region in a single open reading frame. Nevertheless, within each individual, allelic variation was dramatic, resulting from apparent deletions and insertions of repetitive modules (Fig. 1). However, despite the similarity in H-fibroin gene structure, the sequences and number of repeats in the internal regions varied widely across species and orders.

Fig. 1.

Fig. 1.

Allelic variation in silk protein sequences of caddisflies, a butterfly, and a spider. Illustrations depict variation in silk use across these distantly related organisms. Bars of the same color represent the same repetitive motif within an individual, but not among individuals, and blue ribbons represent aligned regions highlighting insertions and deletions between alleles. The height of the bars indicates proportional unit length of each repeat.

While the level of variation that we observed between alleles of H-fibroin within individual butterflies and caddisflies was striking, it is perhaps unsurprising that a functionally important gene with a common evolutionary origin would share similar patterns of variation. To determine if our findings extended to silk with a different evolutionary origin, we analyzed two spidroin silk genes from the spider Argiope argentata: major ampullate spidroin-2 (MaSp2) and aggregate spidroin-2 (AgSp2). For both genes, we observed patterns of allelic variation that were strikingly convergent with H-fibroin, including modular heterozygosity between alleles (Fig. 1). Indeed, across all samples, “complete repeat indels,” defined as indels that encompassed one or more repeat unit(s), accounted for the majority of the variation between alleles (Table 1).

In this study, we uncovered remarkable convergence in the structure and variation between alleles of silk protein-coding genes across deeply divergent arthropod orders, including two independent origins of silk. Historically, this allelic variation has been overlooked because short-read sequences are not well-suited to resolving long, highly repetitive regions. It is clear from these comparisons that common methods for assessing genomic variation, such as the analysis of single nucleotide variants, cannot fully capture the extent of variation in these repetitive regions. The patterns we observed lend insight into shared mechanisms driving silk gene evolution across two divergent groups of arthropods (Fig. 1). For example, the allelic variation of large indels is consistent with unequal crossing over which could play a role in driving allelic variation. As such, selection may tolerate the presence of diverse alleles that differ in the organization of repeat units while maintaining nearly identical sequence within each repeat. Furthermore, since sequence length is linked to the properties of silk fibers (12), variation among alleles may be critical to the function of silk fibers and represent the product of selection. However, fully uncovering the evolutionary mechanisms underlying variation in silk genes will require population-level sampling. Ultimately, to effectively connect genotype to phenotype, we must account for the full suite of allelic variation that exists within structural protein-coding genes. Our findings echo a broader trend; advances in long-read sequencing are leading to the discovery of substantial, and previously unobserved, genomic variation across the tree of life (14).

Materials and Methods

We extracted and sequenced genomic DNA from four of the five species, two of which were available previously, and retrieved the other species from publicly available data (Table 1). For each species, we generated de novo assemblies using Hifiasm v.0.13-r307 (15) and used a custom pipeline to retrieve, annotate, and align the H-fibroin and spidroin sequences (SI Appendix). We then used a custom script to visualize and compare the alleles of fibroin and spidroin genes within each species (16).

Supplementary Material

Appendix 01 (PDF)

Acknowledgments

We thank Christine Frandsen for help with Fig. 1. This project was funded by the United States National Science Foundation grant DEB-2217155, the Dirección General de Investigación, Universidad de Las Américas (Ecuador) AMB.BRT.19.02, and Deutsche Forschungsgemeinschaft 502865717.

Author contributions

P.B.F., S.H., A.Y.K., R.H.B., C.Y.H., S.U.P., and R.J.S. designed research; P.B.F., A.P., J.H., R.H.B., C.Y.H., B.R.-T., and R.H. performed research; P.B.F., R.H.B., C.Y.H., and B.R.-T. contributed new reagents/analytic tools; P.B.F., A.P., J.H., R.H.B., and C.Y.H. analyzed data; and P.B.F., S.H., R.H.B., C.Y.H., and R.J.S. wrote the paper.

Competing interests

The authors declare no competing interest.

Data, Materials, and Software Availability

Gene sequences have been deposited in GenBank with the following accession numbers: OQ787675, OQ787676, OQ787677, OQ787678, OQ787679, OQ787680, BK063240, BK063241, OQ291291, OQ291292, OQ291293, and OQ291294. Script data has been deposited in Zenodo (https://doi.org/10.5281/zenodo.7783469).

Supporting Information

References

  • 1.Sutherland T. D., Young J. H., Weisman S., Hayashi C. Y., Merritt D. J., Insect silk: One name, many materials. Annu. Rev. Entomol. 55, 171–188 (2010). [DOI] [PubMed] [Google Scholar]
  • 2.Nurk S., et al. , The complete sequence of a human genome. Science 376, 44–53 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Hotaling S., Kelley J. L., Frandsen P. B., Toward a genome sequence for every animal: Where are we now? Proc. Natl. Acad. Sci. U.S.A. 118, e2109019118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Marks R. A., Hotaling S., Frandsen P. B., VanBuren R., Representation and participation across 20 years of plant genome sequencing. Nat. Plants 7, 1571–1578 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Wagner J., et al. , Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Gatesy J., Hayashi C., Motriuk D., Woods J., Lewis R., Extreme diversity, conservation, and convergence of spider silk fibroin sequences. Science 291, 2603–2605 (2001). [DOI] [PubMed] [Google Scholar]
  • 7.Zhou C.-Z., et al. , Fine organization of bombyx mori fibroin heavy chain gene. Nucleic Acids Res. 28, 2413–2419 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Ayoub N. A., Garb J. E., Tinghitella R. M., Collin M. A., Hayashi C. Y., Blueprint for a high-performance biomaterial: Full-length spider dragline silk genes. PLoS One 2, e514 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Yonemura N., Mita K., Tamura T., Sehnal F., Conservation of silk genes in trichoptera and lepidoptera. J. Mol. Evol. 68, 641–653 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Frandsen P. B., et al. , Exploring the underwater silken architectures of caddisworms: Comparative silkomics across two caddisfly suborders. Philos. Trans. R. Soc. B. Biol. Sci. 374, 20190206 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Kawahara A. Y., et al. , Long-read HiFi sequencing correctly assembles repetitive heavy fibroin silk genes in new moth and caddisfly genomes. Gigabyte 2022, 1–14 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Arakawa K., et al. , 1000 spider silkomes: Linking sequences to silk physical properties. Sci. Adv. 8, eabo6043 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Baker R. H., Corvelo A., Hayashi C. Y., Rapid molecular diversification and homogenization of clustered major ampullate silk genes in argiope garden spiders. PLoS Genet. 18, e1010537 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Mukamel R. E., et al. , Protein-coding repeat polymorphisms strongly shape diverse human phenotypes. Science 373, 1499–1505 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Cheng H., Concepcion G. T., Feng X., Zhang H., Li H., Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Powell A., h-fibroin-visual, GitHub, 10.5281/zenodo.7783469, 16 December 2022. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

Data Availability Statement

Gene sequences have been deposited in GenBank with the following accession numbers: OQ787675, OQ787676, OQ787677, OQ787678, OQ787679, OQ787680, BK063240, BK063241, OQ291291, OQ291292, OQ291293, and OQ291294. Script data has been deposited in Zenodo (https://doi.org/10.5281/zenodo.7783469).


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES