Skip to main content
BMC Genomics logoLink to BMC Genomics
. 2026 Feb 27;27:346. doi: 10.1186/s12864-026-12659-1

Gene retroposition and functional diversification of retrocopies in the Rhus gall aphid Schlechtendalia chinensis

Hira Tazeen 1, Aftab Ahmad 2, Syed Sibt E Hassan 1, Zhumei Ren 1,
PMCID: PMC13049721  PMID: 41761077

Abstract

Gene duplication via retroposition is a crucial mechanism for gene family expansion, and the gene copies produced are referred to as retrocopies. In this study, we assessed gene retroposition in the Rhus gall aphid, Schlechtendalia chinensis, which induces galls on its primary host plant, Rhus chinensis. We identified a total of 491 retrocopies that are further classified into four categories based on their structures, i.e., classified as 62 putative retrogenes (12.6%), 163 chimeric genes (33.2%), 170 pseudogenes (34.6%), and 96 intact retrocopies (19.6%), respectively. The 37 identified putative retrogenes were categorized into 23 gene families; among these, four putative retrogenes belong to the heat shock protein superfamily. Putative retrogenes occupy 15.4% of the total heat shock protein present in the S. chinensis genome. The 90 chimeric genes were classified into 42 distinct protein families and acquired conserved domains from the parent genes, enabling the emergence of new gene functions and physiological adaptations. Most of the chimeric genes in S. chinensis are under purifying selection and a few under positive selection, while the large number of putative retrogenes are under purifying selection. Ks values indicate that 33.74% chimeric genes in this species are ancient. However, 27.42% putative retrogenes that show ancient evolutionary divergence, and only 19.35% might be considered relatively young and have undergone recent evolutionary divergence. These genetic variations contribute to the evolution of retrocopies in the Schlechtendalia chinensis genome.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12864-026-12659-1.

Keywords: Genome, Retrocopy, Retroposition, Schlechtendalia chinensis

Introduction

Gene duplication is an important evolutionary process that plays a significant role in gene family expansion and genome evolution [1]. Gene duplication can occur through various mechanisms, which may result in the formation of two or more identical gene copies within a genome [2]. One such process is gene retroposition, which uses the enzymatic machinery of retrotransposons (RTs) to convert mRNAs into cDNAs and insert them back into new locations within the genome to generate copies of the gene, referred to as retrocopies [3, 4].

Retrocopies are DNA sequences copied from mRNA, which usually don’t have introns or regulatory regions. These missing parts are a key hallmark of retrocopies [5]. Retrocopies can evolve new regulatory regions or may acquire from nearby genes and could be functionally known as a retrogene. Retrocopy, when inserted in another gene or fused with the exon of another gene, is referred to as a chimera [5, 6]. Retrocopies that could not be activated due to a lack of regulatory regions are called processed pseudogenes [711]. Retrocopies, especially retrogenes and chimera genes, significantly influence genome evolution and the emergence of new gene structures and physical characteristics in eukaryotic organisms [5, 1214].

Transposable elements (TEs) are the main cause of gene duplication via retroposition and the formation of retrocopies. RTs such as LINE L1 and Long terminal repeats (LTR-RTs) provide their enzymatic machinery and may capture gene mRNAs and reverse transcribe them to produce gene retrocopies [5]. The newly produced retrocopy is inserted into a new position in the genome, often lacking an intron and a cis-regulatory region compared to the parent gene [15]. In short, transposable elements play a vital role in gene duplication and the creation of retrocopies, contributing to genomic diversity.

The expansion of gene families is a powerful driver of evolutionary innovation, allowing organisms to adapt to their environments in remarkable ways [16, 17]. Through gene duplication events, gene families grow in size, leading to increased genetic diversity and functional specialization [18]. This process has been observed in various insect species, where the expansion of gene families is connected with changes related to specific environments and ecological niches [16].

Retrogenes are called evolution’s seeds due to their substantial contributions to molecular evolution [19]. Retrogenes have been shown to play a significant role in transcriptome and proteome diversification, as well as being accountable for a variety of traits specific to different species [20]. The Jingwei chimeric gene is the first retrogene reported in D. melanogaster [21]. Several studies explored and conducted comprehensive investigations of retrogene evolution in mosquitoes [22] and silkworms [23, 24], revealing large numbers of retrogenes. However, the dynamics of retrocopy in the majority of non-model insect taxa are largely unknown.

Some retroposed genes produce functional proteins, contributing to novel gene functions and adaptive traits in various organisms [10, 2528]. Retrogenes have been associated with brain development, courting behavior, distinct hormone-pheromone metabolism phenotypes, and antiviral defenses [2932]. A retrogene may function similarly to the parental gene or may evolve new functions [33].

Retrocopies have been reported in various insects, and other aphids belong to the order Hemiptera, where they contribute to genome evolution and functional diversification. Within Hemiptera, species such as Adelges cooleyi, Aphis gossypii, Acyrthosiphon pisum, Aphis craccivora, Daktulosphaira vitifoliae, Diaphorina citri, Eriosoma lanigerum, Halyomorpha halys, Homalodisca vitripennis, Hormaphis cornu, Melanaphis sacchari, Myzus persicae, Nilaparvata lugens, and Pseudococcus longispinus exhibit wide variation in the number of retrocopies. Among these species, Eriosoma lanigerum shows the fewest retrocopies (84), while Aphis craccivora possesses the highest (1211), indicating that retroposition has been an active and dynamic evolutionary process shaping aphids’ genome [34]. However, there are currently no available reports or information on retrocopies in the Rhus gall aphid Schlechtendalia chinensis. This species is a common species of Rhus gall aphid feeding on its host plant, Rhus chinensis, to produce galls, which are used as raw materials in various fields such as medicine, food, industry, and dying [35].

Schlechtendalia chinensis has become a well-established model organism for industrial and applied research on insect gall formation. It has been reported that transposable elements (TEs) occupy approximately 26.04% of its whole genome. Among the 308,196 TEs, 96,053 are retrotransposons, which span 33.04 Mb and account for 9.59% of the genome. Notably, 2,777 of these retrotransposons are either complete or nearly complete, contributing 3.12% of the total genome. RTs were classified into two major orders (LTR-RTs and non-LTR). LTR-RTs covered 15.43 Mb, accounting for 4.48%, and non-LTR RTs covered 16.37 Mb, comprising 5.11% of the S. chinensis whole genome. LINE superfamily is the most extensive non-LTR-RTs, containing 46,644 elements and covering 16.34 Mb (4.74%) of the entire genome [36].

Rhus gall aphids have a distinct life cycle and are commercially advantageous, yet there is a significant gap in genomic knowledge about this aphid species. This study examined the retrocopies’ dynamics in the genome of the Rhus gall aphid Schlechtendalia chinensis. The investigation of retroposition in S. chinensis helps us to understand how this process contributes to genome evolution, gene family expansion, new gene functions, and adaptation in this species. We observed the expansion of protein superfamilies in S. chinensis due to retroposition and characterized the chimeric gene structures.

Results

We investigated a high-quality 344.59 Mb genome assembly of S. chinensis, with 91.71% of the sequences (315.55 Mb) embedded in 13 chromosomes, BUSCO (94.34%). We identified retrocopies in the Schlechtendalia chinensis genome by applying bioinformatic tools designed for scanning, annotating, and displaying retrocopies that were manually validated and curated.

Identification of retrocopies in Schlechtendaliachinensis

We identified a total of 491 retrocopies in the genome of S. chinensis using RetroScan v1.0 [37]. Retrocopies were detected based on the absence of introns (2 or more) by comparing to the parent gene (see methods for details). These retrocopies were further classified into four categories based on their structures. Retrocopies that maintain intact open reading frames (ORFs) and contain functional regulatory regions like promoters were deemed as putative retrogenes due to a lack of RNA sequence data (File S1).

A total of 62 (12.6%) retrocopies were classified as putative retrogenes. In addition, 163 (33.2%) were found to be chimeric genes, representing retrocopies that have fused with the coding region of another gene. A large proportion, 170 (34.6%), were categorized as pseudogenes due to the loss of their functional ORF, rendering them non-functional. A total of 96 (19.6%) retrocopies were identified as intact retrocopies that contain intact ORFs but lack promoter regions, though their functional status remains unclear (Fig. 1 A). Statistical analysis using a Chi-square goodness-of-fit test was performed to assess the distribution of retrocopy types across four categories, with the null hypothesis stating that the distribution is equal (i.e., the retrocopy types are equally distributed across categories). The observed distribution significantly deviated from the expected equal distribution (χ² = 67.28, df = 3, p < 0.0001), indicating that the retrocopies are not equally distributed across the categories. This distribution suggests that the distribution of retrocopies may be influenced by factors such as selective pressure and insertion sites.

Fig. 1.

Fig. 1

Identification of retrocopies in the gall aphid Schlechtendalia chinensis genome. A Percentage of putative retrogene, chimeric gene, pseudogene and intact. B Percentage of region that retrocopies covered their corresponding parent genes. C Percentage of identity of retrocopies with their parent genes

Further analysis was conducted to evaluate the extent to which these retrocopies cover their corresponding parental genes (Table S1). A substantial portion of the retrocopies (55.2%) covers at least 90% of the sequence of the parental gene, indicating high structural conservation, and another 26.1% of the retrocopies cover 60% of their parental gene. Only a small fraction (4.1%) covers 50% of the parent gene sequence. This distribution of sequence coverage suggests a diverse range of conservation levels, with the majority of retrocopies closely resembling their parent gene structure (Fig. 1B).

In addition to coverage, the sequence identity between the retrocopies and their parental genes was assessed. A large proportion (45%) of the retrocopies exhibit 90% sequence identity with their parent gene, indicating a high degree of conservation at the nucleotide level. Following this, 20.2% of the retrocopies show 80% identity, while lower proportions, 8.1% and 7.5% exhibit 70% and 60% identity, respectively. Interestingly, 19.1% of the retrocopies display 50% or lower sequence identity with their parent genes, suggesting more significant divergence. These results reflect varying degrees of genetic conservation and may point to different evolutionary pressures or functional adaptations among the retrocopies (Fig. 1C).

The retroposition event has resulted in the formation of 491 retrocopies in S. chinensis. The distribution of retrocopies among parent genes shows retroposition activity across the genome. A total of 68 parent genes produce only a single retrocopy each, and 48 parent genes produce retrocopies in numbers between 2 and 19 (Fig. 2). These 116 genes represent the majority of parent genes involved in retrocopy formation, contributing to small-scale duplications. In addition, five parent genes, (ID: Schi01G045660.1), (ID: Schi01G047610.1), (ID: Schi01G051990.1), (ID: Schi01G041920.1), and (ID: Schi01G043000.1), produce a large number of retrocopies: 29, 31, 32, 35, and 39, respectively.

Fig. 2.

Fig. 2

Distribution of retrocopies in Schlechtendalia chinensis owned by each parent gene. The circle plot represents the TEs density in the 13 chromosomes of S. chinensis genome. Bar chart categorizes parent genes based on the number of retrocopies

Our analysis identified several parent genes that gave rise to a notably high number of retrocopies (29–39) (File S2). Notably, these parent genes, including craniofacial development protein (CFDP2) (ID: Schi01G051700.1), uncharacterized protein genes (Schi01G043000.1), (Schi01G051990.1), (Schi01G047610.1), (Schi01G045660.1), (Schi01G044030.1), and proteins designated as mobile element-derived, have a suspected origin from transposable elements (TEs). CFDP2, for example, is a documented gene that recruited the APE-like domain from a retrotransposon LINE-RTE [38].

A Poisson random-generation model was evaluated with a mean of λ = 3.99 retrocopies per parental gene. The observed retrocopy distribution significantly deviated from Poisson expectation [χ² (16) ≈ 3.22 × 10²², p < 0.0001]. Poisson upper-tail probabilities with Benjamini–Hochberg FDR correction (q < 0.05) identified parental genes producing ≥ 10 retrocopies as statistically significant hotspots of retroposition and may play substantial roles in gene family expansion.

Evolution of putative retrogenes and their role in protein family expansion

Retrocopies having intact ORFs and regulatory regions were grouped as putative retrogenes. All 62 identified putative retrogenes were annotated using the NCBI RefSeq and UniprotKB/Swiss-Prot databases. Of these, 25 putative retrogenes matched database entries labeled as ‘uncharacterized’ or ‘unknown function’, and were therefore excluded from protein-family-level analysis. The remaining 37 putative retrogenes were categorized into 23 gene families, with the heat shock protein superfamily accounting for the highest number of four putative retrogenes (Fig3. A, File S3). The uncharacterized putative retrogenes are listed in the supplementary table (Table S2). Retroposition thus contributed to the emergence of both characterized and uncharacterized putative retrogenes in S. chinensis.

Fig. 3.

Fig. 3

Annotation of Putative retrogenes and chimeric genes into protein families. Bar graph shown in (A) representing the 23 protein families in putative retrogenes, and (B) representing the 42 protein families in chimeric genes

The evolutionary history of the putative retrogene and the expansion of protein families in S. chinensis is complex and intriguing. In the S. chinensis genome, 26 heat shock protein (HSP) genes were identified based on nr annotation (Table S3). Putative retrogenes accounted for 15.4% of these HSP genes (4 out of 26), indicating that retroposition has contributed to the diversification of the heat shock protein.

The heat shock protein gene (ID: Schi06G007100) in S. chinensis contains six coding sequences (CDS1–CDS6) (Fig. 4). Comparative structural analysis revealed four candidates putative retrogenes, each derived from this parent gene. All four putative retrogenes retain the parental HSP70 CDS structure but lack introns, consistent with retroposition-derived gene structures. The candidate putative retrogene mRNA lengths range from 1,938 to 2,330 bp, reflecting lineage-specific sequence variation after retroposition. The loss of intronic regions, together with the retention of all six CDS, supports their origin through reverse transcription of the parental HSP70 gene. These putative retrogenes may serve distinct roles, potentially contributing uniquely to the overall heat shock protein function.

Fig. 4.

Fig. 4

Structural comparison of the S. chinensis parent gene (Schi06G007100) heat shock protein 70 B2-like and its candidate putative retrogenes. Four putative retrogenes exhibit the parent gene’s six CDS regions, indicating retroposition

Formation of chimeric genes due to retroposition in S. chinensis

Chimeric genes are formed by retroposition when a retrocopy integrates into a new genomic location and fuses with the existing coding sequences of another gene. This fusion may result in the new gene acquiring a conserved domain or functional innovation [39]. We identified 163 chimeric genes in S. chinensis by using Retroscan with these parameters: at least 50% similarity to original genes, 50% coverage of the gene length, and allowed small gaps. To avoid duplicates, we kept only high-quality matches (80% similarity and coverage).

Chimeric genes were classified into 42 distinct protein families, while 73 chimeric genes matched uncharacterized database entries and were therefore excluded from protein-family-level analysis (Fig. 3B, File S4, Table S2). In the S. chinensis genome, chimeric genes accounted for 41.17% of GPI-anchored adhesin-like protein PGA55 isoform X1, 20.5% of putative nuclease HARBI1, and 10.4% of zinc finger proteins within their respective protein categories. The conserved domains in chimeric genes are primarily inherited from parental genes but may also be acquired through mechanisms such as gene fusion, recombination, and horizontal gene transfer [40].

Conserved domains are basically structural elements of proteins that are frequently required for their function and evolutionary conservation across species. In S. chinensis, we highlight 137 chimeric genes with conserved domains that were produced by retroposition. The parent gene transcribes into pre-mature mRNA, which is spliced into mature mRNA. The mature mRNA is reverse transcribed into complementary DNA (cDNA) in the presence of reverse transcriptase enzyme and inserted back into a new location within the host genome. As a result, copies of genes are produced, that are further fused to the nearest genes and produce chimeric genes (Fig. 5A). Chimeric gene acquired domains from both genes that are involved in the formation of chimeric genes (File S5, File S6). (C411Schi05G009250) gene is formed by the fusion of host gene and inserted retrocopy, and the newly formed gene encodes a cyclophilin-10-like protein, which belongs to the cyclophilin superfamily and RING U-box superfamily domain. The presence of both a chaperone-like cyclophilin domain and a U-box domain suggests that this chimeric gene may be involved in protein regulation and gene expression (Fig. 5B).

Fig. 5.

Fig. 5

Formation of a chimeric gene due to retroposition. A In the parent gene black line represents the length, the blue and pink boxes represent exons (E) and introns (I). The parent gene is reverse transcribed and inserted into a new location within the host genome to form retrocopies, which fuse with other genes to form a chimeric gene. B Chimeric gene C411Schi05G009250 retains the cyclophilin superfamily and RING U-box superfamily domain from both genes

Selection pressure on putative retrogenes and chimeric genes

Ks values are used to estimate the divergence time of chimeric genes and putative retrogenes. The Ka/Ks ratio determines the type of selection pressure exerted on these genes. RetroScan’s Ka/Ks calculator estimates the divergence and evolution of chimeric and putative retrogenes. Among putative retrogenes, 27.42% have Ks values greater than 1, with the highest observed Ks value being 4.1, and 19.35% putative retrogenes have Ks values < 1. Putative retrogenes with low Ks values can be considered relatively young and have undergone recent evolutionary divergence (Fig. 6A). While 30.67% of chimeric genes have relatively low Ks values (Ks < 1), and 33.74% chimeric genes have relatively high Ks values (Ks ≥ 1), indicating that the retroposition events occurred across evolutionary time, from more recent to more ancient divergence (Fig. 6B). These results suggest that retroposition in S. chinensis involved genes with both lower (Ks < 1) and higher (Ks > 1) Ks values, reflecting retroposition across a range of evolutionary divergence times (File S7).

Fig. 6.

Fig. 6

Graphs showing the Ks and Ka/Ks values of putative retrogenes and chimeric genes in Schlechtendalia chinensis. A Ks value of the putative retrogenes produced by retroposition. Most of the putative retrogenes have a Ks value > 1. B Ks value of the chimeric genes, 58.3% of these genes showing Ks values equal or greater than 1. C Distribution of Ka/Ks ratios of putative retrogenes. In putative retrogenes, purifying selection is taking place, Ka/Ks < 1. Distribution of Ka/Ks ratios of chimeric genes, indicating that chimeric genes are under purifying selection

We calculated the Ka/Ks ratio for the retrocopies generated by retroposition in S. chinensis to determine whether they are under selective pressure. We calculated the selection pressure on selected retrocopies (putative retrogenes and chimeric genes). Over 48.39% of putative retrogenes show Ka/Ks values that are smaller than 1, indicating that these genes are under purifying selection (Fig. 6C). Out of 163 chimeric genes, 11.4% show Ka/Ks values equal to or greater than 1, and 57.06% chimeric genes have Ka/Ks values below 1 (Fig. 6D), indicating that most of the chimeric genes in S. chinensis are under purifying selection.

Both chimeric genes and putative retrogenes displayed Ka/Ks ratios significantly lower than 1, consistent with purifying selection as expected in molecular evolution [41]. Chimeric genes exhibited a median Ka/Ks of 0.337, and the Wilcoxon signed-rank test confirmed that Ka/Ks values were significantly < 1 (V = 459, p = 9.0 × 10⁻¹⁴). Putative retrogenes showed a median Ka/Ks of 0.191 with a highly significant deviation from neutrality (V = 0, p = 9.1 × 10⁻⁷). These findings were further supported using log-transformed Ka/Ks values (chimeric: t = − 11.72, df = 102, p = 6.61 × 10⁻²¹; retrogenes: t = − 9.80, df = 29, p = 5.26 × 10⁻¹¹), suggesting that both categories of retrocopies evolve under purifying selection.

In summary, our results show the contribution of retroposition to the creation of duplicate genes and expansion of the heat shock protein family. Putative retrogenes occupy 15.4% of the total heat shock protein present in the S. chinensis genome. Heat Shock proteins may be involved in responding to stress and promoting adaptability, both of which are crucial for insect survival [42]. This interpretation is hypothetical without transcriptomic verification. Most of the chimeric genes in S. chinensis are under purifying selection, and few are under positive selection, while a large number of putative retrogenes are under purifying selection. The Ks value indicated that 33.74% chimeric genes and 27.42% putative retrogenes show ancient evolutionary divergence. While, the 19.35% putative retrogenes can be considered relatively young and have undergone recent evolutionary divergence. The presence of young and old genes in S. chinensis indicated both ancient and recent activity of retroposition. In chimeric genes, conserved domains potentially lead to new biological roles. The fusion of domains from different protein families may influence gene regulation, protein folding, enzymatic activity, and cellular interactions. These genetic variations contribute to evolutionary changes in the genome of S. chinensis.

Discussion

In this study, we performed a comprehensive analysis of retrocopies in the S. chinensis genome, shedding light on their evolutionary significance and functional potential. Retrocopies are genetic elements derived from reverse-transcribed mRNA and play diverse roles in genome evolution, including gene duplication, innovation, and regulation. We analyzed these elements and also explored their contribution to the expansion of protein families and the structural complexity of chimeric genes.

Previous studies provided evidence that there are 308,196 TEs in the S. chinensis genome. The TE content in S. chinensis is 26.04% and the LINE superfamily, which is widely distributed non-LTR, RTs, occupies 4.74% of its whole genome [43]. A significant proportion of the S. chinensis genome covered by TE may facilitate the emergence of retrocopies by providing its enzymatic machinery, specifically L1. In this study, 491 retrocopies were identified in the S. chinensis genome (DNA data). Retrocopies are classified into four categories: putative retrogene (due to the unavailability of RNA sequences data, retrogene refer as putative retrogene), chimeric gene, pseudogene, and intact retrocopies. 12.6% of retrocopies were classified as putative retrogenes, 33.2% as chimeric genes, 34.6% as pseudogenes, and 19.6% as intact retrocopies. This distribution is broadly consistent with previous studies, such as the identification of 9,930 retrocopies in 40 insect species [34].

The exceptionally high number of retrocopies (29–39) identified for parent genes such as CFDP2 and other TE-derived or uncharacterized proteins is likely an analytical artifact of the tool. RetroScan identifies retrocopies based on sequence alignment and similarity to a reference parent gene. However, if the parent gene itself is of TE origin, the tool cannot distinguish between a true, single-locus retrocopy and the numerous, fragmented TE sequences that are already dispersed throughout the genome. Consequently, the “huge number of copies” reported for these specific genes probably reflects the tool correctly identifying similarity to a family of related TE sequences, but mis annotation them as retrocopies derived from the annotated parent gene, effectively counting the ancestral TE copies as its descendants.

We retained the original RetroScan output, including the high-copy-number loci derived from TE-related genes, as a conservative measure. Although these retrocopies (n = 39) were classified as processed pseudogenes or intact copies lacking regulatory elements, only a few were identified as putative retrogenes or functional chimeric genes. Therefore, most of them do not represent putative functional genes and do not contribute to evolutionary innovation, leaving our core analysis unaffected.

Schlechtendalia chinensis actively feeds on its primary host plant, Rhus chinensis, and may be affected by its host defense and stressors of metabolites. In this study, we highlight the heat shock protein superfamily. Many biological processes in insects are dependent on heat shock proteins, which also maintain homeostasis and protect cells from damage caused by stress. This protein is produced when insects are under stress, such as heat stress, cold stress, toxic substances, infection, and inflammation. It enhances adaptability and plays a crucial role in insects’ survival [42, 44, 45]. In S. chinensis, the expansion of the HSP70/68-like putative retrogenes is similar to the expansion of stress-responsive genes in other insects [34]. As in Drosophila, HSP retrogenes enhance thermal adaptation [46].

Our analysis highlighted the prevalence of chimeric genes, particularly those formed when retrocopies insert into the non-coding region of other genes. These fusion events often yield novel or multifunctional proteins and are thought to be a major driver of protein diversity [47]. Many of the chimeric retrocopies in S. chinensis incorporate conserved protein domains such as zinc fingers and cyclophilins. The ability of conserved domains to be inherited, modified, and recombined enables the emergence of new gene functions and complex physiological adaptations [40].

Most chimeric genes in S. chinensis are under purifying selection, while a few are under positive selection. The large number of putative retrogenes also shows purifying selection. For chimeric genes, the presence of relatively high Ks values together with purifying selection signals suggests that these genes have been retained under long-term selective constraint and exhibit evolutionary conservation over extended evolutionary timescales. In contrast, only 27.42% of putative retrogenes show relatively high Ks values, indicating more ancient diversification. The presence of (Ks < 1) young and (Ks > 1) old genes in S. chinensis shows both current and earliest activity of retroposition during evolution. By comparing genomes of closely related aphid species, future researchers may uncover lineage-specific retroposition events and provide insights into evolutionary pressures acting on the genes.

Conclusion

In this study, we have documented the landscape of retrocopy dynamics in the S. chinensis genome, illustrating their potential to evolve into functional putative retrogenes and chimeric transcripts. This work underscores the impact of transposable element-driven retroposition as a mechanism for gene family expansion and offers a valuable genomic resource for future comparative studies of gall-forming aphids.

Materials and methods

Sample collection and genome sequencing

The mature galls formed by Schlechtendalia chinensis were collected in Wufeng county (30°19′ N, 110°67′ E, 329 m above sea level), Hubei Province, China, in October 2019. In a single gall, there are thousands of aphids produced from the same clone of a single fundatrix. Approximately 200 live aphids from a single gall were collected for total DNA extraction and the whole genome was sequenced using PacBio platform. Another 150 individuals for transcriptome analysis, and 100 for Hi-C sequencing analysis. The sequencing generated 85 GB of data, and Hi-C libraries were built to study DNA interactions, helping assemble 91.7% of the genome into 13 chromosomes by mapping spatial connections. The more detailed protocols referred the citation [43, 48]. Hi-C data improved assembly accuracy by showing how DNA folds in 3D. Final chromosomes were manually verified for errors.

A portion of the aphids was preserved in 75% ethanol for morphological identification, serving as voucher specimens. The gall aphid studied in this research was officially identified by the corresponding author, and the specimens (Voucher no. Ren_A1798) were deposited in the herbarium of the School of Life Science, Shanxi University.

Retrocopy identification

Retrocopies were identified using the bioinformatic tool, RetroScan version 1.0 [37]. RetroScan operated on a Linux or Unix system with these parameters: at least 50% similarity to original genes, 50% coverage of the gene length, and allowed small gaps. To avoid duplicates, we kept only high-quality matches (80% similarity and coverage). RetroScan combines tools like LAST [49], BED tools [50], Clustal W2 [51], KaKs Calculator [52], HISAT2 [53], StringTie [54], SAM tools [55], and Shiny [56] to compare genes to the genome. This research makes use of genomic DNA-based data; therefore, transcriptomic analysis cannot be conducted because RNA sequence data are not available. It sorts retrocopies into categories like putative retrogenes, chimeric genes, intact, and pseudogenes. We manually checked each hit to confirm they matched their original genes and verified putative retrogenes using the NCBI RefSeq database.

RetroScan automatically computed the coverage and sequence identity based on pairwise BLASTn alignments. Coverage represents the percentage of the parental coding sequence (CDS) aligned to the retrocopy, while identity refers to the percentage of identical nucleotides within the aligned region. For deduplication, RetroScan applied an identity cutoff and a coverage cutoff of 80%, respectively. Additionally, a similarity-based clustering step was used, whereby retrocopy alignments meeting these cutoffs were grouped only if they belonged to a cluster of at least 10 homologous sequences. This minimum cluster size was chosen to ensure robust and reproducible clustering and to minimize the inclusion of spurious matches or assembly artifacts. Within each such cluster, if multiple retrocopies aligned to the same parental gene, only the hit with the highest BLAST alignment bitscore was retained as the representative non-redundant retrocopy, while all other members were filtered out as probable duplicates or assembly artifacts. Candidate retrocopies not meeting the minimum cluster size threshold were excluded from cluster-based analyses but were retained in overall copy number estimates and supplementary datasets.

Annotation of retrocopies

The retrocopies (putative retrogene and chimeric gene) sequences were extracted into fasta files and imported into Geneious Prime version 2023. By using Blastn against the NCBI RefSeq nr database and Blastx against the Refseq protein, UniprotKB/Swissprot protein database in Geneious Prime v2023. The best-hit proteins were used to assign protein families.

Ka/Ks analysis and divergence time estimation

We calculated the time of divergence by using the Ka/Ks calculator induced in RetroScan. In RetroScan, multiple alignments are performed using ClustalW2 between the corresponding protein sequences. It calculated the age of retrocopies by comparing their DNA changes to their original genes. Ka measures protein-altering changes, while Ks measures DNA changes that don’t affect the protein function. The Ka/Ks ratio tells us if retrocopies evolved under natural selection. Using RetroScan alignment tools and a calculator, we found these values.

Chimeric genes domain detection

We extracted the nucleotide sequences of chimeric genes and their parent genes in fasta format for the detection of conserved domains. Conserved domains are central to the function and evolution of chimeric genes. The ability of these domains to be inherited, modified, and combined into new gene structures allows for a rich variety of functional adaptations. These gene fusion events are a driving force in the evolution of novel biochemical functions and complex physiological processes in organisms [40]. We found a conserved domain of a chimeric gene by using the online tool NCBI conserved domain (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi).

Statistical analysis

All statistical analyses were performed using R statistical software version 4.5.1. Differences in retrocopy classification proportions were assessed using Chi-square goodness-of-fit tests against the null hypothesis of equal distribution across categories. The significance threshold was set at p < 0.05. Retrocopy formation bias was evaluated by fitting a Poisson model (λ = 3.99) to the distribution of retrocopy counts per parental gene. Goodness-of-fit was assessed using χ² testing, and overdispersion relative to the Poisson expectation was taken as evidence of non-random retroposition. Poisson upper-tail probabilities were computed for each gene, followed by Benjamini–Hochberg false discovery rate (FDR) correction (q < 0.05) to identify statistically supported hotspot genes.

For selection analysis, to test whether Ka/Ks values were significantly less than 1, only genes with valid Ka/Ks estimates were included. Genes with undefined or missing values (NA) were excluded because Ka/Ks could not be computed due to zero synonymous substitutions or poor sequence alignment. we used a one-sample Wilcoxon signed-rank test. As confirmation, a one-sample t-test was performed on log-transformed Ka/Ks values. A significance level of p < 0.05 was applied in all tests.

Structural characterization of putative retrogenes and chimeric genes

In Schlechtendalia chinensis, putative retrogenes and chimeric gene structures were characterized by sequence annotation, structural validation, and manual visualization. Each retrocopy was annotated with Geneious Prime v2023. A Blast n query was performed against the NCBI RefSeq nr database on nucleotide sequences, while a Blastx query was conducted against the UniProtKB/SwissProt database to assign putative functions to each retrocopy. RetroScan confirmed intron loss and mapped retrocopies back to their parental genes for structural validation of putative retrogenes. Alignments between retrocopies and their respective parent genes were further inspected in Geneious Prime for verification.

Chimeric genes were detected when retroposed sequences were fused with exons of other genes. Conserved domain annotations were identified for Schlechtendalia chinensis chimeric genes using the NCBI Conserved Domain Database (CDD). Each entry includes the query ID, domain hit type, PSSM accession, domain coordinates, E-value, bitscore, and functional definition. Domains such as RING-Ubox and Cyclophilin_RING were identified, supporting the predicted protein superfamilies of the chimeric genes shown in Fig. 5. Gene structures were manually modeled and illustrated using Microsoft Excel and PowerPoint, highlighting the protein family expansion in putative retrogenes, and conserved domains in chimeric genes.

Supplementary Information

12864_2026_12659_MOESM1_ESM.xlsx (54.2KB, xlsx)

Supplementary Material 1: Table S1. Coverage and identity values for all retrocopies in S. chinensis. Table S2. List of uncharacterized putative retrogenes and chimeric genes. Table S3. List of heat shock protein (HSP) genes identified in Schlechtendalia chinensis.

12864_2026_12659_MOESM2_ESM.docx (323.5KB, docx)

Supplementary Material 2: Figure S1. Parent–child exon–intron comparison of HSP putative retrogenes in Schlechtendalia chinensis.

12864_2026_12659_MOESM3_ESM.docx (246.6KB, docx)

Supplementary Material 3: Figure S2. Conserved domain validation of chimeric gene in the Schlechtendalia chinensis genome.

12864_2026_12659_MOESM4_ESM.xlsx (455KB, xlsx)

Supplementary Material 4: File S1. List of retrocopies produced by various parent genes in S. chinensis. File S2. Distribution of retrocopies per parent gene in S. chinensis. File S3. List of the protein families in putative retrogenes. File S4. Protein families in chimeric genes. File S5. Formation of chimeric genes in S. chinensis. File S6. Conserved domain in chimeric genes. File S7. List of Ka, Ks, and Ka/Ks values for putative retrogenes and chimeric genes in S. chinensis.

Acknowledgements

Not applicable.

Authors’ contributions

**Hira Tazeen**: Conceptualization, Investigation, Methodology, Data analysis, Writing original draft, Review and editing. **Aftab Ahmad**: Data analysis, Review and editing, Software and investigation. **Syed Sibt E Hassan**: Review and editing. **Zhumei Ren**: Conceptualization, Supervision, Writing - Review and editing, Project administration, Funding acquisition.

Funding

This study was partially supported by the Joint Funds of the National Natural Science Foundation of China (U24A20358), Central Guiding Local Technology Development Fund (YDZJSX2024D012), Biomedical and Health Laboratory in Shanxi Province, and Research Project Supported by the Shanxi Scholarship Council of China (2020-018).

Data availability

High-throughput sequencing data analyzed in this project and the whole Genome project are deposited under Bio Project (PRJNA833747) and Bio Sample (SAMN28016330) to NCBI GenBank.

Declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Innan H, Kondrashov F. The evolution of gene duplications: classifying and distinguishing between models. Nat Rev Genet. 2010;11(2):97–108. [DOI] [PubMed] [Google Scholar]
  • 2.Magadum S, Banerjee U, Murugan P, Gangapur D, Ravikesavan R. Gene duplication as a major force in evolution. J Genet. 2013;92(1):155–61. [DOI] [PubMed] [Google Scholar]
  • 3.Miller D, Chen J, Liang J, Betrán E, Long M, Sharakhov IV. Retrogene duplication and expression patterns shaped by the evolution of sex chromosomes in malaria mosquitoes. Genes. 2022;13(6):968. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Markova DN, Ruma FB, Casola C, Mirsalehi A, Betrán E. Recurrent co-domestication of PIF/Harbinger transposable element proteins in insects. Mobile DNA. 2022;13(1):28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Casola C, Betrán E. The genomic impact of gene retrocopies: what have we learned from comparative genomics, population genomics, and transcriptomic analyses? Genome Biol Evol. 2017;9(6):1351–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Wang W, Zheng H, Fan C, Li J, Shi J, Cai Z, et al. High rate of chimeric gene origination by retroposition in plant genomes. Plant Cell. 2006;18(8):1791–802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Vanin EF. Processed pseudogenes: characteristics and evolution. Annu Rev Genet. 1985;19:253–72. [DOI] [PubMed] [Google Scholar]
  • 8.Weiner AM, Deininger PL, Efstratiadis A. Nonviral retroposons: genes, pseudogenes, and transposable elements generated by the reverse flow of genetic information. Annu Rev Biochem. 1986;55(1):631–61. [DOI] [PubMed] [Google Scholar]
  • 9.Es L. Initial sequencing and analysis of the human genome. Nat. 2001;409:860–921. [DOI] [PubMed] [Google Scholar]
  • 10.Emerson J, Kaessmann H, Betrán E, Long M. Extensive gene traffic on the mammalian X chromosome. Sci. 2004;303(5657):537–40. [DOI] [PubMed] [Google Scholar]
  • 11.Kaessmann H, Vinckenbosch N, Long M. RNA-based gene duplication: mechanistic and evolutionary insights. Nat Rev Genet. 2009;10(1):19–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Xu P, Feuda R, Lu B, Xiao H, Graham RI, Wu K. Functional opsin retrogene in nocturnal moth. Mob DNA. 2016;7:1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Matsumura K, Imai H, Go Y, Kusuhara M, Yamaguchi K, Shirai T, Ohshima K. Transcriptional activation of a chimeric retrogene PIPSL in a hominoid ancestor. Gene. 2018;678:318–23. [DOI] [PubMed] [Google Scholar]
  • 14.Xu P, Lu B, Chao J, Holdbrook R, Liang G, Lu Y. The evolution of opsin genes in five species of mirid bugs: duplication of long-wavelength opsins and loss of blue-sensitive opsins. BMC Ecol Evol. 2021;21(1):66. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Fablet M, Bueno M, Potrzebowski L, Kaessmann H. Evolutionary origin and functions of retrogene introns. Mol Biol Evol. 2009;26(9):2147–56. [DOI] [PubMed] [Google Scholar]
  • 16.Tolman ER, Beatty CD, Frandsen PB, Bush J, Bruchim OR, Driever ES et al. Newly sequenced genomes reveal patterns of gene family expansion in select Dragonflies (Odonata: Anisoptera). Insect Syst Divers. 2025;9(4):8.
  • 17.Ravikanthachari N, Boggs CL. Gene family evolution in brassicaceous-feeding insects: implications for adaptation and host plant range. bioRxiv; 2023.
  • 18.Ohno S. Evolution by gene duplication. Berlin: Springer; 2013.
  • 19.Brosius J. Retroposons-seeds of evolution. Science. 1991;251(4995):753. [DOI] [PubMed] [Google Scholar]
  • 20.de Van Peer Y, Mizrachi E, Marchal K. The evolutionary significance of polyploidy. Nat Rev Genet. 2017;18(7):411–24. [DOI] [PubMed] [Google Scholar]
  • 21.Long M, Langley CH. Natural selection and the origin of jingwei, a chimeric processed functional gene in drosophila. Sci. 1993;260(5104):91–5. [DOI] [PubMed] [Google Scholar]
  • 22.Toups MA, Hahn MW. Retrogenes reveal the direction of sex-chromosome evolution in mosquitoes. Genetics. 2010;186(2):763–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Toups MA, Pease JB, Hahn MW. No excess gene movement is detected off the avian or lepidopteran Z chromosome. Genome Biol Evol. 2011;3:1381–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Wang J, Long M, Vibranovski MD. Retrogenes moved out of the z chromosome in the silkworm. J Mol Evol. 2012;74:113–26. [DOI] [PubMed] [Google Scholar]
  • 25.Marques AC, Dupanloup I, Vinckenbosch N, Reymond A, Kaessmann H. Emergence of young human genes after a burst of retroposition in primates. PLoS Biol. 2005;3(11):e357. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Bai Y, Casola C, Feschotte C, Betrán E. Comparative genomics reveals a constant rate of origination and convergent acquisition of functional retrogenes in drosophila. Genome Biol. 2007;8:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Betrán E, Thornton K, Long M. Retroposed new genes out of the X in drosophila. Genome Res. 2002;12(12):1854–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Vinckenbosch N, Dupanloup I, Kaessmann H. Evolutionary fate of retroposed gene copies in the human genome. Proc Natl Acad Sci. 2006;103(9):3220–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Burki F, Kaessmann H. Birth and adaptive evolution of a hominoid gene that supports high neurotransmitter flux. Nat Genet. 2004;36(10):1061–3. [DOI] [PubMed] [Google Scholar]
  • 30.Dai H, Chen Y, Chen S, Mao Q, Kennedy D, Landback P, Eyer-Walker A, Du W, Long M. The evolution of courtship behaviors through the origination of a new gene in drosophila. Proc Natl Acad Sci. 2008;105(21):7478–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Sayah DM, Sokolskaja E, Berthoux L, Luban J. Cyclophilin A retrotransposition into TRIM5 explains Owl monkey resistance to HIV-1. Nat. 2004;430(6999):569–73. [DOI] [PubMed] [Google Scholar]
  • 32.Zhang J, Dean AM, Brunet F, Long M. Evolving protein functional diversity in new genes of drosophila. Proc Natl Acad Sci. 2004;101(46):16246–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Ciomborowska J, Rosikiewicz W, Szklarczyk D, Makałowski W, Makałowska I. Orphan retrogenes in the human genome. Mol Biol Evol. 2012;30(2):384–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Ahmad A, Zhang W. Genomic exploration of retrocopies in Insect pests of plants and their role in the expansion of heat shock proteins superfamily as evolutionary targets. BMC Genomics. 2024;25(1):1116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Wang H, Cui K, Shao S, Liu J, Chen H, Wang C, et al. Molecular response of gall induction by aphid Schlechtendalia chinensis (Bell) attack on Rhus chinensis Mill. J Plant Interact. 2017;12(1):465–79. [Google Scholar]
  • 36.Ahmad A, Ren Z. Mobilome of the Rhus gall aphid Schlechtendalia chinensis provides insight into TE insertion-related inactivation of functional genes. Int J Mol Sci. 2022;23(24):15967. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Wei Z, Sun J, Li Q, Yao T, Zeng H, Wang Y. RetroScan: an easy-to-use pipeline for Retrocopy annotation and visualization. Front Genet. 2021;12:719204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Iwashita S, Nakashima K, Sasaki M, Osada N, Song SY. Multiple duplication of the bucentaur gene family, which recruits the APE-like domain of retrotransposon: identification of a novel homolog and distinct cellular expression. Gene. 2009;435(1–2):88–95. [DOI] [PubMed] [Google Scholar]
  • 39.Rogers RL, Hartl DL. Chimeric genes as a source of rapid evolution in drosophila melanogaster. Mol Biol Evol. 2012;29(2):517–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Ponting CP, Russell RR. The natural history of protein domains. Annu Rev Biophys Biomol Struct. 2002;31(1):45–71. [DOI] [PubMed] [Google Scholar]
  • 41.Nei M, Suzuki Y, Nozawa M. The neutral theory of molecular evolution in the genomic era. Annu Rev Genom. 2010;11(1):265–89. [DOI] [PubMed] [Google Scholar]
  • 42.King AM, MacRae TH. Insect heat shock proteins during stress and diapause. Annu Rev Entomol. 2015;60(1):59–75. [DOI] [PubMed] [Google Scholar]
  • 43.Ahmad A, von Dohlen C, Ren Z. A chromosome-level genome assembly of the Rhus gall aphid Schlechtendalia chinensis provides insight into the endogenization of Parvovirus-like DNA sequences. BMC Genomics. 2024;25(1):16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Zhao L, Jones WA. Expression of heat shock protein genes in insect stress responses. Invertebr Sur J. 2012;9(1):93–101. [Google Scholar]
  • 45.Feng H, Wang L, Liu Y, He L, Li M, Lu W, et al. Molecular characterization and expression of a heat shock protein gene (HSP90) from the carmine spider mite, Tetranychus cinnabarinus (Boisduval). J Insect Sci. 2010;10(1):112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Bettencourt BR, Hogan CC, Nimali M, Drohan BW. Inducible and constitutive heat shock gene expression responds to modification of Hsp70 copy number in drosophila melanogaster but does not compensate for loss of thermotolerance in Hsp70 null flies. BMC Biol. 2008;6:1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Fu B, Chen M, Zou M, Long M, He S. The rapid generation of chimerical genes expanding protein diversity in zebrafish. BMC Genomics. 2010;11:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.He H, Crabbe MJ, Ren Z. Genome-wide identification and characterization of the chemosensory relative protein genes in Rhus gall aphid Schlechtendalia chinensis. BMC Genomics. 2023;24(1):222. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC. Adaptive seeds tame genomic sequence comparison. Genome Res. 2011;21(3):487–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinform. 2010;26(6):841–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, et al. Clustal W and Clustal X version 2.0. Bioinformatics. 2007;23(21):2947–8. [DOI] [PubMed] [Google Scholar]
  • 52.Wang D, Zhang Y, Zhang Z, Zhu J, Yu J. KaKs_calculator 2.0: a toolkit incorporating gamma-series methods and sliding window strategies. Genom Proteom Bioinform. 2010;8(1):77–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015;12(4):357–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Pertea M, Pertea GM, Antonescu CM, Chang T-C, Mendell JT, Salzberg SL. Stringtie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33(3):290–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Ge SX, Jung D, Yao R. ShinyGO: a graphical gene-set enrichment tool for animals and plants. Bioinform. 2020;36(8):2628–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

12864_2026_12659_MOESM1_ESM.xlsx (54.2KB, xlsx)

Supplementary Material 1: Table S1. Coverage and identity values for all retrocopies in S. chinensis. Table S2. List of uncharacterized putative retrogenes and chimeric genes. Table S3. List of heat shock protein (HSP) genes identified in Schlechtendalia chinensis.

12864_2026_12659_MOESM2_ESM.docx (323.5KB, docx)

Supplementary Material 2: Figure S1. Parent–child exon–intron comparison of HSP putative retrogenes in Schlechtendalia chinensis.

12864_2026_12659_MOESM3_ESM.docx (246.6KB, docx)

Supplementary Material 3: Figure S2. Conserved domain validation of chimeric gene in the Schlechtendalia chinensis genome.

12864_2026_12659_MOESM4_ESM.xlsx (455KB, xlsx)

Supplementary Material 4: File S1. List of retrocopies produced by various parent genes in S. chinensis. File S2. Distribution of retrocopies per parent gene in S. chinensis. File S3. List of the protein families in putative retrogenes. File S4. Protein families in chimeric genes. File S5. Formation of chimeric genes in S. chinensis. File S6. Conserved domain in chimeric genes. File S7. List of Ka, Ks, and Ka/Ks values for putative retrogenes and chimeric genes in S. chinensis.

Data Availability Statement

High-throughput sequencing data analyzed in this project and the whole Genome project are deposited under Bio Project (PRJNA833747) and Bio Sample (SAMN28016330) to NCBI GenBank.


Articles from BMC Genomics are provided here courtesy of BMC

RESOURCES