Abstract
Nucleotide base composition plays an influential role in the molecular mechanisms involved in gene function, phenotype, and amino acid composition. GC content (proportion of guanine and cytosine in DNA sequences) shows a high level of variation within and among species. Many studies measure GC content in a small number of genes, which may not be representative of genome-wide GC variation. One challenge when assembling extensive genomic data sets for these studies is the significant amount of resources (monetary and computational) associated with data processing, and many bioinformatic tools have not been optimized for resource efficiency. Using a high-performance computing (HPC) cluster, we manipulated resources provided to the targeted gene assembly program, automated target restricted assembly method (aTRAM), to determine an optimum way to run the program to maximize resource use. Using our optimum assembly approach, we assembled and measured GC content of all of the protein-coding genes of a diverse group of parasitic feather lice. Of the 499 426 genes assembled across 57 species, feather lice were GC-poor (mean GC = 42.96%) with a significant amount of variation within and between species (GC range = 19.57%-73.33%). We found a significant correlation between GC content and standard deviation per taxon for overall GC and GC3, which could indicate selection for G and C nucleotides in some species. Phylogenetic signal of GC content was detected in both GC and GC3. This research provides a large-scale investigation of GC content in parasitic lice laying the foundation for understanding the basis of variation in base composition across species.
Keywords: Bioinformatics, computational resource efficiency, base composition, feather lice, protein-coding genes, phylogenetic signal, AT (adenine/thymine) rich
Introduction
Genomic characteristics, such as base composition, play an important role in the evolution and ecology of organisms. These features can be influential in molecular mechanisms involved in gene function, phenotype, and amino acid composition.1-3 Base composition is typically measured as GC content (proportion of guanine and cytosine in DNA), which has been directly linked to amino acid composition. 4 As amino acids are the building blocks of proteins, variation in amino acid composition is a critical component of protein evolution. 4 Measuring GC content within and among species may be the first step for understanding adaptation at a molecular level. For example, identifying adaptive alleles, which may be indirectly constrained based on GC content,5,6 in threatened populations is imperative for species and ecosystem conservation. 7 GC content has been found to be highly variable among and within species as well as at different organizational levels (ie, proteome vs genome8,9). Here, we mainly focus on the GC content of protein-coding genes, and it is important that comparisons between studies be made using analogous data sets. Even with this large-scale variability, some patterns stand out, such as higher recombination rates in GC-rich genes 10 and a negative relationship between GC content of third codon positions and chromosome length.11,12 The patterns of GC content variation across organisms are thought to be linked to genomic characteristics such as methylation throughout the genome, 13 expression levels of coding genes14,15, and genome-wide gene conversion. 16 Many hypotheses have been suggested to explain variation in GC content across different regions of the genome, such as molecular mechanics, environmental factors, natural selection, or a combination thereof.16-18 Before any of these mechanisms can be investigated, determining GC content across an organism’s genes and comparing the variation found among closely related species is needed to understand how base composition has influenced diversification and adaptation. 19
A disproportionate amount of genomic research has focused on vertebrate groups, particularly mammals and birds.20,21 Insects have been given much less attention yet are the most diverse group of animals in the world 22 and are facing catastrophic declines. 23 Parasitic lice (Phthiraptera) are of particular interest due to their global distribution, fast rate of evolution, and high level of diversification,24-28 providing an ideal system to study GC content and protein evolution in a closely related group of organisms 29 . Lice have among the smallest genomes within insects 30 ; however, only a few studies have examined GC content of coding genes in this group with conflicting results. A small number of genes were found to be GC-rich, whereas a much larger set were GC-poor.31,32 Many insects have low GC content overall,33-36 which is consistent with the results of GC content of coding genes in parasitic lice found by Virrueta Herrera et al, 32 even though the mtDNA of some parasitic lice genes is GC-rich compared with other insects31,37. A large-scale genome-wide investigation would provide a better understanding of the patterns in base composition of parasitic lice and allow for more comprehensive comparisons with other organisms.
These genomic studies often use gene assembly programs run on a high-performance computing (HPC) cluster. A significant amount of computational resources are necessary for large-scale data, and these resources are not always available or accessible. 38 Researchers often pay for an allotted amount of resources and are charged for these resources even if they are idle, leading to wasted time and money. 39 Given that gene assemblies are generally resource-intensive, inefficient resource use can quickly become wasteful. A critical component of continuing the advancement and accessibility of molecular studies is the development of programs that can efficiently prepare and analyze these genome-scale data sets.
Automated target restricted assembly method (aTRAM)40,41 is a targeted gene assembly program, which assembles specific loci from unassembled sequences using a closely or distantly related locus as the reference. 42 Automated target restricted assembly method begins by creating a library from the unassembled sequence reads, which consists of BLAST formatted databases split into multiple groups of paired-end sequences and a relational database to associate read-pairs. Next, a target sequence is blasted against these groups to identify homologous reads which are then assembled into a contiguous piece of DNA, termed contig. This process is repeated using the newly assembled contig as the query sequence and so on for multiple iterations until the target locus is assembled. Through this process, aTRAM breaks up large tasks (eg, Basic Local Alignment Search Tool (BLAST)) into multiple small tasks, using central processing units (CPUs). Automated target restricted assembly method allows the user to determine the number of CPUs in each run and uses them to blast the groups of paired-end reads within a library in parallel, increasing the overall rate of gene assembly. However, it is unclear if increasing the number of CPUs results in the most efficient use of resources. Adding more CPUs might be lowering the computational efficiency as more of these resources might be sitting idle.
Here, we assembled all nuclear protein-coding genes for several feather louse (Ischnocera) genera. We used a reference set of genes from the recently annotated pigeon louse (Columbicola columbae) genome 43 to measure GC content across the proteome of this diverse group of ectoparasites. We compared GC content among genes and across species and estimated the phylogenetic signal of GC content. Our aim was to gain an understanding of the level of variation in base composition found within this group of insects and to provide the data needed to continue investigating protein evolution.
We used aTRAM, which was designed to maximize resource use by allowing the user to incorporate as many cores as available during the assembly process. However, based on the time and resources used to assemble a large number of genes from a single taxon, it is unclear if aTRAM is using those resources efficiently, or if there are more optimal ways to use resources (eg, in parallel) to maximize efficiency. Before assembling the genes for our full set of taxa, we first investigated aTRAM resource efficiency given different amounts of computational resources to see how the available resources are being used during the assembly process. Focusing on 2 of the most commonly manipulated computational resources (CPUs and tasks), we measured the rate of and computational efficiency of gene assemblies using a varying number of resources. Our goal is to maximize computational efficiency whereas optimizing the rate at which genes can be assembled.
Methods
Our data set included 57 species in 54 genera of parasitic feather lice. Raw data were obtained from the NCBI (National Center for Biotechnology Information) Sequence Read Archive (SRA) (see Supplemental Data for SRA). All paired-end reads were trimmed using Trimmomatic v0.39 in paired-end mode to remove areas of low quality and to clip adapters (Illumina universal adapter). 44 We used a sliding window of 4 base pairs with a minimum quality of 20 (Phred + 33), and all reads shorter than 100 base pairs were dropped. All genes were assembled with aTRAM v2.4.3 42 using the 13 362 annotated genes from the pigeon louse genome 43 as a reference.
aTRAM
The software aTRAM was designed to maximize computational resources. For example, running BLAST searches on each group of paired-end reads can be done in parallel using many CPUs or sequentially using a single CPU. It is unknown which of the following methods would improve gene assembly rate and resource efficiency: a) adding more CPUs linearly or b) running multiple instances of aTRAM in parallel with fewer CPUs. Although HPC clusters generally have multiple options for resource manipulation, we focus on 2 common resources users have to select when running programs with HPC: numbers of CPUs and tasks. All computational tests were run on the University of Nevada, Reno HPC cluster, Pronghorn, which uses Slurm as a workload manager. For all tests, the same gene (chr1-aug-0.14-mRNA-1) and library (Alcedoecus: 55 groups of paired-end reads) were used. We examined resource efficiency by measuring the percent of available resources used with an increasing scale of CPUs and tasks.
CPUs
We altered Slurm sbatch options (–cpus-per-task) to determine the change in gene assembly rate with an increasing number of CPUs. Efficiency was measured by the number of genes assembled per CPU per hour. All tests were run exclusively on a single node with 1 task and a 2 hour time limit. These tests were run using 1, 2, 4, 8, 16, and 32 CPUs setting the CPU argument in aTRAM to the same value (–cpus; details about aTRAM arguments here: https://github.com/juliema/aTRAM). We tested these methods with 2 different assemblers, Trinity 45 and ABySS. 46
Tasks
The next test focused on parallelization to measure CPU use efficiency and gene assembly rate. All tests were run exclusively on a single node with Slurm argument –cpus-per-task = 1 (see results). By adding tasks, we effectively increase the number of instances of aTRAM running simultaneously. We used the same scaling, (1, 2, 4, 8, 16, 32), for tasks and for the aTRAM –cpu argument. With ABySS, we used message passing interface (MPI) mode using –abyss-np. This changes the number of processors used in parallel execution; however, using this mode with ABySS is no longer recommended but does not change the outcome of our research. We used the same scaling for –abyss-np. This argument was not an option to use with Trinity. Values for these 3 parameters (–ntasks, aTRAM CPUs, and –abyss-np) matched for all tests. Each test assembled 64 copies of the same gene, and no time limit was given. Due to the high level of I/O (input/output) during assembly, we used a temporary file system to reduce the amount of memory needed to store files during gene assembly. Central processing unit and memory efficiency were obtained with Slurm seff command for each job. Tests were run with both Trinity and ABySS.
Full data set assembly
We assembled a target set of 13 364 of the protein-coding genes for 57 feather lice taxa based on the results from the CPU and task tests with aTRAM using ABySS. These genes were all of the annotated genes from the pigeon louse genome. The amino acid sequences from these genes were used as the reference for tblastn searches. Exonerate 47 was used to stitch together assembled contigs and concatenate exons using the pigeon louse amino acid reference sequences. Once genes were assembled, genes that were not selected as reciprocal best hit (RBH) for the associated pigeon louse target after a reciprocal best BLAST search were removed. Using the pigeon louse gene set, we ran analysis of variance (ANOVA) tests in R v4.3.1 48 to compare the length of all genes, the length of genes that assembled for at least 1 taxon, and the length of genes that did not assemble for any taxa.
GC content
GC content was measured for all remaining genes (n = 499 426) using an original Python (v3.8.5) script excluding Ns from calculations. Removing Ns from sequences did not impact our results of GC content. Automated target restricted assembly method outputs assembled genes in reading frame so GC content was also calculated at each codon position (GC1, GC2, and GC3). From here on, we refer to GC content across all codon positions for an entire gene or taxon as simply GC. Phylogenetic linear regression models (R package phylolm 49 ) were used to assess the correlation between overall GC content and variation in GC content (standard deviation [SD]). This was done in R for GC and GC3 per taxon. We focused on GC3 over GC1 and GC2 because it is the least evolutionarily constrained codon position. 50 Because changing the base at the third codon position typically does not change the amino acid, third positions are less constrained and more likely to reflect underlying mutational biases, and this may be linked to GC bias.51,52 GC, GC1, GC2, and GC3 values were mapped onto phylogenetic trees using the packages ape v5.7-1 53 and ggplot2 v3.3.6 54 in R. Phylogenetic trees were obtained from de Moya et al 28 and pruned to taxa in our data set.
Phylogenetic signal
We tested for phylogenetic signal of GC and GC3 to see whether closely related feather lice taxa have similar GC content compared with more distant relatives. Analyses were done using the phylosignal v1.3 55 and phylobase v0.8.10 56 packages in R. We first measured global phylogenetic signal across the entire phylogeny with both Moran I and Abouheif Cmean with 100 simulations and 999 repetitions. Moran I and Abouheif Cmean are measures of spatial autocorrelation used to test the level of similarity of a characteristic between branch tips that are close in proximity.57,58 Abouheif Cmean is a slight variation of Moran I by ignoring branch lengths and focusing on the mean of multiple topology possibilities with a weighted matrix of relatedness. 59 This accounts for any inaccuracies in the tree and when paired with Moran I can provide more confidence that any signal found is not reliant on the estimated branch lengths. In addition, Abouheif Cmean is a robust analysis that provides accurate statistics under many conditions (ie, polytomies or tree size 60 ). Both indices calculate a value from −1 to 1, meaning the absence (−1) or presence (1) of similarities in a trait at certain phylogenetic distances. Second, because global measures of phylogenetic signal do not specify which lineages show trait similarities, we ran a Local Indicators of Phylogenetic Association (LIPA) analysis to identify local hotspots of autocorrelation. 55
Results
Central processing units
Trinity and ABySS had similar patterns of gene assembly rate and resource efficiency, as measured by genes assembled per CPU per hour, although ABySS assembled genes faster than Trinity overall (Figure 1A and B). We found that aTRAM used computational resources most efficiently (ie, assembled the most genes per CPU) when given 1 CPU (Trinity: 7 genes/CPU/h—ABySS: 8 genes/CPU/h; Table 1). This is a common outcome for most parallelization problems, as any increase in parallelization introduces overhead in the form of thread communication and synchronization. When addressing parallelization of aTRAM using tasks, we used the Slurm argument –cpus-per-task = 1 for all tests.
Figure 1.
Gene assembly rate (A and C) and resource use efficiency (B and D) across an increasing number of CPUs (A and B) and aTRAM instances (C and D). An instance in this context represents a task, specifically the number of concurrently running aTRAM instances. Two different assemblers were used: ABySS (pink circles) and Trinity (blue triangles).
Table 1.
Statistics from assembling genes with an increasing number of CPUs on a high-performance computing cluster using 2 different assemblers (Trinity and ABySS).
Scaling CPUs | ||||
---|---|---|---|---|
Trinity vs ABySS | ||||
Number of CPUs | Run time (min) | Number of genes assembled | Gene assembly rate (genes/min) | Resource efficiency (genes/CPU/h) |
Trinity | ||||
1 | 120 | 14 | 0.12 | 7.00 |
2 | 120 | 23 | 0.19 | 5.75 |
4 | 120 | 43 | 0.36 | 5.38 |
8 | 110 | 64 | 0.58 | 4.36 |
16 | 68 | 64 | 0.94 | 3.53 |
32 | 47 | 64 | 1.36 | 2.55 |
ABySS | ||||
1 | 120 | 16 | 0.13 | 8.00 |
2 | 120 | 26 | 0.22 | 6.50 |
4 | 120 | 50 | 0.42 | 6.25 |
8 | 87 | 64 | 0.74 | 5.52 |
16 | 51 | 64 | 1.25 | 4.71 |
32 | 31 | 64 | 2.06 | 3.87 |
Each row indicates a single run and how many CPUs were used. Resource efficiency was calculated by the number of genes assembled per available CPU per hour.
Tasks
For both assemblers, 32 tasks resulted in the highest gene assembly rate and CPU efficiency (Trinity: 1.60 genes/min and 22.04% CPU efficiency; ABySS: 2.29 genes/min and 35.23% CPU efficiency; Table 2). Speedup was calculated for each stepwise increase in the number of tasks, which measures relative improvement (eg, wall time) when increasing processing elements during execution of a program. 61 We found that ABySS had a faster speedup than Trinity for each test and is, therefore, faster when scaling up with larger data sets. Although memory efficiency for both assemblers was quite low, ABySS had a much steeper increase with increasing tasks and Trinity memory efficiency seemed to peak at 16 tasks and began declining with 32 tasks (Figure 1C and D). Because we did not test with more than 32 tasks, our results suggest that we likely have not yet reached the parallelization plateau where synchronization and communication costs outweigh the advantages of parallel computation. Thus, a greater number of tasks would likely further increase assembly rate and efficiency.
Table 2.
Statistics from assembling genes with an increasing number of tasks on a high-performance computing cluster using 2 different assemblers (Trinity and ABySS).
Scaling tasks: resource efficiency | ||||
---|---|---|---|---|
Trinity vs ABySS | ||||
Number of tasks | Genes per minute | Speedup (t(1)/t(N)) | CPU efficiency (%) | Memory efficiency (%) |
Trinity | ||||
1 | 0.18 | 1.00 | 1.58 | 0.42 |
2 | 0.27 | 1.56 | 3.05 | 0.37 |
4 | 0.50 | 2.82 | 5.73 | 0.48 |
8 | 0.75 | 4.28 | 10.24 | 0.55 |
16 | 1.08 | 6.17 | 15.15 | 1.42 |
32 | 1.60 | 9.10 | 22.04 | 1.1 |
ABySS | ||||
1 | 0.19 | 1.00 | 1.52 | 0.35 |
2 | 0.31 | 1.63 | 2.92 | 0.34 |
4 | 0.56 | 2.95 | 5.66 | 1.27 |
8 | 0.97 | 5.09 | 11.00 | 2.51 |
16 | 1.52 | 8.00 | 18.36 | 5.00 |
32 | 2.29 | 12.00 | 35.23 | 9.98 |
Each row indicates a single run and how many tasks were used. All runs were given 1 CPU per task based on the results from scaling CPUs. Speedup measures the relative improvement of the same run with differing resources. Central processing unit and memory efficiency were obtained from the workload manager (Slurm) output (CPU efficiency = cpu_time / (run_time × number_of_cpus); memory efficiency is the amount of allocated memory used in a run).
Full data set assembly
Based on the results from the CPU and task scaling tests, we assembled all of the protein-coding genes from 57 feather lice taxa using the following Slurm arguments: –nodes = 1, –cpus-per-task = 1, –ntasks = 32, –cpus 32 (aTRAM), and –abyss-np 32 for running ABySS as a parallel MPI job (–abyss-np). The average run time was 3.25 days, with a range between 1.38 and 6.25 days. The average number of loci assembled per taxon was 9055 with a range between 6637 and 12 139. To assemble all of these loci, we used 1824 CPUs and 146 477.04 CPU hours. For the 57 taxa, we assembled 516 176 genes total. After removing the genes that did not pass the reciprocal best blast test, our final data set (used for all further analyses) included 499 426 protein-coding genes with an average of 8761 genes per taxon.
Within our data set, 11 780 genes assembled for at least 1 taxon and 7182 of those genes were assembled by a minimum of 54 taxa. We found that 1584 genes did not assemble for any taxon. Using the nucleotide sequences from the pigeon louse reference, we compared gene length between all of the reference genes (n = 13 364), genes that we assembled for at least 1 taxon (n = 11 780), and genes that did not assemble for any taxon (n = 1584). Data were log transformed to fit with assumptions of normality. Gene lengths between these groups were significantly different with a large effect size showing shorter genes were less likely to assemble (ANOVA: F2, 26 726 = 2270, P = < .001, η2 = 0.15). Pairwise t tests with Bonferroni correction showed a significant difference between all group pairings (P ⩽ .001; Figure 2).
Figure 2.
Differences in gene length (measured in DNA) from Columbicola columbae between 3 groups: all C. columbae protein-coding genes (n = 13 364; green), genes that assembled for at least 1 taxon in our data set (n = 11 780; pink), and genes that did not assemble for any taxon (n = 1584; yellow). The group of genes that did not assemble exhibited significantly shorter gene lengths compared with the other 2 groups. The presented data use logged values, with the y-axis indicating non-logged values.
GC content
We measured GC content for all genes for each taxon (n = 499 426), as well as GC content for each codon position (GC1, GC2, and GC3). Data followed a normal distribution for GC, GC1, and GC2. The distribution for GC3 was generally normal with a long right tail, indicating a disproportionate group of genes that are GC-rich. The range of GC content for all genes was 19.57% to 73.33% (M = 42.96%; SD = 6.53; see Supplemental Data for all measures of GC content per taxon). Only 15% of all genes in our data set had a GC content of 50% or greater with most genes being GC-poor. Overall GC content was significantly different between taxa (n = 499 426; ANOVA: F56, 496 925 = 1410, P ⩽ .001) with a large effect size (η2 = 0.14). To be sure the significance of the ANOVA was not due to only a few groups with significant differences, we ran a pairwise t test with Bonferroni correction and found a significant difference in GC content between 91% of groups (P ⩽ .05). The mean GC content across all protein-coding genes for the pigeon louse reference (n = 13 364) was 41.40% (SD = 5.17), similar to the GC content found in our data set. Both of these findings, however, are higher than GC content found across the entire pigeon louse genome, which is 36%. 42 For each of the 3 codon positions, mean GC content was 47.80% (SD = 5.32) for GC1, 37.70% (SD = 5.96) for GC2, and 43.37% (SD = 14.15) for GC3. As expected, there was much more variation found at GC3 compared with GC1 and GC2. The distribution of GC content per taxon for GC, GC1, GC2, and GC3 is shown in Supplemental Figures 1 to 4.
We found a positive correlation between GC content and variation in GC content (measured as SD) for overall GC (r = 0.61, P ⩽ .001; Figure 3) and for GC3 (r = 0.70, P ⩽ .001; Figure 3). Species with higher GC content have significantly more variation across their genes, whereas those that are more GC-poor seem to have a more consistent base composition. We found a weak positive relationship between GC content and gene length (ANOVA: F(1, 496 980) = 4355, P ⩽ .001; η2 = 0.0087) among the genes. There is a slightly stronger positive relationship between GC and gene length for the pigeon louse reference; however, the effect size is still quite small (ANOVA: F(1, 13 362) = 158.2, P ⩽ .001; η2 = 0.01).
Figure 3.
A strong positive correlation was found between both mean GC (blue, circles) and GC3 (GC content at codon position 3; orange, triangles) and their respective SDs. Although accounting for phylogeny, R2 values were 0.61 for GC and 0.70 for GC3. Mean GC, GC3, and SD were calculated for all genes assembled per taxon. Each data point represents a distinct genus, and regression lines are depicted in blue (GC) and orange (GC3). The y-axis shows the fitted values calculated from the phylogenetic linear regression model.
GC content on tree and phylogenetic signal
GC content was mapped onto the subsampled phylogenetic tree from de Moya et al, 28 and in some cases, related taxa had similar GC levels (Figure 4). The pattern of GC content seen across the tree is also seen at all codon positions (Supplemental Figures 5 and 6), specifically for GC3 (Supplemental Figure 7). This pattern suggests that transitions between GC-rich and GC-poor genes may have occurred multiple times across this group, as opposed to a more continuous change from older to more recently diverged taxa (see 19 ).
Figure 4.
Phylogenetic tree (modified from de Moya et al 28 ) with mean GC content mapped onto branches. Local hotspots exhibiting significant phylogenetic signal, as determined by the LIPA analysis, are indicated by asterisks. Purple circles represent a positive relationship, whereas orange rectangles denote a negative relationship. Darker shades of blue indicate higher GC levels, whereas lighter shades of green represent lower GC levels.
Phylogenetic signal for GC and GC3, was not found using Moran I but was when using Abouheif Cmean. For overall GC, Moran I was insignificant and close to 0 (I = −0.0099, P = .100) whereas Abouheif Cmean was significant with a low positive signal (CM = 0.190, P = .020). A similar pattern appeared for GC3 with no global phylogenetic signal detected with Moran I (I = −0.010, P = .077) but this signal was detected with Abouheif Cmean (CM = 0.194, P = .015). Although no significant phylogenetic signal was found regarding GC using Moran I at the global scale, the LIPA analysis found significant positive autocorrelation for 13 taxa using Moran I, as well as for 15 taxa using Abouheif Cmean (Figure 4). Two of the taxa identified as having significant phylogenetic signal with Abouheif Cmean were negatively autocorrelated, whereas all of the taxa that exhibited significant phylogenetic signal using Moran I were positively autocorrelated. For GC3, the LIPA analysis for both Moran I and Abouheif Cmean revealed significant autocorrelation for the same 14 taxa; however, 2 of these were negatively autocorrelated with Abouheif Cmean (Supplemental Figure 7).
Discussion
Considering the disproportionately lower amount of genetic research on insects relative to their abundance and global distribution, 20 we set out to accomplish 2 main goals. First, we aimed to improve gene assembly efficiency with aTRAM by decreasing the amount of time and resources needed using parallelization with HPC, allowing for easier access to genetic data for these understudied groups. Second, we investigated the base composition of the largest set to date of protein-coding genes of feather lice. We found that while 32 CPUs assembled the most genes per minute, resource use was most efficient when using 1 CPU (Trinity: 7 genes/CPU/ hr| ABySS: 8 genes/CPU/h; Figure 1A and B). Using a single CPU, the number of tasks that offered the highest rate of genes assembled per minute, as well as the best CPU and memory efficiency was 32 tasks (Trinity: 1.6 genes/min| ABySS 2.29 genes/min; Figure 1C and D). By manipulating the resources given to aTRAM, we obtained a gene assembly speedup of 9.10 (Trinity) and 12 (ABySS; Table 2) times. This reduced the assembly time of a set of 13 364 genes from 48.72 to 3.8 days. As seen in our analysis, increasing the number of CPUs does not necessarily improve efficiency of a program (Figure 1B). This becomes increasingly complicated when addressing additional components of HPC, such as RAM, permanent disk space, and threading. In general, we recommend running similar tests on other software to gain a general understanding of a program’s resource usage with an HPC system.
Our final data set included 499 426 genes and on average had a GC content below 50% (mean GC = 42.96%; SD = 6.53). For third codon positions only, we found a similar average GC3 to that found by Virrueta Herrera et al; 32 however, we found a higher average GC1 and lower average GC2. We also found higher variation in GC content (19.57%-73.33%) across all genes. One explanation for the increased GC content in our data set compared with Baldwin-Brown et al 43 is the bias toward easily aligned genes. We retained only those genes that could be identified by reciprocal best-hit BLAST to the pigeon louse genome. These genes are less likely to contain repetitive sequences, and repetitive genome features are known to be GC-poor. By only retaining easily aligned genes, we have likely removed GC-poor genes from the data set.
Our results suggest feather lice may have higher GC content in coding sequences compared with many other insects that have been investigated. For example, GC content of protein-coding genes from 2 species of parasitoid wasp is around 30% (Aphidius ervi and Lysiphlebus fabarum 36 ) and between 33% and 39% for the honeybee (Apis mellifera 62 ). Some species of insects do have similar GC content to feather lice, such as silkworms (Bombyx 63 ). Unfortunately, the genetic studies that focus on insects are overrepresented by a small number of model species, such as members of Drosophila and Lepidoptera,20,64 ignoring the enormous diversity of insects. To properly understand how GC content of feather lice compares to other insects, more studies are needed on non-model species.
For overall GC and GC3, we found a significant correlation between %GC for a species and SD among genes within that species (Figure 3), a pattern also seen in many vertebrates. 11 Species with higher values for GC and GC3 showed much more variation compared with species with lower GC and GC3. This could indicate that GC-rich genes are being selected for in some species, resulting in more GC variation across their genes. Alternatively, this could be a consequence of GC-biased gene conversion (gBGC), which favors the use of G and C bases. 65 This is particularly prominent in species with high rates of recombination. When an allelic mismatch occurs during the repair of a meiotic double-strand break, gBGC results in a biased frequency of G:C compared with A:T conversions, thus increasing the chance of GC substitution.10,66 It is thought that gBGC plays a significant role in the variation in GC content within genomes of some taxonomic groups. 67 Many studies have found evidence that suggests gBGC plays a prominent role in the GC-rich base composition of avian 68 and mammalian isochores.69,70 In addition, Pessia et al 71 examined the genomes of a broader range of organisms (Unikonts, Excavates, Chromalveolates, and Plantae) and found that gBGC is evident in most eukaryotic groups. It is important to note that disentangling gBGC from directional selection requires further analysis. Although the mechanism is different, both result in a biased increase in 1 allele. 72 Because feather lice have low GC content and high rates of substitution, it may be that gBGC is not a strong influence on GC content in this group and other factors play a larger role. Alternatively, these results could indicate a shift toward higher GC heterogeneity, where different mechanisms result in GC-rich and GC-poor regions of the genome (eg, 73 ).
Another prominent hypothesis that could also explain GC variation and evolution focuses on selection acting on the molecular machinery that results in nucleotide biases. Specifically, deamination of methyl-cytosine produces thymine causing a T:G mismatch requiring repair. 74 Methyl-cytosine deamination and GC content create a positive feedback loop that can either decrease or increase GC content in an organism. 75 In addition, some studies have found a positive correlation between GC content of transposable element (TE) and genome-wide GC content (eg, 76 ). GC-rich TEs can increase GC content within the region of insertion if that region has a lower %GC than the TE, and the same rule would follow for GC-poor TEs. However, parasitic lice have a relatively low percentage of TEs in their genome (<5%)42,77,78. Interestingly, this fraction of genome-wide TEs is lower than that of silkworms (~40% of genome made up of TEs 79 ), with whom feather lice share a similar average GC content. 73 Continuing to investigate the mechanisms driving GC content and the evolutionary consequences of GC-rich or poor genomes is critical for deepening our understanding of molecular evolution and, ultimately, species diversification.80,81 To determine the driving force behind changes in base composition of feather lice among other organisms, future research needs to focus on a combination of synonymous vs non-synonymous substitutions (dN/dS ratios), 82 effective population size (Ne), 83 and codon usage bias. 84
Phylogenetic signal indicates groups of related organisms that may exhibit similar ecological or genetic traits 85 because of phylogenetic relatedness alone. We estimated the phylogenetic signal of GC content in feather lice, which revealed that close relatives had more similar GC content than would be expected by chance. Specifically, local phylogenetic signal of GC content was detected for overall GC (Figure 4) and GC3 (Supplemental Figure 7). In addition, using Abouheif Cmean, we identified positive global phylogenetic signal of mean GC and GC3 (GC: CM = 0.190, P = .020; GC3: CM = 0.194, P = .015). The LIPA analysis found 15 (GC) and 14 (GC3) species with significant autocorrelation, most being positive. Thus, GC content is likely a feature of phylogenetically closely related lineages. Because of this, we can have more confidence that base composition alone will not bias the construction of phylogenetic trees, a common concern in systematics. 86
Feather lice genera have been grouped into different ecomorphs based on similar morphological characteristics. These specialized phenotypes allow them to live in different areas of their avian host’s body to escape host defense mechanisms 24 ,87-89 and found evidence of convergent evolution in these ecomorphs but it is unknown if each ecomorph type experiences the same selective pressures on the same genes or expresses similar genetic pathways. Signatures of convergent evolution have been found in other species that exhibit ecomorphs, which can tell us more about the genetic architecture of closely and distantly related organisms and the role selective pressures play in diversification.90,91 Furthermore, these data provide new opportunities for future studies to explore insect protein evolution and adaptive evolution between feather lice genera and species. 92
Scientists without a background in bioinformatics or computer science, however, often struggle with using the programs necessary for these evolutionary studies. Because this software often needs to be executed with HPC, researchers frequently hire specialists if they have available funds. This burden is much heavier on minorities, as they are less likely to receive funding93,94. These programs need to be built in a way that is more accessible and resource-efficient when used by scientists at large. Future development should increase focus on improving the program’s ability to identify and allot available computational resources on various HPC architectures without the need for complex manipulation by the user. We have increased accessibility to aTRAM by developing a Singularity container that provides a more simple manner to install and use the program (https://github.com/averygrant/atram_singularity). Our aTRAM container consists of all of the files and dependencies needed for running aTRAM leading to fewer installation steps.
Genomic base composition is a fundamental feature of the genome of all organisms. Our results show that most feather louse protein-coding genes are GC-poor with the greatest variation found in GC3. However, GC content varies considerably between species. On average, lice have a higher GC content than other insects, although there is considerable variation in GC content among insect species. For example, feather lice GC content is similar to silkworms, 63 while being very different from the pea aphid (Acyrthosiphon pisum). 33 More detailed comparisons of GC content among insects will require examination of this feature at a genomic scale, ideally comparing orthologous genes. It is expected that GC content may vary between different gene regions,31,95 such as nuclear vs mtDNA96,97 and introns vs exons, 98 and this variation needs to be taken into account. As insect genomes become more available for a higher diversity of species, more research can focus on untangling the impact of molecular mechanisms and selective pressures on GC-rich or GC-poor organisms and, ultimately, shed light on protein evolution in insects.
Supplemental Material
Supplemental material, sj-png-2-bbi-10.1177_11779322241257991 for Rapid Targeted Assembly of the Proteome Reveals Evolutionary Variation of GC Content in Avian Lice by Avery R Grant, Kevin P Johnson, Edward L Stanley, James Baldwin-Brown, Stanislav Kolenčík and Julie M Allen in Bioinformatics and Biology Insights
Supplemental material, sj-png-3-bbi-10.1177_11779322241257991 for Rapid Targeted Assembly of the Proteome Reveals Evolutionary Variation of GC Content in Avian Lice by Avery R Grant, Kevin P Johnson, Edward L Stanley, James Baldwin-Brown, Stanislav Kolenčík and Julie M Allen in Bioinformatics and Biology Insights
Supplemental material, sj-png-4-bbi-10.1177_11779322241257991 for Rapid Targeted Assembly of the Proteome Reveals Evolutionary Variation of GC Content in Avian Lice by Avery R Grant, Kevin P Johnson, Edward L Stanley, James Baldwin-Brown, Stanislav Kolenčík and Julie M Allen in Bioinformatics and Biology Insights
Supplemental material, sj-png-5-bbi-10.1177_11779322241257991 for Rapid Targeted Assembly of the Proteome Reveals Evolutionary Variation of GC Content in Avian Lice by Avery R Grant, Kevin P Johnson, Edward L Stanley, James Baldwin-Brown, Stanislav Kolenčík and Julie M Allen in Bioinformatics and Biology Insights
Supplemental material, sj-png-6-bbi-10.1177_11779322241257991 for Rapid Targeted Assembly of the Proteome Reveals Evolutionary Variation of GC Content in Avian Lice by Avery R Grant, Kevin P Johnson, Edward L Stanley, James Baldwin-Brown, Stanislav Kolenčík and Julie M Allen in Bioinformatics and Biology Insights
Supplemental material, sj-png-7-bbi-10.1177_11779322241257991 for Rapid Targeted Assembly of the Proteome Reveals Evolutionary Variation of GC Content in Avian Lice by Avery R Grant, Kevin P Johnson, Edward L Stanley, James Baldwin-Brown, Stanislav Kolenčík and Julie M Allen in Bioinformatics and Biology Insights
Supplemental material, sj-png-8-bbi-10.1177_11779322241257991 for Rapid Targeted Assembly of the Proteome Reveals Evolutionary Variation of GC Content in Avian Lice by Avery R Grant, Kevin P Johnson, Edward L Stanley, James Baldwin-Brown, Stanislav Kolenčík and Julie M Allen in Bioinformatics and Biology Insights
Supplemental material, sj-xlsx-1-bbi-10.1177_11779322241257991 for Rapid Targeted Assembly of the Proteome Reveals Evolutionary Variation of GC Content in Avian Lice by Avery R Grant, Kevin P Johnson, Edward L Stanley, James Baldwin-Brown, Stanislav Kolenčík and Julie M Allen in Bioinformatics and Biology Insights
Acknowledgments
The authors thank Sebastian Smith and John Anderson for their help with the computational components of this publication.
Footnotes
Author Contributions: ARG and JMA conceived and designed the study; ARG performed the formal analyses, created all visualizations, and wrote the original draft; all authors contributed to data collection, and reviewing and editing; ARG, JMA, and KPJ analyzed results; JMA supervised and funded the research; all authors approved the final manuscript.
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by NSF grants 1925312 to JMA, NSF DEB 1925487 and DEB 1926919 to KPJ. The computational work in this publication was made possible by a grant from the National Institute of General Medical Sciences (GM103440) from the National Institutes of Health. The authors would like to acknowledge the support of Research & Innovation and the Cyberinfrastructure Team in the Office of Information Technology at the University of Nevada, Reno for facilitation and access to the Pronghorn High-Performance Computing Cluster.
ORCID iD: Avery R Grant
https://orcid.org/0000-0002-7623-4060
Supplemental Material: Supplemental material for this article is available online.
References
- 1. Wernegreen JJ. Ancient bacterial endosymbionts of insects: genomes as sources of insight and springboards for inquiry. Exp Cell Res. 2017;358:427-432. doi: 10.1016/J.YEXCR.2017.04.028 [DOI] [PubMed] [Google Scholar]
- 2. Du MZ, Zhang C, Wang H, Liu S, Wei W, Guo FB. The GC content as a main factor shaping the amino acid usage during bacterial evolution process. Front Microbiol. 2018;9:2948. doi: 10.3389/fmicb.2018.02948 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Pellicer J, Hidalgo O, Dodsworth S, Leitch IJ. Genome size diversity and its impact on the evolution of land plants. Genes. 2018;9:88. doi: 10.3390/GENES9020088 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. du Toit Z, du Plessis M, Dalton DL, Jansen R, Grobler JP, Kotzé A. Mitochondrial genomes of African pangolins and insights into evolutionary patterns and phylogeny of the family Manidae. BMC Genomics. 2017;18:746. doi: 10.1186/s12864-017-4140-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Luo H, Thompson LR, Stingl U, Hughes AL. Selection maintains low genomic GC content in marine SAR11 lineages. Mol Biol Evol. 2015;32:2738-2748. doi: 10.1093/molbev/msv149 [DOI] [PubMed] [Google Scholar]
- 6. Castillo AI, Nelson ADL, Lyons E. Tail wags the dog? Functional gene classes driving genome-wide GC content in Plasmodium spp. Genome Biol Evo. 2019;11:497-507. doi: 10.1093/gbe/evz015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Supple MA, Shapiro B. Conservation of biodiversity in the genomics era. Genome Biol. 2018;19:1-12. doi: 10.1186/S13059-018-1520-3/FIGURES/3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Cao J, Wu X, Jin Y. Lower GC-content in editing exons: implications for regulation by molecular characteristics maintained by selection. Gene. 2008;421:14-19. doi: 10.1016/j.gene.2008.05.012 [DOI] [PubMed] [Google Scholar]
- 9. Li X, Du D. Variation, evolution, and correlation analysis of C+G content and genome or chromosome size in different kingdoms and phyla. PLoS ONE. 2014;9:88339. doi: 10.1371/journal.pone.0088339 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Duret L, Arndt PF. The impact of recombination on nucleotide substitutions in the human genome. PLoS Genet. 2008;4:e1000071. doi: 10.1371/journal.pgen.1000071 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Romiguier J, Ranwez V, Douzery EJP, Galtier N. Contrasting GC-content dynamics across 33 mammalian genomes: relationship with life-history traits and chromosome sizes. Genome Res. 2010;20:1001-1009. doi: 10.1101/GR.104372.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Matsubara K, Kuraku S, Tarui H, et al. Intra-genomic GC heterogeneity in sauropsids: evolutionary insights from cDNA mapping and GC 3 profiling in snake. BMC Genomics. 2012;13:604. doi: 10.1186/1471-2164-13-604 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Mugal CF, Arndt PF, Holm L, Ellegren H. Evolutionary consequences of DNA methylation on the GC content in vertebrate genomes. G3. 2015;5:441-447. doi: 10.1534/G3.114.015545/-/DC1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Rao YS, Chai XW, Wang ZF, Nie QH, Zhang XQ. Impact of GC content on gene expression pattern in chicken. Genet Sel Evol. 2013;45:9. doi: 10.1186/1297-9686-45-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Sémon M, Mouchiroud D, Duret L. Relationship between gene expression and GC-content in mammals: statistical significance and biological relevance. Hum Mol Genet. 2005;14:421-427. doi: 10.1093/hmg/ddi038 [DOI] [PubMed] [Google Scholar]
- 16. Niu Z, Xue Q, Wang H, et al. Mutational biases and GC-biased gene conversion affect GC content in the plastomes of Dendrobium genus. Int J Mol Sci. 2017;18:2307. doi: 10.3390/ijms18112307 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Eyre-Walker A, Hurst LD. The evolution of isochores. Nat Rev Genet. 2001;2:549-555. doi: 10.1038/35080577 [DOI] [PubMed] [Google Scholar]
- 18. Hildebrand F, Meyer A, Eyre-Walker A. Evidence of selection upon genomic GC-content in bacteria. PLoS Genet. 2010;6:e1001107. doi: 10.1371/JOURNAL.PGEN.1001107 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Šmarda P, Bureš P, Horová L, et al. Ecological and evolutionary significance of genomic GC content diversity in monocots. Proc Natl Acad Sci USA. 2014;111:E4096-E4102. doi: 10.1073/pnas.1321152111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Hotaling S, Kelley JL, Frandsen PB. Toward a genome sequence for every animal: where are we now? Proc Natl Acad Sci USA. 2021;118:e2109019118. doi: 10.1073/pnas.2109019118 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Matoulek D, Ježek B, Vohnoutová M, Symonová R. Advances in vertebrate (cyto)genomics shed new light on fish compositional genome evolution. Genes (Basel). 2023;14:244. doi: 10.3390/genes14020244 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Grimaldi D, Engel MS. Evolution of the Insects. 1st ed. Cambridge University Press; 2005. [Google Scholar]
- 23. Kunin WE. Robust evidence of declines in insect abundance and biodiversity. Nature. 2019;574:641-642. doi: 10.1038/d41586-019-03241-9 [DOI] [PubMed] [Google Scholar]
- 24. Price RD, Hellenthal RA, Palma RL, Johnson KP, Clayton DH. Chewing Lice: World Checklist and Biological Overview. Illinois Natural History Survey; 2003. [Google Scholar]
- 25. Johnson KP, Shreve SM, Smith VS. Repeated adaptive divergence of microhabitat specialization in avian feather lice. BMC Biol. 2012;10:52. doi: 10.1186/1741-7007-10-52 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Johnson KP, Allen JM, Olds BP, et al. Rates of genomic divergence in humans, chimpanzees and their lice. Proc R Soc B. 2014;281:20132174. doi.org/10.1098/rspb.2013.2174 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Clayton DH, Bush SE, Johnson KP. Coevolution of Life on Hosts: Integrating Ecology and History. University of Chicago Press; 2015. [Google Scholar]
- 28. de Moya RS, Allen JM, Sweet AD, et al. Extensive host-switching of avian feather lice following the cretaceous-paleogene mass extinction event. Commun Biol. 2019;2:445. doi: 10.1038/s42003-019-0689-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Huttener R, Thorrez L, In’t Veld T, et al. GC content of vertebrate exome landscapes reveal areas of accelerated protein evolution. BMC Evol Biol. 2019;19:144. doi: 10.1186/s12862-019-1469-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Johnston JS, Yoon KS, Strycharz JP, Pittendrigh BR, Clark JM. Body lice and head lice (Anoplura: Pediculidae) have the smallest genomes of any hemimetabolous insect reported to date. J Med Entomol. 2007;44:1009-1012. doi: 10.1603/0022-2585 [DOI] [PubMed] [Google Scholar]
- 31. Yoshizawa K, Johnson KP. Changes in base composition bias of nuclear and mitochondrial genes in lice (Insecta: Psocodea). Genetica. 2013;141:491-499. doi: 10.1007/s10709-013-9748-z [DOI] [PubMed] [Google Scholar]
- 32. Virrueta Herrera S, Sweet AD, Allen JM, Walden KKO, Weckstein JD, Johnson KP. Extensive in situ radiation of feather lice on tinamous. Proc R Soc B Biol Sci. 2020;287:20193005. doi: 10.1098/rspb.2019.3005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. International Aphid Genomics Consortium. Genome sequence of the pea aphid Acyrthosiphon pisum. PLoS Biol. 2010;8:e1000313. doi: 10.1371/journal.pbio.1000313 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Xue J, Zhou X, Zhang CX, et al. Genomes of the rice pest brown planthopper and its endosymbionts reveal complex complementary contributions for host adaptation. Genome Biol. 2014;15:521. doi: 10.1186/s13059-014-0521-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Wang L, Tang N, Gao X, et al. Genome sequence of a rice pest, the white-backed planthopper (Sogatella furcifera). Gigascience. 2017;6:1-9. doi: 10.1093/GIGASCIENCE/GIW004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Dennis AB, Ballesteros GI, Robin S, et al. Functional insights from the GC-poor genomes of two aphid parasitoids, Aphidius ervi and Lysiphlebus fabarum. BMC Genomics. 2020;21:376. doi: 10.1186/s12864-020-6764-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Sweet AD, Johnson KP, Cao Y, et al. Structure, gene order, and nucleotide composition of mitochondrial genomes in parasitic lice from Amblycera. Gene. 2021;5:145312. doi: 10.1016/j.gene.2020.145312 [DOI] [PubMed] [Google Scholar]
- 38. Dominguez Del Angel V, Hjerde E, Sterck L, et al. Ten steps to get started in genome assembly and annotation. F1000Res. 2018;7:149. doi: 10.12688/f1000research.13598.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Tavares WFC, Roberto Miranda Assis M, Borin E. Quantifying detecting HPC resource wastage in cloud environments. Paper presented at: 2021 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW); October 26-29, 2021:41-46; Belo Horizonte. doi: 10.1109/SBAC-PADW53941.2021.00017 [DOI] [Google Scholar]
- 40. Allen JM, Boyd B, Nguyen N-P, et al. Phylogenomics from whole genome sequences using aTRAM. Syst Biol. 2017;66:786-798. doi: 10.1093/sysbio/syw105 [DOI] [PubMed] [Google Scholar]
- 41. Allen JM, Huang DI, Cronk QC, Johnson KP. ATRAM—automated target restricted assembly method: a fast method for assembling loci across divergent taxa from next-generation sequencing data. BMC Bioinformatics. 2015;16:98. doi: 10.1186/s12859-015-0515-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Allen JM, LaFrance R, Folk RA, Johnson KP, Guralnick RP. aTRAM 2.0: an improved, flexible locus assembler for NGS data. Evol Bioinform Online. 2018;14:1176934318774546. doi: 10.1177/1176934318774546 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Baldwin-Brown JG, Villa SM, Vickrey AI, et al. The assembled and annotated genome of the pigeon louse Columbicola columbae, a model ectoparasite. G3. 2021;11:jkab009. doi: 10.1093/g3journal/jkab009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114-2120. doi: 10.1093/bioinformatics/btu170 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Grabherr MG, Haas BJ, Yassour M, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29:644-652. doi: 10.1038/nbt.1883 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Jackman SD, Vandervalk BP, Mohamadi H, et al. ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter. Genome Res. 2017;27:768-777. doi: 10.1101/gr.214346.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Slater GSC, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics. 2005;6:31. doi: 10.1186/1471-2105-6-31 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. R Core Team. R: a language and environment for statistical computing. R J. https://www.R-project.org/
- 49. Ho LST, Ané C. A linear-time algorithm for Gaussian and non-Gaussian trait evolution models. Syst Biol. 2014;63:397-408. doi: 10.1093/sysbio/syu005 [DOI] [PubMed] [Google Scholar]
- 50. Bofkin L, Goldman N. Variation in evolutionary processes at different codon positions. Mol Biol Evol. 2007;24:513-521. doi: 10.1093/molbev/msl178 [DOI] [PubMed] [Google Scholar]
- 51. Kliman RM, Bernal CA. Unusual usage of AGG and TTG codons in humans and their viruses. Gene. 2005;352:92-99. doi: 10.1016/j.gene.2005.04.001 [DOI] [PubMed] [Google Scholar]
- 52. Palidwor GA, Perkins TJ, Xia X. A general model of codon bias due to GC mutational bias. PLoS ONE. 2010;5:e13431. doi: 10.1371/journal.pone.0013431 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Paradis E, Schliep K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics. 2019;35:526-528. doi: 10.1093/bioinformatics/bty633 [DOI] [PubMed] [Google Scholar]
- 54. Wickham H. ggplot2: Elegant Graphics for Data Analysis. Springer; 2009. [Google Scholar]
- 55. Keck F, Rimet F, Bouchez A, Franc A. Phylosignal: an R package to measure, test, and explore the phylogenetic signal. Ecol Evol. 2016;6:2774-2780. doi: 10.1002/ece3.2051 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Hackathon R, Bolker B, Butler M, et al. Phylobase: base package for phylogenetic structures and comparative data. R package version 0.8.10. Published 2010. https://github.com/fmichonneau/phylobase
- 57. Gittleman JL, Kot M. Adaptation: statistics and a null model for estimating phylogenetic effects. Syst Biol. 1990;39:227-241. doi: 10.2307/2992183 [DOI] [Google Scholar]
- 58. Abouheif E. A method for testing the assumption of phylogenetic independence in comparative data. Evol Ecol Res. 1999;1:895-909. [Google Scholar]
- 59. Pavoine S, Ollier S, Pontier D, Chessel D. Testing for phylogenetic signal in phenotypic traits: new matrices of phylogenetic proximities. Theor Popul Biol. 2008;73:79-91. doi: 10.1016/J.TPB.2007.10.001 [DOI] [PubMed] [Google Scholar]
- 60. Münkemüller T, Lavergne S, Bzeznik B, et al. How to measure and test phylogenetic signal. Methods Ecol Evol. 2012;3:743-756. doi: 10.1111/J.2041-210X.2012.00196.X [DOI] [Google Scholar]
- 61. Ristov S, Prodan R, Gusev M, Skala K. Superlinear speedup in HPC systems: why and when? FedCSIS. 2016;8:889-898. doi: 10.15439/2016F498 [DOI] [Google Scholar]
- 62. Qi W, Yan C, Li W, et al. Distinct patterns of simple sequence repeats and GC distribution in intragenic and intergenic regions of primate genomes. Aging. 2016;8:2635-2654. doi: 10.18632/aging.101025 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. Zhou QZ, Fang SM, Zhang Q, Yu QY, Zhang Z. Identification and comparison of long non-coding RNAs in the silk gland between domestic and wild silkworms. Insect Sci. 2018;25:604-616. doi: 10.1111/1744-7917.12443 [DOI] [PubMed] [Google Scholar]
- 64. Kyriacou RG, Mulhair PO, Holland PWH. GC content across insect genomes: phylogenetic patterns, causes and consequences. J Mol Evol. 2024;92:138-152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Eyre-Walker A. Recombination and mammalian genome evolution. Proc R Soc B Biol Sci. 1993;252:237-243. doi: 10.1098/RSPB.1993.0071 [DOI] [PubMed] [Google Scholar]
- 66. Webster MT, Hurst LD. Direct and indirect consequences of meiotic recombination: implications for genome evolution. Trends Genet. 2012;28:101-109. doi: 10.1016/j.tig.2011.11.00 [DOI] [PubMed] [Google Scholar]
- 67. Duret L, Galtier N. Biased gene conversion and the evolution of mammalian genomic landscapes. Annu Rev Genomics Hum Genet. 2009;10:285-311. doi: 10.1146/annurev-genom-082908-150001 [DOI] [PubMed] [Google Scholar]
- 68. Bolívar P, Mugal CF, Nater A, Ellegren H. Recombination rate variation modulates gene sequence evolution mainly via GC-biased gene conversion, not Hill-Robertson interference, in an avian system. Mol Biol Evol. 2015;33:216-227. doi: 10.1093/MOLBEV/MSV214 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69. Galtier N. Gene conversion drives GC content evolution in mammalian histones. Trends Genet. 2003;19:65-68. doi: 10.1016/S0168-9525(02)00002-1 [DOI] [PubMed] [Google Scholar]
- 70. Kudla G, Helwak A, Lipinski L. Conversion and GC-content evolution in mammalian Hsp70. Mol Biol Evol. 2004;21:1438-1444. doi: 10.1093/molbev/msh146 [DOI] [PubMed] [Google Scholar]
- 71. Pessia E, Popa A, Mousset S, Rezvoy C, Duret L, Marais GA. Evidence for widespread GC-biased gene conversion in eukaryotes. Genome Biol Evol. 2012;4:675-682. doi: 10.1093/gbe/evs052 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72. Borges R, Szöllősi GJ, Kosiol C. Quantifying GC-biased gene conversion in great ape genomes using polymorphism-aware models. Genetics. 2019;212:1321-1336. doi: 10.1534/genetics.119.302074 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73. Jørgensen FG, Schierup MH, Clark AG. Heterogeneity in regional GC content and differential usage of codons and amino acids in GC-poor and GC-rich regions of the genome of Apis mellifera. Mol Biol Evol. 2007;24:611-619. doi: 10.1093/MOLBEV/MSL190 [DOI] [PubMed] [Google Scholar]
- 74. Shen JC, Rideout WM, 3rd, Jones PA. The rate of hydrolytic deamination of 5-methylcytosine in double-stranded DNA. Nucleic Acids Res. 1994;22:972-976. doi: 10.1093/nar/22.6.972 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75. Fryxell KJ, Zuckerkandl E. Cytosine deamination plays a primary role in the evolution of mammalian isochores. Mol Biol Evol. 2000;17:1371-1383. doi: 10.1093/oxfordjournals.molbev.a026420 [DOI] [PubMed] [Google Scholar]
- 76. Symonová R, Suh A. Nucleotide composition of transposable elements likely contributes to AT/GC compositional homogeneity of teleost fish genomes. Mob DNA. 2019;10:49. doi: 10.1186/s13100-019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77. Kirkness EF, Haas BJ, Sun W, et al. Genome sequences of the human body louse and its primary endosymbiont provide insights into the permanent parasitic lifestyle. Proc Natl Acad Sci USA. 2010;107:12168-12173. doi: 10.1073/pnas.1003379107 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78. Xu Y, Ma L, Liu S, et al. Chromosome-level genome of the poultry shaft louse Menopon gallinae provides insight into the host-switching and adaptive evolution of parasitic lice. Gigascience. 2024;13:1-15. doi: 10.1093/gigascience/giae004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79. Xu HE, Zhang HH, Xia T, Han MJ, Shen YH, Zhang Z. BmTEdb: a collective database of transposable elements in the silkworm genome. Database (Oxford). 2013;2013:bat055. doi: 10.1093/database/bat055 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80. Figuet E, Ballenghien M, Romiguier J, Galtier N. Biased gene conversion and GC-content evolution in the coding sequences of reptiles and vertebrates. Genome Biol Evol. 2015;7:240-250. doi: 10.1093/gbe/evu277 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81. Trávníček P, Čertner M, Ponert J, Chumová Z, Jersáková J, Suda J. Diversity in genome size and GC content shows adaptive potential in orchids and is closely linked to partial endoreplication, plant life-history traits and climatic conditions. New Phytol. 2019;224:1642-1656. doi: 10.1111/nph.15996 [DOI] [PubMed] [Google Scholar]
- 82. Bolívar P, Guéguen L, Duret L, Ellegren H, Mugal CF. GC-biased gene conversion conceals the prediction of the nearly neutral theory in avian genomes. Genome Biol. 2019;20:5. doi: 10.1186/s13059-018-1613-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83. Gossmann TI, Keightley PD, Eyre-Walker A. The effect of variation in the effective population size on the rate of adaptive molecular evolution in eukaryotes. Genome Biol Evol. 2012;4:658-667. doi: 10.1093/gbe/evs027 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84. Clément Y, Sarah G, Holtz Y, et al. Evolutionary forces affecting synonymous variations in plant genomes. PLoS Genet. 2017;13:e1006799. doi: 10.1371/journal.pgen.1006799 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85. Jombart T, Pavoine S, Devillard S, Pontier D. Putting phylogeny into the analysis of biological traits: a methodological approach. J Theor Biol. 2010;264:693-701. doi: 10.1016/j.jtbi.2010.03.038 [DOI] [PubMed] [Google Scholar]
- 86. Van Den Bussche RA, Baker RJ, Huelsenbeck JP, Hillis DM. Base compositional bias and phylogenetic analyses: a test of the “flying DNA” hypothesis. Mol Phylogenet Evol. 1998;10:408-416. doi: 10.1006/mpev.1998.0531 [DOI] [PubMed] [Google Scholar]
- 87. Clay T. Some problems in the evolution of a group of ectoparasites. Evolution. 1949;3:279-299. doi: 10.2307/2405715 [DOI] [PubMed] [Google Scholar]
- 88. Bush SE, Kim D, Reed M, Clayton DH. Evolution of cryptic coloration in ectoparasites. Am Nat. 2010;176:529-535. doi: 10.1086/656269 [DOI] [PubMed] [Google Scholar]
- 89. Kolencik S, Stanley EL, Punnath A, et al. Parasite escape mechanisms drive morphological diversification in avian lice. Proc Biol Sci. 2024;29:20232665. doi: 10.1098/rspb.2023.2665 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90. Brewer MS, Carter RA, Croucher PJ, Gillespie RG. Shifting habitats, morphology, and selective pressures: developmental polyphenism in an adaptive radiation of Hawaiian spiders. Evolution. 2015;69:162-178. doi: 10.1111/evo [DOI] [PubMed] [Google Scholar]
- 91. McGlothlin JW, Kobiela ME, Wright HV, Kolbe JJ, Losos JB, Brodie ED, 3rd. Conservation and convergence of genetic architecture in the adaptive radiation of anolis lizards. Am Nat. 2022;200:E207-E220. doi: 10.1086/721091 [DOI] [PubMed] [Google Scholar]
- 92. Koonin EV. Orthologs, paralogs, and evolutionary genomics. Annu Rev Genet. 2005;39:309-338. doi: 10.1146/annurev.genet.39.073003.114725 [DOI] [PubMed] [Google Scholar]
- 93. Ginther DK, Schaffer WT, Schnell J, et al. Race, ethnicity, and NIH research awards. Science. 2011;333:1015-1019. doi: 10.1126/science.1196783 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94. Carnethon MR, Kershaw KN, Kandula NR. Disparities research, disparities researchers, and health equity. JAMA. 2020;323:211-212. doi: 10.1001/jama.2019.19329 [DOI] [PubMed] [Google Scholar]
- 95. Sparks ME, Hebert FO, Johnston JS, et al. Sequencing, assembly and annotation of the whole-insect genome of Lymantria dispar dispar, the European gypsy moth. G3. 2021;11:jkab150. doi: 10.1093/g3journal/jkab150 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96. Sadílek D, Urfus T, Vilímová J, Hadrava J, Suda J. Nuclear genome size in contrast to sex chromosome number variability in the human bed bug, Cimex lectularius (Heteroptera: Cimicidae). Cytometry A. 2019;95:746-756. doi: 10.1002/cyto.a.23729 [DOI] [PubMed] [Google Scholar]
- 97. Li X, Yan L, Pape T, Gao Y, Zhang D. Evolutionary insights into bot flies (Insecta: Diptera: Oestridae) from comparative analysis of the mitochondrial genomes. Int J Biol Macromol. 2020;149:371-380. doi: 10.1016/j.ijbiomac.2020.01.249 [DOI] [PubMed] [Google Scholar]
- 98. Amit M, Donyo M, Hollander D, et al. Differential GC content between exons and introns establishes distinct strategies of splice-site recognition. Cell Rep. 2012;1:543-556. doi: 10.1016/j.celrep.2012.03.013 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, sj-png-2-bbi-10.1177_11779322241257991 for Rapid Targeted Assembly of the Proteome Reveals Evolutionary Variation of GC Content in Avian Lice by Avery R Grant, Kevin P Johnson, Edward L Stanley, James Baldwin-Brown, Stanislav Kolenčík and Julie M Allen in Bioinformatics and Biology Insights
Supplemental material, sj-png-3-bbi-10.1177_11779322241257991 for Rapid Targeted Assembly of the Proteome Reveals Evolutionary Variation of GC Content in Avian Lice by Avery R Grant, Kevin P Johnson, Edward L Stanley, James Baldwin-Brown, Stanislav Kolenčík and Julie M Allen in Bioinformatics and Biology Insights
Supplemental material, sj-png-4-bbi-10.1177_11779322241257991 for Rapid Targeted Assembly of the Proteome Reveals Evolutionary Variation of GC Content in Avian Lice by Avery R Grant, Kevin P Johnson, Edward L Stanley, James Baldwin-Brown, Stanislav Kolenčík and Julie M Allen in Bioinformatics and Biology Insights
Supplemental material, sj-png-5-bbi-10.1177_11779322241257991 for Rapid Targeted Assembly of the Proteome Reveals Evolutionary Variation of GC Content in Avian Lice by Avery R Grant, Kevin P Johnson, Edward L Stanley, James Baldwin-Brown, Stanislav Kolenčík and Julie M Allen in Bioinformatics and Biology Insights
Supplemental material, sj-png-6-bbi-10.1177_11779322241257991 for Rapid Targeted Assembly of the Proteome Reveals Evolutionary Variation of GC Content in Avian Lice by Avery R Grant, Kevin P Johnson, Edward L Stanley, James Baldwin-Brown, Stanislav Kolenčík and Julie M Allen in Bioinformatics and Biology Insights
Supplemental material, sj-png-7-bbi-10.1177_11779322241257991 for Rapid Targeted Assembly of the Proteome Reveals Evolutionary Variation of GC Content in Avian Lice by Avery R Grant, Kevin P Johnson, Edward L Stanley, James Baldwin-Brown, Stanislav Kolenčík and Julie M Allen in Bioinformatics and Biology Insights
Supplemental material, sj-png-8-bbi-10.1177_11779322241257991 for Rapid Targeted Assembly of the Proteome Reveals Evolutionary Variation of GC Content in Avian Lice by Avery R Grant, Kevin P Johnson, Edward L Stanley, James Baldwin-Brown, Stanislav Kolenčík and Julie M Allen in Bioinformatics and Biology Insights
Supplemental material, sj-xlsx-1-bbi-10.1177_11779322241257991 for Rapid Targeted Assembly of the Proteome Reveals Evolutionary Variation of GC Content in Avian Lice by Avery R Grant, Kevin P Johnson, Edward L Stanley, James Baldwin-Brown, Stanislav Kolenčík and Julie M Allen in Bioinformatics and Biology Insights