Abstract
Lateral gene transfer (LGT) from microbial symbionts to invertebrate animals is described at an increasing rate, particularly between Wolbachia endosymbionts and their diverse invertebrate hosts. We sought to assess the use of a capture system to cost-effectively sequence such LGT from the host genome. The sequencing depth of Illumina paired end data obtained with a Wolbachia capture system correlated well with that for an Illumina paired end data set used to detect LGT in Wolbachia-depleted B. malayi (p-value: <2e-16). Using a sequencing depth threshold of two or three standard deviations above the mean, 96.9% or 96.7% of positions, respectively, are predicted in the same manner between the two datasets, with 24.7% or 42.5% of the known 49.0 kbp of LGT sequence predicted correctly, respectively. Prior qPCR results for nuwts showed similar correlations for both datasets supporting our conclusion that oligonucleotide-based capture methods can be used to obtain sequences from Wolbachia-host LGT. However, at least 121 positions had a minority of the reads supporting the endosymbiont reference base call using the capture data, illustrating that sequence reads from endosymbiont-host LGTs can confound endosymbiont genome projects, erroneously altering the called consensus genome, a problem that is irrespective to the sequencing technology or platform.
Introduction
The transfer of DNA between diverse organisms through lateral gene transfer (LGT) can allow organisms to acquire novel functional genes. Although most described LGT events occur within a single domain of life, LGT of functional genes or genetic elements has also been described between different domains of life such as bacteria and eukaryotes, including animals. For example, Bartonella henselae can naturally transfer its plasmid via the Type IV secretion system to human cells1. Bdelloid rotifers (small freshwater invertebrates that lack sexual reproduction) contain DNA from bacteria, fungi, and plants2. Hypothenemus hampei, the coffee berry borer, acquired a Bacillus mannanase gene3. Likewise, several plant parasitic nematodes have acquired plant cell wall-degrading enzymes from bacteria including cellulases that allow the nematodes to invade plant tissues4–6.
Numerous cases of LGT have been detected between Wolbachia endosymbionts and the genomes of diverse invertebrate taxa that are the hosts of these endosymbionts7–22. Wolbachia endosymbionts colonize a wide range of arthropods and filarial nematodes, including ~40% of insect species23–25. Like mitochondria, Wolbachia cells are maternally inherited and transferred through the egg cytoplasm23, 24, providing ample opportunity for LGT of bacterial genes to eukaryotic genomes. We have adopted the nomenclature of nuwts for such nuclear Wolbachia transfers following the existing standard for nuclear mitochondrial transfers (numts) and nuclear plastid transfers (nupts).
In 2001, Kondo et al. first described a nuwt in the bean beetle Callosobruchus chinensis using X-linked inheritance and inverse PCR12. Numerous Wolbachia genomes have been integrated into chromosome 4 of the D. ananassae Hawaii 2L chromosome constituting >2% of the fly genome and 20% of chromosome 426, with at least 28 D. ananassae genes of Wolbachia origin being transcribed, albeit at low levels7. Nuwts were detected in four lines of D. ananassae from Asia and the Pacific indicating that the nuwt in D. ananassae may be widely distributed7. In 2007, most (8/11) of the genome sequencing projects of invertebrates that harbored Wolbachia showed evidence of having LGT between the endosymbiont genome and the host chromosomes7 suggesting that nuwts are common. Nuwts were experimentally confirmed in all five of the hosts examined further7.
In nematodes, nuwts have been widely identified. One stronglyoidean nematode22 and all filarial nematodes examined to date have nuwts including Brugia malayi 7, Brugia pahangi 7, Brugia timori 7, Acanthocheilonema viteae 14, Acanthocheilonema spirocauda 27, Onchocerca volvulus 8, Onchocerca flexuosa 14, Dirofilaria immitis 7, and Loa loa 28. The stronglyoidean nematode as well as four filarial nematodes have nuwts despite currently lacking the Wolbachia endosymbiont14, 22, 28, demonstrating that these nematodes were once infected and have lost their endosymbiont. In filarial nematodes, one nuwt likely predates the divergence of O. volvulus and Onchocerca ochengi 8 while another predates the divergence of B. malayi, B. pahangi, and B. timori 7. In O. flexuosa, a nematode that today lacks Wolbachia, 97 nuwts were identified through transcriptome sequencing with some nuwt sequences being co-transcribed with nematode genes29. Proteomic analysis resulted in the identification of three peptides that map to two Wolbachia ABC transport-related proteins29. In a subsequent study, anti-sense probes to nuwt transcripts coding for HlyD, aminopeptidase P, and a hypothetical protein showed in situ hybridization based-labeling of the lateral chords and intestine of both sexes, the hypodermis, the empty uteri of young females, and the testis and developing sperm of males, while sense probes (negative controls) showed no labeling30. Immunohistochemical labeling of worm sections by antibodies raised against these peptides gave broadly similar results, indicative of translation30.
Despite these observations, detection of nuwts in filarial nematode genomes is still not routine. In arthropods, where Wolbachia is not an obligate mutualist, nuwts can be identified by sequencing insects treated with an antibiotic for multiple generations, such that the Wolbachia endosymbiont is eliminated and sequences generated are from the nuwts. However, this is not possible in filarial nematodes where Wolbachia endosymbionts are essential for nematode survival. In a model filarial nematode system, Brugia malayi, we demonstrated that nuwts could be detected from DNA collected from nematodes prior to the worm dying from an antibiotic treatment used to severely deplete Wolbachia abundance17, but this depletion strategy is not possible in all filarial systems as many are not routinely cultured in the laboratory.
Capture-based whole genome sequencing of Wolbachia endosymbionts has been successfully demonstrated for Wolbachia endosymbionts of filarial nematodes31 and arthropods31 and their phage32. Wolbachia genomes range in size from ~0.9–1.5 Mb. The generally smaller genome sizes of Wolbachia from nematodes is largely due to an absence of phage WO sequences, a lower level of repetitive DNA and a small number of gene losses33. To account for different genome sizes and DNA content across the most intensively studied Wolbachia supergroups, capture baits were designed on 11 complete Wolbachia genome sequences from supergroups A, B, C, and D available in the NCBI databases at the time of design. Approximately 215,000 oligonucleotide baits (120-mers) were tiled across these genome sequences with 60 base overlaps31. The entire genome of wBm, the Wolbachia endosymbiont of B. malayi that was used in the capture bait design, was captured when total B. malayi DNA was used as a test case31. Similarly, >95% of the Wolbachia DNA was captured from total DNA isolated from the pill bug Armadillidium vulgare, which contains an endosymbiont whose genome was not used in the design31.
Here, we present data that the baits also capture nuwts through a comparison of two existing, published Illumina paired-end data sets for B. malayi 17, 31. The first data set, which is referred to as the depletion data, includes >138 million reads from a paired end library constructed directly from DNA from tetracycline-treated, and thus Wolbachia-depleted, B. malayi filarial nematodes17 (SRX142902). In this way, it is a gold standard for detecting nuwts in B. malayi since the Wolbachia endosymbiont genome is depleted to every extent possible, enabling the accurate detection of nuwts. The second data set, which is referred to as the capture data, includes >91 million reads from a paired end library constructed from B. malayi DNA where the Wolbachia endosymbionts were not depleted, but Wolbachia sequences were selectively sequenced after capture using Agilent Sure Select RNA baits (SRX1057997), as previously described31. The capture system enables a greater recovery of Wolbachia reads from both the endosymbiont genome and nuwts in the host genome. Such enrichment of Wolbachia sequences leads to a decrease in sequencing cost and/or an increase in sequencing depth. The capture dataset used here contains the Wolbachia sequences from both the endosymbiont and nematode genomes, which were sequenced simultaneously, but reads could alternatively be captured from endosymbiont-depleted samples in insects. While Wolbachia levels can sometimes be depleted, in many cases the levels of Wolbachia endosymbionts cannot be manipulated, and therefore the capture data set is derived from a DNA sample that represents a frequent DNA sample in filarial nematode genomics.
We rely on sequencing depth to identify and compare nuwts in both data sets as described in prior published work on B. malayi nuwts, where a subset were validated by qPCR and an analysis of SNPs17. Using this approach we are able to identify some, but not all, nuwts. We also find that nuwt sequences can inadvertently alter the consensus genome sequence of the endosymbiont relative to the actual genome sequence of the endosymbiont, a problem not unique to the use of a capture-based system.
Results
Identifying nuwts in the presence of sequences from a Wolbachia endosymbiont genome
The two datasets being compared differ in two important ways: (a) absolute sequencing depth (where capture depletion) and (b) the ratio of the sequencing depth for a Wolbachia sequence found as a nuwt relative to the same sequence found in the endosymbiont genome (where depletion capture) (Fig. 1). More specifically, the entire 90 Mbp Brugia genome was sequenced in the depletion dataset to a sequencing depth of only 100–150X which is lower than the 2000–3000X sequencing depth in the capture dataset across the Wolbachia genome. A direct comparison of the coverage on the Brugia genome or the Wolbachia genome is not possible given that the depletion data has the Wolbachia endosymbiont sequencing reads significantly reduced while the capture data has the Brugia sequencing reads reduced. In the depletion data, since few reads are derived from the Wolbachia endosymbiont genome, when all reads are mapped to the endosymbiont genome, we expect clearly delineated peaks in the sequencing depth that correspond to nuwts and the height of the peak is expected to be proportional to the number of copies of that nuwt (Fig. 1A). In the capture data we expect these peaks still occur, but they are added to a background sequencing depth from the reads derived from the Wolbachia endosymbiont genome (Fig. 1B). As a result of these two factors, the capture data has a much larger sequencing depth, but the antibiotic depletion of Wolbachia in the depletion data leads to higher ratio differences in sequencing depth. Despite these differences, the sequencing depth for each nt position in the wBm genome correlates well between the capture data and the depletion data (Fig. 2, p-value: <2e-16, R-squared = 0.307). Therefore, we expect that regions of increased sequencing depth in the capture data correspond to multiple copies of Wolbachia sequences present in the host’s genome as multi-copy nuwts.
While a correlation is observed between the sequencing depth values for the capture data and those for the depletion data, as described above, such a correlation does not necessarily imply that sequencing depth can accurately predict nuwts. There are three factors that will influence the ability to predict nuwts using the capture data: (1) the copy number of the nuwt, (2) the relative abundance of Wolbachia chromosomes compared to the nematode chromosomes, and (3) the presence of suitable oligonucleotides in the capture pool. Factors 1 and 2 are interdependent. For example, if there are four Wolbachia genome equivalents for every nematode genome equivalent in the sample, then it may only be possible to predict nuwts that have at least four copies per nuclear genome, resulting in a 2-fold increase in sequencing depth for diploid organisms. However, the third factor is independent as it relates to how the oligonucleotide capture pool was designed. In this case, the major concern is the known removal from the capture pool of baits corresponding to reference nematode sequences, which included some nuwts. The exclusion of nuwts will lead to gaps in the tiling of the Wolbachia genome, although these missing regions may be partially or completely captured by probes to adjacent sequences in a fragment. With regards to all three of these factors, we expect this depletion data set to better predict nuwts than this capture data set. However, as discussed in the Background, Wolbachia-depletion is not always feasible.
Previously17, we described the limits of applying statistical tests and models to predict nuwts given that the sequencing depth distribution of the depletion data was not bell-shaped and did not have a unimodal distribution (Fig. 3A, Table 1). In that study, for detection of nuwts, we were able to justify an empirically derived threshold of 16x sequencing depth through visual comparison of the sequencing depth distributions of the B. malayi sequencing data and the Wolbachia data17. With the oligonucleotide capture system presented here, sequencing data from the B. malayi genome lacking homology to Wolbachia sequence is not available for such an analysis. However, the sequencing depth distribution of the capture data appears to have a bell-shaped distribution (Fig. 3B) with similar values for the mean and median suggesting it is unimodal, or nearly unimodal (Table 1). This is most likely the result of the increased sequencing depth across the entire Wolbachia genome and the smaller relative difference in sequencing depth between any single copy Wolbachia genes and any single copy nuwts. Given that the sequencing depth distribution appears to be bell-shaped, we can use the standard deviation to set a threshold for distinguishing nuwts. Using two or three standard deviations, the threshold for detecting nuwts would be 5327x or 6348x, respectively (Fig. 3B).
Table 1.
Depletion (SRX142902) | Capture (SRX1057997) | |
---|---|---|
Million read pairs sequenced | 69.3 | 45.9 |
Bases (Gbp) | 13.7 | 4.6 |
Read length (bp) | 99 | 50 |
Median insert size ± absolute deviation (bp) | 206 ± 39 | 180 ± 44 |
Reads pairs mapped to Wolbachia genome | 76,596 | 35,843,198 |
Mean sequencing depth ± standard deviation | 13.0 ± 65.1 | 3286 ± 1020 |
Median sequencing depth | 4 | 3271 |
Maximum sequencing depth | 1,613 | 14,305 |
Figure 2 can then be divided into quadrants based on these thresholds and the one previously established for the depletion data. The upper right quadrant (UR) will contain the data for nuwts that were predicted with both datasets while the lower left (LL) quadrant contains the data for regions predicted to not include nuwts for both datasets. The remainder of the data points are found in the upper left quadrant (UL) and lower right quadrant (LR). The UL should contain regions only predicted as nuwts using the Wolbachia-enriched capture data while the LR contains nuwts only predicted using the Wolbachia-depleted data. Of note, the three factors identified above, which could influence our predictions (i.e. nuwt copy number, Wolbachia-nematode relative abundance, and suitable probes), will all lead to data appearing in the LR; we could not identify any factors that would lead to data appearing in the UL.
To examine how the threshold for predicting nuwts effects our observations and the placement of data in these three quadrants, we sought to examine the difference between using two and three standard deviations as a threshold cutoff for the capture data. If three standard deviations are used yielding a threshold of 6348x, then 1,042,854 of 1,080,085 genomic positions (96.6%) are predicted in the same manner between the two datasets. However, while 49,049 nucleotide positions were predicted as being in nuwts in the depletion data, only 12,122 such positions are predicted with the capture data. This suggests that a three standard deviation threshold may be too stringent. By comparison, if two standard deviations are used, yielding a threshold of 5327x, then 1,045,613 of 1,080,085 positions (96.8%) are predicted in the same manner with the two datasets. However, while 49,049 nucleotide positions were predicted as being in nuwts in the depletion data, 20,861 such positions are predicted with the capture data. Therefore, while the lower threshold increases the overall similar predictions by only a small percentage (96.7% v. 96.9%), it nearly doubles the number of predictions of nucleotides in nuwts that are similar to the depletion data (24.7% v. 42.5%). This might argue in favor of the less stringent threshold.
However, a different picture emerges upon examining the quadrants. The more stringent three standard deviation threshold places 37,079 points in the LR quadrant, whereas the less stringent threshold places 31,330 points in this area (Table 2). However, the number of predictions in the UL increases almost 20-fold, going from 152 with the more stringent threshold to 3,142 with the less stringent threshold (Table 2). Data points in the UL are not easily attributed to any given factor of the experimental design, so it might be desirable to minimize them, arguing in favor of the more stringent threshold. Clearly there are trade-offs that need to be balanced in assigning the threshold. Ultimately, the best threshold employed should be based on the hypothesis being tested and further downstream analyses that are undertaken, including the manner in which the predictions are validated.
Table 2.
Positions with ≥16X Sequencing Depth in Depletion Data | Positions with <16X Sequencing Depth Depletion Data | |
---|---|---|
Positions with ≥6348X in the Capture Data | 11,970 | 152 |
Positions with <6348X in the Capture Data | 37,079 | 1,030,884 |
Positions with ≥5327X in the Capture Data | 17,719 | 3,142 |
Positions with <5327X in the Capture Data | 31,330 | 1,027,894 |
Contribution of factors resulting in under-identifying nuwts
As described above, we expected to have points in the LR quadrant due to three factors. The third factor, that the oligonucleotide capture design did not include some nuwt sequences, would result in nuwts predicted by the depletion data, but having very low sequencing depth in the capture data. There are 13,839 data points that are more than two standard deviations below the mean in the capture data, or <1246x sequencing depth, suggesting absence of a probe (factor 3). Of those, ~16%, or 2,225 data points are from regions above the threshold for predicting nuwts using the depletion data, meaning the two data sets are in conflict. These positions were found to occur in 14 regions of the Wolbachia genome. For all of these regions, we confirmed that probes had been removed from the capture design due to homology to a nuwt in a filarial nematode genome. Therefore, the decrease in sequencing depth is likely due to the absence of suitable probes in the capture design.
SNPs identify LGTs
We expect the sequence reads mapping to wBm in the capture data to contain some proportion of SNPs, given that both nuwts and the endosymbiont genome will have accumulated mutations following transfer, particularly nuwts that are no longer under selection. In some cases, the base call for the endosymbiont genome may not even be the dominant base call due to the high copy number of some nuwts in the B. malayi genome17. There are four factors that influence the presence and extent of variation in the sequence reads: (a) differences between the endosymbiont consensus genome being sequenced and the reference endosymbiont genome, (b) heterogeneity in the genome of the Wolbachia endosymbiont population sequenced, (c) the presence of a SNP in the nuwt at a detectable level, and (d) the presence of SNPs in nuwt paralogs.
Differences between the endosymbiont genome queried and the reference genome sequenced would be observed as homogenous variation that is present in almost 100% of the underlying reads. Across the entire reference 1.08 Mbp wBm genome34, there are 17 nt positions (0.0016%) where >99.9% of the sequencing reads obtained from the oligonucleotide capture system support an alternate base call relative to the reference wBm genome. Eight of these positions had an alternate base call supported by 100% of reads while nine positions had an alternate base call supported by >99% of the reads, indicative of a sequencing error in a single read. This indicates that either (a) there are sequencing errors in the reference genome at these 17 positions or (b) there is sequence variation between the endosymbionts used to generate the oligonucleotide capture data set and those used to generate the BACs for sequencing the reference wBm genome. These positions are located in six regions of the genome, including in a portion of an intergenic region at wBm coordinates 678, 738–678, 827; a tryptophanyl-tRNA synthetase at 692, 659–692, 669; a 4-hydroxybenzoate polyprenyltransferase at 695, 065–695, 066; a region containing nine putative genes at 832, 790–838, 394; DNA polymerase I at 948, 208–949, 008; and a hypothetical protein at 951,478. This analysis is not possible for the antibiotic-depleted B. malayi since the non-nuwt regions on the wBm genome typically had <5X sequencing depth.
The other three factors (i.e. Wolbachia population heterogeneity within the nematodes sequenced, nuwt SNPs, and nuwt paralog SNPs) would result in heterogeneous variation at a specific position. In the case of variation in the Wolbachia endosymbiont population, the position should have total sequence read depth similar to the mean. Using either the second or third deviations above the mean as a cutoff, as described above, we find a higher proportion of positions with >20% variation in the underlying reads for putative nuwts with increased sequencing depth than those regions with sequencing depth values predictive of placement in the endosymbiont genome. For the positions with sequencing depth values within two standard deviations of the mean, we find only 0.06% have >20% variation in the underlying reads, a number that may be artificially inflated since nuwts are under-predicted using either of these standard deviation thresholds. In comparison, for regions predicted to contain nuwts with either the two standard deviation or the three standard deviation threshold we find 1.8% of positions have >20% variation in the underlying reads. For the 12,122 positions that have capture data with ≥6,348x sequencing depth, 121 positions have >50% of the reads underlying that position supporting a variant base, meaning a majority of the reads support an alternate base call relative to the reference. Yet the reference base call was always supported by >18% of the underlying capture reads for these 12,122 positions. This suggests that the reference base call was correct but that a majority of the sequence arose from higher sequencing depth of Wolbachia sequences from the nuclear genome.
We also sought to examine the converse – if genetically variable positions had increased sequencing depth. There are 1,020 positions with >20% variation in the underlying reads. These positions have an average sequencing depth of 5139X with a standard deviation of 1976X with values ranging from 324X to 12,865X and a median sequencing depth of 4849X. Since sequencing depth of 5327x corresponds to 2 standard deviations above the mean sequencing depth, it follows that nearly half of the genetically variable sites had a sequencing depth value that was more than 2 standard deviations higher than the average sequencing depth across the genome. As such, it is likely that SNPs alone can be used to predict some nuwts in at least some hosts of Wolbachia endosymbionts.
Comparison with prior validation results
Nuwts in B. malayi have been described and validated twice previously7, 17. The first validation was of a subset of nuwts detected in the assembled B. malayi genome7. These nuwts were validated by PCR amplification of the junctions of nuwts and the nuclear genome followed by end sequencing verification of the products. Unfortunately, given that the probe set used here had probes removed that had homology to the filarial nematode genome, we anticipate that those nuwts would be poorly recovered with uneven results. Consistent with this, the 12 genes that were expected to be found in these amplification products had sequencing depth ranging from 754x (low sequencing depth) to 6275x (~3 standard deviations above the mean sequencing depth).
The second validation of nuwts was conducted on nuwts detected using sequencing depth and SNPs in the previously published depletion data17 that is used for comparisons here. Validation was conducted by using qPCR to measure the copy number of nuwts for comparison to the sequencing depth, which should also reflect the copy number17. A comparison of this previous qPCR data with the capture data presented here reveals similar R2 values for both data sets, suggesting that at least a subset of nuwts can be predicted accurately (Fig. 4). The plots reveal different y-intercepts, which is expected given the differing presence of sequencing reads from the endosymbiont genome between the two data sets (Fig. 4). The slopes are different, reflecting different levels of sequencing depth (Fig. 4).
Discussion
Identifying nuwts using capture-based sequencing
Numerous LGTs from Wolbachia to its many hosts have been described, which are relatively recent in nature7, 8, 11, 12, 14, 15, 17. Evidence for most of these suggests that they are evolving neutrally, accumulating mutations at a slow rate, including deleterious mutations7, 8, 11, 12, 14, 16. Most of the sequences are still very recognizable using BLASTN-based searches or using mapping algorithms like BWA35, BOWTIE36, 37, MOSAIK38, and STAMPY39. Therefore, it is not surprising that they are also captured using an oligonucleotide-based approach. Using two existing datasets, we demonstrate that nuwts can be identified using sequence data following selection with an oligonucleotide-based capture system even in the presence of the endosymbiont genome.
However, fewer nuwts were detected in the capture data than in the depletion data likely owing to a combination of factors including that (a) the oligonucleotides for capture design excluded some nuwts in the reference B. malayi genome and (b) the dynamic range between sequencing depth of the endosymbiont genome and nuwts is smaller. Absence of some nuwt sequences from the oligonucleotide bait pool used for capture appears to account for ~6.0–7.1% of the missing data points, depending on which threshold is applied. This has been corrected with the most recent capture design by adding back regions of the Wolbachia genome that were removed due to similarity with nematode genomes. Although certain nuwts were under-represented, the capture approach nonetheless provided sequence of the Wolbachia genome31, suggesting that adjacent baits can capture these missing sequences, albeit at lower sequencing depth levels.
The differences between these datasets and the ability to detect nuwts with these datasets highlights that there is not a single method that can be applied to appropriately detect nuwts in all datasets. The appropriate method will always be related to the sequencing technology, the relative proportion of Wolbachia and host reads, and the characteristics of the most closely related reference genomes. The copy number of the nuwts and the relative abundance of, and thus sequencing depth difference between, the Wolbachia and B. malayi chromosomes appear to be the dominant factors influencing misidentification of nuwts. Where the dynamic range of the ratio of the sequencing depth across multi-copy nuwts relative to the average sequencing depth across the genome is larger, as in the depletion data, it is much easier to detect nuwts (Fig. 5). Therefore, while nuwts can be predicted in the presence of the endosymbiont, it is better to predict nuwts following Wolbachia depletion when possible. When depletion is not possible, care must be taken to identify thresholds best suited to the data at hand including the ratio of sequencing depth or copy number between the nuclear genome and endosymbiont genome, as well as the evenness of sequencing depth, which can be influenced by factors like library construction.
Using the capture data, nuwts can be predicted using sequencing depth and SNP variation in the underlying sequence reads. While the best strategy and thresholds used likely depend on the nature of the data, it is likely that the combination of sequencing depth and genetic variation would produce the best result, and the union of the two predictions would produce the most conservative estimate of the true nuwt content in a genome. Such predictions could guide researchers interested in Wolbachia-host LGT to candidate regions of the genome to be examined with more focused approaches. Of course each prediction method has its own benefits and limitations, most notably the sole use of polymorphisms may prevent the detection of recent nuwts that have yet to accumulate mutations.
Effect of nuwts on genome consensus of Wolbachia endosymbionts
Given that we have already shown that nuwts can be present in the B. malayi genome in many copies, it is not surprising that sometimes the reads derived from nuwts may outnumber the reads derived from the Wolbachia genome. When >50% of reads originate from a SNP-containing nuwt, the consensus genome sequence for the Wolbachia genome will be incorrect. This appears to have happened for 227 positions in the current data, which range in sequencing depth from 2546x to 12,865x. Of these, 17 SNPs were supported by ≥99.9% of the underlying reads, which may indicate either a sequence error in the reference, or a difference in this endosymbiont genome sequence relative to the reference.
The remaining 210 positions with intermediate levels of genetic variation (>50% and <90%) demonstrate that the sequencing depth and the polymorphisms arise from a contribution of the nuwts and the endosymbiont genome with the endosymbiont base being the minor variant. In these cases, a majority rules genome consensus calling algorithm is likely to yield an incorrect consensus endosymbiont genome. It is important to note that this is not a problem limited to capture-based sequencing. This will occur with any sequence-based method involving multiple alleles. In the filarial nematode system, it may be easier to detect because there are many relatively small nuwts (<10-kbp) with multiple copies allowing the use of both sequencing depth and polymorphism for detection. However, this will also be an issue in the insect systems and sequencing depth may prove more difficult to measure since much larger nuwts are typically observed. It might seem that the use of long reads, like those from the Pacific Biosciences RS, would remedy this problem. However, we have already observed this problem in at least one Wolbachia genome assembled using Pacific Biosciences data. These errors are likely dependent on the error-correction algorithms used. As such, as the technology develops and long, error-free reads can be obtained, it would likely eliminate these problems. Using a BAC-based sequencing approach to obtain endosymbiont genomes can alleviate this problem, as is the case with the reference wBm genome. However, such BAC-based sequencing is no longer considered cost effective. In the absence of a BAC-based approach, care should be taken in assessing the functionality of genes, or the lack thereof, in systems where LGT to the host is likely. More specifically, studies examining pseudogenes in endosymbiont genomes and proposing the loss of functionality (e.g. refs 40–42) need sufficient experimental validation of the responsible SNPs to ensure their location in the endosymbiont genome, since any analogous nuwts are likely to be pseudogenized.
Conclusions
DNA from Wolbachia is found in the genomes of a great number of its hosts as nuwts. The detection of nuwts is often linked to careful analyses of the host genome sequencing project. Low-cost methods to detect and sequence nuwts are needed in order to better understand the extent of such transfers in host genomes. Such studies lay the groundwork for how such integrations occur and if they have any functional significance. Here, we demonstrate that oligonucleotide-based capture systems can be used to capture and sequence nuwt sequences. Such sequences can be distinguished from the bacterial endosymbiont genome by an increase in sequencing depth as well as genetic heterogeneity of the sequences. However, the predictions are likely to be more limited than when a depletion based strategy can be employed. Lastly, we demonstrate that the presence of nuwts can confound genome consensus calling, yielding an erroneous genome sequence of the endosymbiont.
Methods
Data sets
This manuscript compares two Illumina paired end datasets. The first data set, which is referred to as the depletion data, is described by Ioannidis et al.17 (SRX142902). This data set includes >138 million reads from a 300-bp paired end library constructed directly from DNA from tetracycline-treated, and thus Wolbachia-depleted, B. malayi filarial nematodes. The second data set, which is referred to as the capture data, is described by Geniez et al.31 (SRX1057997). This data set includes >91 million reads from a paired end library constructed from B. malayi DNA library from which Wolbachia-derived sequences were captured using the Agilent Sure Select protocol and RNA baits as previously described31.
Sequence alignments
For the depletion data set, the alignments from Ioannidis et al.17 were used, which were aligned with BWA version 0.5.9-r1635 to the reference wBm genome (AE017321.1)34 with default parameters and had duplicates removed with MarkDuplicates as implemented in Picard 1.4843. To ensure comparability of the data, the capture data was re-aligned with BWA version 0.5.9-r1635 with default parameters to the reference wBm genome (AE017321.1)34 and had duplicates removed with MarkDuplicates as implemented in Picard version 1.4843. Statisics about both libraries and sequencing were collected with Picard version 1.4843.
Sequencing depth, SNP identification, and other calculations
The sequencing depth of the sequences was measured with MPILEUP in SAMTOOLS version 0.1.19-44428cd44 using the alignments described above. SNPs were identified using this MPILEUP output. PERL version 5.8.8 and LINUX commands (e.g. CUT, AWK, and SED) were used to parse data and perform the calculations presented. Figures were constructed and statistical tests performed in R version 2.15.2.
Acknowledgements
This work was funded by the National Institutes of Health through the NIH Director’s New Innovator Award Program (1-DP2-OD007372), National Institute of Allergy and Infectious Diseases (U19AI110820), and internal funding from New England Biolabs, Inc.
Author Contributions
J.C.D.H. conceived the study, carried out the bioinformatics analyses, and drafted the manuscript. B.S. conceived the study and edited the manuscript. J.F. conceived the study and edited the manuscript. All authors read and approved the final manuscript.
Competing Interests
The authors declare that they have no competing interests.
Footnotes
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Schroder G, Schuelein R, Quebatte M, Dehio C. Conjugative DNA transfer into human cells by the VirB/VirD4 type IV secretion system of the bacterial pathogen Bartonella henselae. Proc Natl Acad Sci USA. 2011;108:14643–14648. doi: 10.1073/pnas.1019074108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Gladyshev EA, Meselson M, Arkhipova IR. Massive horizontal gene transfer in bdelloid rotifers. Science. 2008;320:1210–1213. doi: 10.1126/science.1156407. [DOI] [PubMed] [Google Scholar]
- 3.Acuna R, et al. Adaptive horizontal transfer of a bacterial gene to an invasive insect pest of coffee. Proc Natl Acad Sci USA. 2012;109:4197–4202. doi: 10.1073/pnas.1121190109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Whiteman NK, Gloss AD. Parasitology: Nematode debt to bacteria. Nature. 2010;468:641–642. doi: 10.1038/468641a. [DOI] [PubMed] [Google Scholar]
- 5.Danchin EGJ, et al. Multiple lateral gene transfers and duplications have promoted plant parasitism ability in nematodes. Proc Natl Acad Sci USA. 2010;107:17651–17656. doi: 10.1073/pnas.1008486107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Mitreva M, Smant G, Helder J. Role of horizontal gene transfer in the evolution of plant parasitism among nematodes. Methods Mol Biol. 2009;532:517–535. doi: 10.1007/978-1-60327-853-9_30. [DOI] [PubMed] [Google Scholar]
- 7.Dunning Hotopp JC, et al. Widespread lateral gene transfer from intracellular bacteria to multicellular eukaryotes. Science. 2007;317:1753–1756. doi: 10.1126/science.1142490. [DOI] [PubMed] [Google Scholar]
- 8.Fenn K, et al. Phylogenetic relationships of the Wolbachia of nematodes and arthropods. PLoS Pathog. 2006;2:e94. doi: 10.1371/journal.ppat.0020094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Nikoh N, et al. Bacterial genes in the aphid genome: absence of functional gene transfer from Buchnera to its host. PLoS Genet. 2010;6:e1000827. doi: 10.1371/journal.pgen.1000827. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Werren JH, et al. Functional and evolutionary insights from the genomes of three parasitoid Nasonia species. Science. 2010;327:343–348. doi: 10.1126/science.1178028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Nikoh N, et al. Wolbachia genome integrated in an insect chromosome: evolution and fate of laterally transferred endosymbiont genes. Genome Res. 2008;18:272–280. doi: 10.1101/gr.7144908. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Kondo N, Nikoh N, Ijichi N, Shimada M, Fukatsu T. Genome fragment of Wolbachia endosymbiont transferred to X chromosome of host insect. Proc Natl Acad Sci USA. 2002;99:14280–14285. doi: 10.1073/pnas.222228199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Dunning Hotopp JC. Horizontal gene transfer between bacteria and animals. Trends Genet. 2011;27:157–163. doi: 10.1016/j.tig.2011.01.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.McNulty SN, et al. Endosymbiont DNA in endobacteria-free filarial nematodes indicates ancient horizontal genetic transfer. PLoS One. 2010;5:e11029. doi: 10.1371/journal.pone.0011029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Aikawa T, et al. Longicorn beetle that vectors pinewood nematode carries many Wolbachia genes on an autosome. Proc Biol Sci. 2009;276:3791–3798. doi: 10.1098/rspb.2009.1022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Brelsfoard C, et al. Presence of extensive Wolbachia symbiont insertions discovered in the genome of its host Glossina morsitans morsitans. PLoS Negl Trop Dis. 2014;8:e2728. doi: 10.1371/journal.pntd.0002728. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Ioannidis P, et al. Extensively duplicated and transcriptionally active recent lateral gene transfer from a bacterial Wolbachia endosymbiont to its host filarial nematode Brugia malayi. BMC Genomics. 2013;14:639. doi: 10.1186/1471-2164-14-639. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Klasson L, Kambris Z, Cook PE, Walker T, Sinkins SP. Horizontal gene transfer between Wolbachia and the mosquito Aedes aegypti. BMC Genomics. 2009;10:33. doi: 10.1186/1471-2164-10-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Nikoh N, Nakabachi A. Aphids acquired symbiotic genes via lateral gene transfer. BMC Biol. 2009;7:12. doi: 10.1186/1741-7007-7-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Woolfit M, Iturbe-Ormaetxe I, McGraw EA, O’Neill SL. An ancient horizontal gene transfer between mosquito and the endosymbiotic bacterium Wolbachia pipientis. Mol Biol Evol. 2009;26:367–374. doi: 10.1093/molbev/msn253. [DOI] [PubMed] [Google Scholar]
- 21.Doudoumis V, et al. Detection and characterization of Wolbachia infections in laboratory and natural populations of different species of tsetse flies (genus Glossina) BMC Microbiol. 2012;12(Suppl 1):S3. doi: 10.1186/1471-2180-12-S1-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Koutsovoulos G, Makepeace B, Tanya VN, Blaxter M. Palaeosymbiosis revealed by genomic fossils of Wolbachia in a strongyloidean nematode. PLoS Genet. 2014;10:e1004397. doi: 10.1371/journal.pgen.1004397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Stouthamer R, Breeuwer JA, Hurst GD. Wolbachia pipientis: microbial manipulator of arthropod reproduction. Annu Rev Microbiol. 1999;53:71–102. doi: 10.1146/annurev.micro.53.1.71. [DOI] [PubMed] [Google Scholar]
- 24.Werren JH. Biology of Wolbachia. Annu Rev Entomol. 1997;42:587–609. doi: 10.1146/annurev.ento.42.1.587. [DOI] [PubMed] [Google Scholar]
- 25.Hilgenboecker K, Hammerstein P, Schlattmann P, Telschow A, Werren JH. How many species are infected with Wolbachia?–A statistical analysis of current data. FEMS Microbiol Lett. 2008;281:215–220. doi: 10.1111/j.1574-6968.2008.01110.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Klasson L, et al. Extensive duplication of the Wolbachia DNA in chromosome four of Drosophila ananassae. BMC Genomics. 2014;15:1097. doi: 10.1186/1471-2164-15-1097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Keroack CD, et al. Absence of the Filarial Endosymbiont Wolbachia in Seal Heartworm (Acanthocheilonema spirocauda) but Evidence of Ancient Lateral Gene Transfer. J Parasitol. 2016;102:312–318. doi: 10.1645/15-872. [DOI] [PubMed] [Google Scholar]
- 28.Desjardins CA, et al. Genomics of Loa loa, a Wolbachia-free filarial parasite of humans. Nat Genet. 2013;45:495–500. doi: 10.1038/ng.2585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.McNulty SN, et al. Transcriptomic and proteomic analyses of a Wolbachia-free filarial parasite provide evidence of trans-kingdom horizontal gene transfer. PLoS One. 2012;7:e45777. doi: 10.1371/journal.pone.0045777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.McNulty SN, et al. Localization of wolbachia-like gene transcripts and peptides in adult onchocerca flexuosa worms indicates tissue specific expression. Parasites & vectors. 2013;6:2. doi: 10.1186/1756-3305-6-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Geniez S, et al. Targeted genome enrichment for efficient purification of endosymbiont DNA from host DNA. Symbiosis. 2013;58:201–207. doi: 10.1007/s13199-012-0215-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Kent BN, et al. Complete bacteriophage transfer in a bacterial endosymbiont (Wolbachia) determined by targeted genome capture. Genome Biol Evol. 2011;3:209–218. doi: 10.1093/gbe/evr007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Gerth M, Gansauge MT, Weigert A, Bleidorn C. Phylogenomic analyses uncover origin and spread of the Wolbachia pandemic. Nature communications. 2014;5:5117. doi: 10.1038/ncomms6117. [DOI] [PubMed] [Google Scholar]
- 34.Foster J, et al. The Wolbachia genome of Brugia malayi: endosymbiont evolution within a human pathogenic nematode. PLoS Biol. 2005;3:e121. doi: 10.1371/journal.pbio.0030121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Langmead, B. Aligning short sequencing reads with Bowtie. Curr Protoc Bioinformatics Chapter 11, Unit 11 17 (2010). [DOI] [PMC free article] [PubMed]
- 37.Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Lee WP, et al. MOSAIK: a hash-based algorithm for accurate next-generation sequencing short-read mapping. PLoS One. 2014;9:e90581. doi: 10.1371/journal.pone.0090581. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Lunter G, Goodson M. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 2011;21:936–939. doi: 10.1101/gr.111120.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Nikoh N, et al. Evolutionary origin of insect-Wolbachia nutritional mutualism. Proc Natl Acad Sci USA. 2014;111:10257–10262. doi: 10.1073/pnas.1409284111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Darby AC, et al. Analysis of gene expression from the Wolbachia genome of a filarial nematode supports both metabolic and defensive roles within the symbiosis. Genome Res. 2012;22:2467–2477. doi: 10.1101/gr.138420.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Husnik F, et al. Horizontal gene transfer from diverse bacteria to an insect genome enables a tripartite nested mealybug symbiosis. Cell. 2013;153:1567–1578. doi: 10.1016/j.cell.2013.05.040. [DOI] [PubMed] [Google Scholar]
- 43.Haft DH, Selengut JD, White O. The TIGRFAMs database of protein families. Nucleic Acids Res. 2003;31:371–373. doi: 10.1093/nar/gkg128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Li H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]