Can we have it all? Repurposing target capture for repeat genomics. A commentary on: ‘Aiming off the target: recycling target capture sequencing reads for investigating repetitive DNA’

Tony Heitkam; Sònia Garcia

doi:10.1093/aob/mcab080

. 2021 Jul 21;128(7):iii–v. doi: 10.1093/aob/mcab080

Can we have it all? Repurposing target capture for repeat genomics. A commentary on: ‘Aiming off the target: recycling target capture sequencing reads for investigating repetitive DNA’

Tony Heitkam ^1,^✉, Sònia Garcia ²

PMCID: PMC8577196 PMID: 34289009

Abstract

This article comments on:

Lucas Costa, André Marques, Chris Buddenhagen, William Wayt Thomas, Bruno Huettel, Veit Schubert, Steven Dodsworth, Andreas Houben, Gustavo Souza and Andrea Pedrosa-Harand, Aiming off the target: recycling target capture sequencing reads for investigating repetitive DNA, Annals of Botany, Volume 128, Issue 7, 2 December 2021, Pages 835–848 https://doi.org/10.1093/aob/mcab063

Keywords: Target capture sequencing, repetitive DNA, phylogenomics

With increasingly fast developments in DNA sequencing and still decreasing sequencing costs, we are now in a position to examine the genomic basis of whole populations and large species groups to clarify their contributions to plant biodiversity and their environments. To span large species sample sets, a range of high-throughput and reduced representation sequencing methods are currently emerging, such as genome skimming, genotyping by sequencing and target capture sequencing. In this issue of Annals of Botany, Costa et al. repurpose reads from target capture sequencing to identify and analyse repetitive DNA for cyto- and phylogenomics research.

Target capture methods make use of selected low-copy regions, usually genes. These genes are captured by matching single-stranded DNA baits, which retrieve the selected targets with high coverage. In contrast to other approaches such as RNA-seq, target capture (1) produces datasets that are strictly focused on phylogenetically relevant genes, while avoiding unspecific sequences; (2) can be potentially applied to any material, including herbarium or museum specimens; (3) does not depend on a specific developmental stage or tissue; and (4) can be much cheaper. These enrichment studies are on the rise (Fig. 1A), and newly available phylogenomics probe sets (Johnson et al., 2018) will further increase their popularity. Necessarily, target capture sequencing excludes the largest part of the genome, the repetitive fraction, accounting for up to 80 % in many plant species. However, as repetitive sequences belong to the main sources of genomic innovation, an understanding of them is needed to obtain a complete picture of plant evolution. For example, repeats drive evolution and speciation processes, build the structural backbone of genomes and underpin the formation of regulatory networks (Biscotti et al., 2015). In addition, they provide valuable targets for the generation of cytogenetic markers, and to track karyotype evolution (Schmidt et al., 2019).

Fig. 1. — (A) Target capture sequencing approaches are becoming increasingly popular. A Web of Science search showed a steady growth in publications over the last decade. The small drop in 2020 is likely to be a side effect of the global pandemic. The query was performed with the search phrase ‘TOPIC: [‘target capture sequencing’ OR ‘hyb-seq’ OR (‘target enrichment’ AND ‘sequencing’)]’. (B) The target capture repurposing strategy as suggested by the study of Costa *et al.* (2021). Existing target capture sets of five *Rhynchospora* species were investigated for their suitability to provide information on the repeat content in *Rhynchospora* genomes.

Besides these new, exciting and universal phylogenomics approaches, is it possible to derive additional benefits from these target capture datasets? To paraphrase, can we have it all: a gene-based phylogenomics overview and a wide characterization of a plant’s repeat landscape from a single, low-cost sequencing strategy? To go deeper into this question, it is helpful to look closely into the read output of the target capture methods. Since the focus is on enriching low-copy sequence regions, most of the repetitive genomic fraction is excluded. However, depending on the enrichment efficiency, many off-target reads may persist, potentially representing the whole repeat content of the plant. Hence, target capture data with low enrichment efficiency may be used to reconstruct the repeat landscape of an unknown genome.

We have to keep in mind that it is crucial to have access to unbiased datasets to understand the genome’s repetitive fraction. To ensure this, genome skimming approaches are considered the gold standard (Novák et al., 2010). Although Illumina sequencing introduces some biases (Benjamini and Speed, 2012), low coverage sequencing of whole shotgun genomic DNA is the best option for calculating repeat abundances. Many studies follow this approach, such as the comparative overview of the repeat landscapes of tomato and potato published recently in Annals of Botany (Gaiero et al., 2019).

In contrast to the established methods for characterizing a plant’s repeat profile from genome skimming reads, the target capture datasets are highly biased in their sequence composition. They have to be, as it is a feature of the capture strategy – during which a skewing is introduced between the read coverage of target sequences and the remaining genome fraction. This bias probably influences all quantification approaches that may follow. This immediately suggests that the enrichment efficiency is crucial to determine all follow-up steps. Whereas repeat analysis works best if enrichment is low or even absent, phylogenomics requires the highest capture efficiencies. Optimally, both target capture and genome skimming are performed in parallel. Instead, if target capture is selected for phylogenomics and repeat genomics, a convenient degree of enrichment should be chosen, although it will probably affect both. So, are there ways to determine a genome’s accurate repeat composition using target capture data? How do the results compare with those obtained from genome skimming data, the current standard in the field?

Costa et al. addressed this question and investigated the usefulness of target capture reads for a comprehensive analysis of the repeat fraction. For this, they comparatively analysed genome skimming and target capture data from five species of the sedge genus Rhynchospora. All analysed Rhynchospora species harbour chromosomes with holocentromeres, i.e. centromeres that span the complete chromosomal length and are marked by specific repeats (Marques et al., 2015). Large-scale repeat analyses would offer insights into genome, karyotype and centromere evolution of this genus and, hence, also address a biologically relevant question. Testing all their analyses against skewed target capture and low-coverage skimming data, Costa et al. acquired a good estimation of the existing biases produced by sequence capture. Using this information, they developed a range of filtering strategies to separate the on- and off-target reads and proceeded with various repeat characterization approaches, such as identification, quantification, repeat-based phylogenomics and cytogenetics (Fig. 1B). A similar approach has recently been successful in leveraging off-target reads for clinical and diagnostic purposes (Mangul et al., 2021).

The capture data were effective in identifying all of the most prominent Rhynchospora repeat families. Nevertheless, as predicted, their abundances deviated from the expected values; some low-copy repeats or even genes were quantified in higher numbers, whereas some high-copy repeats were quantified in lower numbers. These types of repeat-specific biases were not reproducible or predictable between the different Rhynchospora species. Satellite DNA, for example, was sometimes over- and sometimes largely under-represented. Notably, in some instances, repeats with low abundance were detected that did not make it into the genome skimming dataset. In addition, the authors observed a greater diversity of retrotransposons in target capture datasets as compared with genome skimming, probably by amplification of older, more diverged repeats using the target capture strategy.

Importantly, the repeats identified in the target capture datasets were effective for the development of fluorescent in situ hybridization (FISH) probes, indicating that the corresponding sequences were representative for the individual repeat families. This implies that repurposing of available target capture sequences can be an economic alternative for enabling large-scale cytogenomics, especially given the cost of genome skimming for a large number of species. Costa et al. also managed to reproduce a phylogeny based on 256 targeted genes by exploring the repeat abundances. Given the repeat quantification bias of this approach, we could imagine that repeat similarities may provide a less distorted phylogenetic signal (Vitales et al., 2020). In the future, it would also be interesting to see how the different filtering approaches impact the resulting phylogenies. We look forward to seeing results also for larger and more diverse species groups.

In conclusion, we come back to the question: can we have it all, repeat characterization and phylogenomics, from a single low-cost experiment? Although we conclude that target capture strategies do not represent the method of choice for repeat quantification, the exploration of further uses is highly encouraged. The identification of suitable cytogenetic markers for FISH, as proposed here, is especially promising and may enable larger, comparative cytogenetic studies at lower costs which could complement phylogenomics approaches. This combined cytogenomics and phylogenomics strategy will certainly generate new questions regarding the chromosomal contribution to speciation and diversification. Considering this, the study of Costa et al. is timely and valuable. It investigates many potential added-value strategies for target capture data and gives a good overview of what can be achieved and how it can be implemented. Given its growth trend, we foresee that the near future will bring many target capture datasets spanning the angiosperm phylogeny and beyond, leading to many insights into plant systematics, biodiversity and gene evolution. All approaches that attempt to repurpose these data are useful and will benefit many plant scientists.

ACKNOWLEDGEMENTS

We thank Andreas Houben for providing feedback for this commentary.

LITERATURE CITED

Benjamini Y, Speed TP. 2012. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Research 40: e72. [DOI] [PMC free article] [PubMed] [Google Scholar]
Biscotti MA, Olmo E, (Pat) Heslop-Harrison JS. 2015. Repetitive DNA in eukaryotic genomes. Chromosome Research 23: 415–420. [DOI] [PubMed] [Google Scholar]
Costa L, Marques A, Buddenhagen C, et al. 2021. Aiming off the target: recycling target capture sequencing reads for investigating repetitive DNA. Annals of Botany 128: 835–848. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gaiero P, Vaio M, Peters SA, Schranz ME, de Jong H, Speranza PR. 2019. Comparative analysis of repetitive sequences among species from the potato and the tomato clades. Annals of Botany 123: 521–532. [DOI] [PMC free article] [PubMed] [Google Scholar]
Johnson MG, Pokorny L, Dodsworth S, et al. 2018. A universal probe set for targeted sequencing of 353 nuclear genes from any flowering plant designed using k-medoids clustering. Systematic Biology 68: 594–606. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mangul S, Brito JJ, Groha S, et al. 2021. Seeing beyond the target: leveraging off-target reads in targeted clinical tumor sequencing to identify prognostic biomarkers. bioRxiv doi: 10.1101/2021.05.28.446240. [DOI] [Google Scholar]
Marques A, Ribeiro T, Neumann P, et al. 2015. Holocentromeres in Rhynchospora are associated with genome-wide centromere-specific repeat arrays interspersed among euchromatin. Proceedings of the National Academy of Sciences, USA 112: 13633–13638. [DOI] [PMC free article] [PubMed] [Google Scholar]
Novák P, Neumann P, Macas J. 2010. Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data. BMC Bioinformatics 11: 378. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schmidt T, Heitkam T, Liedtke S, Schubert V, Menzel G. 2019. Adding color to a century-old enigma: multi-color chromosome identification unravels the autotriploid nature of saffron (Crocus sativus) as a hybrid of wild Crocus cartwrightianus cytotypes. New Phytologist 222: 1965–1980. [DOI] [PubMed] [Google Scholar]
Vitales D, Garcia S, Dodsworth S. 2020. Reconstructing phylogenetic relationships based on repeat sequence similarities. Molecular Phylogenetics and Evolution 147: 106766. [DOI] [PubMed] [Google Scholar]

[CIT0001] Benjamini Y, Speed TP. 2012. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Research 40: e72. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0002] Biscotti MA, Olmo E, (Pat) Heslop-Harrison JS. 2015. Repetitive DNA in eukaryotic genomes. Chromosome Research 23: 415–420. [DOI] [PubMed] [Google Scholar]

[CIT0003] Costa L, Marques A, Buddenhagen C, et al. 2021. Aiming off the target: recycling target capture sequencing reads for investigating repetitive DNA. Annals of Botany 128: 835–848. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0004] Gaiero P, Vaio M, Peters SA, Schranz ME, de Jong H, Speranza PR. 2019. Comparative analysis of repetitive sequences among species from the potato and the tomato clades. Annals of Botany 123: 521–532. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0005] Johnson MG, Pokorny L, Dodsworth S, et al. 2018. A universal probe set for targeted sequencing of 353 nuclear genes from any flowering plant designed using k-medoids clustering. Systematic Biology 68: 594–606. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0006] Mangul S, Brito JJ, Groha S, et al. 2021. Seeing beyond the target: leveraging off-target reads in targeted clinical tumor sequencing to identify prognostic biomarkers. bioRxiv doi: 10.1101/2021.05.28.446240. [DOI] [Google Scholar]

[CIT0007] Marques A, Ribeiro T, Neumann P, et al. 2015. Holocentromeres in Rhynchospora are associated with genome-wide centromere-specific repeat arrays interspersed among euchromatin. Proceedings of the National Academy of Sciences, USA 112: 13633–13638. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0008] Novák P, Neumann P, Macas J. 2010. Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data. BMC Bioinformatics 11: 378. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0009] Schmidt T, Heitkam T, Liedtke S, Schubert V, Menzel G. 2019. Adding color to a century-old enigma: multi-color chromosome identification unravels the autotriploid nature of saffron (Crocus sativus) as a hybrid of wild Crocus cartwrightianus cytotypes. New Phytologist 222: 1965–1980. [DOI] [PubMed] [Google Scholar]

[CIT0010] Vitales D, Garcia S, Dodsworth S. 2020. Reconstructing phylogenetic relationships based on repeat sequence similarities. Molecular Phylogenetics and Evolution 147: 106766. [DOI] [PubMed] [Google Scholar]

PERMALINK

Can we have it all? Repurposing target capture for repeat genomics. A commentary on: ‘Aiming off the target: recycling target capture sequencing reads for investigating repetitive DNA’

Tony Heitkam

Sònia Garcia

Abstract

Fig. 1.

ACKNOWLEDGEMENTS

LITERATURE CITED

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Can we have it all? Repurposing target capture for repeat genomics. A commentary on: ‘Aiming off the target: recycling target capture sequencing reads for investigating repetitive DNA’

Tony Heitkam

Sònia Garcia

Abstract

Fig. 1.

ACKNOWLEDGEMENTS

LITERATURE CITED

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases