Skip to main content
Annals of Botany logoLink to Annals of Botany
. 2021 May 29;128(7):835–848. doi: 10.1093/aob/mcab063

Aiming off the target: recycling target capture sequencing reads for investigating repetitive DNA

Lucas Costa 1, André Marques 2, Chris Buddenhagen 3, William Wayt Thomas 4, Bruno Huettel 5, Veit Schubert 6, Steven Dodsworth 7, Andreas Houben 6, Gustavo Souza 1, Andrea Pedrosa-Harand 1,
PMCID: PMC8577205  PMID: 34050647

Abstract

Background and Aims

With the advance of high-throughput sequencing, reduced-representation methods such as target capture sequencing (TCS) emerged as cost-efficient ways of gathering genomic information, particularly from coding regions. As the off-target reads from such sequencing are expected to be similar to genome skimming (GS), we assessed the quality of repeat characterization in plant genomes using these data.

Methods

Repeat composition obtained from TCS datasets of five Rhynchospora (Cyperaceae) species were compared with GS data from the same taxa. In addition, a FISH probe was designed based on the most abundant satellite found in the TCS dataset of Rhynchospora cephalotes. Finally, repeat-based phylogenies of the five Rhynchospora species were constructed based on the GS and TCS datasets and the topologies were compared with a gene-alignment-based phylogenetic tree.

Key Results

All the major repetitive DNA families were identified in TCS, including repeats that showed abundances as low as 0.01 % in the GS data. Rank correlations between GS and TCS repeat abundances were moderately high (r = 0.58–0.85), increasing after filtering out the targeted loci from the raw TCS reads (r = 0.66–0.92). Repeat data obtained by TCS were also reliable in developing a cytogenetic probe of a new variant of the holocentromeric satellite Tyba. Repeat-based phylogenies from TCS data were congruent with those obtained from GS data and the gene-alignment tree.

Conclusions

Our results show that off-target TCS reads can be recycled to identify repeats for cyto- and phylogenomic investigations. Given the growing availability of TCS reads, driven by global phylogenomic projects, our strategy represents a way to recycle genomic data and contribute to a better characterization of plant biodiversity.

Keywords: Genome skimming, holocentric, reduced-representation sequencing, RepeatExplorer, Rhynchospora, satellite DNA, transposable elements

INTRODUCTION

One intriguing aspect of plant genomes is the staggering 2400-fold variation in DNA content among species (Pellicer et al., 2010). Advances in genomics have led to the discovery that most of this diversity is the result of variable amounts of repetitive DNA, commonly divided into tandem repeats and dispersed repeats (Elliott and Gregory, 2015; Weiss-Schneeweiss et al., 2015). Tandemly distributed satellite DNAs are remarkable for their fast evolution and variance in abundance and structure at all hierarchical levels (Novák et al., 2017; Ávila Robledillo et al., 2018). As for dispersed repetitive sequences, transposable elements are especially abundant in flowering plant genomes (Jurka et al., 2011; Galindo-González et al., 2017). Retrotransposons, particularly the ones possessing a long terminal repeat (LTR-retrotransposons), account for most of this abundance (Weiss-Schneeweiss et al., 2015), with two major superfamilies being recognized (Ty1/copia and Ty3/gypsy) based on the order of the protein-coding domains, and further divided into a number of lineages according to phylogenetic distance (Neumann et al., 2019).

Contrasting with the previous notion that repetitive DNA was no more than ‘junk DNA’, cytogenomic studies in the last few decades helped to uncover possible roles for tandem and dispersed repeats. Specific satellite DNAs and retrotransposons have been found to have a functional and/or structural role in centromeres (Cheng et al., 2002; Houben and Schubert, 2003; Nagaki et al., 2003; Macas et al., 2015; Marques et al., 2015; Ribeiro et al., 2017). The role of LTR-retrotransposons in genome size variation led to investigations correlating these to heterochromatin distribution, ecological variables, community structure and plant distribution (Guignard et al., 2016; Van-Lume et al., 2017; Lyu et al., 2018; Souza et al., 2019). The activity of transposable elements in the host genome can also cause modifications to gene regulation and the formation of retrogenes, generating morphological innovations and impacting speciation processes (Schrader and Schmitz, 2019). Moreover, lineage-specific satellite DNAs have been widely used as efficient cytomolecular markers, allowing the identification of chromosome pairs and elucidating chromosome rearrangement and duplication events (Koo et al., 2011; Čížková et al., 2013; Ávila Robledillo et al., 2018; Ribeiro et al., 2020).

The fast evolution of repetitive DNA, with many satellite families being genus- or species-specific, impairs their use in phylogenetic studies, as concerted evolution and homogenization reduce sequence variability for comparative studies across taxa (Macas et al., 2015; Mascagni et al., 2020; Ribeiro et al., 2020). Their abundances, however, have phylogenetic significance, as demonstrated by a method to reconstruct phylogenetic relationships using the abundance of different repetitive elements (Dodsworth et al., 2015). This method has proven useful to elucidate relationships in different groups of plants and animals (Dodsworth et al., 2017; Bolsheva et al., 2019; Martín-Peciña et al., 2019). Other methods have assessed the usefulness of repeat-based phylogenetic analysis. More recently, it was demonstrated that sequence similarity measures of repeated sequences can also be used as characters to resolve phylogenetic relationships (Vitales et al., 2020). In a similar approach, assembly and alignment-free (AAF) methods can be applied to high-complexity fractions of the genome, such as repetitive DNA, in a phylogenomic framework (Fan et al., 2015; Sarmashghi et al., 2019).

In order to identify and characterize the diversity of repeats in a genome, the RepeatExplorer pipeline was created, using a graph-based clustering algorithm to group high-copy sequences based on similarity, with posterior identification of protein-coding domains by cross-referencing with an extensive group of up-to-date databases (Novak et al., 2013). RepeatExplorer can identify repetitive elements with ~0.1× genome coverage, most commonly known as genome skimming (GS), which is often sufficient to study high-copy nuclear and organellar DNA (Straub et al., 2012; Dodsworth, 2015; Dodsworth et al., 2019). The GS method is just one of several ‘reduced representation’ methods of high-throughput sequencing. Throughout the last decade, a number of these sequencing methods have been proposed, generating high-quality sequencing data at decreasing costs, such as restriction site-associated sequencing (RAD-seq, Eaton et al., 2016) and transcriptome sequencing (RNA-seq, Wang et al., 2017).

Another widely used reduced representation method is target capture sequencing (TCS), in which several genomic probes are designed to ‘capture’ and enrich specific low-copy coding regions of the nuclear genome (Albert et al., 2007; Gnirke et al., 2009). These probes can be designed based on conserved regions retrieved from the alignment of several genomes of a divergent set of organisms (e.g. anchored hybrid enrichment, Lemmon et al., 2012) or by comparing transcriptomic data to search for a conserved set of orthologues across a group (Johnson et al., 2019). Since the development of TCS (Albert et al., 2007), many sets of probes have been developed and applied in both plants and animals (Cosart et al., 2011; Faircloth et al., 2012; Ilves and López-Fernández, 2014; Mandel et al., 2014; Heyduk et al., 2016; Sass et al., 2016; Schmickl et al., 2016). Moreover, the nature of these probe design approaches has allowed the development of universal probe sets, with the potential to be used across all of the angiosperms (e.g. Buddenhagen et al., 2016; Johnson et al., 2019). In addition to the advantage of being universal, a high recovery rate of target regions (or the number of enriched targeted loci in the final library) is often achievable with very low ‘enrichment efficiency’ (percentage of library reads successfully mapped to a target sequence) (Johnson et al., 2019), meaning that a TCS enriched library will often contain a high number of ‘off-target’ reads. The use of these off-target reads, in combination with the enriched reads, has been referred to as ‘Hyb-Seq’ (hybrid sequencing, Weitemier et al., 2014). As the off-target reads are often rich in high-copy DNA, Hyb-Seq approaches have been used to reconstruct whole plastomes and to obtain ribosomal DNA profiles (Weitemier et al., 2014; Schmickl et al., 2016; Sproul et al., 2020).

To check whether off-target reads could also be recycled for identifying and potentially quantifying repetitive DNA, we selected the sedge Rhynchospora as a model. Rhynchospora is one of the largest genera of Cyperaceae, comprising ~350 species, but it is under-studied from a phylogenetic point of view (Thomas et al., 2009; Buddenhagen et al., 2016). Cytologically, Rhynchospora has been of great interest due to its holocentric chromosomes, which present the centromere dispersed along the sister chromatids rather than localized in a primary constriction (Bureš et al., 2013). Moreover, cytogenomic studies on Rhynchospora pubera have led to the discovery of the first centromere-specific satellite DNA reported in a holocentric organism, Tyba (Marques et al., 2015). Subsequent studies confirmed the presence of Tyba in other Rhynchospora species (Ribeiro et al., 2017). Nevertheless, other non-centromeric satellites found in Rhynchospora species showed the typical block-like pattern on localized chromosomal regions (Ribeiro et al., 2017).

Large-scale repeat analysis covering all major clades of Rhynchospora would provide valuable insights into the repeat evolution of this genus. Recently, using a set of probes developed by anchored hybrid enrichment (Buddenhagen et al., 2016), a number of Rhynchospora species were sequenced using a TCS approach. Here, we assessed the quality of repeat characterization in five Rhynchospora species (for which we also possessed GS data) from this dataset. As a considerable part of the off-target reads can be sequences close to the original targeted loci (Dodsworth et al., 2019), we searched for sequencing bias by comparing the target results with results obtained from GS (i.e. unenriched libraries). We specifically addressed three questions: (1) can we characterize the repetitive DNA fraction of Rhynchospora genomes using TCS data, compared with GS data?; (2) can we develop cytological markers from TCS clustering data?; and (3) can we use TCS data to reconstruct repeat-based phylogenetic trees?

MATERIALS AND METHODS

Plant material and sequence data

To assess whether the off-target reads from TCS could be used in a similar manner as GS data to identify high-copy repeats, a comparison between these data types was necessary. At first, only three species fitted this requirement (Rhynchospora globosa, R. pubera and R. tenuis); thus, two additional species were collected for GS sequencing. Individuals of Rhynchospora cephalotes (L.) Vahl and R. exaltata Kunth were collected near the towns of Jacaraú (Paraíba, Brazil, voucher UFP87625) and Jaqueira (Pernambuco, Brazil, voucher JPB51537), respectively. These individuals were cultivated in (1) the experimental garden of the Laboratory of Plant Cytogenetics and Evolution at the Universidade Federal de Pernambuco (Brazil), (2) the greenhouse of the Max Plank Institute for Plant Breeding Research (Cologne, Germany) and (3) the greenhouse of the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK Gatersleben, Germany).

We downloaded available short-read archive data of R. pubera (Vahl) Boeckeler (Marques et al., 2015, BioProject PRJEB9643) from the NCBI GenBank (www.ncbi.nlm.nih.gov). Genome skimming sequences of R. globosa (Kunth) Roem. & Schult. and R. tenuis Link were obtained from Ribeiro et al. (2017) and deposited in GenBank under BioProject PRJNA672922. Target capture sequencing reads (150 bp) for R. cephalotes, R. exaltata, R. globosa, R. pubera and R. tenuis were obtained from Buddenhagen (2016) and deposited in GenBank under BioProject PRJNA672127.

DNA extraction and sequencing

Genomic DNA from R. cephalotes and R. exaltata was extracted from leaves with a NucleoBond HMW DNA kit (Macherey and Nagel, Düren, Germany). Quality was assessed with Agilent TapeStation and the gDNA was quantified by Qubit BR assay (Thermo). An Illumina-compatible library was then prepared from 400 ng of input gDNA with an NEBNext Ultra™ II FS DNA Library Prep Kit for Illumina (New England Biolabs) with a total of four PCR cycles to introduce dual indexed barcodes. Libraries were sequenced in the Max Planck Genome Centre Cologne on a HiSeq2500 system with 2× 250 bp rapid mode using a HiSeq Rapid PE Cluster and Rapid SBS v2 kit. The new sequence data were deposited in GenBank under BioProject PRJNA672693.

Genome size measurements

Novel DNA content measurements for R. cephalotes and R. exaltata were estimated by flow cytometry. Sample preparation was done according to Loureiro et al. (2007). Young leaves of each of the studied plants were chopped simultaneously with their respective reference standard, Solanum lycopersicum ‘Stupicke’ (2C = 1.96 pg, Dolezel et al., 1992) for R. cephalotes and Raphanus sativus ‘Saxa’ (2C = 1.11 pg, Dolezel et al., 1992) for R. exaltata in a Petri dish (kept on ice) containing 2 mL of Woody Plant Buffer (WPB). The sample was then filtered through a 30-µm disposable mesh filter (CellTrics, Sysmex, Norderstedt, Germany) with subsequent addition of 50 μg mL−1 propidium iodide (from a stock of 1 mg mL−1; Sigma–Aldrich) and 50 µg mL−1 RNase (Sigma–Aldrich). Nine replicates per species were made. The samples were measured in a CyFlow Space flow cytometer (Sysmex) equipped with a green laser (532 nm). Histograms of relative fluorescence were obtained using the software Flomax v. 2.3.0. (Sysmex, Norderstedt, Germany). Mean fluorescence and coefficient of variation were assessed at half of the fluorescence peak. The absolute DNA content (pg per 2C) was calculated by multiplying the ratio of the G1 peaks by the genome size of the internal standard.

Filtering of TCS reads

As our aim was to characterize the repeat fraction of the sequenced species, we were only interested in the off-target reads, whose abundance is inversely proportional to the efficiency of target sequence enrichment. To get rid of the target reads, we filtered the raw TCS data by mapping them to a set of 256 sequences representing consensus sequences of the target loci that were enriched in the Rhynchospora dataset (Buddenhagen, 2016), saving the unmapped reads for the repeat characterization. Two different mapping algorithms were used for comparison: the Geneious Read Mapper v. 6.0.3, with custom sensitivity settings (60 % maximum mismatch per read, index word length = 12, maximum ambiguity = 8, Kearse et al., 2012) and the Bowtie2 v. 2.4.1 mapper with high-sensitivity preset settings (end-to-end alignment, 0–800 insert size, report all matches, Langmead and Salzberg, 2012), both implemented in the software Geneious v. 7.1.9 (Kearse et al., 2012). These settings were chosen after testing for computational time and mapping results. The Bowtie2 ‘highest sensitivity’ setting presented no major improvement in the mapping results when compared with the faster ‘high sensitivity’ preset. We applied ‘end-to-end’ instead of ‘local’ alignment to avoid sequence trimming and to better compare with the Geneious Read Mapper results. Additionally to the two mapping algorithms, we uploaded a FASTA file with the 256 target sequences to RepeatExplorer (https://repeatexplorer-elixir.cerit-sc.cz/) prior to the analysis as a Custom Repeat Database (Novak et al., 2013). With this option we could exclude clusters of enriched gene sequences mistakenly identified as repeats from the analysis. In summary, we ended with one GS dataset (comprising previously published datasets for R. globosa, R. pubera and R. tenuis and the newly sequenced R. cephalotes and R. exaltata data) and four different ‘target datasets’ for each species: (1) raw target capture reads; (2) reads left after mapping with Geneious Read Mapper; (3) reads left after mapping with Bowtie2 mapper and (4) reads left after exclusion of ‘enriched gene clusters’ annotated according to a Custom Repeat Database (RepeatExplorer).

In silico repeat analysis

In order to compare the repeat composition observed in all different datasets, we employed the RepeatExplorer pipeline (Novak et al., 2013). Reads from all datasets of R. cephalotes, R. exaltata, R. globosa, R. pubera and R. tenuis were uploaded to the platform, filtered by quality with default settings (95 % of bases equal to or above the quality cut-off value of 10) and interlaced. Clustering was performed with default settings of 90 % similarity over a 55 % minimum sequence overlap. The Find RT Domains tool and additional database searches (BLASTx) were used to identify protein domains for repeat annotation, and graph layouts of individual clusters were examined interactively using the SeqGrapheR tool (Novak et al., 2013).

Although the entire set of reads from each dataset was uploaded to RepeatExplorer, we used the Read Sampling option on the clustering analysis to manually input the number of reads to be analysed, accounting for 0.13× of the genome of each species (Table 1). Only for R. globosa was this not possible, as no information on genome size was available for this species. In this case, we ran tests with 500 000, 1 000 000 and 2 000 000 reads. However, independently of the number of reads used as input, only 200 061 reads were analysed, always with similar results. Clusters with at least 0.01 % genome abundance were automatically annotated and manually checked. We used the TAREAN tool (Novák et al., 2017) available in the RepeatExplorer pipeline to annotate satellite DNAs. Satellites were named based on previous publications (Marques et al., 2015; Ribeiro et al., 2017) or by using the species abbreviation followed by SAT, a number based on the decreasing order of abundance and a hyphen followed by the number of base pairs of the monomer. The consensus monomer sequences of the identified satellite DNAs of each species were compared using DOTTER (Sonnhammer and Durbin, 1995) in order to confirm tandem organization and to identify similarity among repeats from the same family.

Table 1.

Genome size (1C), number of reads from GS and raw TCS datasets, percentage of mapped reads on Bowtie2 Mapper and Geneious Read Mapper datasets, percentage of reads identified on ‘target clusters’ of RepeatExplorer (RE) dataset and number of reads analysed for all datasets on each species.

Species 1C (Mbp) Number of GS read pairs Number of raw TCS readsa Percentage of filtered TCS reads* Read pairs analysed
Bowtie2 Geneious RE
Rhynchospora cephalotes 356.97 11 737 291 6 932 932 11.77 23.12 16.80 600 000
Rhynchospora exaltata 244.5 21 384 261 5 151 874 13.02 20.65 18.45 422 199
Rhynchospora globosa 4 000 000b 3 713 422 4.10 14.95 8.82 200 061
Rhynchospora pubera 1613.7c 4 000 000c 4 164 982 3.96 12.29 8.66 2 797 919
Rhynchospora tenuis 381.42d 4 000 000b 4 826 084 10.18 20.64 6.36 661 128

*Percentage of target reads identified by the different strategies. RE, RepeatExplorer.

Testing method performance

To assess if the abundances of the different repeats observed in our raw and filtered target capture datasets were similar, we compared their abundances with the ones observed in the GS datasets. For this, we used the abundance values of the individual repeat families (Supplementary Data Table 1). As we wanted to check whether the order of abundance of different classes of repetitive elements was similar between the different datasets, we applied Spearman’s rank correlation using a t-distribution with N – 2 degrees of freedom to calculate test significance (P-value). The analysis was undertaken with the package stats implemented in the software R v. 4.0.2 (R Core Team, 2019). Correlation plots were constructed with the R package ggplot2 (Wickham, 2016).

Repeat amplification, probe labelling and in situ hybridization

Primers for the R. cephalotes Tyba variant found in the TCS datasets (see Results section) were designed based on the most conserved region of the consensus sequence (F, 5′-AAGCTATTTGAATGCAATTATGTGC; R, 5′-AGCGTTTCTAGCCACATTTGA). Genomic DNA (40 ng) of R. cephalotes was used for PCR reaction with 1× PCR buffer, 2 mm MgCl2, 0.1 mm of each dNTP, 0.4 μm of each primer, 0.025 U of Taq polymerase (Qiagen) and water. The PCR conditions were as follows: 94 °C for 2 min; 30 cycles of 94 °C for 50 s, 58 °C for 50 s and 72 °C for 1 min; and 72 °C for 10 min. PCR products were labelled with Atto488-dUTP (Jena Bioscience) with a nick translation labelling kit (Jena Bioscience).

Mitotic chromosomes of R. cephalotes were prepared from root tips, pretreated in 2 mm 8-hydroxyquinoline at 10 °C for 20 h and fixed in ethanol:acetic acid (3:1 v/v) for 2 h at room temperature, and stored at −20 °C. Fixed root tips were digested with 2 % cellulose, 2 % pectinase and 2 % pectolyase in citrate buffer (0.01 m sodium citrate and 0.01 m citric acid) for 120 min at 37 °C and squashed in a drop of 45 % acetic acid. Fluorescent in situ hybridization (FISH) was performed as described by Aliyeva-Schnorr et al. (2015). The hybridization mixture contained 50 % (v/v) formamide, 10 % (w/v) dextran sulphate, 2 × SSC and 5 ng μL−1 of the probe. Slides were denatured at 75 °C for 5 min, and the final stringency of hybridization was 76 %.

Immuno-FISH

To visualize the centromeres of R. cephalotes, we performed immunostaining of the centromere-specific histone variant CENH3 with polyclonal antibodies developed for R. pubera (RpCENH3, Marques et al., 2015). Mitotic preparations were made from root meristems fixed in 4 % paraformaldehyde in Tris buffer (10 mm Tris, 10 mm EDTA, 100 mm NaCl, 0.1 % Triton, pH 7.5) for 5 min on ice under vacuum and for another 25 min only on ice. After washing twice in Tris buffer, the roots were chopped in LB01 lysis buffer (15 mm Tris, 2 mm Na2EDTA, 0.5 mm spermine 4HCl, 80 mm KCl, 20 mm NaCl, 15 mM β-mercaptoethanol, 0.1 % Triton X-100, pH 7.5), filtered through a 50-μm filter (CellTrics, Sysmex), and diluted 1:10; subsequently, 100 μL of the diluted suspension was centrifuged onto microscopic slides using a Cytospin3 (Shandon, Germany) as described by Jasencakova et al. (2001). Immuno-FISH with anti-RpCENH3 antibodies and the Tyba repeat was performed according to Ishii et al. (2015). We used rabbit anti-RpCENH3 (diluted 1:200) as primary antibody and detected it with Cy3-conjugated anti-rabbit IgG (Dianova) secondary antibody (diluted 1:200). Slides were incubated overnight at 4 °C and washed three times in 1 × PBS before the secondary antibody was applied.

Microscopy

For widefield microscopy, we used an epifluorescence microscope (BX61, Olympus) equipped with a cooled CCD camera (Orca ER, Hamamatsu). To achieve super-resolution of ~120 nm (with a 488 nm laser excitation), we applied spatial structured illumination microscopy (3D-SIM) using a 63×/1.40 Oil Plan-Apochromat objective of an Elyra PS.1 microscope system and the software ZENBlack from Carl Zeiss (Weisshart et al., 2016).

Comparative repeat phylogenomics

We employed a repeat abundance-based phylogenetic inference method (see details in Dodsworth et al., 2015) to assess if repeat abundance identified in our TCS reads could be used to resolve phylogenetic relationships, using one of our filtered datasets (Bowtie) and the GS dataset for comparison. First, we concatenated reads for our five species with 0.065× coverage, with species-specific codes for each set of reads, and ran a comparative clustering analysis (simultaneous clustering of all species on the dataset) on RepeatExplorer with default settings (Novak et al., 2013). As the sequences were coded with the species names, we could identify the number of reads that each species contributed to each of the generated clusters, which is proportional to the abundance of each repeat in the genome of each species. Parsimony analysis using repeat abundances as quantitative characters was undertaken as described by Dodsworth et al. (2015).

To access the phylogenetic potential of repetitive elements based on sequence similarity, we used the AAF approach (Fan et al., 2015) using all reads identified as repeats by RepeatExplorer in the Bowtie dataset. AAF constructs phylogenies directly from unassembled genome sequence data, bypassing both genome assembly and alignment. Thus, it calculates the statistical properties of the pairwise distances between genomes, allowing it to optimize parameter selection and to perform bootstrapping.

In order to compare our repeat abundance-based phylogeny with a nuclear marker-based phylogeny, we extracted the aligned sequences of 256 loci gathered by target capture (Buddenhagen, 2016) of our five Rhynchospora species. For simplification, we used the most general model of DNA substitution, GTR + I + G (Abadi et al., 2019). Phylogenetic relationships were inferred using Bayesian inference as implemented in BEAST v. 1.8.3 (Drummond and Rambaut, 2007). Two independent runs with four Markov chain Monte Carlo (MCMC) runs were conducted, sampling every 1000 generations for 10 000 000 generations. Each run was evaluated in TRACER v.1.7 (Rambaut et al., 2018) to assess MCMC convergence and a burn-in of 25 % was applied. We then obtained the consensus phylogeny and clade posterior probabilities with the ‘sumt’ command.

RESULTS

Efficiency of target sequence filtering

We generated GS data and genome size estimates for R. cephalotes and R. exaltata to add to the already sequenced data of the other three Rhynchospora here analysed in order to compare repeat composition from TCS and GS data (Table 1). The percentage of filtered reads presented in Table 1 is indicative of the enrichment efficiency, or how much of the final library corresponds to one of the target genes. The Geneious datasets presented a higher number of filtered reads than the Bowtie datasets, with an average of 11.8 % of filtered reads for Geneious and 8.6 % for Bowtie (Table 1). Rhynchospora pubera presented the smallest number of filtered reads among all five species, with both the Bowtie2 and Geneious filters (3.96 and 12.29 % respectively). Rhynchospora cephalotes had the highest number of filtered reads among the Geneious datasets, while R. exaltata presented the largest proportion among the Bowtie datasets. Using the Custom Repeat Database option of RepeatExplorer, the highest amount of ‘target clusters’ was found on R. exaltata (18.45 %) and the lowest amount was found on R. tenuis (6.36 %).

Repetitive DNA content of different datasets

To evaluate the quality of repeat characterization from TCS datasets, the genomic proportions of different repetitive element lineages were compared with those observed in the GS datasets (Fig. 1). Generally, the proportion of the total repeat fraction observed in the GS datasets was smaller than the ones observed in all the TCS datasets (Supplementary Data Fig. S1). In order to demonstrate whether our filtering strategies were able to improve repeat mining in TCS data, we included the raw TCS datasets in the analysis. As these raw TCS reads contained enriched target sequences, it was expected that these sequences would be present as unclassified clusters by RepeatExplorer. The raw TCS datasets presented the highest values for total repetitive fraction in almost all species (including the unclassified, putative target sequences), with the exception of R. globosa, in which the Geneious dataset showed a total of 49.41 % repeat proportion against 46.74 % on the raw dataset. This is also reflected in the difference in the number of clusters representing at least 0.01 % of total genomic proportion formed in the clustering analysis. While the GS datasets presented cluster numbers varying from 155 to 294, filtered TCS datasets ranged from 462 to 589 clusters and raw TCS datasets ranged from 628 to 751. These differences are mostly due to the discrepancy in the proportion of unclassified repetitive elements in the different datasets (Supplementary Data Fig. S1). The unclassified repeat proportion on GS varied from 6.21 % in R. exaltata to 14.70 % in R. globosa, while in TCS it accounted for to up to 43.63 % (raw TCS of R. exaltata). Overall, the proportion of unclassified repeats was smaller in all filtered datasets when compared with raw TCS (Fig. 2A). Additional mapping to the original 256 target regions and separated BLAST searches with conserved domains did not produce any matches for the unclassified clusters. In contrast, BLASTx of highly abundant unclassified clusters of the TCS datasets showed similarity to coding sequences for proteins such as glycosylphosphatidylinositol anchor protein and oligomeric Golgi complex subunits. This confirmed that at least part of the excess of unclassified clusters in the TCS data was a by-product of accidentally enriched non-repetitive genomic regions. Furthermore, there was a high number of repetitive element lineages that were not found in the GS dataset but were found in TCS datasets (Supplementary Data Table 1). These additional repeats could be low-abundance elements that, although not abundant enough to be detected in the GS datasets, could have been accidentally enriched in the TCS datasets.

Fig. 1.

Fig. 1.

Barplots representing genomic abundance of classified repeat types identified in every dataset (GS, genome skimming; Raw TCS, raw target capture sequencing; Geneious, Geneious filtered dataset; Bowtie, Bowtie2 filtered dataset; RE, RepeatExplorer custom database filtered dataset) of Rhynchospora cephalotes (A), R. exaltata (B), R. globosa (C), R. pubera (D) and R. tenuis (E). Bar colours represent different repeat types according to the key at the lower right corner.

Fig. 2.

Fig. 2.

Boxplots Comparison of genomic proportions of unclassified elements (A) and correlation index (r) with GS data (B) between raw and filtered target capture datasets [RepeatExplorer (RE), Bowtie and Geneious] of all Rhynchospora species.

The filtering strategies did not largely impact satellite DNA abundance in the analysed species, with the exception of R. exaltata, in which it was possible to identify ~4-fold more satellite reads in the Bowtie2 and Geneious datasets than in the raw datasets. Also in R. exaltata, there was a huge discrepancy in the abundance of satellites, much higher in the GS dataset when compared with all TCS datasets (Fig. 1). However, the two satellite DNAs responsible for this abundance difference were also found in all TCS datasets (Supplementary Data Table S2, Fig. S2), although in smaller proportions. The amount of satellite DNA was generally lower in the TCS datasets, with the exception of R. tenuis, in which all TCS datasets presented a small increase in satellite abundance when compared with GS (Fig. 1). In this case, clusters formed by unfiltered enriched sequences could have masked the identification of low-abundance repeats. Although in some species some of the satellites found in GS data were not present in every dataset, the most abundant for each species could be identified in all of the TCS datasets. Similarly, some satellites found in TCS datasets were not found in the GS datasets, possibly being low-abundance satellites accidentally enriched (Supplementary Data Table S2, Fig. S2).

For mobile elements, there was a general agreement in the order of abundance of repeat types found in the GS and TCS datasets, with filtered datasets showing an increase in the proportion of annotated elements when compared with the raw TCS dataset (Fig. 1), with a few exceptions. For example, in R. exaltata, LTRs from the Ty3/gypsy superfamily were more abundant in the raw TCS, Bowtie2 and RepeatExplorer datasets than in the GS datasets (Fig. 1). We also compared the abundances at lineage level (Supplementary Data Table S1). Patterns of abundance of LTR families from GS and TCS datasets were similar, with most of the genomic abundance of Ty1/copia and Ty3/gypsy superfamilies being the result of the amplification of up to four main lineages (Supplementary Data Table S1). Generally, LTRs found in GS with abundance as low as 0.01 % could also be identified in the target datasets. Surprisingly, in all five species, a greater diversity of LTR retroelements was observed in the target capture datasets when compared with GS (Supplementary Data Table 1). This led to a few interesting discrepancies, such as in R. pubera, where target capture datasets showed high abundance of Ty3/gypsy/Retand (1–1.3 %) and Ty3/gypsy/Tekay elements (0.46–0.50 %), despite these not being found in GS data.

To statistically compare the results obtained in the different datasets, we checked for a correlation between the classified repeat abundances observed (at lineage level, when possible) in all target capture datasets with the ones observed in the GS datasets (Fig. 3). Although all tests showed significant correlations (P < 0.05), the strength of the correlation varied depending on the dataset. Raw TCS datasets had the weakest correlation in all five species, with filtering of the targeted sequences generally improving the correlation with the GS dataset (Fig. 2B). The strongest correlations for R. globosa (r = 0.92), R. pubera (r = 0.74) and R. tenuis (r = 0.79) were observed with the Geneious dataset, while for R. cephalotes and R. exaltata the best correlation was observed on the Bowtie dataset (r = 0.78 and 0.75). The RepeatExplorer filtering for R. globosa and R. tenuis did not improve the correlation when compared with the raw TCS dataset (r = 0.85 and r = 0.66 respectively). However, it also showed as strong a correlation as Bowtie for R. cephalotes (r = 0.78) and as Geneious for R. pubera (r = 0.74). In addition to this, we checked if there was a significant correlation between enrichment efficiency (the proportion of the genomic library that hybridized to a target probe) and the repeat characterization efficiency (the proportion of classified repeats given by the RepeatExplorer analysis). Even though our sampling was small (n = 5), we could see a clear trend of inverse correlation between these values (P = 0.02, R2 = 0.83), which shows that highly efficient library enrichment may impair repeat characterization in off-target reads.

Fig. 3.

Fig. 3.

Correlation between repeat abundances observed on GS and target capture datasets of Rhynchospora species. Genome skimming abundance values are represented on the x-axis of each plot, while target dataset abundances are on the y-axis. Spearman’s correlation index (r) for each case is shown in the lower right corner. RE, RepeatExplorer.

Chromosomal localization of the satellite DNA found in the TCS dataset

To test whether it was possible to use the TCS data to investigate the chromosomal repeat distribution by FISH, we chose the most abundant repetitive element found in the R. cephalotes TCS datasets. This repeat was a 172-bp satellite DNA with 60 % sequence similarity to Tyba of R. pubera (Marques et al., 2015). The in situ hybridization pattern of the R. cephalotes Tyba (RcTyba) variant was similar to the distribution reported in R. pubera. Small foci appeared in interphase nuclei, and a continuous line along both condensed chromatids occurred at all pro- and metaphase chromosomes. Via immuno-FISH using a CENH3-specific antibody and RcTyba repeats, respectively, the holocentric centromere structure of R. cephalotes has been confirmed due to the co-localization of CENH3 and RcTyba (Fig. 4).

Fig. 4.

Fig. 4.

Co-localization of R. cephalotes Tyba repeats and CENH3 in interphase nuclei (A) and prometaphase (B) and metaphase chromosomes (C) via 3D-SIM imaging. Both Tyba and CENH3 clearly indicate the presence of holocentromeres in the condensed mitotic chromosomes. DAPI, 4′,6′-diamidino-2-phenylindole.

Repeat abundance and structure found in TCS data reflect phylogenetic relationships

We used repeat abundances obtained by comparative clustering analysis of Bowtie datasets of our five Rhynchospora species to reconstruct phylogenetic relationships. In the comparative clustering analysis of the GS dataset, 1 754 326 concatenated reads were analysed, forming 450 clusters with at least 0.01 % genomic abundance. For the Bowtie dataset, 1 186 605 of concatenated reads were analysed, with 582 clusters representing at least 0.01 % of total genomic abundance. Repeat composition varied among species, with the largest clusters of each species being almost or completely absent in the others (Fig. 5A). By using the first 150 most abundant clusters of the comparative analysis, we were able to reconstruct the phylogenetic relationships among the five Rhynchospora species for both the Bowtie (Fig. 5B) and GS datasets (Fig. 5C) with high bootstrap support (BS) (mean BS = 100 and 98.3, respectively). Using the reads from all clusters identified in the Bowtie dataset comparative analysis, the AAF analysis yielded the same relationships with high bootstrap support (mean BS = 100, Fig. 5D). Branch lengths of the abundance-based analysis were significantly higher than for the AAF analysis. Despite this, the species relationships observed in the repeat-based phylogenies were congruent with the ones retrieved in the Bayesian analysis with 256 concatenated target loci, with R. cephalotes + R. exaltata forming a clade sister to R. pubera + R. tenuis, and R. globosa sister to both clades (Fig. 5E).

Fig. 5.

Fig. 5.

Phylogenomics of Rhynchospora species using GS and target capture datasets. (A) Graphic representation of the 30 most abundant clusters originated from the comparative clustering analysis with the Bowtie dataset. Height of rectangles represents the genomic abundance of each cluster, with colours indicating the repeat type, according to the key. Asterisks (*) indicate the clusters that represent Tyba, which is absent in R. globosa. (B–E) Phylogenetic trees obtained by repeat abundance of the GS (B) and Bowtie (C) datasets, AAF of total reads in repetitive clusters (D) and Bayesian inference based on 256 target regions (E). Star in (B) represents the only node with support <100 (BS = 99).

DISCUSSION

TCS data can be used to identify highly abundant repeats

We were able to find most of the repeat diversity of five Rhynchospora species using filtered and unfiltered TCS reads. Depending on the filtering strategy employed, the Rhynchospora data used here showed low abundance of on-target reads, indicating a considerable proportion of off-target reads suitable for repeat analysis. Target-based sequencing approaches often have varied enrichment efficiency, with on-target enriched reads representing as little as 5 % of the final genomic library (Johnson et al., 2019). This is particularly the case with universal kits that are designed to work across large taxonomic groups, at the cost of enrichment efficiency. Thus, it is believed that the off-target sequence reads from such libraries can be used in a similar fashion to GS sequencing, an approach often described as Hyb-Seq (Weitemier et al., 2014; Dodsworth et al., 2019). This approach has been used for plastome assembly in several species (Weitemier et al., 2014; Schmickl et al., 2016), as well as for ribosomal DNA profile comparisons, in which GS and TCS data showed satisfactory correlation (Sproul et al., 2020). However, this method was yet to be tested for identification of high-copy repeats such as satellite DNA and transposable elements.

Our results show that there is a moderate rank correlation between the abundances of annotated repeats obtained by analysing GS and raw target capture datasets, and that this correlation increases when analysing off-target reads only. The significant correlation indexes showed that although annotated repeat proportions of our filtered target datasets are not identical to those observed in GS, high-copy repeats can be sufficiently identified in off-target reads. Although the Geneious dataset presented the highest proportion of filtered reads and highest correlation index with GS data for three of the five species, the Bowtie2 and RepeatExplorer approaches were also sufficient for increasing the correlation index in most cases. Furthermore, different settings for the mappers, such as the ‘local’ read alignment setting for Bowtie2, can be tested in order to improve mapping to target sequences. Therefore, any of the filtering strategies presented here can produce sufficiently accurate results in regard to repeat identification and order of abundance when compared with the traditional GS approach.

Although the total abundance of satellite DNAs varied between GS and TCS datasets, we were able to identify the most abundant satellite DNA families of all five Rhynchospora species in all target datasets. While some low-abundance satellites (<0.2 % of genomic abundance) found in GS data were not found in the target datasets, others with abundances as low as 0.09 % were found in all datasets, suggesting that satellite abundance does not necessarily affect its presence in the off-target reads (Supplementary Data Table S2). More importantly, various satellites found in the TCS datasets were not found in the GS dataset (Supplementary Data Table S2), probably being low-abundance satellites accidentally enriched in the sequencing process. Low-abundance satellites may sometimes not be detected by RepeatExplorer, requiring additional filtering to be detected (Ruiz-Ruano et al., 2016), which could explain the absence of these satellites in our GS datasets.

One of the benefits of using Rhynchospora as a model for this study was the fact that it possesses satellite DNAs with varying chromosomal distributions. Tyba clusters are dispersed, forming a linear distribution along the metaphase holocentromeres of R. pubera, R. ciliata, R. cephalotes and R. tenuis (Marques et al., 2015; Ribeiro et al., 2017; this study). However, other satellites in Rhynchospora, such as RgSAT1-186 in R. globosa, form localized blocks (Ribeiro et al., 2017). Off-target sequences used in Hyb-Seq are often adjacent to the targeted gene regions (Weitemier et al., 2014; Dodsworth et al., 2019), raising the possibility that our analysis would preferentially identify widespread repeats such as Tyba, which are interspersed with genic regions (Marques et al., 2015). However, RgSAT1-186, identified as subterminal clusters in the chromosomes of R. globosa (Ribeiro et al., 2017), was identified as the most abundant cluster on all the R. globosa TCS datasets, showing that the chromosomal distribution of a satellite DNA was not interfering in the randomness of the off-target reads (Supplementary Data Fig. S1).

As well as confirming its presence in the target datasets of R. pubera and R. tenuis, we were able to find novel Tyba variants in R. cephalotes and in R. exaltata, with 60 and 57 % sequence similarity to RpTyba, respectively. Rhynchospora globosa did not present any Tyba variant, in concordance with the results of Ribeiro et al. (2017). The abundance of Tyba in R. pubera was very low in the target datasets (~0.14 %) compared with the GS results (~2.8 % here, 3.6 % in Marques et al., 2015). In our R. pubera TCS datasets, the most abundant tandem repeat was RpSAT5-287, which appeared as only the fifth most abundant in the GS dataset (Supplementary Data Table S1). This satellite was not found in the previous R. pubera characterization and may indicate the potential to discover additional low-abundance sequences using TCS datasets. Although fast rates of evolution for satellite DNAs may lead to intraspecific abundance variation (Ceccarelli et al., 2011), this 20-fold difference is probably too high to be a product of differences between the individuals used for each sequencing method. Satellite abundances estimated by TAREAN depend on several factors, such as sequence coverage, monomer homogeneity and similarity with other repeats (Novák et al., 2017). As abundance information gathered by short reads can grossly underestimate the true abundance of this repeat type (i.e. Ribeiro et al., 2020), caution is needed when interpreting TCS-yielded genomic abundances, though similar caution is needed with GS data as well.

Transposable elements, particularly LTR-retrotransposons, are often the largest fraction of repetitive DNA in plants (Galindo-González et al., 2017). In our GS datasets, LTR−Ty1/copia was the most abundant repeat type in three out of five species. The TCS datasets yielded similar results to GS, with a few discrepancies, especially in R. exaltata (predominance of Ty3/gypsy instead of Ty1/copia) and R. pubera (significantly higher proportion of Ty3/gypsy than in GS). As discussed for the satellites, these discrepancies could be the result of some repeat lineages being accidentally enriched during the TCS procedure, which should demand caution when interpreting repeat abundances in these types of data. Additionally, the intraspecific variability of repeat abundance could account for some of the variation observed here, since GS and TCS datasets came from different individuals. Intraspecific repeat variation ranges from low (Renny-Byfield and Baumgarten, 2020) to high among natural populations and even within an organism (Shams and Raskina, 2018). Thus, it is possible that part of the discrepancies could also be caused by natural differences in the genomic abundance of some repetitive lineages within species. In order to do a more detailed comparison, we used the individual lineage abundance values for our correlation analysis. The high correlation rates indicate that different LTR lineages, although varying in abundance, contributed similarly to the repetitive fraction in GS and filtered target datasets. Our results show that in addition to being able to find the majority of amplified lineages of LTR-retrotransposons, we could also find lineages with abundances as low as 0.01 % in the target datasets. We also find a higher diversity of LTR lineages in the target datasets than in the GS data, with a few of these ‘extra’ lineages being over-abundant when compared with the GS dataset (Supplementary Data Table S1). These lineages were absent in the GS results probably due to masking by the highly abundant satellite DNA clusters, which were underestimated in some of our TCS datasets. Nonetheless, the fact that we could identify even low-abundance LTRs in the target datasets, coupled with the moderate correlation with the GS-yielded abundances, indicate that off-target reads from target sequencing may be sufficient to identify most of the LTR-retrotransposons in a genome.

It is important to note that, although the patterns of repeat abundance are highly similar, the numerical difference observed for some repeat classes shows that combining GS and TCS data may produce inconsistent results. Although it is possible to mine off-target reads for predominant repeats, the exact abundance values are not as accurate as those yielded by GS data, which is still the most recommended way to gather such information (Dodsworth, 2015). Another important point is that technical bias could still be one big factor in the reliability of TCS data for repeat mining. With our limited sample, it was already possible to find a high inverse correlation between the proportion of classified repeats and enrichment efficiency (or the proportion of on-target reads in the final TCS library). Enrichment efficiency of a single target-capture kit can vary greatly among distant species (i.e. 5–68 % variation in the species tested by Johnson et al., 2019). Future uses of off-target TCS reads may have to take into account that species with highly efficient enrichment may not produce reliable repeat characterization. As good capture efficiency (number of target genes in the final library) can be achieved with low enrichment efficiency (Dodsworth, 2015; Johnson et al., 2019), one possible strategy for future projects would be to sequence a portion of the unenriched library, in order to maximize the usefulness of the final dataset.

TCS data can be used to develop cytogenetic probes

We tested whether we could use repeat information obtained from TCS data to develop probes for cytogenetic techniques such as FISH. Highly abundant repetitive elements are frequently chosen for FISH experiments, as they are easier to visualize on chromosomes than low-abundance repeats and can often be important components of predominantly heterochromatic and centromeric regions (Marques et al., 2015; Bilinski et al., 2017; Ávila Robledillo et al., 2018). In our Rhynchospora species, the most abundant cluster in all datasets was a satellite DNA. For R. tenuis and R. globosa we found the same satellites found previously by Ribeiro et al. (2017) as the most abundant satellites (Tyba and RgSAT1-186 respectively).

Similar to previous results on other Rhynchospora (Marques et al., 2015; Ribeiro et al., 2017), the Tyba variant from R. cephalotes presented a dispersed distribution in interphase nuclei and a line-like pattern along the sister chromatids of condensed chromosomes. Co-localization with CENH3 further confirmed the holocentromere-specific localization of Tyba. The conservation of the holocentromeric distribution of RcTyba on R. cephalotes strengthens its putative role in centromere function as proposed previously (Marques et al., 2015; Ribeiro et al., 2017). It also points to a remarkably old origin for this satellite and its holocentromeric association, sharing a common ancestor between 35 and 46 My (95 % credible interval; Buddenhagen, 2016). A large-scale survey of Tyba in the entire Rhynchospora genus could help to further elucidate the evolution of this satellite and its association with the holocentromere. Although GS is still the most reliable way to gather repeat information, it is still expensive to sequence a great number of species. With the growing availability of TCS data (Johnson et al., 2019; Andermann et al., 2020), the possibility of recycling off-target reads to mine highly abundant repeats demonstrated here can be a cost-efficient alternative for large-scale cytogenomic investigations.

TCS data can be used to construct repeat-based phylogenies

In order to test if repeat abundances observed in target datasets are accurate enough to reconstruct phylogenetic relationships, we applied the methodology described by Dodsworth et al. (2015) to our Bowtie and GS datasets as well as using an AAF method (Fan et al., 2015) using the total set of reads output by the comparative repeat analysis on the Bowtie dataset. Dodsworth et al.’s approach takes into account the assumption that, as repeat abundance changes primarily through random genetic drift (Jurka et al., 2011), they can be used as selection-free characters for phylogenetic reconstruction. On the other hand, AAF methods have been shown to identify potentially useful markers for taxonomic resolution from GS datasets (Bohmann et al., 2020). Both analyses were successful in reconstructing the major relationships between the five Rhynchospora species with maximal bootstrap support. Our results not only corroborate that AAF methods can be useful for repetitive sequence data, but also show that they can be employed on the off-target portion of TCS data. These sequence similarity-based approaches (e.g. Vitales et al., 2020) may be more appropriate for groups where no GS data are available for verifying the accuracy of repeat abundance in the target dataset, when abundance-based approaches provide less resolved trees or when genome sizes are unknown.

The phylogenetic relationships retrieved by the repeat abundance and AAF-based phylogeny were not only congruent with the GS-based analysis and a Bayesian tree of the 256 target regions, but also with recent studies in the genus based on other phylogenetic data (Buddenhagen et al., 2016; Ribeiro et al., 2018). The fine resolution in both repeat abundance analyses is mainly due to the significant intra-specific difference in repeat composition observed in the Bowtie dataset and corroborated by the GS analysis. Although it is common for closely related species to share similar repeat profiles, Rhynchospora is a fairly old genus and some of the species here analysed diverge by several million years (Buddenhagen, 2016). Our results show the potential of repeats from off-target reads to be used as an additional phylogenetically informative dataset, from a part of the genome that is completely different from those typically used for phylogenetic studies. Target capture-based sequencing already offers the opportunity to construct robust phylogenies with hundreds of informative markers and also to assemble whole plastomes and other organellar DNAs via the off-target reads to infer phylogenetic relationships, and nuclear–organellar discordance (Dodsworth et al., 2019). Repeat-based phylogenies offer an additional strategy, based on the same TCS datasets, potentially uncovering nuclear intragenomic (in)congruence, while further increasing the usefulness of TCS datasets. Even in cases where repeat-based phylogenies cannot offer additional species evolution insights, it can help to understand the evolution of certain repetitive sequences across a group. The robustness of the repetitive DNA information obtained from our target datasets can prove useful in a variety of phylogenomic approaches, such as similarity-based repeat phylogenies (Vitales et al., 2020), as well as the alignment-free and abundance-based methods presented here.

SUPPLEMENTARY DATA

Supplementary data are available online at https://academic.oup.com/aob and consist of the following. Table S1: detailed annotation of repetitive elements at lineage level for all datasets on all Rhynchospora species. Table S2: names and genomic abundances of the satellites found in all datasets of the five Rhynchospora species. Figure S1: barplots of genomic abundance of unclassified repeats in every dataset of all analysed Rhynchospora species. Figure S2: dotplot comparison of satellites found in all datasets of all analysed Rhynchospora species.

mcab063_suppl_Supplementary_Table_1
mcab063_suppl_Supplementary_Table_2
mcab063_suppl_Supplementary_Figure_1
mcab063_suppl_Supplementary_Figure_2

ACKNOWLEDGEMENTS

The authors are grateful to Dr Magdalena Vaio (Facultad de Agronomía, Uruguay) for providing comments and suggestions for the manuscript and to M.Sc. Erton Almeida for the collection of Rhynchospora cephalotes. All authors have declared that there are no conflicts of interest regarding this article.

FUNDING

This study was supported in part by the Coordenacão de Aperfeicoamento de Pessoal de Nıvel Superior–Brasil (CAPES) (Finance Code 001), CAPES-PRINT (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Programa Institucional de Internacionalização) [project number 88887.363884/2019-00 (L.C.)] and CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnologico) [grant number 141037/2018-0 (L.C.)].

LITERATURE CITED

  1. Abadi S, Azouri D, Pupko T, Mayrose I. 2019. Model selection may not be a mandatory step for phylogeny reconstruction. Nature Communications 10: 934. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Albert TJ, Molla MN, Muzny DM, et al. 2007. Direct selection of human genomic loci by microarray hybridization. Nature Methods 4: 903–905. [DOI] [PubMed] [Google Scholar]
  3. Aliyeva-Schnorr L, Beier S, Karafiátová M, et al. 2015. Cytogenetic mapping with centromeric bacterial artificial chromosomes contigs shows that this recombination-poor region comprises more than half of barley chromosome 3H. Plant Journal 84: 385–394. [DOI] [PubMed] [Google Scholar]
  4. Andermann T, Torres Jiménez MF, Matos-Maraví P, et al. 2020. A guide to carrying out a phylogenomic target sequence capture project. Frontiers in Genetics 10: 1407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Ávila Robledillo L, Koblížková A, Novák P, et al. 2018. Satellite DNA in Vicia faba is characterized by remarkable diversity in its sequence composition, association with centromeres, and replication timing. Scientific Reports 8: 5838. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Bilinski P, Albert PS, Berg JJ, et al. 2017. Parallel altitudinal clines reveal adaptive evolution of genome size in Zea mays. PLoS Genetics 14: e1007162. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Bohmann K, Mirarab S, Bafna V, Gilbert MTP. 2020. Beyond DNA barcoding: the unrealized potential of genome skim data in sample identification. Molecular Ecology 29: 2521–2534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Bolsheva NL, Melnikova NV, Kirov IV, et al. 2019. Characterization of repeated DNA sequences in genomes of blue-flowered flax. BMC Evolutionary Biology 19: 49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Buddenhagen C, Lemmon AR, Lemmon EM, et al. 2016. Anchored phylogenomics of angiosperms I: assessing the robustness of phylogenetic estimates. BioRxiv, doi: 10.1101/086298, preprint. [DOI] [Google Scholar]
  10. Buddenhagen CE. 2016. A view of Rhynchosporeae (Cyperaceae) diversification before and after the application of anchored phylogenomics across the angiosperms. PhD Thesis, Florida State University, USA. [Google Scholar]
  11. Bureš P, Zedek F, Markova M. 2013. Holocentric chromosomes. In: Plant genome diversity, Vol. 2. Vienna: Springer, 187–204. [Google Scholar]
  12. Ceccarelli M, Sarri V, Caceres ME, Cionini PG. 2011. Intraspecific genotypic diversity in plants. Genome 54: 701–709. [DOI] [PubMed] [Google Scholar]
  13. Cheng Z, Dong F, Langdon T, et al. 2002. Functional rice centromeres are marked by a satellite repeat and a centromere-specific retrotransposon. Plant Cell 14: 1691–1704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Čížková J, Hřibová E, Humplíková L, Christelová P, Suchánková P, Doležel J. 2013. Molecular analysis and genomic organization of major DNA satellites in banana (Musa spp.). PLoS ONE 8: e54808. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Cosart T, Beja-Pereira A, Chen S, Ng SB, Shendure J, Luikart G. 2011. Exome-wide DNA capture and next generation sequencing in domestic and wild species. BMC Genomics 12: 347. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Dodsworth S. 2015. Genome skimming for next-generation biodiversity analysis. Trends in Plant Science 20: 525–527. [DOI] [PubMed] [Google Scholar]
  17. Dodsworth S, Chase MW, Kelly LJ, et al. 2015. Genomic repeat abundances contain phylogenetic signal. Systematic Biology 64: 112–126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Dodsworth S, Jang T-S, Struebig M, Chase MW, Weiss-Schneeweiss H, Leitch AR. 2017. Genome-wide repeat dynamics reflect phylogenetic distance in closely related allotetraploid Nicotiana (Solanaceae). Plant Systematics and Evolution 303: 1013–1020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Dodsworth S, Pokorny L, Johnson MG, et al. 2019. Hyb-Seq for flowering plant systematics. Trends in Plant Science 24: 887–891. [DOI] [PubMed] [Google Scholar]
  20. Dolezel J, Sgorbati S, Lucretti S. 1992. Comparison of three DNA fluorochromes for flow cytometric estimation of nuclear DNA content in plants. Physiologia Plantarum 85: 625–631. [Google Scholar]
  21. Drummond AJ, Rambaut A. 2007. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evolutionary Biology 7: 214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Eaton DAR, Spriggs EL, Park B, Donoghue MJ. 2016. Misconceptions on missing data in RAD-seq phylogenetics with a deep-scale example from flowering plants. Systematic Biology 66: 399–412. [DOI] [PubMed] [Google Scholar]
  23. Elliott TA, Gregory TR. 2015. What’s in a genome? The C-value enigma and the evolution of eukaryotic genome content. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 370: 20140331. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Faircloth BC, McCormack JE, Crawford NG, Harvey MG, Brumfield RT, Glenn TC. 2012. Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales. Systematic Biology 61: 717–726. [DOI] [PubMed] [Google Scholar]
  25. Fan H, Ives AR, Surget-Groba Y, Cannon CH. 2015. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics 16: 522. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Galindo-González L, Mhiri C, Deyholos MK, Grandbastien M-A. 2017. LTR-retrotransposons in plants: engines of evolution. Gene 626: 14–25. [DOI] [PubMed] [Google Scholar]
  27. Gnirke A, Melnikov A, Maguire J, et al. 2009. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nature Biotechnology 27: 182–189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Guignard MS, Nichols RA, Knell RJ, et al. 2016. Genome size and ploidy influence angiosperm species’ biomass under nitrogen and phosphorus limitation. New Phytologist 210: 1195–1206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Heyduk K, Trapnell DW, Barrett CF, Leebens-Mack J. 2016. Phylogenomic analyses of species relationships in the genus Sabal (Arecaceae) using targeted sequence capture. Biological Journal of the Linnean Society 117: 106–120. [Google Scholar]
  30. Houben A, Schubert I. 2003. DNA and proteins of plant centromeres. Current Opinion in Plant Biology 6: 554–560. [DOI] [PubMed] [Google Scholar]
  31. Ilves KL, López-Fernández H. 2014. A targeted next-generation sequencing toolkit for exon-based cichlid phylogenomics. Molecular Ecology Resources 14: 802–811. [DOI] [PubMed] [Google Scholar]
  32. Ishii T, Sunamura N, Matsumoto A, Eltayeb AE, Tsujimoto H. 2015. Preferential recruitment of the maternal centromere-specific histone H3 (CENH3) in oat (Avena sativa L.) × pearl millet (Pennisetum glaucum L.) hybrid embryos. Chromosome Research 23: 709–718. [DOI] [PubMed] [Google Scholar]
  33. Jasencakova Z, Meister A, Schubert I. 2001. Chromatin organization and its relation to replication and histone acetylation during the cell cycle in barley. Chromosoma 110: 83–92. [DOI] [PubMed] [Google Scholar]
  34. Johnson MG, Pokorny L, Dodsworth S, et al. 2019. A universal probe set for targeted sequencing of 353 nuclear genes from any flowering plant designed using k-medoids clustering. Systematic Biology 68: 594–606. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Jurka J, Bao W, Kojima KK. 2011. Families of transposable elements, population structure and the origin of species. Biology Direct 6: 44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Kearse M, Moir R, Wilson A, et al. 2012. Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics 28: 1647–1649. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Koo D-H, Hong CP, Batley J, et al. 2011. Rapid divergence of repetitive DNAs in Brassica relatives. Genomics 97: 173–185. [DOI] [PubMed] [Google Scholar]
  38. Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nature Methods 9: 357–359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Lemmon AR, Emme SA, Lemmon EM. 2012. Anchored hybrid enrichment for massively high-throughput phylogenomics. Systematic Biology 61: 727–744. [DOI] [PubMed] [Google Scholar]
  40. Loureiro J, Rodriguez E, Dolezel J, Santos C. 2007. Two new nuclear isolation buffers for plant DNA flow cytometry: a test with 37 species. Annals of Botany 100: 875–888. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Lyu H, He Z, Wu C-I, Shi S. 2018. Convergent adaptive evolution in marginal environments: unloading transposable elements as a common strategy among mangrove genomes. New Phytologist 217: 428–438. [DOI] [PubMed] [Google Scholar]
  42. Macas J, Novák P, Pellicer J, et al. 2015. In depth characterization of repetitive DNA in 23 plant genomes reveals sources of genome size variation in the legume tribe Fabeae. PLoS ONE 10: e0143424. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Mandel JR, Dikow RB, Funk VA, et al. 2014. A target enrichment method for gathering phylogenetic information from hundreds of loci: an example from the Compositae. Applications in Plant Sciences 2: 1300085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Marques A, Ribeiro T, Neumann P, et al. 2015. Holocentromeres in Rhynchospora are associated with genome-wide centromere-specific repeat arrays interspersed among euchromatin. Proceedings of the National Academy of Sciences of the USA 112: 13633–13638. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Martín-Peciña M, Ruiz-Ruano FJ, Camacho JPM, Dodsworth S. 2019. Phylogenetic signal of genomic repeat abundances can be distorted by random homoplasy: a case study from hominid primates. Zoological Journal of the Linnean Society 185: 543–554. [Google Scholar]
  46. Mascagni F, Vangelisti A, Giordani T, Cavallini A, Natali L. 2020. A computational comparative study of the repetitive DNA in the genus Quercus L. Tree Genetics & Genomes 16: 11. [Google Scholar]
  47. Nagaki K, Talbert PB, Zhong CX, Dawe RK, Henikoff S, Jiang J. 2003. Chromatin immunoprecipitation reveals that the 180-bp satellite repeat is the key functional DNA element of Arabidopsis thaliana centromeres. Genetics 163: 1221–1225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Neumann P, Novák P, Hoštáková N, Macas J. 2019. Systematic survey of plant LTR-retrotransposons elucidates phylogenetic relationships of their polyprotein domains and provides a reference for element classification. Mobile DNA 10: 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Novak P, Neumann P, Pech J, Steinhaisl J, Macas J. 2013. RepeatExplorer: a Galaxy-based web server for genome-wide characterization of eukaryotic repetitive elements from next-generation sequence reads. Bioinformatics 29: 792–793. [DOI] [PubMed] [Google Scholar]
  50. Novák P, Ávila Robledillo L, Koblížková A, Vrbová I, Neumann P, Macas J. 2017. TAREAN: a computational tool for identification and characterization of satellite DNA from unassembled short reads. Nucleic Acids Research 45: e111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Pellicer J, Fay MF, Leitch IJ. 2010. The largest eukaryotic genome of them all? Botanical Journal of the Linnean Society 164: 10–15. [Google Scholar]
  52. R Core Team 2019. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. [Google Scholar]
  53. Rambaut A, Drummond AJ, Xie D, Baele G, Suchard MA. 2018. Posterior summarization in Bayesian phylogenetics using tracer 1.7. Systematic Biology 67: 901–904. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Renny-Byfield S, Baumgarten A. 2020. Repetitive DNA content in the maize genome is uncoupled from population stratification at SNP loci. BMC Genomics 21: 98. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Ribeiro T, Marques A, Novák P, et al. 2017. Centromeric and non-centromeric satellite DNA organisation differs in holocentric Rhynchospora species. Chromosoma 126: 325–335. [DOI] [PubMed] [Google Scholar]
  56. Ribeiro T, Buddenhagen CE, Thomas WW, Souza G, Pedrosa-Harand A. 2018. Are holocentrics doomed to change? Limited chromosome number variation in Rhynchospora Vahl (Cyperaceae). Protoplasma 255: 263–272. [DOI] [PubMed] [Google Scholar]
  57. Ribeiro T, Vasconcelos E, Dos Santos KGB, Vaio M, Brasileiro-Vidal AC, Pedrosa-Harand A. 2020. Diversity of repetitive sequences within compact genomes of Phaseolus L. beans and allied genera Cajanus L. and Vigna Savi. Chromosome Research 28: 139–153. [DOI] [PubMed] [Google Scholar]
  58. Ruiz-Ruano FJ, López-León MD, Cabrero J, Camacho JPM. 2016. High-throughput analysis of the satellitome illuminates satellite DNA evolution. Scientific Reports 6: 28333. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Sarmashghi S, Bohmann K, Gilbert MT, Bafna V, Mirarab S. 2019. Skmer: assembly-free and alignment-free sample identification using genome skims. Genome Biology 20: 34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Sass C, Iles WJD, Barrett CF, Smith SY, Specht CD. 2016. Revisiting the Zingiberales: using multiplexed exon capture to resolve ancient and recent phylogenetic splits in a charismatic plant lineage. PeerJ 4: e1584. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Schmickl R, Liston A, Zeisek V, et al. 2016. Phylogenetic marker development for target enrichment from transcriptome and genome skim data: the pipeline and its application in southern African Oxalis (Oxalidaceae). Molecular Ecology Resources 16: 1124–1135. [DOI] [PubMed] [Google Scholar]
  62. Schrader L, Schmitz J. 2019. The impact of transposable elements in adaptive evolution. Molecular Ecology 28: 1537–1549. [DOI] [PubMed] [Google Scholar]
  63. Shams I, Raskina O. 2018. Intraspecific and intraorganismal copy number dynamics of retrotransposons and tandem repeat in Aegilops speltoides Tausch (Poaceae, Triticeae). Protoplasma 255: 1023–1038. [DOI] [PubMed] [Google Scholar]
  64. Sonnhammer ELL, Durbin R. 1995. A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. Gene 167: GC1–GC10. [DOI] [PubMed] [Google Scholar]
  65. Souza G, Costa L, Guignard MS, et al. 2019. Do tropical plants have smaller genomes? Correlation between genome size and climatic variables in the Caesalpinia group (Caesalpinioideae, Leguminosae). Perspectives in Plant Ecology, Evolution and Systematics 38: 13–23. [Google Scholar]
  66. Sproul JS, Barton LM, Maddison DR. 2020. Repetitive DNA profiles reveal evidence of rapid genome evolution and reflect species boundaries in ground beetles. Systematic Biology 69: 1137–1148. [DOI] [PubMed] [Google Scholar]
  67. Straub SCK, Parks M, Weitemier K, Fishbein M, Cronn RC, Liston A. 2012. Navigating the tip of the genomic iceberg: next-generation sequencing for plant systematics. American Journal of Botany 99: 349–364. [DOI] [PubMed] [Google Scholar]
  68. Thomas WMW, Araújo AC, Alves MV. 2009. A preliminary molecular phylogeny of the Rhynchosporeae (Cyperaceae). Botanical Review 75: 22–29. [Google Scholar]
  69. Van-Lume B, Esposito T, Diniz-Filho JAF, Gagnon E, Lewis GP, Souza G. 2017. Heterochromatic and cytomolecular diversification in the Caesalpinia group (Leguminosae): relationships between phylogenetic and cytogeographical data. Perspectives in Plant Ecology, Evolution and Systematics 29: 51–63. [Google Scholar]
  70. Vitales D, Garcia S, Dodsworth S. 2020. Reconstructing phylogenetic relationships based on repeat sequence similarities. Molecular Phylogenetics and Evolution 147: 106766. [DOI] [PubMed] [Google Scholar]
  71. Wang H-J, Li W-T, Liu Y-N, Yang F-S, Wang X-Q. 2017. Resolving interspecific relationships within evolutionarily young lineages using RNA-seq data: an example from Pedicularis section Cyathophora (Orobanchaceae). Molecular Phylogenetics and Evolution 107: 345–355. [DOI] [PubMed] [Google Scholar]
  72. Weisshart K, Fuchs J, Schubert V. 2016. Structured illumination microscopy (SIM) and photoactivated localization microscopy (PALM) to analyze the abundance and distribution of RNA polymerase II molecules on flow-sorted Arabidopsis nuclei. Bio-Protocol 6: e1725. [Google Scholar]
  73. Weiss-Schneeweiss H, Leitch AR, McCann J, Jang T-S, Macas J. 2015. Employing next generation sequencing to explore the repeat landscape of the plant genome. In: Hörandl E, Appelhans M, eds. Next generation sequencing in plant systematics. Königstein: Koeltz Scientific Books. [Google Scholar]
  74. Weitemier K, Straub SCK, Cronn RC, et al. 2014. Hyb-Seq: combining target enrichment and genome skimming for plant phylogenomics. Applications in Plant Sciences 2: 1400042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Wickham H. 2016. ggplot2: elegant graphics for data analysis. New York: Springer. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

mcab063_suppl_Supplementary_Table_1
mcab063_suppl_Supplementary_Table_2
mcab063_suppl_Supplementary_Figure_1
mcab063_suppl_Supplementary_Figure_2

Articles from Annals of Botany are provided here courtesy of Oxford University Press

RESOURCES