Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2023 Feb 14.
Published in final edited form as: Plant Genome. 2021 Sep 25;14(3):e20143. doi: 10.1002/tpg2.20143

K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes

Bruno Contreras-Moreira 1,, Carla V Filippi 1,2,3, Guy Naamati 1, Carlos García Girón 1, James E Allen 1, Paul Flicek 1
PMCID: PMC7614178  EMSID: EMS164607  PMID: 34562304

Abstract

The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.

1. Introduction

Besides genes, plant genomes contain intergenic sequences, which have increasing repetitive sequences as the genome size grows. The growth in repeat content is roughly linear up to a genome size of 10 Gbp, including most known angiosperms, and then plateaus (Novák et al., 2020). The repetitive fraction of the genome is made up of low-copy repeats, simple repeats (such as satellite DNA), and transposable elements (TEs), which were discovered by Barbara McClintock in maize (Zea mays L.) (McClintock, 1950).

Transposable elements can be important for explaining observed phenotypes or domestication [see, for instance, Studer et al. (2011)] and are used as a source of genetic variability in breeding programs (Thieme et al., 2017). The hypothesis is that the copy-and-paste and cut-and-paste mechanisms of TEs might leave footprints in the genome and can potentially affect the expression, regulation, or coding sequences of neighboring genes. Moreover, TEs are increasingly receiving attention in studies tackling plant pangenomes (e.g., Gordon et al., 2017). According to the Wicker classification, plant TEs can be classified either as Class I RNA retrotransposons or Class II DNA transposons (Wicker et al., 2007). Software resources such as RepeatMasker (RM) (Smit et al., 2015), RepBase (Bao et al., 2015), or RepetDB (Amselem et al., 2019), which are typically used to annotate TEs and other repeats in plant genomes, use the Wicker classification rules and repeat libraries (Lerat, 2010). These libraries can be generic, such as RepBase, which is available for subscribers only, or customized for a genome of interest with RepeatModeler (RepMod) (Flynn et al., 2020). These repeat annotation strategies can take up to several days on a computer cluster, depending on the genome size, and often mask disease resistance (R) genes, which are of great interest in plant breeding (Bayer et al., 2018).

In addition to the intrinsic biological value of TEs, the annotation of repeats can be used to estimate assembly quality (Wierzbicki et al., 2020) as an alternative to gene completeness (Van Bel et al., 2019). For other genomic analyses, the bulk of repeated sequences may disrupt common computational genomic analyses and are thus often masked out, without any classification attempt. For instance, whole-genome alignment, promoter analysis, and the construction of graph genomes require the computation of frequency tables of k-mers, which are nucleotide words of size k. If repeated sequences are not masked, the frequency tables are severely biased and can affect the results obtained (Hickey et al., 2020). Although annotation approaches based on sequence similarity are computationally expensive, k-mer masking strategies are orders of magnitude faster (Beier et al., 2020; da Cruz et al., 2020; Girgis, 2015; Kurtz et al., 2008) and, in our experience, are much better for prepare whole-genome alignments of barley (Hordeum vulgare L.) and wheat (Triticum aestivum L.) cultivars via LASTZ (Harris, 2007).

In this study, we benchmarked a two-step approach for annotating repeated sequences in plants. First, repeats were called by k-mer counting with the Repeat Detector (Red). Second, the discovered repeated sequences were annotated by sequence alignment to a newly curated metacollection of repeats called nrTEplants. We compared this approach with the conventional RM pipeline on a set of 20 angiosperms from Ensembl with nrTEplants, REdat (Nussbaumer et al., 2013) and custom RepMod libraries. We then compared their performance and discuss the results. The nrTEplants library is bundled with documentation on how to update it and scripts to mask and annotate plant genomes, enabling interoperability, reuse, and reproducible analyses (Wilkinson et al., 2016).

2. Materials and Methods

2.1. Plant repeat libraries

We searched the literature for plant-specific libraries of repeated sequences and selected those in Table 1. Although some are specific for a species or repeat family, others comprise repeats from mixed species, such as REdat from PlantsDB (Nussbaumer et al., 2013) or RepetDB (Amselem et al., 2019). FASTA files with nucleotide sequences of repeats were downloaded from the indicated URLs or obtained from the authors.

Table 1. Collections of plant repeated sequences used as components of nrTEplants.

Dataset Description and source Last updated Total sequences Median length
bp
TREP TEsa from Triticeae and various other species. https://botserv2.uzh.ch/kelldata/trep-db/index.html 2019 4,162 4,234
SINEbase Consensus sequences of Short interspersed nuclear element families (Vassetzky & Kramerov, 2013). http://sines.eimb.ru 2020 60 183
REdat Repeats from several sources and species in PlantsDB (Nussbaumer et al., 2013). https://pgsb.helmholtz-muenchen.de 2013 61,730 7,504
RepetDB Repeats detected and classified by TEdenovo and used by TEannot (Amselem et al., 2019). http://urgi.versailles.inra.fr/repetdb 2019 33,416 3,567
EDTArice Extensive de novo TE Annotator (Ou et al., 2019). https://github.com/oushujun/EDTA 2019 2,431 984
EDTAmaize Extensive de novo TE Annotator (Ou et al., 2019) https://github.com/oushujun/EDTA 2019 1,362 3,308
SoyBaseTE Comprehensive database of soybean TEs (Du et al., 2010). https://www.soybase.org/soytedb 2010 38,664 1,716
TAIR10TE Arabidopsis thaliana TEs https://www.arabidopsis.org 2019 31,189 305
SunflowerTE Staton et al. (2012) https://www.sunflowergenome.org 2016 73,627 4,709
SUNREP The repetitive component of the sunflower genome (Natali et al., 2013) pgagl.agr.unipi.it/sequence-repository 2013 47,441 616
MelonTE Castanera et al. (2019) 2020 1,560 3981
RosaTE Hibrand Saint-Oyant et al. (2018) https://iris.angers.inra.fr/obh/downloads 2017 355,304 226
a

TE, transposable element.

2.2. Plant transcript sequences

Plant species in Ensembl Plants release 46 (November 2020) (Howe et al., 2020) were ranked in terms of the number of proteins reviewed in Uniprot on 22 Feb. 2020 (UniProt Consortium, 2019). This was considered as an indicator of annotation quality, as UniProt protein sequences are commonly used during prediction and validation of gene models. A list of the best-annotated dicot and monocot species was produced, including Arabidopsis thaliana (L.) Heynh., Brassica napus L., Glycine max (L.) Merr., sunflower (Helianthus annuus L.), Medicago truncatula Gaertn., Phaseolus vulgaris L., Populus trichocarpa Torr. & A.Gray ex Hook., Solanum lycopersicum L., Vitis vinifera L., Brachypodium distachyon (L.) P.Beauv., Hordeum vulgare, Oryza sativa subsp. japonica L., Sorghum bicolor (L.) Moench., and Zea mays. Transcripts (cDNA) from these species were downloaded with the script ens_sequences.pl from https://github.com/Ensembl/plant-scripts.

2.3. Sequence clustering

Transcripts and TE sequences were clustered with GET_ HOMOLOGUES-EST version 10042020 (Contreras-Moreira et al., 2017). This software runs BLASTN and the MCL algorithm, and computes coverage by combining local alignments. The sequence identity cut-off was 95% and the alignment coverage 75%. Global variables in the script get_homologuesest.pl, lines L36-7, were set to $MAXSEQLENGTH = 55000 and $MINSEQLENGTH = 90. Sequences were clustered with the command get_homologues-est.pl -d repeats -m cluster -M -t 0 -i 100. The longest sequence in each cluster was taken as a representative.

2.4. Positive control Pfam domains

A list of 22 Pfam domains found in TEs was curated (Mistry et al., 2021), available at https://github.com/Ensembl/plant_tools/blob/master/bench/repeat_libs/control_pos.list.

2.5. Negative control: Pfam domains of disease resistance genes

For the identification and curation of Pfam domains encoded by disease resistance (R) genes, the following steps were performed. First, a set of 153 protein sequences encoded by reference R genes (i.e., cloned and/or with robust evidence) was retrieved from http://www.prgdb.org/prgdb (Osuna-Cruz et al., 2018). Second, the program hmmscan from HMMER Version 3.2.1 (Eddy, 1998) was used for initial Pfam domain identification (Version 32, default settings), yielding a total of 60 Pfam hidden Markov models. The observed order and combinations of Pfam domains were retrieved. Third, the proteins of six plant species (A. thaliana, B. distachyon, G. max, H. annuus, H. vulgare, and T. aestivum) containing at least one of the 60 Pfam domains previously identified were retrieved from https://plants.ensembl.org/biomart/martview (Kinsella et al., 2011). These proteins were subsequently filtered, retaining only those with the ordered combinations of Pfam domains observed in the reference R proteins, and were considered as potential R proteins (428 in A. thaliana, 577 in B. distachyon, 1,008 in G. max, 849 in H. annuus, 838 in H. vulgare, and 3,607 in T. aestivum). From the initial set of Pfam domains, only 43 were consistently identified in our final panel of potential encoded proteins of R genes and used as a negative control. Note that one of them (PF02892, zf-BED) is often found in transposases (Mistry et al., 2021). The list is available at https://github.com/Ensembl/plant_tools/blob/master/bench/repeat_libs/control_neg_NLR.list.

2.6. De novo annotation of nucleotide-binding and leucine-rich repeat immune receptor genes

The NLR-annotator software package (Steuernagel et al., 2020) was used for de novo annotation of nucleotide-binding and leucine-rich repeat immune receptor (NLR) genes, which are the most abundant R genes characterized to date, in whole genome sequences. Briefly, the 20 plant genomes were dissected into fragments 20 kb in length, with 5 kb overlaps, via the ChopSequence.jar routine. The cut sequences were then scanned to find NLR-associated sequence motifs with the NLR-Parser.jar command. Finally, NLR-Annotator.jar was used to integrate the annotated motifs and retrieve the actual NLR loci in BED format. In order to compute intersections with repeats, only NLR loci with an overlap of > 50 bp were considered. Moreover, to account for the fact that the tested masking strategies covered different fractions of the genome, odd ratios of NLR masking were computed via Equation 1:

OR=NLRmasked÷GenmaskedNLR÷Gen (1)

where OR is the odds ratio, NLRmasked is the masked NLR space, Genmasked is the masked genome space, NLR is the NLR space, and Gen is the genome space.

2.7. Masking and annotation of repeats in plant genomes

RepeatMasker Version 4.0.5 and a fork of Repeat Detector (Red) Version 2.0 adapted for Ensembl, available at https://github.com/EnsemblGenomes/Red, were used to call repeats in plant genomes in the libraries REdat Version 9.3 and nrTEplanst Version 0.3. In addition, RepeatMasker Version 4.1.2-p1 was also run to call repeats with custom repeat libraries produced by 20 parallel jobs in RepeatModeler-2.0.2a (Flynn et al., 2020). Note that custom libraries were obtained for only 10 species, as the remaining RepMod jobs were killed after 7 d in a computer farm. RepMod repeat coordinates were converted to BED format and overlapping intervals were merged. Low complexity sequences were called with dustmasker Version 1.0.0 (Morgulis et al., 2006). Tandem repeats were discovered with trf Version 4.0 with the parameters 2 5 7 80 10 40 500 -d -h (Benson, 1999). Red was called from the script https://github.com/Ensembl/plant-scripts/blob/master/repeats/Red2Ensembl.py, which can run several sequences in parallel and feed the results into a Ensembl core database (Stabenau et al., 2004). In addition, minimap2 version 2.17-r974-dirty (Li, 2018) was used to annotate the repeats called by Red with sequences from nrTEplants as follows: minimap2 K100M –score-N 0 -x map-ont nrTEplants. Minimap2 is called from the script https://github.com/Ensembl/plant-scripts/blob/master/repeats/AnnotRedRepeats.py, which parses its output to annotate the repeats. By default, only repeats with a length of > 90 bp are processed. Transposable element classification terms are parsed from the FASTA header of the library after a hash (#; e.g., RLG_43695:mipsREdat_9.3p_ALL#LTR/Gypsy). Elapsed runtime and RAM consumption was measured with the command time -v tool.

Genomic intersections among repeated sequences called by Red and RM, and genomic features (i.e., protein-coding genes, exons, proximal downstream and upstream 500-bp windows, and NLR loci) were computed with Bedtools (Version 2.26.0) (Quinlan & Hall, 2010) using bedtools intersect -a bed/genes.bed -b repeat.bed -sorted -wo. To avoid redundancy, exons were extracted from Ensembl canonical transcripts (see http://plants.ensembl.org/info/website/glossary.html). When we retrieved downstream and upstream genomic intervals, intersecting neighbor genes were first subtracted to eliminate any potential coding sequences.

2.8. K-mer analysis of repeats in downstream and upstream windows

Repeats overlapping proximal downstream or upstream 500-bp windows were extracted via bedtools intersect analysis and the sequences were cut with bedtools getfasta. Canonical k-mers with k = [16,21,31] were counted with Jellyfish Version 2.3.0 (Marçais & Kingsford, 2011) by the commands jellyfish-linux count -C -m K -s 2G -t 4 and jellyfish-linux dump -L 20.

2.9. Enrichment of Pfam domains

Enrichment was computed by the R function fisher.test (R Core Team, 2020) and Pfam domains (Mistry et al., 2021) were retrieved by Recipe B4 of https://github.com/Ensembl/plant-scripts (Contreras-Moreira et al., 2021). Pfam domain counts for the complete proteome were used as the expected frequencies. Only genes with an overlap of > 50 bp and domains with adjusted false discovery rates (p < .05) were considered.

2.10. Control sets of annotated repeated sequences

Repeated sequences annotated by the sequencing consortia of olive tree (Olea europaea L.)(Jiménez-Ruiz et al., 2020), Rosa chinensis Jacq. (Hibrand Saint-Oyant et al., 2018), and sunflower (Badouin et al., 2017) were downloaded from https://genomaolivar.dipujaen.es/db/downloads.php, https://iris.angers.inra.fr/obh/downloads, and https://sunflowergenome.org/annotations-data, respectively.

3. Results and Discussion

3.1. Construction and benchmarking of a nonredundant library of repeats: nrTEplants

A set of plant TE libraries and annotated repeats from selected species, listed in Table 1 plus transcript sets from the best functionally annotated plant species in Ensembl were curated and their TE classification terms uniformized. Next, they were merged and clustered (95% identity, 75% coverage of shortest sequence). From the resulting 994,349 clusters, the 174,426 clusters contained TE sequences and were six-frame translated and assigned Pfam domains. Of these, a subset of 8,910 mixed clusters comprising both TE and transcript sequences, and required further processing (see the example in Supplemental Figure S1). After empirical assessment, we decided to take only clusters (a) containing sequences from at least six different TE libraries (six replicates), which eventually left out Rosa TE repeats; and (b) those with a fraction of sequences marked as a ‘potential host gene’ in RepetDB below 0.00. The resulting nrTElibrary contained 171,104 sequences (see Supplemental Table S1 and Supplemental Table S2). Note that different cut-off values might have been selected with different input sequences or control sets. For example, increasing the number of replicates equates to computing an intersection set. Instead, to get a union set, the cut-off will need to be lowered.

In order to benchmark the newly constructed library, we compiled a positive control comprising 22 Pfam domains found in TEs, and a negative control: a list of 43 Pfam domains found in disease resistance NLR genes. Among these controls, we observed 20 true positives, 2 false negatives, 36 true negatives, and 2 false positives, yielding a sensitivity of 0.91 and a specificity of 0.95. The nrTEplants library can be obtained at https://github.com/Ensembl/plant-scripts/releases/tag/v0.3. A step-by-step guide on how to produce a nonredundant repeat library, including sample files with the control Pfam domains, is available at https://github.com/Ensembl/plant_tools/tree/master/bench/repeat_libs.

3.2. Masking repeats within plant genomes

Twenty plant genomes were selected from Ensembl (Howe et al., 2020) to benchmark the repeat calling strategies. These are listed in Table 2 next to the genomic fraction of repeats reported in the literature and their guanine–cytosine content. All these genome sequences were annotated with RM (Smit et al., 2015) with several repeat libraries (nrTEplants and REdat) (Nussbaumer et al., 2013) and species-specific custom libraries (RepMod). In addition, the fraction of repeats called by Red, based on k-mer enrichment, is also shown. Note that Red automatically selected k values from 13 to 16 as the genomes increased in length.

TABLE 2. Plant genomes from release 49 (September 2020) of Ensembl Plants (Howe et al., 2020) used in this work and their reported repeated fractions in the literature.

Species GCa Assembled genome size Reported repeated fraction Literature source
% Mbp %
Arabidopsis thaliana 36.1 119.7 19.0 Legrand et al. (2019)
Arabidopsis helleri (L.) 36.0 196.2 32.7 Legrand et al. (2019)
 O’Kane & Al-Shehbaz
Prunus dulcis (Mill.) 37.6 227.5 37.6 Alioto et al. (2020)
 D.A.Webb
Brachypodium distachyon 46.4 271.2 21.4 International Brachypodium Initiative (2010)
Brassica rapa L. 35.3 283.8 32.3 Zhang et al. (2018)
Trifolium pratense 32.4 304.8 41.8 De Vega et al. (2015)
Arabis alpina L. 36.8 308.0 47.9 Willing et al. (2015)
Cucumis melo L. 33.5 357.9 44.0 Ruggieri et al. (2018)
Citrullus lanatus (Thunb.) 33.6 365.5 45.2 Guo etal. (2013)
 Matsum. & Nakai
Oryza sativa 43.6 375.0 35 International Rice Genome Sequencing Project (2005)
Setaria viridis (L.) P.Beauv. 46.2 395.7 46 Thielen et al. (2020)
Vitis vinifera 34.5 486.3 41.4 French-Italian Public Consortium for Grapevine Genome Characterization (2007)
Rosa chinensis 38.8 515.6 67.9 Raymond et al. (2018)
Camelina sativa (L.) Crantz 36.6 641.4 28 Kagale et al. (2014)
Malus domestica Borkh. 38.0 702.9 59.5 Daccord et al. (2017)
Olea europaea 35.4 1,140.9 43 Unver et al. (2017)
Zea mays 46.9 2,135.1 85 Schnable et al. (2009)
Helianthus annuus 38.5 3,027.8 74.7 Badouin et al. (2017)
Aegilops tauschii 46.3 4,224.9 85.9 Zhao et al. (2017)
Triticum turgidum 46.0 10,463.1 82.2 Maccaferri et al. (2019)
a

GC, guanine-cytosine content

In Figure 1, the resulting percentages of repeated sequences are plotted next to the values reported in the literature. The median difference between the REdat repeated fraction and the literature reports is 26.5%. This number is 9.8% for nrTEplants, 4.3% for Red, and 6.3% for RepMod (over 10 genomes). These results suggest that Red can successfully mask any genomes without previous knowledge of the repetitive sequence repertoire of a species. As shown in Supplemental Table S3, Red-masked fractions were also consistent among cultivars of the wheat pangenome. Moreover, repeats called by Red generally overlapped sequences masked with REdat (66.6%), nrTEplants (73.8%), and RepMod (94.1%) (see Supplemental Table S4). In contrast, the overlap with low complexity regions (in dustmasker) and tandem repeats (in trf) is small (2.8% and 4.9%, respectively).

Figure 1. Fraction of repeated sequences in plant genomes.

Figure 1

Twenty genomes from release 49 (November 2020) of Ensembl Plants were annotated with RepeatMasker (Smit et al., 2015) and the libraries REdat (Nussbaumer et al., 2013) and nrTEplants. The results for 10 genomes masked with RepMod custom libraries are also shown (Flynn et al., 2020). The percentage of repeated sequences is plotted next to the values reported in the literature for those genomes and the fraction of repeats provided by Repeat Detector (Red), based on k-mer enrichment (Girgis, 2015). Species are sorted by genome size from smallest to largest

Table 3 summarizes the number and length of repeats called by all the strategies tested. We observed that Red called more repeats than nrTEplants and REdat but less than custom RepMod libraries (a median of 845 per Mbp, compared with 391 for nrTEplants, 221 for REdat, and 961 for RepMod). In terms of the sequence length of the shortest contig at 50% of the total sequence length, the performance depended on the species, but it seems that repeats called by RepMod are generally shorter.

TABLE 3. Summary of repeated sequences annotated with Repeat Detector (Red) (Girgis, 2015) and RepeatMasker (Smit et al., 2015) with the libraries nrTEplants and REdat (Nussbaumer et al., 2013) and with custom libraries obtained for some species by RepeatModeler (Flynn et al., 2020). Total repeats and N50 is the sequence length of the shortest contig at 50% of the total sequence length (N50) estimates of repeats are shown.

Species Red nrTEplants REdat RepMod
Repeats N50 Repeats N50 Repeats N50 Repeats N50
Arabidopsis thaliana 172,935 445 48,144 1,779 28,797 2,211 72,138 1,178
Arabidopsis halleri 226,080 554 81,857 1,380 57,901 1,431 - -
Prunus dulcis 190,357 1,627 105,546 2,528 36,891 1,025 243,499 1,422
Brachypodium distachyon 150,191 4,986 74,215 6,260 67,632 6,665 222,710 2,125
Brassica rapa 348,258 642 160,157 1,046 69,345 777 303,119 628
Trifolium pratense 277,811 555 139,254 326 155,808 265 - -
Arabis alpina 279,129 1,040 146,057 2,245 98,017 1,050 -
Cucumis melo 305,083 1,939 148,925 3,141 51,833 1,338 407,579 1,819
Citrullus lanatus 323,894 2,596 151,980 1,020 52,941 1,103 - -
Oryza sativa 278,406 2,931 160,371 4,479 129,121 6,077 - -
Setaria viridis 247,732 3,124 116,459 1,727 105,088 1,722 - -
Vitis vinifera 423,876 1,753 185,204 3,369 69,315 1,550 496,352 1,604
Rosa chinensis 463,880 2,125 189,086 1,479 93,715 950 499,475 1,958
Camelina sativa 709,160 878 267,290 1,272 201,059 1,176 611,700 1,105
Malus domestica 531,496 2,416 211,929 4,729 126,487 1,268 - -
Olea europaea 901,519 3,153 291,445 1,956 375,614 1,218 - -
Zea mays 847,205 13,137 365,978 11,806 372,467 11,419 853,432 1,1380
Helianthus annuus 2,387,122 5,018 355,890 8,716 479,400 1,317 - -
Aegilops tauschii 1,506,690 10,133 777,962 9,973 847,592 9,431 1,758,407 7,894
Triticum turgidum 4,291,533 9,066 1,914,776 9,947 1,784,719 10,124 72,138 1,178

Figure 2 summarizes how the called repeats overlapped with genes, exons, and 500-bp windows upstream and downstream. It can be seen that Red repeats overlapped a larger fraction of the gene space (23.2%) than REdat (12.4%) and nrTEplants (18.8%), as did RepMod repeats (24.4%). When only exons were considered, REdat repeats overlapped 4.1% of these, with nrTEplants, Red, and RepMod behaving similarly (11.6, 11.9, and 11.7%, respectively). The figure also shows that Red and RepMod mask more of the proximal upstream and downstream space, which will probably have a positive impact on k-mer counting strategies for promoter analysis (Ksouri et al., 2021). The analysis in Supplemental Table S5 shows that Red identified four times more k-mers with 20+ copies in this regulatory space, which agrees with recent work showing that unidentified TEs are over-represented in specific regulatory networks (Baud et al., 2019).

Figure 2.

Figure 2

Fraction of exons, genes, and 500-bp upstream and downstream regions overlapping annotated repeats in plant genomes. Twenty genomes from release 49 (November 2020) of Ensembl Plants were annotated by Red (Girgis, 2015) or RepeatMasker (Smit et al., 2015) with the libraries REdat (Nussbaumer et al., 2013) and nrTEplants. The results for 10 genomes masked with RepMod custom libraries are also shown (Flynn et al., 2020)

In order to check whether the compared approaches masked preferentially genes from certain families, a Pfam enrichment analysis was carried out; this is summarized in Figure 3. It can be seen that RepMod and Red repeats show the least enrichment. Nevertheless, we found that Red repeats preferentially overlapped four domains (enriched in three or more genomes: reverse transcriptase-like, TIR, NB-ARC, and integrase core domains). Similarly, RepMod repeats were enriched in two protein kinase domains. In contrast, a few Pfam domains were enriched in 10+ genomes in genes overlapping repeats annotated with REdat (153 domains) and nrTEplants (87 domains)(see Supplemental Table S6).

Figure 3. Enriched Pfam domains of protein-coding genes overlapping repeats.

Figure 3

Twenty genomes from release 49 (November 2020) of Ensembl Plants were annotated with Repeat Detector (Red) (Girgis, 2015) and RepeatMasker (Smit et al., 2015) with the libraries REdat (Nussbaumer et al., 2013) and nrTEplants. The results for 10 genomes masked with RepMod custom libraries are also shown (Flynn et al., 2020)

As gene annotation is frequently performed after repeat masking, we reasoned this could affect the Pfam enrichment analyses. Therefore, we carried out a complementary analysis where NLR genes were called de novo on the genomic sequences instead of using the Ensembl gene annotation. The results, summarized in Supplemental Table S7, confirm that Red tends to mask fewer NLR genes than expected at genomic scale, with only one species (Trifolium pratense L.) with an odds ratio > 1. In contrast, we obtained odd ratios greater than 1 for several species with REdat (n = 7), nrTEplants (n = 12), and RepMod (n = 6 out of 10 species).

3.3. Annotating Red-masked repeats within genomes with nrTEplants and minimap2

In the previous analyses, we showed that Red masking is an effective way of calling repeats in plant genomes, comparable with RepMod. Moreover, we observed that nrTEplants behaved better than REdat in most cases. Therefore, we wanted to check whether repeats called with Red could be annotated and classified. For that, we aligned the repeat sequences against the nonredundant nrTElibrary with minimap2. The results are plotted in Figure 4, where it can be seen that in most species, more than half of the repeat space could be annotated (median: 65.9%). As our library contained only TEs, we expected a fraction of the unmapped space to contain simple repeats or satellite DNA. However, in some species, only a small fraction of repeats could be classified. We reasoned this was caused by a repeat consensus not represented in the library. This was confirmed in a separate experiment, where the repeated sequences of olive and R. chinensis obtained from their authors were mapped to Red repeats, as seen in Figure 4 (control). Another positive control was also carried out with sunflower repeated sequences in order to confirm that no valuable repeats had been lost during the construction of nrTEplants. These results indicated that in species where a curated library did not work well, the repeats could be classified by custom collection of repeated sequences for that taxon. As we saw in the previous section, this can also be achieved with species-specific libraries produced with RepMod; however, note that in our tests three-fourths of repeat families discovered by RepMod remained unclassified (see Supplemental Table S8).

Figure 4. Fraction of Repeat Detector (Red) repeats mapped to nrTEplants sequences.

Figure 4

Twenty genomes from release 49 (November 2020) of Ensembl Plants were annotated with Red (Girgis, 2015). The resulting repeats were subsequently mapped to the library nrTEplants with minimap2 (Li, 2018), producing the genome fractions shown. Repeats from three species (R. chinensis, O. europaea, and H. annuus) were also mapped to annotated repeats provided by the respective sequencing consortia as a control. Species are sorted by genome size from smallest to largest

The results in the previous paragraph were obtained with the default map-ont setting of minimap2. Note that we also tried the map-pab and asm20 settings, but obtained similar results. Red clover (Trifolium pratense) was reanalyzed replacing minimap2 with the BLAST algorithms megablast, dc-megablast, blastn, and rmblastn (Altschul et al., 1997). Compared with the mapped fraction produced by minimap2 (0.4%), a maximum value of 6.1% was obtained with blastn. This modest gain in sensitivity required 1,412 min. The algorithm rmblastn, used by RM, yielded a mapped fraction of 0.7%. We concluded that the alternatives to minimap2 offered little gain at the cost of spiralling computing time.

Figure 5 shows the runtime and RAM required by the two-step protocol presented in this paper, measured on a CentOS7.9 computer using four cores of a Xeon E5-2620 v4 (2.10 GHz) central processing unit. Panels A and B correspond to the first step, Red masking. It can be seen that all genomes tested take less than 40 min to run, with the exception of tetraploid Triticum turgidum L., which took 71 min. The memory consumption was below 20 GB in most cases, but climbed to 22.7 GB and 29.9 GB for Aegilops tauschii Coss. and T. turgidum. Panel C illustrates the runtime of the second step, the mapping of nrTEplants. It can be seen that all plants required less than 27 min, except A. tauschii and T. turgidum, which took 3 and 1 h respectively. The memory consumed by minimap2 was ∼3.8 GB in all cases. A comparison with the data in Supplemental Table S8 indicated that the protocol presented in this paper was up to two orders of magnitude faster than the combination of RepMod and RM, even with only four central processing unit cores.

Figure 5.

Figure 5

Runtime and memory requirements of a two-step repeat annotation protocol based on the Repeat Detector (Girgis, 2015), minimap2 (Li, 2018), and the nrTEplants library. The protocol was tested on 20 genomes from release 49 (November 2020) of Ensembl Plants. Similar values were measured on an Ubuntu box with four-core i5-6600 (3.30 GHz) central processing unit cores

4. Conclusions

The hybrid two-step methodology presented in this paper was tested on 20 angiosperms with genome sizes ranging from 0.12 to 10.46 Gbp. Overall, we observed that Red consistently produced repeated fractions similar to the expected values from the literature. Comparable results were obtained for 10 species analyzed with RepMod custom libraries. The meta-library nrTEplants, built by Pfam-informed sequence clustering, also showed good performance in most species but failed to recover the expected repeat fraction in cases such as melon (Cucumis melo L.) or sunflower. This observation highlights the problem of using repeat libraries that do not include sequences similar to the genome of interest. This is the most likely explanation for the low masking values observed for REdat, as that library was produced before many of these genomes were available. For that reason, separating the tasks of calling and classifying repeats, as performed here, seems a promising strategy.

On the one hand, Red k-mer masking does not have a preference for masking particular protein-coding families, in contrast to repeats annotated with RM using REdat and nrTEplants. In fact, it also behaved better than custom RepMod libraries with respect to NLR genes annotated de novo. On the other hand, Red appropriately masked plant genomes for which no repeat libraries have been curated yet. If there is a need to classify the repeats called by Red, a curated repeat library can be obtained directly from Ensembl Plants (see https://github.com/Ensembl/plant-scripts/blob/master/repeats/get_repeats_ensembl.sh) or the INSDC archives (see, for example, https://www.ebi.ac.uk/ena/browser/view/CACTIH01), or by clustering repeats from different sources, as demonstrated in this study. Our protocol took less than 2 h to run and up to 30 GB of RAM, and can use nrTEplants or any repeat library in FASTA format. This is about two orders of magnitude faster than building species-specific custom libraries with RepMod for the species tested in this benchmark. We thus conclude that the approach presented here is an efficient way of annotating repeated sequences in plant genomes.

Supplementary Material

Supplemental Figure 1
Supplemental Tables

Core Ideas.

  • Control Pfam domains minimize unrelated coding sequences in repeat libraries.

  • Repeat calling by k-mer counting with Red does not preferentially mask NLR genes.

  • Repeats called by Red can be efficiently classified by sequence similarity with minimap2.

Acknowledgments

We are grateful to Doreen Ware and Vasili Sitnik for comments on drafts of this manuscript. We thank the Gramene team for their continual support and cooperation, as well as members of the Ensembl team for developing and maintaining the front-end and back-end software and infrastructure that underpin Ensembl Plants. This work was funded by The UK Biosciences and Biotechnology Research Council [BB/P016855/1, BB/P027849/1, and Ensembl-4-Breeders workshop support], the National Sciences Foundation [1127112], the ELIXIR implementation studies FONDUE, and ‘Apple as a Model for Genomic Information Exchange’ and the European Molecular Biology Laboratory. Funding for open access charges was provided by theUK Biosciences and Biotechnology Research Council [BB/P016855/1].

Abbreviations

NLR

nucleotide-binding and leucine-rich repeat immune receptor

R genes

disease resistance genes

Red

Repeat Detector

RepMod

RepeatModeler

RM

RepeatMasker

TE

transposable element

Footnotes

AUTHOR CONTRIBUTIONS

Bruno Contreras-Moreira: formal analysis, funding acquisition, investigation, resources, software, supervision, writing—original draft, writing—review and editing. Carla V Filippi: data curation, investigation, methodology, resources, writing—original draft, writing—review and editing. Guy Naamati: resources, software, writing—review and editing. James E Allen: resources, software, writing—review and editing. Paul Flicek: funding acquisition, resources, writing—original draft, writing—review and editing.

CONFLICT OF INTEREST

Paul Flicek is a member of the Scientific Advisory Boards of Fabric Genomics, Inc. and Eagle Genomics Ltd.

Data and Source Code Availability

The repeat library and the scripts used to mask and annotate the plant genomes, together with the benchmark scripts and data, can be obtained at https://github.com/Ensembl/plant-scripts.

References

  1. Alioto T, Alexiou KG, Bardil A, Barteri F, Castanera R, Cruz F, Dhingra A, Duval H, Fernández i Martí Á, Frias L, Galán B, et al. Transposons played a major role in the diversification between the closely related almond and peach genomes: Results from the almond genome sequence. The Plant Journal. 2020;101:455–472. doi: 10.1111/tpj.14538. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Amselem J, Cornut G, Choisne N, Alaux M, Alfama-Depauw F, Jamilloux V, Maumus F, Letellier T, Luyten I, Pommier C, Adam-Blondon A-F, et al. RepetDB: A unified resource for transposable element references. Mobile DNA. 2019;10:6. doi: 10.1186/s13100-019-0150-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Badouin H, Gouzy J, Grassa CJ, Murat F, Staton SE, Cottret L, Lelandais-Brière C, Owens GL, Carrère S, Mayjonade B, Legrand L, et al. The sunflower genome provides insights into oil metabolism, flowering and Asterid evolution. Nature. 2017;546:148–152. doi: 10.1038/nature22380. [DOI] [PubMed] [Google Scholar]
  5. Bao W, Kojima KK, Kohany O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA. 2015;6:11. doi: 10.1186/s13100-015-0041-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Baud A, Wan M, Nouaud D, Anxolabehere D, Quesneville H. Traces of past transposable element presence in Brassicaceae genome dark matter. BioRxiv. 2019 doi: 10.1101/547877. [DOI] [Google Scholar]
  7. Bayer PE, Edwards D, Batley J. Bias in resistance gene prediction due to repeat masking. Nature Plants. 2018;4:762–765. doi: 10.1038/s41477-018-0264-0. [DOI] [PubMed] [Google Scholar]
  8. Beier S, Ulpinnis C, Schwalbe M, Münch T, Hoffie R, Koeppel I, Hertig C, Budhagatapalli N, Hiekel S, Pathi KM, Hensel G, et al. Kmasker plants—A tool for assessing complex sequence space in plant species. The Plant Journal. 2020;102:631–642. doi: 10.1111/tpj.14645. [DOI] [PubMed] [Google Scholar]
  9. Benson G. Tandem repeats finder: A program to analyze DNA sequences. Nucleic Acids Research. 1999;27:573–580. doi: 10.1093/nar/27.2.573. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Castanera R, Ruggieri V, Pujol M, Garcia-Mas J, Casacuberta JM. An improved melon reference genome with single-molecule sequencing uncovers a recent burst of transposable elements with potential impact on genes. Frontiers in Plant Science. 2019;10:1815. doi: 10.3389/fpls.2019.01815. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Contreras-Moreira B, Cantalapiedra CP, García-Pereira MJ, Gordon SP, Vogel JP, Igartua E, Casas AM, Vinuesa P. Analysis of plant pan-genomes and transcriptomes with GET_HOMOLOGUES-EST, a clustering solution for sequences of the same species. Frontiers in Plant Science. 2017;8:184. doi: 10.3389/fpls.2017.00184. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Contreras-Moreira B, Naamati G, Rosello M, Allen JE, Hunt SE, Muffato M, Gall A, Flicek P. Ensembl/Plant-Scripts. GitHub. 2021. https://github.com/Ensembl/plant_tools . [DOI] [PMC free article] [PubMed]
  13. da Cruz MHP, Domingues DS, Saito PTM, Paschoal AR, Bugatti PH. TERL: Classification of transposable elements by convolutional neural networks. Briefings in Bioinformatics. 2020;22(3):bbaa185. doi: 10.1093/bib/bbaa185. [DOI] [PubMed] [Google Scholar]
  14. Daccord N, Celton J-M, Linsmith G, Becker C, Choisne N, Schijlen E, van de Geest H, Bianco L, Micheletti D, Velasco R, Di Pierro EA, et al. High-quality de novo assembly of the apple genome and methylome dynamics of early fruit development. Nature Genetics. 2017;49:1099–1106. doi: 10.1038/ng.3886. [DOI] [PubMed] [Google Scholar]
  15. De Vega JJ, Ayling S, Hegarty M, Kudrna D, Goicoechea JL, Ergon Å, Rognli OA, Jones C, Swain M, Geurts R, Lang C, et al. Red clover (Trifolium pratense L.) draft genome provides a platform for trait improvement. Scientific Reports. 2015;5:17394. doi: 10.1038/srep17394. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Du J, Grant D, Tian Z, Nelson RT, Zhu L, Shoemaker RC, Ma J. SoyTEdb: A comprehensive database of transposable elements in the soybean genome. BMC Genomics. 2010;11:113. doi: 10.1186/1471-2164-11-113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14:755–763. doi: 10.1093/bioinformatics/14.9.755. [DOI] [PubMed] [Google Scholar]
  18. Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, Smit AF. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences of the United States of America. 2020;117:9451–9457. doi: 10.1073/pnas.1921046117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. French–Italian Public Consortium for Grapevine Genome Characterization. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature. 2007;449:463–467. doi: 10.1038/nature06148. [DOI] [PubMed] [Google Scholar]
  20. Girgis HZ. Red: An intelligent, rapid, accurate tool for detecting repeats de-novo on the genomic scale. BMC Bioinformatics. 2015;16:227. doi: 10.1186/s12859-015-0654-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Gordon SP, Contreras-Moreira B, Woods DP, Des Marais DL, Burgess D, Shu S, Stritt C, Roulin AC, Schackwitz W, Tyler L, Martin J, et al. Extensive gene content variation in the Brachypodium distachyon pan-genome correlates with population structure. Nature Communications. 2017;8:2184. doi: 10.1038/s41467-017-02292-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Guo S, Zhang J, Sun H, Salse J, Lucas WJ, Zhang H, Zheng Y, Mao L, Ren Y, Wang Z, Min J, et al. Thedraft genomeofwatermelon(Citrullus lanatus) and resequencing of 20 diverse accessions. Nature Genetics. 2013;45:51–58. doi: 10.1038/ng.2470. [DOI] [PubMed] [Google Scholar]
  23. Hall TA. BioEdit: A user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucleic Acids Symposium Series. 1999;41:95–98. [Google Scholar]
  24. Harris RS. Improved pairwise alignment of genomic DNA. Doctoral dissertation. The Pennsylvania State University; 2007. [Google Scholar]
  25. Hibrand Saint-Oyant L, Ruttink T, Hamama L, Kirov I, Lakhwani D, Zhou NN, Bourke PM, Daccord N, Leus L, Schulz D, Van de Geest H, et al. A high-quality genome sequence of Rosa chinensis to elucidate ornamental traits. Nature Plants. 2018;4:473–484. doi: 10.1038/s41477-018-0166-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Hickey G, Heller D, Monlong J, Sibbesen JA, Sirén J, Eizenga J, Dawson ET, Garrison E, Novak AM, Paten B. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biology. 2020;21:35. doi: 10.1186/s13059-020-1941-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Howe KL, Contreras-Moreira B, De Silva N, Maslen G, Akanni W, Allen J, Alvarez-Jarreta J, Barba M, Bolser DM, Cambell L, Carbajo M, et al. Ensembl Genomes 2020—Enabling non-vertebrate genomic research. Nucleic Acids Research. 2020;48:D689–D695. doi: 10.1093/nar/gkz890. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. International Brachypodium Initiative. Genome sequencing and analysis of the model grass Brachypodium distachyon. Nature. 2010;463:763–768. doi: 10.1038/nature08747. [DOI] [PubMed] [Google Scholar]
  29. International Rice Genome Sequencing Project. The map-based sequence of the rice genome. Nature. 2005;436:793–800. doi: 10.1038/nature03895. [DOI] [PubMed] [Google Scholar]
  30. Jiménez-Ruiz J, Ramírez-Tejero JA, Fernández-Pozo N, Leyva-Pérez M, de la O, Yan H, de la Rosa R, Belaj A, Montes E, Rodríguez-Ariza MO, Navarro F, et al. Transposon activation is a major driver in the genome evolution of cultivated olive trees (Olea europaea L.) The Plant Genome. 2020;13:e20010. doi: 10.1002/tpg2.20010. [DOI] [PubMed] [Google Scholar]
  31. Kagale S, Koh C, Nixon J, Bollina V, Clarke WE, Tuteja R, Spillane C, Robinson SJ, Links MG, Clarke C, Higgins EE, et al. The emerging biofuel crop Camelina sativa retains a highly undifferentiated hexaploid genome structure. Nature Communications. 2014;5:3706. doi: 10.1038/ncomms4706. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Kinsella RJ, Kähäri A, Haider S, Zamora J, Proctor G, Spudich G, Almeida-King J, Staines D, Derwent P, Kerhornou A, Kersey P, et al. Ensembl BioMarts: A hub for data retrieval across taxonomic space. Database. 2011;2011:bar030. doi: 10.1093/database/bar030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Ksouri N, Castro-Mondragón JA, Montardit-Tardá F, van Helden J, Contreras-Moreira B, Gogorcena Y. Tuning promoter boundaries improves regulatory motif discovery in nonmodel plants: The peach example. Plant Physiology. 2021;185(3):1242–1258. doi: 10.1093/plphys/kiaa091. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Kurtz S, Narechania A, Stein JC, Ware D. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics. 2008;9:517. doi: 10.1186/1471-2164-9-517. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Legrand S, Caron T, Maumus F, Schvartzman S, Quadrana L, Durand E, Gallina S, Pauwels M, Mazoyer C, Huyghe L, Colot V, et al. Differential retention of transposable element-derived sequences in outcrossing Arabidopsis genomes. Mobile DNA. 2019;10:30. doi: 10.1186/s13100-019-0171-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Lerat E. Identifying repeats and transposable elements in sequenced genomes: How to find your way through the dense forest of programs. Heredity. 2010;104:520–533. doi: 10.1038/hdy.2009.165. [DOI] [PubMed] [Google Scholar]
  37. Li H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–3100. doi: 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Maccaferri M, Harris NS, Twardziok SO, Pasam RK, Gundlach H, Spannagl M, Ormanbekova D, Lux T, Prade VM, Milner SG, Himmelbach A, et al. Durum wheat genome highlights past domestication signatures and future improvement targets. Nature Genetics. 2019;51:885–895. doi: 10.1038/s41588-019-0381-3. [DOI] [PubMed] [Google Scholar]
  39. Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27:764–770. doi: 10.1093/bioinformatics/btr011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. McClintock B. The origin and behavior of mutable loci in maize. Proceedings of the National Academy of Sciences of the United States of America. 1950;36:344–355. doi: 10.1073/pnas.36.6.344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, Tosatto SCE, Paladin L, Raj S, Richardson LJ, Finn RD, et al. Pfam: The protein families database in 2021. Nucleic Acids Research. 2021;49:D412–D419. doi: 10.1093/nar/gkaa913. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Morgulis A, Gertz EM, Schäffer AA, Agarwala R. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. Journal of Computational Biology. 2006;13:1028–1040. doi: 10.1089/cmb.2006.13.1028. [DOI] [PubMed] [Google Scholar]
  43. Natali L, Cossu RM, Barghini E, Giordani T, Buti M, Mascagni F, Morgante M, Gill N, Kane NC, Rieseberg L, Cavallini A. The repetitive component of the sunflower genome as shown by different procedures for assembling next generation sequencing reads. BMC Genomics. 2013;14:686. doi: 10.1186/1471-2164-14-686. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Novák P, Guignard MS, Neumann P, Kelly LJ, Mlinarec J, Koblížková A, Dodsworth S, Kovařík A, Pellicer J, Wang W, Macas J, et al. Repeat-sequence turnover shifts fundamentally in species with large genomes. Nature Plants. 2020;6:1325–1329. doi: 10.1038/s41477-020-00785-x. [DOI] [PubMed] [Google Scholar]
  45. Nussbaumer T, Martis MM, Roessner SK, Pfeifer M, Bader KC, Sharma S, Gundlach H, Spannagl M. MIPS PlantsDB: A database framework for comparative plant genome research. Nucleic Acids Research. 2013;41:D1144–D1151. doi: 10.1093/nar/gks1153. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Osuna-Cruz CM, Paytuvi-Gallart A, Di Donato A, Sundesha V, Andolfo G, Aiese Cigliano R, Sanseverino W, Ercolano MR. PRGdb 3.0: A comprehensive platform for prediction and analysis of plant disease resistance genes. Nucleic Acids Research. 2018;46:D1197–D1201. doi: 10.1093/nar/gkx1119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Ou S, Su W, Liao Y, Chougule K, Agda JRA, Hellinga AJ, Lugo CSB, Elliott TA, Ware D, Peterson T, Jiang N, Hirsch CN, et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biology. 2019;20:275. doi: 10.1186/s13059-019-1905-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Quinlan AR, Hall IM. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. R Core Team. R:A language and environment for statistical computing. R Foundation for Statistical Computing; 2020. [Google Scholar]
  50. Raymond O, Gouzy J, Just J, Badouin H, Verdenaud M, Lemainque A, Vergne P, Moja S, Choisne N, Pont C, Carrère S, et al. The Rosa genome provides new insights into the domestication of modern roses. Nature Genetics. 2018;50:772–777. doi: 10.1038/s41588-018-0110-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Ruggieri V, Alexiou KG, Morata J, Argyris J, Pujol M, Yano R, Nonaka S, Ezura H, Latrasse D, Boualem A, Benhamed M, et al. An improved assembly and annotation of the melon (Cucumis melo L.) reference genome. Scientific Reports. 2018;8:8088. doi: 10.1038/s41598-018-26416-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA, Minx P, et al. The B73 maize genome: Complexity, diversity, and dynamics. Science. 2009;326:1112–1115. doi: 10.1126/science.1178534. [DOI] [PubMed] [Google Scholar]
  53. Smit AFA, Hubler R, Green P. RepeatMasker Open-4.0. Institute for Systems Biology. 2015. https://www.repeatmasker.org .
  54. Stabenau A, McVicker G, Melsopp C, Proctor G, Clamp M, Birney E. The Ensembl core software libraries. Genome Research. 2004;14:929–933. doi: 10.1101/gr.1857204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Staton SE, Bakken BH, Blackman BK, Chapman MA, Kane NC, Tang S, Ungerer MC, Knapp SJ, Rieseberg LH, Burke JM. The sunflower (Helianthus annuus L.) genome reflects a recent history of biased accumulation of transposable elements. The Plant Journal. 2012;72:142–153. doi: 10.1111/j.1365-313X.2012.05072.x. [DOI] [PubMed] [Google Scholar]
  56. Steuernagel B, Witek K, Krattinger SG, Ramirez-Gonzalez RH, Schoonbeek H-J, Yu G, Baggs E, Witek AI, Yadav I, Krasileva KV, Jones JDG, et al. The NLR-Annotator tool enables annotation of the intracellular immune receptor repertoire. Plant Physiology. 2020;183:468–482. doi: 10.1104/pp.19.01273. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Studer A, Zhao Q, Ross-Ibarra J, Doebley J. Identification of a functional transposon insertion in the maize domestication gene tb1. Nature Genetics. 2011;43:1160–1163. doi: 10.1038/ng.942. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Thielen PM, Pendelton AL, Player RA, Bowden KV, Lawton TJ, Wisecaver JH. Reference genome for the highly transformable Setaria viridis cultivar ME034V. Genes, Genomes, Genetics. 2020;10(10):3467–3478. doi: 10.1534/g3.120.401345. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Thieme M, Lanciano S, Balzergue S, Daccord N, Mirouze M, Bucher E. Inhibition of RNA polymerase II allows controlled mobilisation of retrotransposons for plant breeding. Genome Biology. 2017;18:134. doi: 10.1186/s13059-017-1265-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. UniProt Consortium. UniProt: A worldwide hub of protein knowledge. Nucleic Acids Research. 2019;47:D506–D515. doi: 10.1093/nar/gky1049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Unver T, Wu Z, Sterck L, Turktas M, Lohaus R, Li Z, Yang M, He L, Deng T, Escalante FJ, Llorens C, et al. Genome of wild olive and the evolution of oil biosynthesis. Proceedings of the National Academy of Sciences of the United States of America. 2017;114:E9413–E9422. doi: 10.1073/pnas.1708621114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Van Bel M, Bucchini F, Vandepoele K. Gene space completeness in complex plant genomes. Current Opinion in Plant Biology. 2019;48:9–17. doi: 10.1016/j.pbi.2019.01.001. [DOI] [PubMed] [Google Scholar]
  63. Vassetzky NS, Kramerov DA. SINEBase: A database and tool for SINE analysis. Nucleic Acids Research. 2013;41:D83–D89. doi: 10.1093/nar/gks1263. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Walkowiak S, Gao L, Monat C, Haberer G, Kassa MT, Brinton J, Ramirez-Gonzalez RH, Kolodziej MC, Delorean E, Thambugala D, Klymiuk V, et al. Multiple wheat genomes reveal global variation in modern breeding. Nature. 2020;588:277–283. doi: 10.1038/s41586-020-2961-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, Flavell A, Leroy P, Morgante M, Panaud O, Paux E, et al. A unified classification system for eukaryotic transposable elements. Nature Reviews Genetics. 2007;8:973–982. doi: 10.1038/nrg2165. [DOI] [PubMed] [Google Scholar]
  66. Wierzbicki F, Schwarz F, Cannalonga O, Kofler R. Generating high quality assemblies for genomic analysis of transposable elements. BioRxiv. 2020 doi: 10.1101/2020.03.27.011312. [DOI] [Google Scholar]
  67. Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten J-W, da Silva Santos LB, Bourne PE, Bouwman J, et al. The FAIR Guiding Principles for scientific data management and stewardship. Science Data. 2016;3:160018. doi: 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Willing E-M, Rawat V, Mandáková T, Maumus F, James GV, Nordström KJV, Becker C, Warthmann N, Chica C, Szarzynska B, Zytnicki M, et al. Genome expansion of Arabis alpina linked with retrotransposition and reduced symmetric DNA methylation. Nature Plants. 2015;1:14023. doi: 10.1038/nplants.2014.23. [DOI] [PubMed] [Google Scholar]
  69. Zhang L, Cai X, Wu J, Liu M, Grob S, Cheng F, Liang J, Cai C, Liu Z, Liu B, Wang F, et al. Improved Brassica rapa reference genome by single-molecule sequencing and chromosome conformation capture technologies. Horticulture Research. 2018;5:50. doi: 10.1038/s41438-018-0071-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Zhao G, Zou C, Li K, Wang K, Li T, Gao L, Zhang X, Wang H, Yang Z, Liu X, Jiang W, et al. The Aegilops tauschii genome reveals multiple impacts of transposons. Nature Plants. 2017;3:946–955. doi: 10.1038/s41477-017-0067-8. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Figure 1
Supplemental Tables

Data Availability Statement

The repeat library and the scripts used to mask and annotate the plant genomes, together with the benchmark scripts and data, can be obtained at https://github.com/Ensembl/plant-scripts.

RESOURCES