Skip to main content
PLOS Genetics logoLink to PLOS Genetics
. 2021 Oct 14;17(10):e1009768. doi: 10.1371/journal.pgen.1009768

The genomic ecosystem of transposable elements in maize

Michelle C Stitzer 1,*, Sarah N Anderson 2, Nathan M Springer 2, Jeffrey Ross-Ibarra 1,3
Editor: Kirsten Bomblies4
PMCID: PMC8547701  PMID: 34648488

Abstract

Transposable elements (TEs) constitute the majority of flowering plant DNA, reflecting their tremendous success in subverting, avoiding, and surviving the defenses of their host genomes to ensure their selfish replication. More than 85% of the sequence of the maize genome can be ascribed to past transposition, providing a major contribution to the structure of the genome. Evidence from individual loci has informed our understanding of how transposition has shaped the genome, and a number of individual TE insertions have been causally linked to dramatic phenotypic changes. Genome-wide analyses in maize and other taxa have frequently represented TEs as a relatively homogeneous class of fragmentary relics of past transposition, obscuring their evolutionary history and interaction with their host genome. Using an updated annotation of structurally intact TEs in the maize reference genome, we investigate the family-level dynamics of TEs in maize. Integrating a variety of data, from descriptors of individual TEs like coding capacity, expression, and methylation, as well as similar features of the sequence they inserted into, we model the relationship between attributes of the genomic environment and the survival of TE copies and families. In contrast to the wholesale relegation of all TEs to a single category of junk DNA, these differences reveal a diversity of survival strategies of TE families. Together these generate a rich ecology of the genome, with each TE family representing the evolution of a distinct ecological niche. We conclude that while the impact of transposition is highly family- and context-dependent, a family-level understanding of the ecology of TEs in the genome can refine our ability to predict the role of TEs in generating genetic and phenotypic diversity.

Author summary

Transposable elements (TEs) are pieces of DNA that can jump to new positions in the genome. When they land at a new location, they generate a mutation. Such mutations in genes affecting kernel and plant pigmentation allowed the discovery of TEs in maize in the 1940’s. Since, we have learned that TEs are a ubiquitous feature of eukaryotic genomes, and that TEs make up over 85% of all the DNA in a maize genome. Here, we investigate the roles of individual TE copies and TE species interacting within the maize genome, and how these relationships are analogous to ecological communities. The community of transposable elements within a single genome represents a rich diversity of ecological strategies for survival in the complex, hostile environment of a host genome.

Introduction

‘Lumping our beautiful collection of transposons into a single category is a crime’

-Michael R. Freeling, Mar. 10, 2017

Transposable elements (TEs) are pieces of DNA that can replicate or move themselves within a genome. The majority of DNA in plant genomes is TE derived, and their activity is the largest contributor to differences in genome size within and between taxa [1]. When they transpose, TEs also generate mutations as they insert into novel positions in the genome [2, 3]. These two linked processes—that of replication of the TE, and mutation suffered by the host genome—generate a conflict between individual lineages of TEs and their host genome. Individual TE lineages gain evolutionary advantage by increasing in copy number, while the host genome gains fitness if it can reduce deleterious mutations arising from transposition. As a result of this conflict, many genomes are littered with a bulk of TE-derived DNA that is often both transcriptionally and recombinationally inert [4]. While this conflict between TEs and their host has long been noted to shape general patterns of TE evolution [58], the details of how this conflict unfolds are tenuous and rarely well understood [9, 10].

The staggering diversity of TEs presents a major challenge in understanding their conflict with the host genome. For example, although they are united by their ability to move between positions in the host genome, the mechanisms by which TEs do so differ between the major TE classes. Class I retrotransposons, often the major contributor of TE DNA in plants [11], can be further divided into three orders—long terminal repeat (LTR), long interspersed nuclear element (LINE), and short interspersed nuclear element (SINE). All class I TEs are transcribed to mRNA by host polymerases, some are translated to produce reverse transcriptase and other enzymes, and all use TE encoded enzymes for reverse transcription of a cDNA copy that can be integrated at a new position in the host genome. In contrast, the two major orders of class II DNA TEs transpose in different ways. TIR elements are physically excised from one position on the chromosome and moved by TE-encoded transposase proteins that recognize short, diagnostic, terminal inverted repeats (TIRs). Helitron elements are thought to transpose via a rolling circle mechanism that generates a new copy after a single strand nick by an element-encoded protein and subsequent strand invasion and repair [12]. The process of transposition for most TEs (all LTR, TIR; some LINE, SINE) generates a target site duplication (TSD) in the host DNA at the integration site, and thus the identification of a TSD bordering a TE can confirm transposition. These well-described mechanisms of transposition generate predictable sequence organization that can be recognized computationally, but also generate differences in the genomic localization of these elements, via enzymatic site preference of TE encoded proteins [13, 14]. Most of these orders are further subdivided into TE superfamilies based on differences in the sequence, arrangement, and function of proteins encoded by the TE to ensure its transposition [15].

The process of transposition generates new TE copies within a genome, forming relationships between TEs that allow their systematic grouping into families. Many taxonomic schemes for TEs exist [1519], but the most widely-applied approach for genome-scale data [15] relies on sequence homology between copies. Although not entirely representative of TE evolutionary history [20, 21], such approaches nonetheless reflect to some degree the ability of TE encoded proteins to bind TE DNA and move other TE copies in trans, as recognition of specific nucleic acid sequences by TE-encoded proteins is a necessary step in the transposition process. The resulting TE families thus represent groups of related TEs that share both evolutionary history and transposition machinery, and are the groupings most naturally analogous to species in higher eukaryotes.

TE families differ from one another in many ways, including their total copy number, where they insert in the genome, which tissues they are expressed in, and how they are restricted epigenetically by the host genome. In the maize genome, some families are small, consisting of a few (e.g. Bs [22]) or tens (e.g. Ds1 [23]) of copies, while others contain tens of thousands of copies (e.g. huck, cinful-zeon [2428]). Some TE families are expressed in certain tissues (e.g. Misfit [29]), while others are expressed more broadly across many tissues (e.g. cinful [29]). Some families preferentially insert into genic regions (e.g. Mu1 [30]), others in the centromere (e.g. CRM1 [31]). And while some families lack DNA methylation, others are methylated across the entire body of the TE, and yet others act to spread methylation outwards into flanking sequences [32, 33].

Although the major classes of TEs are found across taxa, their relative abundances differ [34] and there is no clear consensus as to the factors that explain the diversity of TEs within a genome [3538]. One approach to understand the diversity of TEs is to consider the genome as a community and apply principles of community ecology to understand their distribution and abundance [39]. Initially proposed in terms of a dichotomy between TEs that have specialized in heterochromatic or euchromatic niches [7], thoughts about the ecology of the genome have been refined into a continuum of space, with different TE lineages existing in different genomic niches [8, 39, 40]. Empirical descriptions of TEs in a community ecology context, however, have been limited to a few families [4143]. The genome reflects the interface between ecological and evolutionary processes, as TEs alter their environment by inserting. This in turn affects how TEs evolve and adapt—making the genome a system to explore the interface between ecology and evolution [4446]. Due to the long time scales over which TEs can persist in genomes, distinguishing whether processes occurring within the genome reflect ecological or evolutionary time scales can be difficult, although the two can be separated [47].

Here, we utilize the genomic ecosystem as a framework to describe patterns we observe in the extant maize genome. We take advantage of the diversity of TEs in maize, the record of past transposition still detectable in the genome, and the rich developmental and tissue-specific resources of maize to investigate the family-level ecological and evolutionary dynamics of TEs in maize. We integrate many metrics that can be measured at the level of TE family to present a natural history of TEs in the B73 maize genome, characterizing and describing the genomic features that differentiate superfamilies and families of TEs. We model survival of individual copies and families in the genome to facilitate an understanding of the complex and interactive strategies TEs use to associate with their host and each other, and identify suites of traits that act to define specific genomic niches and survival strategies. We conclude that understanding the diversity of TEs in the maize genome helps not only to describe TE function, but also that of the host genome.

Results

General features of TE orders and superfamilies

We identified members of each of the 13 superfamilies of transposable elements (TEs) previously identified in plants [15] in our structural annotation of the maize B73 reference genome. This annotation resolves nested insertions of TEs within other elements, resulting in a total of 143,067 LTR retrotransposons (RLC (Ty1/Copia), RLG (Ty3/Gypsy), and RLX (Unknown LTR) superfamilies), 1,640 LINE and SINE (nonLTR) retrotransposons (RIL, RIT, and RST superfamilies), 171,570 TIR transposons (DTA (hAT), DTC (CACTA), DTH (Pif/Harbinger), DTM (Mutator), DTT (Tc1/Mariner), and DTX (Unknown TIR) superfamilies), and 22,234 Helitrons (DHH superfamily) (Table 1 and Fig 1A). We determined the number of families, median length, median age, distance to the nearest gene, and the number of base pairs each superfamily contributes to the genome (Fig 2; Interactive distributions per family: https://mcstitzer.shinyapps.io/maize_te_families/). For each family and superfamily, we determined the proportion of elements that are nested within another TE and the proportion of elements that are split into multiple pieces by other TE insertions.

Table 1. TE superfamilies in the maize genome.

Class Order Superfamily Common Name Number Copies Number Families
DNA transposon Helitron DHH Helitron 22,339 1,722
DNA transposon TIR DTA hAT 5,096 275
DNA transposon TIR DTC CACTA 2,768 73
DNA transposon TIR DTH Pif/Harbinger 63,216 458
DNA transposon TIR DTM Mutator 928 67
DNA transposon TIR DTT Tc1/Mariner 67,533 269
DNA transposon TIR DTX Unknown TIR 34,778 76
Retrotransposon LTR RLC Ty1/Copia 46,553 2,788
Retrotransposon LTR RLG Ty3/Gypsy 75,761 7,719
Retrotransposon LTR RLX Unknown LTR 20,789 13,290
Retrotransposon nonLTR RIT RTE 296 2
Retrotransposon nonLTR RIL L1 477 29
Retrotransposon nonLTR RST SINE 892 533

Fig 1. Abundance of TEs.

Fig 1

The relative copy number (A) and size in million base pairs (Mb) (B) of families and superfamilies shown by the size of the rectangle. Superfamilies are denoted by color, and each family is bounded by gray lines within the superfamily. Superfamily names begin with a two letter code: ‘DT’ belong to the order of Terminal Inverted Repeat transposons, ‘DH’ refers to the order Helitron, ‘RL’ belong to the order Long Terminal Repeat retrotransposons, and ‘RI’ and ‘RS’ are nonLTR retrotransposons (LINEs and SINEs). Superfamily names beginning with ‘D’ are Class II DNA transposons, while those starting with ‘R’ are Class I retrotransposons.

Fig 2. Characteristics of each superfamily of TE.

Fig 2

Superfamilies are classified into orders and classes, as shown at the bottom of the plot. (A-D) Family characteristics of each of the most numerous 10 families (with ≥ 10 copies) of each superfamily. Family names are listed in S1 Table. (A) TE length, (B) Distance to the closest gene, (C) proportion of TE copies found within another TE, and (D) TE age. In (A, B, & D) family medians are shown as points, with lines representing upper to lower quartiles. Superfamilies are shown as colored rectangles, where the dotted line reflects the median and box boundaries reflect lower and upper quartiles. In (C), families are shown as points and superfamily proportions as a barplot.

Even at the broad taxonomic level of order, there are considerable differences among TEs. Because of their size, (median length 8.4 kb; Fig 2A and S1(C) Fig) LTR retrotransposons contribute more total base pairs to the genome (1,363 Mb; Fig 1B) and are commonly disrupted by another TE copy (23 disrupted; S1(D) Fig). LTR retrotransposons are also typically far from genes (median distance 16.4 kb; Fig 2B, only 3.5% within a gene transcript; S1(A) Fig, median distance to a syntenic gene 31.9 kb; S1(B) Fig) and 12 of copies insert into a preexisting TE copy (Fig 2C). The median time since insertion of LTR retrotransposons is 315,000 years (Fig 2D). In contrast, despite having more copies (Table 1), TIR elements contribute fewer base pairs to the genome (74.1 Mb) and are rarely disrupted by the insertion of another TE copy (< 5% disrupted) (S1(D) Fig), presumably due to their much smaller size (median length 306 bp; Fig 2A and S1(C) Fig). TIR elements as a group are slightly further from genes (median distance 17.2 kb; Fig 2B, 1.7% within a gene transcript; S1(A) Fig, median distance to a syntenic gene 29.0 kb, S1(B) Fig), and commonly insert into preexisting TE copies (≈ 70% of copies; Fig 2C). They represent the most recent insertions, with a median age of 185,000 years (Fig 2D). Although Helitron elements are fewer in number than TIR elements, they contribute more base pairs to the genome (93.8 Mb) and are more commonly disrupted by the insertion of another TE (14 of copies; S1(D) Fig) due to their increased length (median length 2.4 kb). Helitrons are also closer to genes than TIR elements (median distance 10.4 kb; Fig 2B, with 22.9% overlapping a gene transcript S1(A) Fig; median distance 25.4 kb from a syntenic gene; S1(B) Fig), and less frequently insert into a preexisting copy (50% of copies are found within another TE; Fig 2C). Helitrons are represented by relatively old copies, with a median age of 500,000 years (Fig 2D). NonLTR retrotransposons (LINEs and SINEs) contribute only 2.9 Mb, are relatively short (median length 548 bp), and only 5% of copies are disrupted by the insertion of another TE (S1(D) Fig). LINEs and SINEs are however often close to genes (median distance 2.3 kb; Fig 2B, 18.6% in a gene transcript; S1(A) Fig, median distance to a syntenic gene 10.1 kb; S1(B) Fig) and only 37% insert into another TE copy (Fig 2C). These nonLTR retroelements have a median age of 350,000 years (Fig 2D).

Within these orders, variation also exists among superfamilies (Fig 2). For example, TE superfamilies are found nonuniformly along chromosomes (Fig 3 and S2 Fig): while some superfamilies like RLG (Ty3/Gypsy) and DTC (CACTA) are enriched in centromeric and pericentromeric regions, others, like RLC (Ty1/Copia) and DTA (hAT) are found more commonly on chromosome arms. As maize genes are enriched on chromosome arms, this distribution is reflected in the distance each superfamily is found from genes (Fig 2B). Similarly, while most TIR superfamilies are found far from genes (median 17.2 kb), DTM (Mutator) elements are only a median distance of 2.4 kb away from genes (Fig 2B). And although TIR elements are often short (median 311 bp), DTC elements have a median length of 2886 base pairs (Fig 2A).

Fig 3. Chromosomal distribution of superfamilies and example families.

Fig 3

Counts of number of insertions in 1 Mb bins across chromosome 1 for (A) TE superfamilies and (B-E) the 5 families with highest copy number in each of four superfamilies, DHH (B), DTT (C), RLC (D), and RLG (E). Family names are listed in S1 Table.

Features of TE families

These descriptive statistics measured at the order and superfamily level are an aggregate across many TE families. TE families are defined based on sequence homology between copies [48], using a 80% sequence similarity cutoff described in Wicker et al. (2007) [15]. This results in thousands of families of LTR retrotransposon and Helitron elements, and hundreds of families of DNA TIR elements (Table 1). Although the majority of all TE families have fewer than ten copies (Fig 1A), the largest LTR retrotransposon and Helitron families in the genome consist of thousands of copies. Consistent with previous analyses built on subsets of maize bacterial artificial chromosomes (BACs) [26, 49], a majority (75%) of maize LTR retrotransposon families are present only as a single copy in the B73 genome. The average LTR family contains 6.1 copies, with this distribution ranging from 1 to 16,289 copies. In contrast, the family size distribution of TIR transposons is more uniform, with the average family containing 142 elements (range 1 to 9953) and only 10% of families are represented by a single copy. Helitron families are smaller, with 14 copies on average (66% represented by a single copy), and nonLTR retrotransposon families have an average of 3 copies (77% consist of a single copy).

Families are also found nonuniformly along chromosomes (Fig 3B, 3C, 3D and 3E and S3 Fig). Sometimes, the distribution of copies in the largest families in a superfamily match the pattern seen when summarized across all members of a superfamily, such as the five largest RLC families which all share an enrichment on chromosome arms (Fig 3D). There are also families that differ from the aggregate superfamily distribution. For example, the second largest RLG family (RLG00003) is enriched on chromosome arms, and the third largest RLG family (RLG00005) is more uniformly distributed along the chromosome (Fig 3E).

Further, the ages of different TE families vary greatly as well (Fig 4 and S4 Fig). We determine ages of individual copies based on terminal branch lengths of TE phylogenetic trees. For LTR retrotransposons, we additionally measure divergence between the two LTRs of each insertion (See Methods). Some families have not had a new insertion in the last 100 kya, while others have expanded rapidly in that time frame (Fig 4B, 4C, 4D and 4E). Some families display cyclical dynamics, readily generating new insertions that are retained, with pulses of stasis in between (e.g. DTA00073, Fig 4C). Others show sustained activity in the past (e.g. DHH00004, Fig 4B). In total, 70% of TIR families, 20% of LTR families (estimated with LTR-LTR divergence), 15% of nonLTR families and only 7% of Helitron families have been active in the last 100 kya.

Fig 4. Age distribution of (A) superfamilies and (B-E) five largest families of (B) DHH, (C) DTA, (D) RLC, and (E) RLG.

Fig 4

Family names are listed in S1 Table. Counts of number of insertions in 10,000 year bins are shown. As they are rare, TE copies older than 1.1 million years are not shown. Ages are calculated with terminal branch lengths for all TEs except LTR retrotransposons, which are calculated with LTR-LTR divergence. See S5 Fig for LTR retrotransposon plots with terminal branch length ages.

Features of the transposition process

Here, we address features that restrict and allow movement of TE copies, as well as influence their survival after insertion.

TE proteins

Numerous sequence features of the TE itself are required for the complex transposition process to occur, and these are best understood at the level of TE family. One requirement is the presence of TE encoded proteins that catalyze movement. Functional characterization of TE protein coding capacity is complicated by difficulty in identifying the effect of stop codons or nonsynonymous changes on transposition. Instead we measure homology to TE proteins (see Methods for details), although we recognize this does not fully reflect whether a TE copy can produce a transpositionally-competent protein product. Although TE-encoded proteins are often of similar length within a TE superfamily due to domain conservation and shared ancestry, the longest ORF in a TE varies by family (Fig 5A). Sometimes this is due to the presence of nonautonomous or noncoding copies. While nonautonomous copies rely on protein production in trans by other family members, autonomous TE copies encode their own transposition machinery in cis. The majority (52%) of LTR families have at least one family member that retains some remnant of coding capacity for all the TE proteins necessary for transposition. In contrast, only 0.6% of TIR families, 0.3% of helitron families, and 0.2% of nonLTR families have at least one family member with protein coding sequence that matches known TE proteins. For all TEs, coding capacity varies substantially between families (Fig 5B). Several LTR retrotransposon families have a small proportion of potentially autonomous copies (S12(C) Fig), and yet other families where coding potential for required proteins is split between between different TE copies (e.g. RLG00001, where 1.6% of copies code only for GAG and 13.5% of copies code for only POL, although both proteins are required for retrotransposition; S12(A), S12(B) and S12(C) Fig). Also, families range from having almost exclusively potentially autonomous coding copies (≥ 75% of copies in 14 families of DTC, RLC, and RLG, S3 Table), to having exclusively nonautonomous noncoding copies (842 families, spanning all 13 superfamilies; S4 Table).

Fig 5. TEs code for proteins that are expressed, and expression varies by family across tissues.

Fig 5

In A-D, families are in the same order as presented in Fig 2, and listed in S1 Table. (A) Length of longest open reading frame within the TE, measured in amino acids. (B) Proportion of family with all proteins required for transposition. (C) log10 median TE expression across tissues, per-TE copy. (D) Tissue specificity of TE expression τ, with low values representing constitutive expression, and high values representing tissue specificity. (E) Per copy TE expression across tissues (RPM, reads per million), clustered by expression level. Families with greater than 10 copies are shown in rows, and tissues in columns.

Coding capacity for TE proteins likely dictates the ability to generate new insertions, and as such is associated with TE age and the timing of activity of the family. Averaged across all orders, TEs that code for proteins are younger than those that do not code for proteins (median age of 198 kya vs. 263 kya; significant effect of protein coding in Wilcoxon rank sum test, p < 2e − 16). Further, noncoding copies from families that lack a coding member in B73 show an elevated median age (266 kya) when compared to noncoding copies in families with coding members (174 kya) (all pairwise Wilcoxon rank sum tests show a significant effect of coding status, all p < 2.1e − 11). For most superfamilies, coding members are older than noncoding copies from families with coding members (S7 Fig).

TE expression

Beyond simply coding for TE proteins, another requirement for TE transposition and transgenerational inheritance is expression of the TE itself such that the TE-encoded protein can be generated. Mapping of RNA-seq reads to repetitive TE families is a challenge, as it can be impossible to identify the exact copy that is expressed when a read maps equally well to multiple TE copies [50]. We choose to summarize multiply mapping reads and TE expression at the level of per-copy RPM of the family. This likely averages relevant variation in expression known to exist between copies within maize TE families [51, 52], but reflects patterns observed at the level of the family. Measured in this way, large families are generally transcriptionally repressed (Fig 5C), while small families show higher median per-copy expression levels. Most families are not expressed in any tissues surveyed (Fig 5E). While superfamily medians and median expression per copy of the ten largest families per superfamily show below 0.1 RPM per copy (Fig 5C), per copy rates of expression can be higher for small families. For example, the 19 copies of RLC00184 (also known as stonor) show high median expression of 4.33 RPM per copy.

Tissue specificity can reflect different strategies for TE survival, like that a TE must jump in germline tissue to ensure its transgenerational inheritance at a new locus. Tissue specificity, measured as τ (see Methods), is highest when values of τ are equal to 1, and 0 when constitutively expressed at identical levels across all tissues. Helitrons and most LTR retrotransposon superfamilies (RLC and RLG) show lower median τ than TIR and nonLTR retrotransposon superfamilies (All pairwise Wilcoxon rank sum tests significant (< 4.2e − 12) except TIR and nonLTR comparison; Fig 5D). Tissue specificity can be extreme, with some families showing expression in only one tissue (Fig 5D)). For example, DTH00434 shows maximal per copy expression in mature pollen (4.3 RPM), with highly tissue specific expression (τ = 0.998).

TE regulation

TE expression is likely limited by regulation of the TE by the host genome, which we measure via DNA methylation and MNase hypersensitivity in the TE and regions surrounding it. TEs on average are heavily regulated by their host genome: average cytosine methylation across TEs is high (averaged across five tissues, 82% of cytosines in a CG context in a TE are methylated, 67% in a CHG context, and 4% in a CHH context), although this varies across superfamilies (S5 Table) and families (Fig 6A, 6B and 6C). Only a small fraction of base pairs within TEs is in chromatin accessible to MNase, averaging 0.2% in shoot tissue, and 0.08% in root tissue (S9(K) and S9(M) Fig), both lower than genome-wide proportions (0.5% in shoot, 0.2% in root; significantly different from genome-wide values, one-sample Wilcoxon Signed Rank test p < 2.2e − 16 for both shoot and root). Despite this overall pattern of regulation, the host genome restricts some families of TEs differently. For example, the median CG methylation of the family DTM00796 is only 52% in anther tissue (Fig 6A), while most other families show higher methylation. There is even more extreme variation in CHG methylation across TE families (Fig 6B): though many TE families show low CHH methylation across the body of the TE, some families of DNA transposons show relatively high CHH methylation (Fig 6C). Although the numbers presented here are for anther tissue, these patterns are robust across tissues (S8 Fig).

Fig 6. TEs and their flanking sequences are regulated by their host genome.

Fig 6

Families are presented in the same order as in Fig 2, and listed in S1 Table. CG methylation in TE (A), CHG methylation in TE (B), and CHH methylation in TE (C). CG methylation in 2 kb flanking the TE (D), CHG methylation in 2 kb flanking the TE (E), and CHH methylation in 2 kb flanking the TE (F). All methylation data from anther tissue, other tissues shown in S8 Fig. In (A—C), superfamily median is shown as a dashed line with the interquartile range in the shaded box. In (D—E), median methylation for regions up to 2 kb up and downstream of the TE are plotted for each family, with family size denoted by line transparency (darker lines are larger families).

Some TEs can preferentially insert to particular genomic locations, often based on local chromatin state [13, 14]. Others can modify the methylation patterns in flanking regions after insertion [32, 33]. Genome-wide, methylation levels in the region surrounding a TE insertion reflect these processes, and are variable both across families and DNA methylation contexts. Of the 1,243 TE families with ten or more copies, median methylation levels averaged across all tissues are elevated within the TE compared to 500 bp away for 734 TE families for CG methylation, 957 for CHG methylation, and 1086 families for CHH methylation. This pattern can be visualized as the decay of methylation moving away from the TE (Fig 6D, 6E and 6F and S8 Fig). The magnitude of reduction in local CG and CHG methylation moving away from the TE differs in extent and pattern, from families where methylation is reduced immediately adjacent to the TE to others with minimal reductions even 2 kb away from the TE (Fig 6D and 6E). In contrast, most families show rapid reductions in CHH methylation within 100 bp away from the edge of the TE (Fig 6F).

TE base composition

Observed DNA methylation levels may be impacted by the base composition of the TE, as cytosines must be present to be methylated. TE families differ in GC content (S9(A) Fig); with extremes ranging from 21% (DTT13542) to 84% (DTH14236) median GC content. This appears to be a consequence of bases carried by the TE itself and not of regional mutation pressure, as variation in GC content in the TE is greater than that of the flanking sequence (S9(B) Fig). For example, GC content in the 1kb flanking DTH14236 is over 30% lower than that in the TE (52% GC in the flanking region). Beyond the proportion of cytosines in the sequence, the context in which these cytosines are found can impact whether and how they are methylated. For example, 51 families have a median of 0 cytosines that can be methylated in either the CG or CHG context (S6 Table). And even with similar GC content, families differ in the contexts in which they have those cytosines, as families can have moderate GC proportions, but high proportions of these in a CG context (e.g. DTM00473; S9(A) and S9(C) Fig). This is reflected in increased TG content, potentially a consequence of deamination of methylated cytosines (S9(I) and S10 Figs). Notably, TEs with high amounts of methylatable cytosines within the TE do not always share high methylatable cytosine proportions for the region flanking the TE (S9(C), S9(D), S9(E), S9(F), S9(G) and S9(H) Fig).

Although difficulty in mapping short reads to a highly repetitive genome precludes a comprehensive analysis of population frequencies of TEs across maize individuals, we use the proportion of segregating sites within TEs as a proxy for copy number. We measure segregating sites in a panel that includes 1,218 maize and teosinte individuals [53]. While as a whole TEs have fewer segregating sites per base pair (median 0.022) than the genome-wide proportion (0.039) (one-sided Wilcoxon signed rank test, p < 2.2e − 16; S9(O) Fig), some TE families show high numbers of segregating sites (e.g. DTH10060, 0.177 segregating sites per bp). In contrast to the sequence carried by the TE, variation in the region the TE is inserted into is considerably closer to genome-wide averages than that of the TE itself (median 0.034 segregating sites per bp; S9(P) Fig), but still significantly different (one-sided Wilcoxon signed rank test, p < 2.2e − 16).

Features structuring TE survival after insertion

The recombinational environment that a TE exists in can impact the efficacy of natural selection on the TE, as higher recombination can unlink deleterious variation from adaptive mutations [54], leading to a positive relationship between recombination and diversity. While LTR retrotransposons are more commonly found in low recombination regions (median 0.30 cM/Mb), Helitrons and TIR elements are more commonly found in higher recombination regions (both show a median 0.43 cM/Mb), and nonLTR retrotransposons are found in the highest recombination regions (median 0.57 cM/Mb) (significant effect of TE order on recombination rate, all pairwise Wilcoxon rank sum tests are significant at p < 1.4e − 06 except the Helitron-TIR comparison). This varies by family—the two largest families of DTT differ in median recombination regions from 0.14 cM/Mb to 0.53 cM/Mb (S11(A) Fig).

Additionally, selection can act on TEs if they have an impact on the expression of genes they land near. Although it is impossible to determine whether a TE insertion causes changes in nearby gene expression using only the B73 genome, we observe differences among superfamilies and families of TEs in the expression levels of the closest gene. Across tissues, genes near TIR and nonLTR elements have higher median expression (1.37 RPKM for TIR and 1.83 RPKM for nonLTR) than genes near LTR (1.04 RPKM) and Helitrons (0 RPKM) (S11(C) Fig) (All pairwise Wilcoxon rank sum tests significant, p < 0.023). Notably, this pattern intensifies for genes within 1 kb of the TE, where median gene expression is over 4 RPKM for genes near TIR and nonLTR elements, but 0 RPKM for these genes close to LTR and Helitron elements (S11(D) Fig). Much of this signal is driven by non-syntenic genes—average expression is much higher for the closest syntenic genes (≈12 RPKM) but shows no significant difference amongst orders (Kruskal-Wallis rank sum test, p-value = 0.2353) (S11(F), S11(G) and S11(H) Fig). Some families are often found near highly expressed genes (e.g. DTA00133, median expression 22.38 RPKM), while median expression of the closest gene for 13 of families is close to zero. However, when genes near TEs are expressed, their expression is much more constitutive than that of TE families (S11(E) Fig and Fig 5E), with mean τ values of 0.75 for genes near TEs and 0.93 for TE families themselves. Tissue specificity varies by family and superfamily as well, and there is a weak correlation between tissue specificity of expression of TE families and expression of the genes they are closest to (Pearson’s correlation 0.067, p = 4e-12).

The maize genome arose from an autopolyploidy event [55], and has been sorted into two extant subgenomes [56]. Subgenome A has retained more genes and base pairs than subgenome B, accounting for 64.8% of sequence [48], and 64% of all TEs (S11(B) Fig). Additionally, the median age of TEs in subgenome B is lower (0.24 Mya) than those in subgenome A (0.26 Mya) (Wilcoxon Signed-rank test shows significant effect of subgenome; p < 2.2e − 16). These differences are likely due to the effect of ongoing transposition erasing any signature of TE differences between parents of the allopolyploidy event, as genome-wide the family with the oldest median age (DTH16531) is only 8.5 million years old.

Modeling survival of TEs

To account for the myriad differences of these 341,426 TE copies in 27,444 families, we approach our understanding of the survival of TEs in the genome by modeling age as a response to the TE-level features and the genomic regions in which TEs exist today. Age reflects survival of TEs, measuring the amount of time since transposition that they have persisted at a genomic position without being lost to selection or drift. Hence, we measure the predictive ability of features of the TE itself and the genomic region it inserted into on TE survival as measured by age.

Random forest regressions using age as a response variable and features that are measured at the level of the individual TE explain moderate amounts of variance (27.7%), and show low mean squared error (0.014). Across all TEs, information on the superfamily a TE belongs to contributes the most to prediction accuracy for age; after permuting their values, the square root of mean squared error (RMSE) increases by 162 kya (Fig 7B). Other features that increase RMSE by over 100 kya include the size of the family the TE comes from, the length of the TE (both in total bp and when including bases coming from copies nested within it), the TE family, the number of segregating sites per bp within the TE, and the median expression of the TE family across all sampled tissues. In aggregate, features of the region flanking each TE explain approximately as much variation in age as features of the TE itself, but there are more flanking features than those measured on the TE. On average, each feature of the TE contributes over 4 times more predictive power than that of a flanking feature (square root mean squared error of 39 kya for a TE feature, 8 kya for a flanking feature) (Fig 7A and 7B).

Fig 7. Features ranked by importance.

Fig 7

(A) Reduction in mean squared error gained by including a feature in a model, summarized into categories. (B) Correlations of each of the top 30 features with age for the five largest families in each superfamily. Features labeled to the right in (C). Size of point is scaled by correlation coefficient, and color by whether the relationship is positive (blue) or negative (red). Rows without values are features that are fixed within a family, thus have no variance. (C) Reduction in mean squared error for top 30 individual features. Colors match categories in (A). (D) Raw correlations between age and segregating sites per base pair (E) Model predictions for the relationship between age and segregating sites per base pair (F) Raw correlations between age and anther CHH methylation of the TE (G) Model predictions for the relationship between age and anther CHH methylation of the TE.

These generalities reflect underlying nonlinearities in the relationships between individual features and age, which are often family-specific. Indeed, correlations of these top features with age differ not only in magnitude, but even in sign between individual families (Fig 7C). To provide additional insight into the local behavior of the relationship between a feature of interest and age, we use the fitted random forest models to predict age for TE copies as we vary the feature. For example, the number of segregating sites in the TE is positively correlated with age in the raw data (Fig 7D), and is confirmed via this permutation approach (Fig 7E). Yet despite this overall pattern, individual families vary in sign (Fig 7C) and slope (Fig 7E) of the relationship. Other features, like CHH methylation of the TE in anther tissue, show relationships that vary by superfamily, where RIT and DHH appear older with increasing CHH methylation of the TE in the anther, while other superfamilies show decreasing age (Fig 7G), a pattern less apparent in the raw correlations (Fig 7F). Across all features, there are largely family-specific combinations of both the direction and strength of correlation with age (Fig 7C). In total, while genomic and TE features contribute to prediction of age, interactions among these features make it difficult to predict the survival of any single family.

Discussion

General patterns

As 85% of the maize genome is repetitive sequence [26, 49], and 63% structurally recognizable TE sequence [48], TEs contribute more to the maize genome than sequence that is uniquely ‘maize.’ Like most plant genomes [11], retrotransposons contribute more base pairs to the maize genome than do DNA transposons (Table 1 and Fig 2B). This is a consequence of the high number of copies (Fig 2A) and the large size of individual retrotransposons (Fig 2C), likely due to a ‘copy and paste’ replication mode that leaves existing copies intact when generating new copies. Also like other plant genomes [57, 58], several superfamilies of DNA transposon in the maize genome are found closer to genes than are retrotransposons (Fig 2B). This is likely due to targeted insertion into euchromatic sequences [59, 60], and differences in removal through natural selection after insertion [61, 62].

TE superfamilies

The bulk of TE sequence is often described at a finer scale, that of individual superfamilies of TEs. Each TE superfamily defined in the maize genome has representatives across the tree of life [6365], suggesting an ancient origin of these genomic parasites. Some superfamilies have retained dramatic and consistent differences in their spatial patterning across chromosomes over hundreds of millions of years. For example, the superfamily RLG is enriched near centromeres in all plants [6668] including maize (Fig 3A), highlighting a genomic niche that allows long-term survival near the centromere. Similar patterns exist at deep time scales for DNA transposon superfamilies, which preferentially insert near genes in both monocots and dicots [57, 58, 69, 70] and in maize are enriched on chromosome arms where genes are concentrated (Fig 3A).

These patterns likely reflect the evolution of different ecological strategies of TEs in the genome. Kidwell and Lisch (1997) [7] described two extremes to the ‘ecology of the genome’—one, a TE that preferentially inserts far from genes, into low recombination heterochromatic regions, and a second, risky TE that inserts near low copy sequences, more likely to disrupt gene function. We observe these extremes at play in the maize genome, in that LTR retrotransposons dominate the heterochromatic space, with over half of all copies greater than 16 kb from a gene (Fig 2B), and most copies heavily methylated (Fig 6A, 6B and 6C). The alternate strategy also exists in the maize genome, with risky insertions near genes and transcribed regions seen for several TIR superfamilies. For example, over half of Mutator transposons (DTM) are found within 1 kb of a gene (and over one quarter of DTM within 100 bp of a gene) (Fig 2B). This likely results from the preferential insertion of DTM elements upstream of genes [30, 60, 71, 72]. We note that we find TIR copies are found further from genes (17.2 kb) than previously reported for grass genomes [49, 58]. We believe this may be due to previous analyses based on preferential assembly and identification of genic TEs—indeed subsetting to the 893 TIR families found in the Maize TE Database [49] results in a much reduced 1.6 kb median distance to genes. On a genome-wide scale at the level of all TEs, the spatial patterns we observe could result from either preferential insertion or differential removal by selection after insertion. Further characterization of these ecological strategies will be facilitated by investigating TE polymorphism across maize individuals [33, 73] and de novo recent insertions that selection has not yet acted on [74].

TE families

While superfamily level observations are useful for gaining an overview of the distribution and survival of TEs in a genome, more detailed study on a time scale relevant to the evolution of the genus Zea comes from studying TE families. Maize TE families are shared with closely related host species, but the number of shared families rapidly decreases with phylogenetic distance. Many families are shared with congeners Zea diploperennis [7577] and Zea luxurians [78], but few families investigated are found in maize’s sister genus Tripsacum (1 mya divergence; [79]) [7577, 80, 81], and the only families shared between maize and Sorghum (12 mya; [55]) are shared only as a result of horizontal transfer events between the species [82]. This suggests that in order to understand TE evolution at a timescale relevant to maize as a species, it is essential to understand families of TEs, rather than the aggregate properties of superfamilies or orders.

Indeed, our family-level analysis also reveals patterns obscured when TEs are averaged together at the level of superfamily. For example, despite the fact that the RLG superfamily is enriched in centromeric and pericentromeric domains (Fig 3A), the second largest family RLG00003 (homologous to the RLG family huck [83]) is predominantly found on chromosome arms (Fig 3D). While many RLG elements contain a chromodomain targeting domain in their polyprotein [84] allowing targeted insertion to centromeres, RLG00003 does not (S12(G) Fig). This lack of a chromodomain may explain a proximal cause of the observed niche of RLG00003, although other factors are certainly at play, as other families with centromeric enrichment also lack chromodomains (S12(G) Fig). DNA transposons are also best described at the family level. While Mutator (DTM) elements are found a median distance of 2.5 kb from genes (Fig 2B) and have long been observed to target insertions near genes in maize [30, 60, 72], the second largest family, DTM13640, is found a median distance of 34 kb away from genes (Fig 2B). The mechanism for gene targeting seems to be mediated through recognition of open chromatin [60, 71], but precise details of the targeting are unknown. Further investigation into the families that insert near and far from genes may pinpoint how their molecular mechanisms of targeting may differ.

Furthermore, differences in the timing of transpositional activity vary extensively between families. Most TE families in maize have had most new insertions in the last 1 million years (Fig 4). Some TE families have bursts of activity, punctuated by a lack of surviving new insertions, while others appear to be headed towards extinction. All of these timings are much more recent than allopolyploidy in maize (≈ 12 mya) [55] and families show little subgenome bias in their distribution (S9(B) Fig), suggesting that these represent lineages evolving within maize.

Maize was domesticated from teosinte (Zea mays subsp. parviglumis) 9,000 years ago [85, 86]. It is tempting to address the contribution of TEs to this major transition, especially given the contribution of TE insertions to maize domestication and improvement [8789]. Although we caution that mutation rates and estimation can complicate ascertainment (see below), 46,949 TEs across all 13 superfamilies have an estimated age of less than 9,000 years, and 24,630 TEs have an estimated age of 0. This suggests that transposition has been ongoing since the divergence of maize from its wild ancestor, but we caution that we lack appropriate confidence intervals for these estimates, especially as non-zero age requires observing at least one nucleotide mutation.

The family-level ecology of the genome

It can be difficult to predict exactly why a particular TE family differs from other families. Community ecologists aim to understand the environmental factors that give rise to the observed diversity of organisms living in one place, including not just features of the environment but also interactions between species. TE families are analogous to species in the genomic ecosystem, and because the genomic environment a TE experiences is constrained to the cell, TEs are forced to interact in both time (Fig 4) and space (Fig 3). We predict each family of TE is adapted to its genomic ecological niche, where the genomic features we measure represent the environmental conditions and resources limiting a species’ ecological niche [90]. TEs additionally can act as ecosystem engineers, modifying the environment they insert into, and generating new habitat for future colonization [10, 44].

In the genomic ecosystem, we can observe interactions between species much like we would see in a traditional ecosystem. We see a number of patterns, including cyclical dynamics of TE activity through time for several families, sustained activity through time, and a reduction in new copies towards the present (Fig 4 and S4 Fig). This means that the genomic environment a newly inserted TE experiences is affected by the activity and abundance of all other TE families in the genome. At one extreme, members of the same family can even encode different proteins required for retrotransposition in different TE copies, where both types are required to be transcribed and translated for either to transpose. Such a system approaches a mutualism, where the success of one type depends on another. Previous knowledge of these systems was limited to the maize retrotransposon families Cinful, which codes for polyprotein domains, and Zeon, which codes for GAG [25] (represented here by a single family, RLG00001). This strategy has been successful in maize, and RLG00001 [48, 77] for example makes up 135Mb of sequence. Sorghum, in contrast, has a genome 13 the size of maize [91] and lacks homologs to RLG00001. Such symbiotic relationships within a TE family have been thought of as remarkably rare [92, 93]; however we identify 25 LTR retrotransposon families where GAG and POL protein domains are found in separate TE copies but less than 1% of copies contain both, suggesting that this pattern is much more prevalent than previously described. These types of elements are best classified as subtypes of a single family, because the cis components of the LTR are recognized by protein domains of both GAG and POL proteins, leading to homogenization of sequence signals. As noted by Le Rouzic et al. (2007) [92], symbiotic TE families face a major barrier in being horizontally transferred, as both copies must be transmitted through an already rare process. Their prevalence in the maize genome thus supports instead a long term coevolution of the maize genome and the TEs that live within it, specializing and diversifying with different ecological strategies.

Unlike most contemporary ecological communities, which are censused when a researcher surveys them, the genomic ecosystem carries a record of past transposition. We can investigate this past ‘fossil’ record using the age of individual TE copies. This allows a robust analysis of the features that define TE survival across time. The TEs we see today are a readout of the joint processes of new transposition—which may not be uniform through time—and removal through selection, deletion, and drift [62]. Survival of a TE can be measured by its age or time since insertion, as our observation of a TE is conditioned on the fact it has not been removed by either neutral processes or selection. Changes in the TE community over time give rise to evolution.

Although relative age differences between TE insertions are limited only by our ability to count mutations, absolute age estimates can be shifted by mutation rate estimates. We use a maize-specific mutation rate [94], which leads to a five times younger estimated age of maize LTR retrotransposons than the 3–6 million years originally estimated by SanMiguel et al. (1998) [95]. Additionally, as nucleotide mutation rates in TEs may be higher than other parts of the genome (≈ 2 fold higher in TEs in Arabidopsis thaliana, [96]), we consider our age estimates to represent an upper bound of TE age. Nonetheless, age represents a comparable metric of survival in the genome, especially when summarized across multiple copies and families. Our random forest model predicting age of TEs thus relates the action of transposition to the processes that occur afterwards on an evolutionary time scale. The model shows that TE superfamily and family size provide best predictive power for age (Fig 7B).

Another TE feature with high predictive power for age and survival in the genome is the length of the TE, both of itself and its length including copies nested within it. For most families, there is a negative correlation between TE length and age (Fig 7C). However, we find that the relationship between TE length and age in maize is often nuanced, with some long TEs surviving over millions of years (Fig 7C and S13(A) and S3(B) Fig). In other taxa, selection is stronger on long TEs, mediated by a higher potential for nonhomologous (ectopic) recombination [9799]. Although a number of factors may contribute to these patterns, it also seems likely that a genome as repetitive and TE-rich as maize perhaps could not have evolved without mechanisms to prevent improper pairing of nonhomologous sequences with high nucleotide similarity [100].

Other predictors of age are expected. For example, we expect a new insertion to be younger if we show that the TE disrupts another TE, which we see for most families shown in Fig 7C. Additionally, we expect the proportion of segregating sites in the TE and the region flanking a TE insertion will be positively associated with TE age, as they reflects a count of the mutations that have accumulated on the haplotype carrying the TE. There is a positive relationship between age and segregating sites for most families shown in Fig 7C. We note that imprecise repair of a double stranded break after excision of a TIR element [70] could obscure this signal to some extent, increasing the number of flanking SNPs while decreasing the average frequency of the TE. Consistent with this mechanism, the superfamily DTT, which excises precisely without introducing nucleotide mutations [101] shows lower median flanking segregating sites per base pair (0.0295) than TIR elements from other superfamilies (0.0310) (Wilcoxon Signed-rank test shows significant effect of DTT superfamily; p < 2.2e − 16).

Elevated CHH methylation of TEs has been found in recently activated TEs in Arabidopsis thaliana [102] and in TEs near genes in maize [103, 104]. We find complicated, nonlinear relationships of CHH methylation with age (Fig 7F and 7G). These differences between and within families may reflect a natural senescence of TE copies. Young copies not yet silenced by the genome lack CHH methylation, intermediate age copies are effectively silenced with higher CHH methylation levels, and the oldest TEs with low CHH methylation are defunct copies incapable of transposition that are no longer silenced. More detailed study of recently active maize TE families will allow understanding of the temporal dynamics of transcriptional and post-transcriptional silencing of TEs.

In spite of previous predictions, distance to a gene and recombination are not found in the top 30 explanatory variables of age. Old TEs are underrepresented near genes in humans and Arabidopsis thaliana [105, 106], consistent with selection against such insertions. Recombination has been implicated in both the removal of TEs and in modifying their impact on fitness via ectopic recombination [6]. We believe that both distance to a gene and recombination rate reflect broad-level summaries of genomic regions, such that they are not predictive in our model once other local features are included. For example, regions with high recombination rate generally show low CG methylation in maize [107], but a subset of genes in such regions show CG methylation across the gene body. Since CG methylation plays a role in TE survival (Fig 7B), inclusion of this feature in our models will thus reduce the importance of recombination rate. Similarly, CHH methylation is most prominent in regions of the genome close to genes, presumably a result of RNA-directed DNA methylation reinforcing the boundary between heterochromatin and euchromatin [103, 104]. As this elevated CHH methylation is often over the TE closest to a gene [104], the distance of a TE to the closest gene may provide largely redundant information beyond what is captured by measurements of CHH methylation. Finally, despite many other features being correlated with either gene density or recombination rate, the two are inextricably linked, as recombination in maize primarily initiates in genes [108]. Together, these combine to reveal few patterns in the relationship between distance to gene, recombination rate, and age of TE copies (S13(C) and S13(D) Fig).

Finally, in spite of the fact our model includes more than 400 features of the genomic environment, TE taxonomy contributes substantially to prediction of TE age (Fig 7A and 7B). We have seen that the relevance and direction of effect of individual features can differ among families (Fig 7C), essentially generating family-specific niches in the genomic ecosystem. In fact, there is no genomic feature we measure which shows even the same direction of correlation with age across all families. The importance of taxonomy in our model suggests that there are unmeasured latent variables that are best captured with superfamily and family labels. This further emphasizes that the analysis of TEs in maize should focus on family, as each family is surviving in a slightly different way, exploiting a unique genomic niche.

Conclusion

Genes in the maize genome are ‘buried in non-genic DNA’ [109] consisting predominately of TEs. The interaction between TEs and the genes of the host genome can structure and inform genome function. The diversity of TEs in an elaborate genome like maize generates a complex ecosystem with many interdependencies and nuances, limiting the ability to predict the functional consequences of a particular TE based only on superfamily or order. Instead, TE families represent a biologically relevant level on which to understand TE evolution, and the features most important for determining survival of individual copies represent dimensions of the ecological niche they inhabit. These observations suggest that the co-evolution between TE and host is ongoing, and inference of the impacts of transposons requires a multifaceted approach. The nuanced understanding generated from exhaustive analysis of genomic features and survival of individual families of TEs serves as a starting point to begin to understand not only TE evolution, but also the evolution of the host genomes they have coevolved with.

Methods

Scripts for generating summaries from data sources and links to summarized data are available at http://www.github.com/mcstitzer/maize_genomic_ecosystem. Interactive distributions per family can be found at https://mcstitzer.shinyapps.io/maize_te_families/.

TE sequence properties

We base our analysis on the TE annotation of the maize inbred line B73 [48], updated to more fully capture TIR elements (see S1 Text). TEs that are nested inside of other TEs are divided for further analyses, by assigning each TE base pair in the genome to a single copy by iteratively removing copies in order of arrival. We remove from analysis any TE for which less than 50 bp remains after resolving nested copies. We add the positions of retrotransposon long terminal repeats (LTRs) to these annotations as produced by LTRharvest [110], and delimit the internal protein coding genes of LTR TEs using LTRdigest [111] and GyDb 2.0 retrotransposon gene HMMs [112]. We additionally identify the longest open reading frame (ORF) in each TE model using transdecoder [113], and identify whether this longest ORF is homologous to known transposases, integrases, and replicases respectively for TIRs, nonLTR retrotransposons, and helitrons (JCVI GenProp1044 http://www.jcvi.org/cgi-bin/genome-properties/GenomePropDefinition.cgi?prop_acc=GenProp1044 and PFAM PF02689, PF14214, PF05970) using hmmscan [114] with default parameters. We characterize copies as autonomous based on the content of their protein coding domains, requiring evidence of all 5 proteins (GAG, AP, RT, RNaseH, INT) for LTR retrotransposons, a reverse transcriptase match for LINEs, a transposase profile match for TIR transposons, and a Rep/Hel profile match for Helitrons. This measure is lenient in defining coding content, as it does not penalize stop codons and frameshifts throughout these coding regions.

After insertion, TE copies accumulate nucleotide substitutions that can be used to estimate their age. To estimate age based on divergence of a TE copy from others in the genome, we generated phylogenies of TE copies by first aligning the entire TE sequence of each copy in each superfamily using Mafft [115] (allowing sequences to be reverse complemented with the option --adjustdirection) and then building an unrooted tree using FastTree [116]. To make tree building computationally efficient in spite of the high number of TE copies and large element size, we use a maximum of 1000 bp for tree building for the largest 5 superfamilies (3’ terminal for Helitrons, 5’ terminal for LTR retrotransposons and TIR elements). The terminal branch length of each copy is used as a measure of its age, representing nucleotide substitutions since divergence from the closest related copy in the B73 reference genome. This measure of age makes a number of assumptions about the tempo and mode of transposition—for example, we assume nucleotide mutations in a TE arose at its current location, which may not be true for TIR elements that excise and move to a new location. Nonetheless, it is the only approach to calculate ages of individual TIR and Helitron elements [117, 118] without relying on a consensus element generated from a multiple sequence alignment that can be biased towards recently transposed copies that have not yet been removed by natural selection or genetic drift [118, 119].

Because the 5’ and 3’ LTR of LTR retrotransposons are identical upon insertion [83], we also estimate their time since insertion using the number of substitutions that occur between the two LTRs. For each LTR retrotransposon copy, we align both LTRs with Mafft [115] and calculate nucleotide divergence with a K2P correction using dna.dist in the ape package of R [120, 121]. For all age measures, we relate nucleotide divergence to absolute time using a mutation rate of 3.3 × 10−8 substitutions per site per year [94]. These LTR-LTR estimates are generally in line with terminal branch length age estimates (Spearman’s correlation 0.65), with LTR-LTR ages often older than terminal branch length ages (S6 Fig).

TE environment and regulation

We characterize the genomic environment of the TE and features that overlap the TE. For each TE, we characterize the distance to the closest gene (gene annotation AGPv4, Zm00001d.2, Ensembl Plants v40) irrespective of strand using GenomicRanges [122]. We additionally measure expression of these closest genes across a developmental atlas of the maize inbred line B73 [123] (accessed from MaizeGDB as walley_fpkm.txt using AGPv4 gene names). In order to estimate the overall dynamics and tissue-specificity of expression, we calculated both the median expression and τ [124] for each of these genes. τ is calculated as the summed deviance of each tissue from the tissue of maximal expression, divided by total number of tissues minus 1. τ values thus range from 0 to 1, with low values representing constitutive expression and high values indicating tissue-specific expression. We further characterize whether the closest gene is found in a syntenic position in Sorghum bicolor (typically indicative of higher conservation) using curated lists across grass genomes, excluding maize genes matched to multiple Sorghum orthologs [125, 126].

In addition to host genes, TEs themselves can be transcribed. Using RNAseq reads from the Walley et al. (2016) [123] expression atlas (NCBI SRP029238), we counted reads that align uniquely to a specific member of a TE family, as well as multiply mapped reads that align to a single family, as in Anderson et al. (2018) [51] and Anderson et al. (2019) [52]. This allows estimation of the expression level of a TE family, despite the repetitive nature of TEs that limits unique mapping of reads. Reads that map to TEs located within genic sequences (generally within introns) were excluded because their expression is indistinguishable from transcription from the gene promoter. We take the mean value of reads per million across the two to three replicates per tissue, and divide by the total family size to get a per-copy metric of expression. As with genes, we calculate median expression across tissues and tissue specificity using τ.

To identify the recombinational environment in which each TE exists, we use a 0.2 cM genetic map of maize generated from the Nested Association Mapping (NAM) panel [127]. We convert AGPv2 coordinates to AGPv4 coordinates using the Ensembl variant converter [128]. To approximate the recombination rate in genomic regions, we fit a monotonic polynomial function to each chromosome [129]. Using this function and TE start and end positions, we calculate a cM value for each TE, and convert to cM/Mb values by dividing by the length of the TE in megabases.

The chromatin environment a TE exists in can impact transposition [60]. We converted data on MNase hypersensitive sites in roots and shoots [130] from the AGPv3 reference genome to AGPv4 coordinates using the Ensembl variant converter [128]. We counted how many hypersensitive sites exist in each TE, as well as the proportion of base pairs of the TE that are hypersensitive. We also calculate these metrics for the 1 kb region flanking the TE on both sides.

Regulation of TEs by the host genome is often mediated via epigenetic modifications. We use bisulfite sequencing reads from shoot apical meristem, anther, ear shoot, seedling leaf, and flag leaf [104, 131]. We trim adapters using TrimGalore, map using bsmap 2.7.4 with parameters (-v 5 -r 0 -q 20) [132], and summarize in 100 bp windows as in Li et al. (2015) [104], to characterize the local proportion of methylated cytosines in all three contexts (CG, CHG, CHH; where H is any base but G). We summarize the average levels of each measure over each TE copy and each of 20 100 bp windows of flanking sequence on either side, imputing missing data with the family mean.

To identify differences between TE copies in their base composition, we calculate GC content plus the number of di- and tri- nucleotide sites containing cytosines in a methylatable context (CG, CHG, CHH). We count these contexts in each TE using the bedtools nuc command [133] and divide by TE length to determine the proportion of the sequence that is methylatable for each context. We also calculate these measures of methylatability for the 1 kb flanking the TE on each side.

We also measure the number of segregating sites per TE base pair and the 1 kb flanking in the Zea mays Hapmap3.2.1 dataset [53] as well as the subgenome [48] each TE is found within.

As we cannot calculate accurate summaries of genomic features for families with a small number of TE copies, we include only those families with more than ten copies when presenting results in the text that identify specific outlier families, such as the family with highest GC content. When presenting summaries at the superfamily and order level or results modeling TE age, we include information from all TE copies, including those from smaller families.

Analysis and interpretation

We implement random forest regression models (in the R package ‘randomForest’ [134]) to understand the importance of different genomic features to TE survival in the genome as measured as the age of individual extant copies. We train models on 50% of copies, and summarize 1000 iterations of trees. The remaining TEs are retained as a test set to estimate model performance. Any missing data is assigned a value of -1, and the categorical variable of superfamily is considered as a factor. Because of limitations in converting numbers to binary, we limit the categorical variable of family to the 31 largest families, and code all others as ‘smaller.’ We summarize the overall importance of each feature in predicting age by permuting its values across individual TE copies and observing the change in mean squared error of the model prediction of the actual value, scaled by its standard deviation. We summarize features into categories reflecting TE taxonomy, TE base composition, TE methylation and chromatin accessibility, TE expression, TE-encoded proteins, nearest gene expression, regional base composition, regional methylation and chromatin accessibility, and regional recombination and selection. A full description of the individual measurements that go into each category is found in S2 Table.

In order to interpret family-specific relationships for the top predictors of age, we perform further analyses. We calculate the Pearson’s correlation coefficient of each predictor with age, using samples from each family. To visualize the nonlinear relationships and interactions produced by such models, we calculate Individual Conditional Expectations (ICE plots [135], R package ‘pdp’ [136]), which summarize the contributions of permuted values of a variable of interest to the response, while conditioning on observed values at all other variables. We provide permuted values summarizing 95% of the observed data, to provide predictions in a region of parameter space the model is trained on. We summarize these responses as deviation of the predicted value generated with permuted data from the true value, and plot as individual lines and superfamily averages.

Supporting information

S1 Fig. Family characteristics of each of the largest 10 families of each superfamily with at least 10 copies.

(A) Proportion of TEs within the transcript of a gene, including introns and UTRs. (B) TE span along the genome, summing both the base pairs of the TE and the base pairs of the TEs nested within it. (C) Proportion of TEs that are intact, that is, uninterrupted by the insertion of another TE. In (A and C), families are shown as points and superfamily proportions as a barplot, and in (B) families are shown with medians as points and lines representing ranges of upper to lower quartiles, with superfamilies shown as colored rectangles.

(TIF)

S2 Fig. Chromosomal distribution of superfamilies across all 10 maize chromosomes.

Count of TE copies of each superfamily in 1 megabase bins across each chromosome.

(TIF)

S3 Fig. Distribution on chromosome 1 of five largest families with at least ten copies in each superfamily.

Count of TE copies in 1 megabase bins along chromosome 1. (A) DHH, (B) DTA, (C) DTC, (D) DTH, (E) DTM, (F) DTT, (G) DTX, (H) RLC, (I) RLG, (J) RLX, (K) RIL, (L) RIT, (M) RST. Note that some families have no copies on chromosome 1, including DTT10880 and DTX10177. Additionally, the RIT superfamily only has two families.

(TIF)

S4 Fig. Ages in 10,000 year bins across each of the largest 10 families of each superfamily with at least 10 copies.

(A) DHH, (B) DTA, (C) DTC, (D) DTH, (E) DTM, (F) DTT, (G) DTX, (H) RLC, (I) RLG, (J) RLX, (K) RIT, (L) RIL, (M) RST. The RIT superfamily only contains two families.

(TIF)

S5 Fig. LTR-LTR ages and terminal branch length ages for LTR retrotransposons.

Ages in 10,000 year bins across each of the largest 10 families of each superfamily with at least 10 copies. Left plots (A-D) show LTR-LTR ages, right plots (E-H) show terminal branch length (TBL) ages. (A) all copies, LTR-LTR, (B) RLC families, LTR-LTR, (C) RLG families, LTR-LTR, (D) RLX families, LTR-LTR, (E) all copies, TBL, (F) RLC families, TBL, (G) RLG families, TBL, (H) RLX families, TBL.

(TIF)

S6 Fig. LTR-LTR ages vs. terminal branch length ages for LTR retrotransposon superfamilies.

Spearman’s correlation coefficient shown on plot for each superfamily.

(TIF)

S7 Fig. Age of TE copies split by coding potential of self and family.

Violin plots with three lines, at median and 25th and 75th percentile. Only ages younger than 2 million years are shown. “Coding copy” refers to those copies that code for protein, “noncoding copy” refers to those copies that don’t code for protein, but a family member does, and “noncoding family” refers to copies from families without a coding member in B73.

(TIF)

S8 Fig. Methylation in TE and flanking sequence, across tissues.

A-J: mCG; K-T: mCHG; U-end mCHH. Tissues on y-axis, from top to bottom: Anther, SAM (shoot apical meristem), Earshoot, Flagleaf, Seedling leaf.

(TIF)

S9 Fig. Features of the TE and flanking sequences.

GC content in the TE (A) and 1kb flanking sequence (B). Proportion of sites methylatable in the CG context in the TE (C) and 1kb flanking sequence (D), methylatable in the CHG context in the TE (E) and 1kb flanking sequence (F), proportion of sites methylatable in the CHH context in the TE (G) and 1kb flanking sequence (H). Proportion of sites containing a TG or CA dinucleotide in the TE (I) and 1kb flanking sequence (J). Proportion of sites in MNase hypersensitive regions in root in TE (K) and 1kb flank (L), and shoot in TE (M) and 1kb flank (N). Proportion of segregating sites in the TE (O) and 1kb flank (P).

(TIF)

S10 Fig. The proportion of methylatable cytosines is negatively correlated with the proportion of TG/CA dinucleotides.

The x-axis reflects the proportion of cytosines in a CG context within the TE, and the y-axis reflects the proportion of dinucleotides in the TE that contain a TG or CA.

(TIF)

S11 Fig. Recombination, subgenome, and expression of closest gene.

(A) Recombination rate across the TE, (B) proportion of TEs in subgenome A, (C) log10 median expression of the closest gene to each TE, (D) log10 median expression of genes within 1kb of the TE, (E) Tau of closest gene to each TE, (F) log10 median expression of the closest syntenic gene, (G) log10 median expression of closest syntenic genes within 1 kb, and (H) Tau of the closest syntenic gene.

(TIF)

S12 Fig. Protein coding gene presence of individual LTR GAG and POL domains.

Shown are (A) the proportion of TEs with evidence of agglutination factor (GAG) domain present, (B) the proportion of TEs with evidence of all polyprotein domains present (aspartic proteinase, integrase, reverse transcriptase, and RNaseH), (C) the proportion of TEs with both GAG and Polyprotein present in the same element. Families are shown as points and superfamily proportions as barplot.

(TIF)

S13 Fig. Predicted and observed relationship of age to TE length and distance to gene.

Raw relationship (A & C) and predicted relationship (B & D) of TE length (A & B) and distance to gene (C & D).

(TIF)

S1 Table. Ten largest families in each superfamily, as shown left to right in plots.

(TXT)

S2 Table. Categories that each feature measured for each TE is classified into.

(TXT)

S3 Table. 14 families with at least 10 copies in the B73 genome, with at least 75% of copies coding for transposition related proteins.

(TXT)

S4 Table. 842 families with at least 10 copies in the B73 genome that lack coding representatives.

(TXT)

S5 Table. Mean methylation levels across superfamilies, averaged across all tissues, and averaged within a tissue.

(TXT)

S6 Table. TE families that lack methylatable cytosines (presented as family median values).

(TXT)

S1 Text. TIR annotation methods.

(PDF)

Acknowledgments

This work was inspired by the concept of transposable elements exisiting within “niches in the ecology of the genome,” as introduced by Margaret Kidwell and Damon Lisch [7].

Data Availability

All relevant data are within the manuscript and its Supporting information files. Scripts for generating summaries from data sources and links to summarized data are available at http://www.github.com/mcstitzer/maize_genomic_ecosystem.

Funding Statement

M.C.S. and J.R.-I. are supported by the National Science Foundation Plant Genome award 1238014. M.C.S. acknowledges support from the National Science Foundation Graduate Research Fellowship under Grant No. 1650042; J.R.-I. acknowledges support from the USDA Hatch project CA-D-PLS-2066-H. S.N.A. and N.M.S. are supported by a grant from USDA-NIFA (2016-67013-24747). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Bennetzen JL, Kellogg EA. Do plants have a one-way ticket to genomic obesity? The Plant Cell. 1997;9(9):1509. doi: 10.1105/tpc.9.9.1509 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Lisch D. How important are transposons for plant evolution? Nature Reviews Genetics. 2013;14(1):49–61. [DOI] [PubMed] [Google Scholar]
  • 3. Oliver KR, McComb JA, Greene WK. Transposable elements: powerful contributors to angiosperm evolution and diversity. Genome Biology and Evolution. 2013;5(10):1886–1901. doi: 10.1093/gbe/evt141 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Fedoroff NV. Transposable elements, epigenetics, and genome evolution. Science. 2012;338(6108):758–767. doi: 10.1126/science.338.6108.758 [DOI] [PubMed] [Google Scholar]
  • 5. Charlesworth B, Charlesworth D. The population dynamics of transposable elements. Genetical Research. 1983;42(01):1–27. [Google Scholar]
  • 6. Charlesworth B, Langley CH. The population genetics of Drosophila transposable elements. Annual Review of Genetics. 1989;23(1):251–287. doi: 10.1146/annurev.ge.23.120189.001343 [DOI] [PubMed] [Google Scholar]
  • 7. Kidwell MG, Lisch D. Transposable elements as sources of variation in animals and plants. Proceedings of the National Academy of Sciences. 1997;94(15):7704–7711. doi: 10.1073/pnas.94.15.7704 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Venner S, Feschotte C, Biémont C. Dynamics of transposable elements: towards a community ecology of the genome. Trends in Genetics. 2009;25(7):317–323. doi: 10.1016/j.tig.2009.05.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Linquist S, Cottenie K, Elliott TA, Saylor B, Kremer SC, Gregory TR. Applying ecological models to communities of genetic elements: the case of Neutral Theory. Molecular Ecology. 2015;. doi: 10.1111/mec.13219 [DOI] [PubMed] [Google Scholar]
  • 10. Kremer SC, Linquist S, Saylor B, Elliott TA, Gregory TR, Cottenie K. Transposable element persistence via potential genome-level ecosystem engineering. BMC Genomics. 2020;21:1–15. doi: 10.1186/s12864-020-6763-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Bennetzen JL. Transposable element contributions to plant gene and genome evolution. Plant Molecular Biology. 2000;42(1):251–269. doi: 10.1023/A:1006344508454 [DOI] [PubMed] [Google Scholar]
  • 12. Thomas J, Pritham EJ. Helitrons, the eukaryotic rolling-circle transposable elements. In: Mobile DNA III. American Society of Microbiology; 2015. p. 893–926. [DOI] [PubMed] [Google Scholar]
  • 13. Labrador M, Corces VG. Interactions between transposable elements and the host genome. Mobile DNA II. 2002; p. 1008–1023. [Google Scholar]
  • 14. Sultana T, Zamborlini A, Cristofari G, Lesage P. Integration site selection by retroviruses and transposable elements in eukaryotes. Nature Reviews Genetics. 2017;18(5):292. doi: 10.1038/nrg.2017.7 [DOI] [PubMed] [Google Scholar]
  • 15. Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, et al. A unified classification system for eukaryotic transposable elements. Nature Reviews Genetics. 2007;8(12):973–982. doi: 10.1038/nrg2165 [DOI] [PubMed] [Google Scholar]
  • 16. Finnegan DJ. Eukaryotic transposable elements and genome evolution. Trends in Genetics. 1989;5:103–107. doi: 10.1016/0168-9525(89)90039-5 [DOI] [PubMed] [Google Scholar]
  • 17. Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J. Repbase Update, a database of eukaryotic repetitive elements. Cytogenetic and Genome Research. 2005;110(1-4):462–467. doi: 10.1159/000084979 [DOI] [PubMed] [Google Scholar]
  • 18. Kapitonov VV, Jurka J. A universal classification of eukaryotic transposable elements implemented in Repbase. Nature Reviews Genetics. 2008;9(5):411–412. doi: 10.1038/nrg2165-c1 [DOI] [PubMed] [Google Scholar]
  • 19. Piégu B, Bire S, Arensburger P, Bigot Y. A survey of transposable element classification systems–a call for a fundamental update to meet the challenge of their diversity and complexity. Molecular Phylogenetics and Evolution. 2015;86:90–109. doi: 10.1016/j.ympev.2015.03.009 [DOI] [PubMed] [Google Scholar]
  • 20. Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, et al. Reply: A unified classification system for eukaryotic transposable elements should reflect their phylogeny. Nature Reviews Genetics. 2009;10(4):276. doi: 10.1038/nrg2165-c3 [DOI] [Google Scholar]
  • 21. Wicker T. So many repeats and so little time: how to classify transposable elements. In: Plant Transposable Elements. Springer; 2012. p. 1–15. [Google Scholar]
  • 22. Johns M, Mottinger J, Freeling M. A low copy number, copia-like transposon in maize. The EMBO journal. 1985;4(5):1093–1101. doi: 10.1002/j.1460-2075.1985.tb03745.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Sutton W, Gerlach W, Peacock W, Schwartz D. Molecular analysis of Ds controlling element mutations at the Adh1 locus of maize. Science. 1984;223(4642):1265–1268. doi: 10.1126/science.223.4642.1265 [DOI] [PubMed] [Google Scholar]
  • 24. Hake S, Walbot V. The genome of Zea mays, its organization and homology to related grasses. Chromosoma. 1980;79(3):251–270. doi: 10.1007/BF00327318 [DOI] [Google Scholar]
  • 25. Sanz-Alferez S, SanMiguel P, Jin YK, Springer PS, Bennetzen JL. Structure and evolution of the Cinful retrotransposon family of maize. Genome. 2003;46(5):745–752. doi: 10.1139/g03-061 [DOI] [PubMed] [Google Scholar]
  • 26. Baucom RS, Estill JC, Chaparro C, Upshaw N, Jogi A, Deragon JM, et al. Exceptional diversity, non-random distribution, and rapid evolution of retroelements in the B73 maize genome. PLoS Genetics. 2009;5(11):e1000732. doi: 10.1371/journal.pgen.1000732 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. SanMiguel P, Vitte C. The LTR-retrotransposons of maize. In: Handbook of Maize. Springer; 2009. p. 307–327. [Google Scholar]
  • 28. Diez CM, Meca E, Tenaillon MI, Gaut BS. Three Groups of Transposable Elements with Contrasting Copy Number Dynamics and Host Responses in the Maize (Zea mays ssp. mays) Genome. PLoS Genetics. 2014;10(4):e1004298. doi: 10.1371/journal.pgen.1004298 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Vicient CM. Transcriptional activity of transposable elements in maize. BMC Genomics. 2010;11(1):601. doi: 10.1186/1471-2164-11-601 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Cresse AD, Hulbert SH, Brown WE, Lucas JR, Bennetzen JL. Mu1-related transposable elements of maize preferentially insert into low copy number DNA. Genetics. 1995;140(1):315–324. doi: 10.1093/genetics/140.1.315 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Zhong CX, Marshall JB, Topp C, Mroczek R, Kato A, Nagaki K, et al. Centromeric retroelements and satellites interact with maize kinetochore protein CENH3. The Plant Cell. 2002;14(11):2825–2836. doi: 10.1105/tpc.006106 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Eichten SR, Ellis NA, Makarevitch I, Yeh CT, Gent JI, Guo L, et al. Spreading of heterochromatin is limited to specific families of maize retrotransposons. PLoS Genetics. 2012;8(12):e1003127. doi: 10.1371/journal.pgen.1003127 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Noshay JM, Anderson SN, Zhou P, Ji L, Ricci W, Lu Z, et al. Monitoring the interplay between transposable element families and DNA methylation in maize. PLoS Genetics. 2019;15(9):e1008291. doi: 10.1371/journal.pgen.1008291 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Elliott TA, Gregory TR. Do larger genomes contain more diverse transposable elements? BMC Evolutionary Biology. 2015;15(1):69. doi: 10.1186/s12862-015-0339-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Ågren JA, Wright SI. Co-evolution between transposable elements and their hosts: a major factor in genome size evolution? Chromosome Research. 2011;19(6):777. [DOI] [PubMed] [Google Scholar]
  • 36. Ågren JA, Greiner S, Johnson MT, Wright SI. No evidence that sex and transposable elements drive genome size variation in evening primroses. Evolution. 2015;69(4):1053–1062. doi: 10.1111/evo.12627 [DOI] [PubMed] [Google Scholar]
  • 37. Sotero-Caio CG, Platt RN, Suh A, Ray DA. Evolution and diversity of transposable elements in vertebrate genomes. Genome Biology and Evolution. 2017;9(1):161–177. doi: 10.1093/gbe/evw264 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Bast J, Jaron K, Schuseil D, Roze D, Schwander T. Asexual reproduction drives the reduction of transposable element load. eLife. 2019; 8:e48548. doi: 10.7554/eLife.48548 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Brookfield JF. The ecology of the genome—mobile DNA elements and their hosts. Nature Reviews Genetics. 2005;6(2):128–136. doi: 10.1038/nrg1524 [DOI] [PubMed] [Google Scholar]
  • 40. Kidwell MG, Lisch DR. Transposable elements as sources of genomic variation. Mobile DNA II. 2002; p. 59–90. [Google Scholar]
  • 41. Promislow DE, Jordan IK, McDonald J. Genomic demography: a life-history analysis of transposable element evolution. Proceedings of the Royal Society of London B: Biological Sciences. 1999;266(1428):1555–1560. doi: 10.1098/rspb.1999.0815 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Abrusán G, Krambeck HJ. Competition may determine the diversity of transposable elements. Theoretical Population Biology. 2006;70(3):364–375. doi: 10.1016/j.tpb.2006.05.001 [DOI] [PubMed] [Google Scholar]
  • 43. Saylor B, Elliott TA, Linquist S, Kremer SC, Gregory TR, Cottenie K. A novel application of ecological analyses to assess transposable element distributions in the genome of the domestic cow, Bos taurus. Genome. 2013;56(9):521–533. doi: 10.1139/gen-2012-0162 [DOI] [PubMed] [Google Scholar]
  • 44. Jones CG, Lawton JH, Shachak M. Organisms as ecosystem engineers. In: Ecosystem management. Springer; 1994. p. 130–147. [Google Scholar]
  • 45. Pelletier F, Garant D, Hendry A. Eco-evolutionary dynamics. Philosophical Transactions of the Royal Society B: Biological Sciences. 2009;364(1523):1483. doi: 10.1098/rstb.2009.0027 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Post DM, Palkovacs EP. Eco-evolutionary feedbacks in community and ecosystem ecology: interactions between the ecological theatre and the evolutionary play. Philosophical Transactions of the Royal Society B: Biological Sciences. 2009;364(1523):1629–1640. doi: 10.1098/rstb.2009.0012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Linquist S, Saylor B, Cottenie K, Elliott TA, Kremer SC, Ryan Gregory T. Distinguishing ecological from evolutionary approaches to transposable elements. Biological Reviews. 2013;88(3):573–584. doi: 10.1111/brv.12017 [DOI] [PubMed] [Google Scholar]
  • 48. Jiao Y, Peluso P, Shi J, Liang T, Stitzer MC, Wang B, et al. Improved maize reference genome with single molecule technologies. Nature. 2017;546(7659):524–527. doi: 10.1038/nature22971 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, et al. The B73 maize genome: complexity, diversity, and dynamics. Science. 2009;326(5956):1112–1115. doi: 10.1126/science.1178534 [DOI] [PubMed] [Google Scholar]
  • 50. Slotkin RK. The case for not masking away repetitive DNA. Mobile DNA. 2018;9(1):15. doi: 10.1186/s13100-018-0120-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Anderson SN, Zynda G, Song J, Han Z, Vaughn M, Li Q, et al. Subtle Perturbations of the Maize Methylome Reveal Genes and Transposons Silenced by Chromomethylase or RNA-Directed DNA Methylation Pathways. G3: Genes, Genomes, Genetics. 2018; p. g3–200284. doi: 10.1534/g3.118.200284 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Anderson SN, Stitzer MC, Zhou P, Ross-Ibarra J, Hirsch CD, Springer NM. Dynamic Patterns of Transcript Abundance of Transposable Element Families in Maize. G3: Genes, Genomes, Genetics. 2019; doi: 10.1534/g3.119.400431 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Bukowski R, Guo X, Lu Y, Zou C, He B, Rong Z, et al. Construction of the third-generation Zea mays haplotype map. GigaScience. 2018;7(4):1. doi: 10.1093/gigascience/gix134 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Hill WG, Robertson A. The effect of linkage on limits to artificial selection. Genetics Research. 1966;8(3):269–294. doi: 10.1017/S0016672300010156 [DOI] [PubMed] [Google Scholar]
  • 55. Swigoňová Z, Lai J, Ma J, Ramakrishna W, Llaca V, Bennetzen JL, et al. Close split of sorghum and maize genome progenitors. Genome Research. 2004;14(10a):1916–1923. doi: 10.1101/gr.2332504 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Schnable JC, Springer NM, Freeling M. Differentiation of the maize subgenomes by genome dominance and both ancient and ongoing gene loss. Proceedings of the National Academy of Sciences. 2011;108(10):4069–4074. doi: 10.1073/pnas.1101368108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Bennetzen J, Liu R, Ma J, Pontaroli A. Maize genome structure and rearrangement. Maydica. 2005;50(3/4):387. [Google Scholar]
  • 58. Han Y, Qin S, Wessler SR. Comparison of class 2 transposable elements at superfamily resolution reveals conserved and distinct features in cereal grass genomes. BMC Genomics. 2013;14(1):71. doi: 10.1186/1471-2164-14-71 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Jiang N, Wessler SR. Insertion preference of maize and rice miniature inverted repeat transposable elements as revealed by the analysis of nested elements. The Plant Cell. 2001;13(11):2553–2564. doi: 10.1105/tpc.010235 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Liu S, Yeh CT, Ji T, Ying K, Wu H. Mu transposon insertion sites and meiotic recombination events co-localize with epigenetic marks for open chromatin across the maize genome. PLoS Genetics. 2009;5(11):e1000733–e1000733. doi: 10.1371/journal.pgen.1000733 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Wright SI, Le QH, Schoen DJ, Bureau TE. Population dynamics of an Ac-like transposable element in self-and cross-pollinating Arabidopsis. Genetics. 2001;158(3):1279–1288. doi: 10.1093/genetics/158.3.1279 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Tenaillon MI, Hollister JD, Gaut BS. A triptych of the evolution of plant transposable elements. Trends in Plant Science. 2010;15(8):471–478. doi: 10.1016/j.tplants.2010.05.003 [DOI] [PubMed] [Google Scholar]
  • 63. Eickbush TH, Jamburuthugoda VK. The diversity of retrotransposons and the properties of their reverse transcriptases. Virus Research. 2008;134(1-2):221–234. doi: 10.1016/j.virusres.2007.12.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Yuan YW, Wessler SR. The catalytic domain of all eukaryotic cut-and-paste transposase superfamilies. Proceedings of the National Academy of Sciences. 2011;108(19):7884–7889. doi: 10.1073/pnas.1104208108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Kapitonov VV, Jurka J. Helitrons on a roll: eukaryotic rolling-circle transposons. TRENDS in Genetics. 2007;23(10):521–529. doi: 10.1016/j.tig.2007.08.004 [DOI] [PubMed] [Google Scholar]
  • 66. Du J, Tian Z, Hans CS, Laten HM, Cannon SB, Jackson SA, et al. Evolutionary conservation, diversity and specificity of LTR-retrotransposons in flowering plants: insights from genome-wide analysis and multi-specific comparison. The Plant Journal. 2010;63(4):584–598. doi: 10.1111/j.1365-313X.2010.04263.x [DOI] [PubMed] [Google Scholar]
  • 67. Neumann P, Navrátilová A, Koblížková A, Kejnovskỳ E, Hřibová E, Hobza R, et al. Plant centromeric retrotransposons: a structural and cytogenetic perspective. Mobile DNA. 2011;2(1):4. doi: 10.1186/1759-8753-2-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Slotkin RK, Nuthikattu S, Jiang N. The impact of transposable elements on gene and genome evolution. In: Plant Genome Diversity Volume 1. Springer; 2012. p. 35–58. [Google Scholar]
  • 69. Bureau TE, Wessler SR. Stowaway: a new family of inverted repeat elements associated with the genes of both monocotyledonous and dicotyledonous plants. The Plant Cell. 1994;6(6):907–916. doi: 10.1105/tpc.6.6.907 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Wicker T, Yu Y, Haberer G, Mayer KF, Marri PR, Rounsley S, et al. DNA transposon activity is associated with increased mutation rates in genes of rice and other grasses. Nature Communications. 2016;7:12790. doi: 10.1038/ncomms12790 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Singer T, Yordan C, Martienssen RA. Robertson’s Mutator transposons in A. thaliana are regulated by the chromatin-remodeling gene Decrease in DNA Methylation (DDM1). Genes & Development. 2001;15(5):591–602. doi: 10.1101/gad.193701 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Jiang N, Ferguson AA, Slotkin RK, Lisch D. Pack-Mutator–like transposable elements (Pack-MULEs) induce directional modification of genes through biased insertion and DNA acquisition. Proceedings of the National Academy of Sciences. 2011; p. 201010814. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Anderson SN, Stitzer MC, Brohammer AB, Zhou P, Noshay JM, O’Connor CH, et al. Transposable elements contribute to dynamic genome content in maize. The Plant Journal. 2019;. doi: 10.1111/tpj.14489 [DOI] [PubMed] [Google Scholar]
  • 74. Dooner HK, Wang Q, Huang JT, Li Y, He L, Xiong W, et al. Spontaneous mutations in maize pollen are frequent in some lines and arise mainly from retrotranspositions and deletions. Proceedings of the National Academy of Sciences. 2019;116(22):10734–10743. doi: 10.1073/pnas.1903809116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75. Zhang Q, Arbuckle J, Wessler SR. Recent, extensive, and preferential insertion of members of the miniature inverted-repeat transposable element family Heartbreaker into genic regions of maize. Proceedings of the National Academy of Sciences. 2000;97(3):1160–1165. doi: 10.1073/pnas.97.3.1160 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76. Meyers BC, Tingey SV, Morgante M. Abundance, distribution, and transcriptional activity of repetitive elements in the maize genome. Genome Research. 2001;11(10):1660–1676. doi: 10.1101/gr.188201 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77. Estep M, DeBarry J, Bennetzen J. The dynamics of LTR retrotransposon accumulation across 25 million years of panicoid grass evolution. Heredity. 2013;110(2):194–204. doi: 10.1038/hdy.2012.99 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78. Tenaillon MI, Hufford MB, Gaut BS, Ross-Ibarra J. Genome Size and Transposable Element Content as Determined by High-Throughput Sequencing in Maize and Zea luxurians. Genome Biology and Evolution. 2011;3:219–229. doi: 10.1093/gbe/evr008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79. Ross-Ibarra J, Tenaillon M, Gaut BS. Historical divergence and gene flow in the genus Zea. Genetics. 2009;. doi: 10.1534/genetics.108.097238 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80. Gerlach W, Dennis E, Peacock W, Clegg M. The Ds1 controlling element family in maize andTripsacum. Journal of Molecular Evolution. 1987;26(4):329–334. doi: 10.1007/BF02101151 [DOI] [PubMed] [Google Scholar]
  • 81. Purugganan MD, Wessler SR. Molecular evolution of magellan, a maize Ty3/gypsy-like retrotransposon. Proceedings of the National Academy of Sciences. 1994;91(24):11674–11678. doi: 10.1073/pnas.91.24.11674 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82. Roulin A, Piegu B, Fortune PM, Sabot F, D’hont A, Manicacci D, et al. Whole genome surveys of rice, maize and sorghum reveal multiple horizontal transfers of the LTR-retrotransposon Route66 in Poaceae. BMC Evolutionary Biology. 2009;9(1):58. doi: 10.1186/1471-2148-9-58 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83. SanMiguel P, Tikhonov A, Jin YK, Motchoulskaia N, et al. Nested retrotransposons in the intergenic regions of the maize genome. Science. 1996;274(5288):765. doi: 10.1126/science.274.5288.765 [DOI] [PubMed] [Google Scholar]
  • 84. Malik HS, Eickbush TH. Modular evolution of the integrase domain in the Ty3/Gypsy class of LTR retrotransposons. Journal of Virology. 1999;73(6):5186–5190. doi: 10.1128/JVI.73.6.5186-5190.1999 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85. Matsuoka Y, Vigouroux Y, Goodman MM, Sanchez J, Buckler E, Doebley J. A single domestication for maize shown by multilocus microsatellite genotyping. Proceedings of the National Academy of Sciences. 2002;99(9):6080–6084. doi: 10.1073/pnas.052125199 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86. Piperno DR, Ranere AJ, Holst I, Iriarte J, Dickau R. Starch grain and phytolith evidence for early ninth millennium BP maize from the Central Balsas River Valley, Mexico. Proceedings of the National Academy of Sciences. 2009;106(13):5019–5024. doi: 10.1073/pnas.0812525106 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87. Studer A, Zhao Q, Ross-Ibarra J, Doebley J. Identification of a functional transposon insertion in the maize domestication gene tb1. Nature Genetics. 2011;43(11):1160–1163. doi: 10.1038/ng.942 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88. Yang Q, Li Z, Li W, Ku L, Wang C, Ye J, et al. CACTA-like transposable element in ZmCCT attenuated photoperiod sensitivity and accelerated the postdomestication spread of maize. Proceedings of the National Academy of Sciences. 2013;110(42):16969–16974. doi: 10.1073/pnas.1310949110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89. Huang C, Sun H, Xu D, Chen Q, Liang Y, Wang X, et al. ZmCCT9 enhances maize adaptation to higher latitudes. Proceedings of the National Academy of Sciences. 2018;115(2):E334–E341. doi: 10.1073/pnas.1718058115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90. Hutchinson GE. Concluding Remarks. Cold Spring Harbor Symposia on Quantitative Biology. 1957;22:415–427. doi: 10.1101/SQB.1957.022.01.039 [DOI] [Google Scholar]
  • 91. Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, et al. The Sorghum bicolor genome and the diversification of grasses. Nature. 2009;457(7229):551. doi: 10.1038/nature07723 [DOI] [PubMed] [Google Scholar]
  • 92. Le Rouzic A, Dupas S, Capy P. Genome ecosystem and transposable elements species. Gene. 2007;390(1):214–220. doi: 10.1016/j.gene.2006.09.023 [DOI] [PubMed] [Google Scholar]
  • 93. Wagner A. Cooperation is fleeting in the world of transposable elements. PLoS Computational Biology. 2006;2(12):e162. doi: 10.1371/journal.pcbi.0020162 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94. Clark RM, Tavaré S, Doebley J. Estimating a nucleotide substitution rate for maize from polymorphism at a major domestication locus. Molecular Biology and Evolution. 2005;22(11):2304–2312. doi: 10.1093/molbev/msi228 [DOI] [PubMed] [Google Scholar]
  • 95. SanMiguel P, Gaut BS, Tikhonov A, Nakajima Y, Bennetzen JL. The paleontology of intergene retrotransposons of maize. Nature Genetics. 1998;20(1):43–45. doi: 10.1038/1695 [DOI] [PubMed] [Google Scholar]
  • 96. Weng ML, Becker C, Hildebrandt J, Rutter MT, Shaw RG, Weigel D, et al. Fine-Grained Analysis of Spontaneous Mutation Spectrum and Frequency in Arabidopsis thaliana. Genetics. 2018; p. genetics–301721. doi: 10.1534/genetics.118.301721 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97. Petrov DA, Aminetzach YT, Davis JC, Bensasson D, Hirsh AE. Size matters: non-LTR retrotransposable elements and ectopic recombination in Drosophila. Molecular Biology and Evolution. 2003;20(6):880–892. doi: 10.1093/molbev/msg102 [DOI] [PubMed] [Google Scholar]
  • 98. Shen JJ, Dushoff J, Bewick AJ, Chain FJ, Evans BJ. Genomic dynamics of transposable elements in the western clawed frog (Silurana tropicalis). Genome Biology and Evolution. 2013;5(5):998–1009. doi: 10.1093/gbe/evt065 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99. Hollister JD, Gaut BS. Population and evolutionary dynamics of Helitron transposable elements in Arabidopsis thaliana. Molecular Biology and Evolution. 2007;24(11):2515–2524. doi: 10.1093/molbev/msm197 [DOI] [PubMed] [Google Scholar]
  • 100. Fu H, Zheng Z, Dooner HK. Recombination rates between adjacent genic and retrotransposon regions in maize vary by 2 orders of magnitude. Proceedings of the National Academy of Sciences. 2002;99(2):1082–1087. doi: 10.1073/pnas.022635499 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101. Gilbert DM, Bridges MC, Strother AE, Burckhalter CE, Burnette JM, Hancock CN. Precise repair of mPing excision sites is facilitated by target site duplication derived microhomology. Mobile DNA. 2015;6(1):15. doi: 10.1186/s13100-015-0046-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102. Cavrak VV, Lettner N, Jamge S, Kosarewicz A, Bayer LM, Scheid OM. How a retrotransposon exploits the plant’s heat stress response for its activation. PLoS Genetics. 2014;10(1):e1004115. doi: 10.1371/journal.pgen.1004115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103. Gent JI, Ellis NA, Guo L, Harkess AE, Yao Y, Zhang X, et al. CHH islands: de novo DNA methylation in near-gene chromatin regulation in maize. Genome Research. 2013;23(4):628–637. doi: 10.1101/gr.146985.112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104. Li Q, Gent JI, Zynda G, Song J, Makarevitch I, Hirsch CD, et al. RNA-directed DNA methylation enforces boundaries between heterochromatin and euchromatin in the maize genome. Proceedings of the National Academy of Sciences. 2015;112(47):14728–14733. doi: 10.1073/pnas.1514680112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 105. Medstrand P, Van De Lagemaat LN, Mager DL. Retroelement distributions in the human genome: variations associated with age and proximity to genes. Genome Research. 2002;12(10):1483–1495. doi: 10.1101/gr.388902 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 106. Hollister JD, Gaut BS. Epigenetic silencing of transposable elements: a trade-off between reduced transposition and deleterious effects on neighboring gene expression. Genome Research. 2009;19(8):1419–1428. doi: 10.1101/gr.091678.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 107. Rodgers-Melnick E, Bradbury PJ, Elshire RJ, Glaubitz JC, Acharya CB, Mitchell SE, et al. Recombination in diverse maize is stable, predictable, and associated with genetic load. Proceedings of the National Academy of Sciences. 2015; p. 201413864. doi: 10.1073/pnas.1413864112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 108. Dooner HK, He L. Polarized gene conversion at the bz locus of maize. Proceedings of the National Academy of Sciences. 2014;111(38):13918–13923. doi: 10.1073/pnas.1415482111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 109. Bennetzen J. In: Maize Genome Structure and Evolution; 2009. p. 179–199. [Google Scholar]
  • 110. Ellinghaus D, Kurtz S, Willhoeft U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics. 2008;9(1):1. doi: 10.1186/1471-2105-9-18 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 111. Steinbiss S, Willhoeft U, Gremme G, Kurtz S. Fine-grained annotation and classification of de novo predicted LTR retrotransposons. Nucleic Acids Research. 2009; p. gkp759. doi: 10.1093/nar/gkp759 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 112. Llorens C, Futami R, Covelli L, Domínguez-Escribá L, Viu JM, Tamarit D, et al. The Gypsy Database (GyDB) of mobile genetic elements: release 2.0. Nucleic Acids Research. 2010; p. gkq1061. doi: 10.1093/nar/gkq1061 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 113.Brian H, Papanicolaou A. Transdecoder (Find Coding Regions Within Transcripts)—http://transdecoder.github.io; 2018.
  • 114.Eddy S. Hmmer 3.1—http://hmmer.org/; 2018.
  • 115. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution. 2013;30(4):772–780. doi: 10.1093/molbev/mst010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 116. Price MN, Dehal PS, Arkin AP. FastTree 2–approximately maximum-likelihood trees for large alignments. PloS One. 2010;5(3):e9490. doi: 10.1371/journal.pone.0009490 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 117. Bergman CM, Bensasson D. Recent LTR retrotransposon insertion contrasts with waves of non-LTR insertion since speciation in Drosophila melanogaster. Proceedings of the National Academy of Sciences. 2007;104(27):11340–11345. doi: 10.1073/pnas.0702552104 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 118.Fiston-Lavier AS, Vejnar CE, Quesneville H. Transposable element sequence evolution is influenced by gene context. arXiv preprint arXiv:12090176. 2012;.
  • 119. Brookfield JF, Johnson LJ. The evolution of mobile DNAs: when will transposons create phylogenies that look as if there is a master gene? Genetics. 2006;173(2):1115–1123. doi: 10.1534/genetics.104.027219 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 120. Paradis E, Schliep K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics. 2018;xx:xxx–xxx. [DOI] [PubMed] [Google Scholar]
  • 121.R Core Team. R: A Language and Environment for Statistical Computing; 2018. Available from: https://www.R-project.org/.
  • 122. Lawrence M, Huber W, Pages H, Aboyoun P, Carlson M, Gentleman R, et al. Software for computing and annotating genomic ranges. PLoS Computational Biology. 2013;9(8):e1003118. doi: 10.1371/journal.pcbi.1003118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 123. Walley JW, Sartor RC, Shen Z, Schmitz RJ, Wu KJ, Urich MA, et al. Integration of omic networks in a developmental atlas of maize. Science. 2016;353(6301):814–818. doi: 10.1126/science.aag1125 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 124. Kryuchkova-Mostacci N, Robinson-Rechavi M. A benchmark of gene expression tissue-specificity metrics. Briefings in Bioinformatics. 2016;18(2):205–214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 125. Zhang Y, Ngu DW, Carvalho D, Liang Z, Qiu Y, Roston RL, et al. Differentially regulated orthologs in sorghum and the subgenomes of maize. The Plant Cell. 2017;29(8):1938–1951. doi: 10.1105/tpc.17.00354 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 126. Schnable J. Grass Syntenic Gene List sorghum v3 maize v3/4 with teff and oropetium v2. 2019; [Google Scholar]
  • 127. Ogut F, Bian Y, Bradbury PJ, Holland JB. Joint-multiple family linkage analysis predicts within-family variation better than single-family analysis of the maize nested association mapping population. Heredity. 2015;114(6):552–563. doi: 10.1038/hdy.2014.123 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 128. Monaco MK, Stein J, Naithani S, Wei S, Dharmawardhana P, Kumari S, et al. Gramene 2013: Comparative plant genomics resources. Nucleic Acids Research. 2014;42(D1):D1193–D1199. doi: 10.1093/nar/gkt1110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 129. Murray K, Müller S, Turlach B. Fast and flexible methods for monotone polynomial fitting. Journal of Statistical Computation and Simulation. 2016;86(15):2946–2966. doi: 10.1080/00949655.2016.1139582 [DOI] [Google Scholar]
  • 130. Rodgers-Melnick E, Vera DL, Bass HW, Buckler ES. Open chromatin reveals the functional maize genome. Proceedings of the National Academy of Sciences. 2016;113(22):E3177–E3184. doi: 10.1073/pnas.1525244113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 131. Eichten SR, Vaughn MW, Hermanson PJ, Springer NM. Variation in DNA methylation patterns is more common among maize inbreds than among tissues. The Plant Genome. 2013;6(2). doi: 10.3835/plantgenome2012.06.0009 [DOI] [Google Scholar]
  • 132. Xi Y, Li W. BSMAP: whole genome bisulfite sequence MAPping program. BMC Bioinformatics. 2009;10(1):232. doi: 10.1186/1471-2105-10-232 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 133. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841–842. doi: 10.1093/bioinformatics/btq033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 134. Liaw A, Wiener M. Classification and Regression by randomForest. R News. 2002;2(3):18–22. [Google Scholar]
  • 135. Goldstein A, Kapelner A, Bleich J, Pitkin E. Peeking Inside the Black Box: Visualizing Statistical Learning With Plots of Individual Conditional Expectation. Journal of Computational and Graphical Statistics. 2015;24(1):44–65. doi: 10.1080/10618600.2014.907095 [DOI] [Google Scholar]
  • 136. Greenwell BM. pdp: An R Package for Constructing Partial Dependence Plots. The R Journal. 2017;9(1):421–436. doi: 10.32614/RJ-2017-016 [DOI] [Google Scholar]

Decision Letter 0

Kirsten Bomblies, Jesse Hollister

17 Sep 2019

Dear Dr Stitzer,

Thank you very much for submitting your Research Article entitled 'The Genomic Ecosystem of Transposable Elements in Maize' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review again a much-revised version. We cannot, of course, promise publication at that time.

Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. In particular, the revision should address the specific comments of Reviewer 1 regarding clarity of figures, description of transposable element taxonomic categories for the general reader, and statistical support for some assertions. Additionally, both reviewers point out that the "genome ecology" concept, while intriguing, is largely employed in this manuscript as an analogy rather than a mode of inference. This should be addressed in a revised manuscript. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see our guidelines.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool.  PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.

Yours sincerely,

Jesse Hollister

Guest Editor

PLOS Genetics

Kirsten Bomblies

Section Editor: Evolution

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: This is a “bird’s eye” view of transposons in the maize genome. It makes the assertion that the maize genome constitutes an ecosystem, in which different families of elements occupy distinct niches. Certainly, the shear diversity of elements with respect to their location, methylation and expression supports this idea. I’m not sure the authors really support the idea that elements actually interact with each other, which weakens the argument. For instance, evidence that one family actually negatively effects the activity of another family would support the hypothesis, as would evidence that some elements tend to insert into other elements rather than themselves. However, evidence of cooperation between families (Cinful and Zeon) is intriguing. Based on the data supplied here, I would suggest that the genome ecology hypothesis is, at this point, mostly a useful metaphor, and a good place to start for further analysis. As indicated below, I think presentation of the data in the figures could be improved for greater clarity. If I was confused, I’m guessing other readers will be confused as well). I would have also liked to have the definitions of superfamilies and families more precisely defined, and I would have liked to have had instance in which the variation observed really does support and ecological model. Overall, because the subject is so broad (hard to believe when we are looking at a single genome!), I think the authors did a pretty good job, particularly since all of the raw data and code has been made available. I very much hope that this information is integrated into TE models in the current maize genome version, since it’s a mess right now. Each section could (and likely will be) a manuscript on its own, but this manuscript opens doors for numerous other, more in depth analyses.

Line 28. That’s a pretty big claim. The manuscript should support it, but I’m not sure it does.

Line 41. A quibble, but maybe say maize genome?

Line 93. As far as I know, this is still a hypothetical mechanism for eukaryotic Helitrons, since no one has demonstrated this mechanism in these organsism. Best to say, “are thought to...”

Line 176. It might be useful to remind the reader as to how orders and superfamilies are defined. Just a couple of sentences for those not familiar with the terminology.

Line 190. Typo.

Line 209. Not sure it’s worth doing, but it would be interesting to know if some TEs are more attractive targets for insertion - this would fit into a competition model. For instance, do some TEs avoid insertions into themselves? That would be kind of cool.

Line 213. Again, it sure would be interesting to see if there are any biases in insertions. I’m not trying to make more work for the authors, but this would be really interesting. Also, give my admittedly anecdotal evidence, I’m surprised that these TIR elements are not closer to genes.

Figure 2. For the non-specialist, the names in the figure don’t mean much, so it is hard for them to get the major point, other than length of different TEs varies. Perhaps it would be worth adding something in the legend saying, “Element designations beginning in a DT represent TIR DNA elements, those beginning in a DH are helitrons,” and so on. The same perhaps in the other Figures with these designations? I know it’s redundant, but I think it’s always a good idea to keep the reader oriented, so that if they see an exceptional family, they will know right away what kind of element it is.

Figure 3. It is very difficult to match the color designations below with the graphics above. In addition to the label, I would go ahead and individually label each tract (particularly for color blind readers, this would be a big help).

Line 219. Again, a reasonable claim is made concerning length, but you could do the math and see if that is the case. That is, does is the length of elements the sole determinant of whether or not it has an insertion into it?

Line 233. “Arrived in the genome?” Does that mean that they were horizontally transferred from another species? Do you mean that a given copy of these elements at a given position is 350,000 years ago?

Line 249. Previous analysis has suggested that the average distance of a TSS to a TE is roughly 300 bp (with a lot a variation). On net, these distances seem much farther. Do you have a calculation of the average distance upstream of TSSs of TEs?

Line 254. It might be useful to remind the reader as to how families are defined.

Line 289. “estimated with LTR-LTR” should be “estimated using LTR-LTR”. Also, if you are going to say this, then how are the age of the other elements established? I’m assuming by divergence from the consensus sequence of a given family. Perhaps it would be a good idea to reference your materials and methods here, since dating TEs is a non-trivial task.

Line 294. Well, yes and no, since you are looking at a combination of any restrictions and allowances and selection following insertion. It’s important that this distinction is made clear.

Line 302. Which proteins? Those that are known to be functional (hAT, CACTA and MULEs have functional transposases) and proteins that are full length and are potentially functional?

Line 315. Confusing wording. 0.6% of families within a super family (implied by the wording), or 0.6% of elements within each family?

Line 319. Should read “potentially autonomous elements”. Autonomy is a functional designation.

Figure 5. As with the other figures, this figure could be more clear. In 5A and B, since each dot has a range around it, I’m assuming there are multiple proteins represented. Are these families (i.e. RLG0001)? And for each color, all of the families within each superfamily that encode a protein are shown? I get it now, but I think the more explicit you can be the better. In 5C, “Presence of GAG, all five domains (GAG and Pol), and Pol (which encodes four domains) in LTR retrotransposons.” So, GAG has five domains that includes a Pol, and Pol (a different Pol?). A bit confusing. Why are all the superfamilies listed with and without colors? What is the point of the list in which they are black? In 5G, it’s not clear how the two axes are organized. Is the X axis organized by similarity of patterns of expression? If so, how is the Y axis organized? The color bar on the left seems to suggest that patterns of expression are more or less randomly distributed, but other than that, I’m not sure how informative this is. And what is “sup”?

Line 354. Maybe or maybe not, since you are dividing the expression level by the total copy number, so higher copy number elements will inevitably show lower per element expression levels. It’s certainly possible that a small subset of a large family is expressing at a relatively high level. It might have been useful to only look at RNAseq data that maps perfectly to intact elements (good tsds, good ORFs), but there are problems with this as well.

Line 372. It is well known that tissue specificity in pollen is likely due to transient relaxation of silencing in the vegetative nucleus and is related to potential activity rather than actual activity. The tough part of this analysis is TEs are often transiently expressed in response to stresses, and which is not represented in this panel. Actually, if data existed for stressed plants in a subset of tissues, you might have gotten some interesting data. As it is, I think the most you can say is that most TEs, under normal conditions, are probably expressed at very low levels.

Line 397. Did the level of methylation correlate with expression level? Also, it is worth pointing out the CHH methylation is largely restricted to TEs adjacent to genes, so position, rather that the type of element can be important. If a given family (e.g. MULEs) are closer to genes, they are more likely to have CHH methylation. This may or may not have to do with how effectively they are silenced.

Line 399, 400. typos.

Line 399. I don’t know what this means. Preferential insertion based on methylation?

Line 400. Or that they insert into regions that are already methylated. How many of the flanking sequences are methylated because they are TEs? For many of these flanking sequences, the only way to know for sure if the TE is causing an effect is if it is polymorphic. I believe the W22 methylome and sequence is available; a comparison would have been illuminating.

Line 415. Yes, because these are TEs that nucleate CHH islands.

Figure 6. Now the colors are no longer designated at all. Here and throughout, I would indicate right in the figure, above the data, what the superfamilies are. Also, since the point of this figure is, in general, to compare the TE and the flanking sequences, why not put B directly under A so that a visual comparison is possible. Frankly, although I accept the overall conclusions, with this and other figures I had trouble getting the point, other than that different families have differences in methylation in the TEs and their flanking sequences.

Line 413. This certainly seems to be true for some, but certainly not for all. Here and throughout some statistical support for what seems visually apparent would be helpful. Also, were the flanking sequences filtered for being TEs or not? What would TEs inserted into other TEs look like? What would families that are more likely to be inserted into other TEs look like.

Line 431. Are these older elements where there was a lot of C to T conversion due to methylation?

Line 477. Statistical support for a difference?

Line 486. The maize genome is notorious for having large numbers of genes that are unlikely to actually be genes. I wonder what this analysis would look like if only syntenic genes were examined. I think this would be more informative.

Line 494. This kind of makes sense, since it is likely that some elements insert into regions of relatively open chromatin and are thus more likely to be expressed.

Line 504. Interesting. Statistically significant?

Line 536. What about the effects of C to T mutations due to methylation?

Line 565. Is this because there are more likely to be more polymorphisms?

Line 571. This is surprising, given what we know about selection against TEs near genes and the fact that CHH is mostly associated with TEs near genes.

Line 616. There are better, more recent references for this.

Line 631. This is reasonable, but what if selection acts more efficiently to remove large LTR insertions near genes?

Line 633. This is also supported by de novo insertion data.

Line 650. That is fascinating. I had no idea that families diversified so quickly. Then again, this depends on how families are defined.

Line 695. Awkward wording.

Line 726. Do you mean newly inserted?

Line 732. Any evidence for that?

Line 738. Any evidence that Zeon and Cinful work together? This is an intriguing hypothesis, but it is speculative and should presented as such. Still, this a really intriguing idea!

Line 779. I’m not clear as to why this is not a somewhat trivial observation.

Line 813. As is evidenced by the very low recombination rates outside of genes in maize.

Reviewer #2: This paper provides a detailed, rigorous description of the properties, abundances, distributions, and ages of transposable elements in the maize genome. The analyses are impressive and, in general, the paper is well-written (but see below for some significant issues). I think it will make a valuable contribution to the literature. However, I do have some comments for revision, which I present in the spirit of improving an already strong paper before publication.

Major points:

- I think the data presented in this paper are very useful, and the paper provides a very detailed overview of TE content in the maize genome. However, I struggle to see just how this is an example of "genome ecology" properly understood, for two reasons:

1) The paper does not actually use any methods from ecology. So, here -- as in most previous uses of the concept -- "ecology" serves as analogy rather than analysis. It's a bioinformatics study that describes TE properties and then layers on the ecological analogy. This is fine, but it may be worth noting that this is what is going on as opposed to, say, actually using ecological methods with genome data.

Examples of studies that are explicitly "genome ecology" in methodology and not just metaphor include:

Saylor et al. (2013). A novel application of ecological analyses to assess transposable element distributions in the genome of the domestic cow, Bos taurus. Genome 56: 521-533.

Serra et al. (2013). Neutral theory predicts the relative abundance and diversity of genetic elements in a broad array of eukaryotic genomes. PLOS One 8: E63915.

Linquist et al. (2015). Applying ecological models to communities of genetic elements: the case of neutral theory. Molecular Ecology 24: 3232-3242.

2) As noted by Linquist et al. (2013; Bio. Rev. 8: 573-584), most previous examples of "genome ecology" were actually "genome evolution". They used population genetics models or phylogenetics, focused on evolutionary timescales, involves evolution of TEs and genomes, and so on. This paper actually tends to drift into this category as well. Notably, most of the section on "The Family-level Ecology of the Genome" talks about evolutionary mechanisms and patterns; there is actually very little ecology discussed.

- The Introduction is very well written, and it is refreshing to see a nuanced overview of TEs rather than the trope of "long dismissed as junk DNA...". Where the paper needs work is in the Results and Discussion. Here, the paper suffers from what is sometimes called "thesis syndrome", where a paper for publication reads very much like a graduate thesis chapter.

1) Results -- This section contains way too much material that should be in the Discussion (or Introduction). Interpretations of findings, reviews of previous literature, etc., do not belong in the Results section. Here, the findings should be presented without interpretation, allowing the reader to agree with or challenge the patterns reported independently of whether they agree with your interpretation.

2) Discussion -- This section spends far too much time discussing the work of others. The Discussion in a data paper (versus a review paper) should discuss your results first and foremost. Citing the literature should mostly be done to put your findings in broader context (after you have discussed them, not "so and so said... and we found that too") and/or to back up an interpretation that you wish to present or a claim you are making. Major sections of the Discussion are more like a review paper than a discussion of the results in a data paper.

Minor comments:

- Despite what elementary school students are taught, there is no specific grammatical rule that forbids the use of "But" to start a sentence. However, doing so 10 times in a single paper is excessive and inelegant (the tendency to over-use it is probably why elementary school students are told not to do it at all).

- p.7 -- "(Figure 2; Interactive distributions per family: )." Not sure what this is supposed to say.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Decision Letter 1

Kirsten Bomblies

13 Jul 2020

* Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. *

Dear Dr Stitzer,

Thank you very much for submitting your Research Article entitled 'The Genomic Ecosystem of Transposable Elements in Maize' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some aspects of the manuscript that should be improved.

We therefore ask you to modify the manuscript according to the review recommendations before we can consider your manuscript for acceptance. Your revisions should address the specific points made by each reviewer.

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Kirsten Bomblies

Section Editor: Evolution

PLOS Genetics

Kirsten Bomblies

Section Editor: Evolution

PLOS Genetics

Thanks for your edits! The paper is much improved. The reviewer however makes some remaining comments and I think that some of these are worth addressing. So I am giving this a "minor revision" - it will not need to go back out for review after these changes. Thanks!

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Please note that I think the manuscript is fine as it is (although I do think some aspects of figure 7 could be clear). My comments below are simply meant to provide food for thought.

Line 176. I think I get what the authors are getting at here, but it occurs to me from the text here, as well as the other reviewers comments, that I’m actually not entirely clear as to the distinction between ecological and evolutionary processes. That’s probably just because I’m not a specialist in either field, but what exactly is the distinction. In this context, does “ecological” mean the interaction between different TEs and between TEs and their hosts? I’m assuming here that it does not mean the affects that TEs have on ecological processes occurring at the level of the host. Does it also mean interactions that have an impact on survival of a given TE lineage? A google definition (sorry, I mentioned that I’m not a specialist) of niche is “all of the interactions of a species with the other members of its community, including competition, predation, parasitism, and mutualism. A variety of abiotic factors, such as soil type and climate, also define a species’ niche.” What are the analogies here? Taking a gene-centric view, I suppose each group of TEs that can cross mobilize constitute a “species”, and TEs “speciate” when one lineage gives rise to two lineages that can no longer cross mobilize. Individuals would be individual stretches of DNA that can replicate (TEs), or can contribute to host function (host genes). However, with TEs get more squirrely, since any give insertion is, in situ, “alive” only to the extent that it retains the capacity to replicate, which is a function of its particular niche. Or something like that. Ecological functions would have to do with all of the particular features that would allow a given TE (or, more properly, TE lineage) to thrive over time, as well as the impact that this TE has on both the other TEs and on host fitness. In contrast, Evolutionary process would be what? Time scale? Neutral processes that result in large changes in genome architecture that are random with respect to ecological processes (at the genome level)? Would this be an example of the distinction? Some TEs target heterochromatin (and thus often other TEs) and some targe TSSs by tethering to components of the POLII complex. These two distinct strategies represent “niches”, with both costs a benefits, so this is Ecological. In contrast, Gout has shown that selection over time tends to remove TEs from regions immediately adjacent to genes, so TEs near genes tend to be younger. So if you have one TE family that is older than a second family, the older family will be distributed differently than the younger family even if the TEs have exactly the same targeting initially. These TEs do not occupy distinct “niches”; the differences are purely a function of selection at the level of the host and time. So in figure 3, if age were mapped on to these distributions, you might find a correlation between age and presence in the pericentromeres (CACTAs are, as I recall older and found more frequently in these regions). Similarly, one can imagine a variety of stochastic events that ends up favoring a given TE family that have nothing to do with particular strategies employed by a given TE that impact distribution. Horizontal transfer, for example, that moves a TE from a highly suppressive environment to a more permissive environment. These would be “Evolutionary” processes. Please forgive me for going on, but I’m just trying to give the authors a sense of what comes up for me when reading the text. Hopefully it is of some use to the authors.

Line 208. Thank you for not using the word “respectively” : )

Line 246. Is “years ago” correct? Is the median age of people in nursing homes 82 years ago?

Line 267. Good call. Since plant genomes have such a high turnover rate, even if a given lineage did “arrive in the genome” a very long time ago, we could not know when.

Line 299. This makes sense, as the copy number of older families, and the overall number of families as defined would go up as time passed. So if you started with a massive bloom and 10,000 copies, all of which were DOA, then, as copies were deleted and the sequences diverged to about 80% similar, you would end up with hundreds of distinct “families”, each with a very low copy number.

Line 303. Both the TIR and helitron examples make me think about gene capture. In rice, there are about 3000 Pack Mules, the average copy number of each of which is about two. Because they do not share the same internal sequences, they could be placed within different families (as defined), but many of them could be the result of capture events mediated by a particular autonomous MULE, which could theoretically mobilize them. Mu elements in maize are a clear example of this. The autonomous element has high homology at the TIRs with the non-autonomous elements but none in the internal sequences. Nevertheless, when the autonomous element is active, all of the non-autonomous elements can replicate. So I’m not sure if they are part of the same “species” or not. Or perhaps the non-autonomous elements can be thought of as parasites? If so, then how does this fit into an ecological model?

Figure 4. What really jumps out for me are two things. First, many of the superfamilies appear to have experience big jumps in copy number recently. Is this really a thing? Second, it really looks like the key distinction with respect to age is at the Family, not the superfamily level. Interesting.

Line 368. A quibble, but saying “partition” suggests that this is a distribution derived from functional constraints rather than a consequence of random loss of function within multiple copies.

line 501. A median of zero is just zero, right?

Line 508. Again, not for this publication, but I wonder if the proportion of converted cytosines in the CG and CHG versus tell a history about a given element. CHH islands tend to be near genes, so a TE lineage that has spent a lot of time near genes could have a different proportion than a TE lineage that didn’t.

Line 550. “...by family, and for example the two..” awkward

Line 562. One wonders if this is because many of the “genes” near helitrons are actually transduplicated parts of the helitron.

567. Again, I’m probably just being thick, but a median of zero means half were above zero and half were below (which they can’t be), so none of them expressed at all?

Figure 6. In legend. D-F?

Line 635. That’s a bit surprising given figure 4, which suggests that there is more variation at the family level.

Line 655. It’s not clear what each line on the Y axis represents here. Correlates with the Y axis names to the left? F-G. what do the colors correspond to? The colors of the labels of families in C? I’m afraid these need a bit more explanation in the legend.

Line 747. And by analysis of de novo, unselected insertions.

Line 779. See also https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-6763-1

Line 794. See my comment earlier about partitioning.

Line 952. “elaborate genome” : )

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Decision Letter 2

Kirsten Bomblies

10 Aug 2021

Dear Dr Stitzer,

We are pleased to inform you that your manuscript entitled "The genomic ecosystem of transposable elements in maize" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Kirsten Bomblies

Section Editor: Evolution

PLOS Genetics

Kirsten Bomblies

Section Editor: Evolution

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

Thanks for your edits! The paper looks really nice and we are happy to see it in our journal.

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: 

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-19-01176R2

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

Acceptance letter

Kirsten Bomblies

4 Oct 2021

PGENETICS-D-19-01176R2

The genomic ecosystem of transposable elements in maize

Dear Dr Stitzer,

We are pleased to inform you that your manuscript entitled "The genomic ecosystem of transposable elements in maize" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Livia Horvath

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Family characteristics of each of the largest 10 families of each superfamily with at least 10 copies.

    (A) Proportion of TEs within the transcript of a gene, including introns and UTRs. (B) TE span along the genome, summing both the base pairs of the TE and the base pairs of the TEs nested within it. (C) Proportion of TEs that are intact, that is, uninterrupted by the insertion of another TE. In (A and C), families are shown as points and superfamily proportions as a barplot, and in (B) families are shown with medians as points and lines representing ranges of upper to lower quartiles, with superfamilies shown as colored rectangles.

    (TIF)

    S2 Fig. Chromosomal distribution of superfamilies across all 10 maize chromosomes.

    Count of TE copies of each superfamily in 1 megabase bins across each chromosome.

    (TIF)

    S3 Fig. Distribution on chromosome 1 of five largest families with at least ten copies in each superfamily.

    Count of TE copies in 1 megabase bins along chromosome 1. (A) DHH, (B) DTA, (C) DTC, (D) DTH, (E) DTM, (F) DTT, (G) DTX, (H) RLC, (I) RLG, (J) RLX, (K) RIL, (L) RIT, (M) RST. Note that some families have no copies on chromosome 1, including DTT10880 and DTX10177. Additionally, the RIT superfamily only has two families.

    (TIF)

    S4 Fig. Ages in 10,000 year bins across each of the largest 10 families of each superfamily with at least 10 copies.

    (A) DHH, (B) DTA, (C) DTC, (D) DTH, (E) DTM, (F) DTT, (G) DTX, (H) RLC, (I) RLG, (J) RLX, (K) RIT, (L) RIL, (M) RST. The RIT superfamily only contains two families.

    (TIF)

    S5 Fig. LTR-LTR ages and terminal branch length ages for LTR retrotransposons.

    Ages in 10,000 year bins across each of the largest 10 families of each superfamily with at least 10 copies. Left plots (A-D) show LTR-LTR ages, right plots (E-H) show terminal branch length (TBL) ages. (A) all copies, LTR-LTR, (B) RLC families, LTR-LTR, (C) RLG families, LTR-LTR, (D) RLX families, LTR-LTR, (E) all copies, TBL, (F) RLC families, TBL, (G) RLG families, TBL, (H) RLX families, TBL.

    (TIF)

    S6 Fig. LTR-LTR ages vs. terminal branch length ages for LTR retrotransposon superfamilies.

    Spearman’s correlation coefficient shown on plot for each superfamily.

    (TIF)

    S7 Fig. Age of TE copies split by coding potential of self and family.

    Violin plots with three lines, at median and 25th and 75th percentile. Only ages younger than 2 million years are shown. “Coding copy” refers to those copies that code for protein, “noncoding copy” refers to those copies that don’t code for protein, but a family member does, and “noncoding family” refers to copies from families without a coding member in B73.

    (TIF)

    S8 Fig. Methylation in TE and flanking sequence, across tissues.

    A-J: mCG; K-T: mCHG; U-end mCHH. Tissues on y-axis, from top to bottom: Anther, SAM (shoot apical meristem), Earshoot, Flagleaf, Seedling leaf.

    (TIF)

    S9 Fig. Features of the TE and flanking sequences.

    GC content in the TE (A) and 1kb flanking sequence (B). Proportion of sites methylatable in the CG context in the TE (C) and 1kb flanking sequence (D), methylatable in the CHG context in the TE (E) and 1kb flanking sequence (F), proportion of sites methylatable in the CHH context in the TE (G) and 1kb flanking sequence (H). Proportion of sites containing a TG or CA dinucleotide in the TE (I) and 1kb flanking sequence (J). Proportion of sites in MNase hypersensitive regions in root in TE (K) and 1kb flank (L), and shoot in TE (M) and 1kb flank (N). Proportion of segregating sites in the TE (O) and 1kb flank (P).

    (TIF)

    S10 Fig. The proportion of methylatable cytosines is negatively correlated with the proportion of TG/CA dinucleotides.

    The x-axis reflects the proportion of cytosines in a CG context within the TE, and the y-axis reflects the proportion of dinucleotides in the TE that contain a TG or CA.

    (TIF)

    S11 Fig. Recombination, subgenome, and expression of closest gene.

    (A) Recombination rate across the TE, (B) proportion of TEs in subgenome A, (C) log10 median expression of the closest gene to each TE, (D) log10 median expression of genes within 1kb of the TE, (E) Tau of closest gene to each TE, (F) log10 median expression of the closest syntenic gene, (G) log10 median expression of closest syntenic genes within 1 kb, and (H) Tau of the closest syntenic gene.

    (TIF)

    S12 Fig. Protein coding gene presence of individual LTR GAG and POL domains.

    Shown are (A) the proportion of TEs with evidence of agglutination factor (GAG) domain present, (B) the proportion of TEs with evidence of all polyprotein domains present (aspartic proteinase, integrase, reverse transcriptase, and RNaseH), (C) the proportion of TEs with both GAG and Polyprotein present in the same element. Families are shown as points and superfamily proportions as barplot.

    (TIF)

    S13 Fig. Predicted and observed relationship of age to TE length and distance to gene.

    Raw relationship (A & C) and predicted relationship (B & D) of TE length (A & B) and distance to gene (C & D).

    (TIF)

    S1 Table. Ten largest families in each superfamily, as shown left to right in plots.

    (TXT)

    S2 Table. Categories that each feature measured for each TE is classified into.

    (TXT)

    S3 Table. 14 families with at least 10 copies in the B73 genome, with at least 75% of copies coding for transposition related proteins.

    (TXT)

    S4 Table. 842 families with at least 10 copies in the B73 genome that lack coding representatives.

    (TXT)

    S5 Table. Mean methylation levels across superfamilies, averaged across all tissues, and averaged within a tissue.

    (TXT)

    S6 Table. TE families that lack methylatable cytosines (presented as family median values).

    (TXT)

    S1 Text. TIR annotation methods.

    (PDF)

    Attachment

    Submitted filename: ecology_of_the_genome_RevisionResponse.pdf

    Attachment

    Submitted filename: response_to_second_review.pdf

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting information files. Scripts for generating summaries from data sources and links to summarized data are available at http://www.github.com/mcstitzer/maize_genomic_ecosystem.


    Articles from PLoS Genetics are provided here courtesy of PLOS

    RESOURCES