Abstract
The growing availability of genomes from non-model organisms offers new opportunities to identify functional loci underlying trait variation through comparative genomics. While cis-regulatory regions drive much of phenotypic evolution, linking them to specific functions remains challenging. We identified 514 cis-regulatory motifs enriched in regulatory regions of five diverse grass species, with 73% consistently enriched across all, suggesting a deeply conserved regulatory code. Leveraging 57 new contig-level genome assemblies, we then quantified shared occupancy of specific motif instances within gene-proximal regions across 589 grass species, revealing widespread gain and loss over evolutionary time. Shared occupancy declined rapidly over the first few million years of divergence, yet ∼50% of motif instances were shared back to the origin of grasses ∼100 million years ago. We used phylogenetic mixed models to identify motif gains and losses associated with ecological niche transitions. Our models revealed significant environmental associations across 1282 motif–orthogroup combinations, including convergent gains of HSF/GARP motifs at an alpha-N-acetylglucosaminidase gene associated with occurrence in temperate environments. Our findings support a “stable motifs, variable binding sites” model in which cis-regulatory evolution involves turnover of thousands of individual binding site instances while largely preserving transcription factors’ binding preferences. Our results highlight the potential of comparative genomics and phylogenetic mixed models to reveal the genetic basis of complex traits.
Keywords: regulatory evolution, plants, comparative genomics, cis-regulation
Introduction
Cis-regulatory changes are arguably the most important genetic mechanism of evolutionary innovation, particularly over relatively shallow evolutionary time scales (King and Wilson 1975; Carroll 2008). While the trans-acting regulators such as transcription factors (TFs) typically have widespread effects genome-wide (Signor and Nuzhdin 2018), the evolution of cis-regulatory regions enables finely tuned gene expression evolution at individual genes (Wittkopp et al. 2004; Prud’homme et al. 2007; Wray 2007). Variation within noncoding regions explains roughly half of additive trait variance within maize populations with much of this attributable to open chromatin regions found within 1 kbp of genes (Rodgers-Melnick et al. 2016), underscoring the importance of gene-proximal regulatory regions in driving phenotypic evolution within species.
TF binding sites (TFBS) are key components of noncoding regulatory regions (Schmitz et al. 2021). TFBS are typically composed of a core DNA sequence motif that is recognized and bound by its cognate TF. In plants, TFBS variants have been shown to be associated with the evolution of traits including inflorescence branching (Hendelman et al. 2021), abiotic stress tolerance (Jiang et al. 2022; Zeng et al. 2025), and fruit shape (Hu et al. 2024). Variants in the 5′ UTR and proximal promoter region are often particularly important for modulating expression levels (Cui et al. 2023; Voichek et al. 2024). Despite the importance of these TF binding variants in driving phenotypic evolution, characterizing key variants remains a major challenge. TF binding assays such as Chromatin Immunoprecipitation Sequencing (ChIP-seq), MNase-defined cistrome-Occupancy Analysis (MOA-seq), and DNA affinity purification sequencing (DAP-seq) have allowed precise characterization of where TFs bind throughout plant genomes (O’Malley et al. 2016; Savadel et al. 2021; Marand et al. 2023), but adapting them to work in non-model systems can be difficult. In particular, the increasing number of sequenced taxa requires scalable methods to discover key cis-regulatory features.
Characterizing evolutionary constraint and convergence across large numbers of taxa has emerged as a powerful strategy for determining genetic function from DNA sequence alone (Smith et al. 2020). The identification of conserved noncoding sequences has been used to identify highly important cis-regulatory features across non-model taxa (Haudry et al. 2013; Partha et al. 2017; Song et al. 2021; Stitzer et al. 2025). A limitation of this approach, however, is that many key cis-regulatory regions cannot be aligned reliably (Schmidt et al. 2010; Phan et al. 2025). Additionally, the extent of nucleotide-level conservation does not reliably predict the degree of shared function (Yang et al. 2015; Wong et al. 2020). Nucleotide-level-conservation-free characterization approaches can offer a broader view of cis-regulatory evolution (Kaplow et al. 2022, 2023).
Due to these advances, our understanding of cis-regulatory evolution and its phenotypic implications is rapidly increasing. Still, some of the fundamental principles of regulatory evolution remain incompletely understood. A growing body of evidence indicates that TFs and their binding site preferences remain largely intact within lineages, suggesting a conserved “regulatory code” (Stergachis et al. 2014; Nitta et al. 2015; Tu et al. 2020). Changes to the regulatory code can occur via expansions or contractions of TF families that lead to the birth or extinction of particular TF binding preferences (Lehti-Shiu et al. 2017). Across deep evolutionary divergence, such as between plants and animals, widespread differences are observed between TFs and their binding preferences (Riechmann et al. 2000). The field lacks a clear consensus on the extent to which changes to TF binding preferences occur over varying time scales, as well as how and when changes occur. A deeper understanding of regulatory code evolution will inform how to effectively transfer genetics across species.
Another emerging model of regulatory evolution is that while TF binding preferences tend to be deeply conserved, individual instances of TFBS are extensively gained and lost over evolutionary time. We refer to this as the “stable motifs, variable binding sites” hypothesis. Previous studies have estimated that only 20% of TFBS instances are conserved between humans and mice (Stergachis et al. 2014), and 20% to 40% of GOLDEN2-LIKE binding sites were found to be conserved between maize and rice (Tu et al. 2022). Relatively few studies so far have traced TFBS evolution across a large number of taxa (but see Andrews et al. (2023)), or associated particular gain and loss events with diversification and adaptation at the macroevolutionary scale.
Grasses (Poaceae) have diversified widely over roughly 100 million years of evolution (Gallaher et al. 2022), occupying habitats ranging from tropical savannas to the Arctic tundra (Bouchenak-Khelladi et al. 2010; Edwards and Smith 2010; Spriggs et al. 2014; Lehmann et al. 2019). Grasses’ ecological diversification has been intertwined with repeated innovations in life history strategies (annual vs. perennial) (Kellogg 2015), photosynthetic pathway transitions (Grass Phylogeny Working Group II 2012), and photoperiod changes (Fjellheim et al. 2014). Despite these significant shifts, aspects of genome structure such as gene content and collinearity are largely preserved across grass lineages (Bennetzen and Freeling 1993; McSteen and Kellogg 2022; Mascher et al. 2024), although ploidy (Stebbins 1985; Zhang et al. 2024), genome size (Bennett and Smith 1976), and gene regulatory patterns (Meng et al. 2021; Stitzer et al. 2025) vary widely across the family. Comparative genomic analyses across grasses are therefore well-suited to illuminate how broad and repeated environmental adaptation unfolds in complex genomes (Buell 2009).
We hypothesized that gain/loss of many individual TFBS across the genome—rather than changes at a few master regulators—underlies grass diversification, enabling grasses to fine-tune gene regulation while maintaining regulatory logic. To test this hypothesis, we performed large-scale comparative genomic analyses across a collection of 727 genome assemblies representing 589 grass species. After establishing that grasses share a conserved set of TF motifs enriched in cis-regulatory regions, we quantified gain and loss of motif instances within gene-proximal regions (500 bp or 1 kbp upstream of aligned translation start sites). Using this set of motif instances, we performed association mapping across species to identify examples of motif gain and loss associated with environmental variables. We documented widespread gain and loss of motif instances as grasses diversified. Additionally, we revealed 1282 environmentally associated motif gain/loss instances with moderate to weak levels of convergence. Together, our findings support a model of cis-regulatory evolution in which adaptation proceeds over macroevolutionary time scales via hundreds to thousands of cis-regulatory changes at small-effect downstream genes rather than at a few large-effect regulators.
Results
Grasses share a deeply conserved cis-regulatory code
We hypothesized that a highly similar repertoire of motifs are enriched in cis-regulatory regions across grasses. We estimated cis-regulatory regions using unmethylated regions (UMRs) available from five species (Brachypodium distachyon, Oryza sativa, Sorghum bicolor, Zea mays, and Hordeum vulgare). These five species last shared a common ancestor before the BOP (Bambusoideae, Oryzoideae, and Pooideae) and PACMAD (Panicoideae, Aristidoideae, Chloridoideae, Micrairoides, Arundinoideae, and Danthonioideae) clades diverged approximately 80 million years ago (Gallaher et al. 2022) (Fig. 1a). UMRs stably mark functional regulatory and genic regions of plant genomes and are rich in TFBS (Crisp et al. 2020). To determine which cis-regulatory motifs are commonly enriched in regulatory regions, we quantified UMR enrichment of 704 experimentally derived plant TF binding motifs from the JASPAR 2024 database (Rauluseviciute et al. 2024). Using randomized UMR sequences with preserved dinucleotide frequencies as background, we measured enrichment of each JASPAR motif in true UMRs relative to dinucleotide-shuffled UMRs. One caveat to this approach is that due to the high sequence similarity among motifs within some TF families, similar motifs cannot be reliably distinguished and may be “double counted” in the analysis.
Figure 1.
Grasses share a conserved set of UMR-enriched motifs. a) Phylogenetic tree of 589 Poaceae species. Five representative species are starred: S. bicolor (orange), Z. mays (red), O. sativa (dark green), B. distachyon (blue), and H. vulgare (dark purple). b) Enrichment of TF motifs in UMRs across five representative grass species. The intersection bars show the number of enriched motifs for each species set. Intersections with fewer than five motifs are not shown. c) Log2 fold change UMR enrichments for 114 motifs with varied enrichment patterns. Rows and columns are clustered by enrichment similarity.
Enrichment fold change correlations were 96% to 99% across all species pairs (Figure S1). Of the 514 motifs that were enriched in at least one species, 73% (377/514) were commonly enriched across all five species (Fig. 1b), with a similar set of motifs enriched in accessible chromatin regions across species (Figure S2). One hundred and fourteen motifs were UMR-enriched for at least one species but not across all five. Some of these motifs showed highly variable lineage-specific patterns; several bHLH and SPL motifs were depleted in H. vulgare and B. distachyon yet enriched in O. sativa (Fig. 1c). While these variably enriched motifs are intriguing, we focused our subsequent analyses of motif gain and loss on the set of 377 commonly enriched motifs (Table S1) to trace how instances of evolutionarily stable motifs have turned over.
Characterization of motif instances across 589 widely adapted species
To characterize cis-regulatory evolution on a large scale, we built a pipeline to quantify and compare motif occurrences across orthologous regions of hundreds of species (Fig. 2a). For this study, we used publicly available WGS short reads to generate genome assemblies for 57 species. While not highly contiguous, our WGS assemblies recovered a median of 4206 genes (75%) from a set of 5592 conserved grass genes in a BUSCO-like analysis (Fig. 2b; Table S2). In total, we amassed a dataset of 727 genome assemblies representing 589 diverse grass species using the 57 new WGS assemblies along with 211 public genome assemblies, 368 newly generated short-read assemblies from Schulz et al. unpublished data, 58 short-read assemblies from Schulz et al. (2023), and 33 highly contiguous assemblies from Stitzer et al. (2025) (Figure S4 and Table S2). Our dataset captures wide and repeated environmental adaptation across the grass family (Hsu et al., unpublished data).
Figure 2.
Summary of TF motif profiling across 727 assemblies. a) Depiction of bioinformatic pipeline to identify TF motif instances upstream of orthologous genes across taxa. Motif scanning was performed 500 bp upstream of the translation start sites of orthologous genes. b) Presence of single-copy grass orthologs (“TABASCO genes”) across 57 newly generated contig-level genome assemblies. The proportion of complete, duplicated, fragmented, and missing genes out of 5592 total TABASCO genes is shown. c) Distribution of taxa representation across 21,387 orthogroups in our dataset. The median number of taxa per orthogroup is plotted as a dashed vertical line.
To enable cross-species comparisons, we identified orthologous genes across taxa by querying a set of ancestrally reconstructed protein sequences against each assembly and retaining the primary alignment. We retained 21,387 orthogroups after filtering, each containing an average of 390 taxa (SD = 186) (Fig. 2c), indicating a high level of taxonomic representation suitable for robust comparative analyses. We scanned intervals 500 bp upstream of the aligned translation start sites for the set of 377 motifs with conserved UMR enrichment. While many important regulatory elements are present further upstream (or downstream) from genes, we considered only 500 bp upstream to maximize sample size given the limited contig lengths available in our short-read assemblies. We collapsed overlapping instances of similar motifs into a single merged interval to reduce redundancy for subsequent analyses, using 35 motif clusters delineated by JASPAR2024 (Table S1). The 35 motif cluster types differed widely in abundance and variability across taxa (Figure S5).
Shared motif occupancy decays nonlinearly with increasing evolutionary divergence
To track the evolutionary gain and loss of motif instances in grasses, we quantified shared motif occupancy across species by comparing the number of motifs found at maize orthologs with those found at orthologs of the other 588 species (see Methods). Because many motif instances are not easily alignable, we chose to track the number of occurrences of each motif 500 bp upstream of the translation start site of each orthogroup, agnostic of position. This means that shared motifs may lack nucleotide-level conservation and that sharing can arise either from conservation or via independent gains at different sequence positions.
Shared motif occupancy decayed nonlinearly with increasing evolutionary divergence. Roughly 60% of maize motif instances were shared with Sorghum (∼15 million years diverged) at orthologous upstream regions, yet 50% were still shared with rice across ∼80 million years of evolution (Fig. 3a). On average, shared motif occupancy between orthologs did not decay to the level observed between random maize genes, approaching an asymptote of 48% shared occupancy (Fig. 3a). Decay curves were similar when defining shared motif occupancy relative to Sorghum (Figure S7a) and rice (Figure S7b) instead of maize. Additionally, motifs at maize orthologs that overlap a ChIP-seq peak for their cognate TF showed similar turnover levels to motifs without in vivo binding evidence (Figure S8). This suggests our motif turnover estimates are not substantially inflated by the many motifs lacking in vivo binding support. Shared occupancy of motif instances varied widely by gene, with most orthogroups approaching 40% to 60% shared occupancy between the most distantly related grasses (Fig. 3a). Coding sequence conservation explained minimal variance in shared motif occupancy at individual orthogroups (R-squared = 0.02) (Figure S9). A small number of highly conserved proteins, such as ribosomal proteins and proteins encoded in mitochondria and chloroplasts, exhibited very high rates of shared motif occupancy. However, the motifs detected near mitochondrial and chloroplast genes are likely artifacts of motif scanning, as these genes have distinct regulatory mechanisms compared to nuclear genes (Liere et al. 2011).
Figure 3.
Shared motif occupancy decays nonlinearly and heterogeneously across gene classes. a) Percentage of maize motif instances retained or regained (shared occupancy) across 589 Poaceae species. Genetic distance was estimated using pairwise distances between maize and Poaceae species at the Angiosperms353 loci. Solid cyan lines represent exponential decay curves fit for 13,114 orthogroups. Points show the mean percentage of maize motifs with shared occupancy in each Poaceae species across all orthogroups, with an exponential decay curve depicted in a dark blue line. Dashed green line represents the mean percentage of motifs with shared occupancy across 100,000 random pairs of maize genes. Approximate divergence times from maize are shown for key taxa (Chen et al. 2022; Gallaher et al. 2022). b). Enriched GO terms for the orthogroups in the top quartile of shared motif occupancy. c) Shared motif occupancy at TF orthogroups (n = 1,190) versus at non-TF orthogroups (n = 11,482). Median values for each class are shown by dotted vertical lines. P value is from an asymptotic two-sample Kolmogorov–Smirnov test.
Orthogroups with higher shared occupancy were significantly enriched (Fisher's exact test, P < 0.05) for 126 Gene Ontology (GO) terms. Many of the strongest enrichments were related to regulatory function by TFs (Fig. 3b; Table S3). Specifically, TF genes (n = 1,190) exhibited slightly higher shared motif occupancy than non-TFs (n = 11,482) (TF mean = 57%, non-TF mean = 54%; Kolmogorov–Smirnov test, P = 9.99e−15) (Fig. 3c). The difference between TFs and non-TFs persisted when comparing shared motif occupancy between maize and sorghum at syntenic orthologs from Li et al. (2024) (Figure S10).
In addition to regulatory functions, we observed many significant GO terms among high shared occupancy genes related to signal transduction (e.g. “calmodulin binding”, “protein phosphorylation”, and “phosphorelay signal transduction system”), cytoskeleton (e.g. “actin filament organization”, “microtubule cytoskeleton organization”, and “cortical microtubule organization”), and development (“flower development”, “abaxial cell fate specification”, and “positive regulation of long-day photoperiodism, flowering”). Low shared occupancy genes were enriched for 114 terms including “ADP binding”, “defense response”, and “plant-type cell wall”. (Figure S11 and Table S3). GO terms that were not significantly enriched among either high or low shared occupancy genes included “response to abiotic stimulus”, “photosynthesis”, and “reproduction”.
Since many key regulatory elements are found further upstream than 500 bp from the translation initiation site, we re-ran our analyses using a 1 kbp upstream window and found a very similar motif decay pattern, the main difference being an elevated “baseline” level of shared motif occupancy due to the larger search space (Figure S12a). We also observed the same enrichment for TF activity among orthogroups with high shared occupancy (Figure S12b and c), indicating that our results are not very sensitive to window size, at least when considering gene-proximal regions.
Environmentally associated motif gain/loss occurs at over a thousand genes
To investigate which motif instances are associated with environmental niche, we employed phylogenetic mixed models, which can be used to control for phylogenetic relatedness while associating genomic features with traits across species (Housworth et al. 2004). We characterized each species’ ecological niche as described in Hsu et al. (2025), summarizing into ten environmental principal coordinate (envPC) axes that together explain 75% of total ecological variability among the natural environments of grasses across the globe. The envPC1 primarily captures thermal variability across taxa, while envPC2 correlates with water availability (Figure S13).
We hypothesized that repeated gains or losses of particular motifs underlie environmental adaptation in grasses. We first tested whether global occurrence rates of motifs across taxa are associated with environmental niche, using the envPCs as proxies of environmental diversity (Fig. 4a). Since variable genome sizes have been hypothesized to alter the spacing of regulatory regions (Schmitz et al. 2021), we controlled for the overall density of all motifs as well as the background mono- and dinucleotide content in genomic background regions, which could influence the probability of detecting a motif by chance. Motifs with FDR < 0.01 were considered significant. Occurrence rates of C2H2-type zinc finger (C2H2/DOF), DNA binding with one finger (DOF/CDF), NAM, ATAF, and CUC (NAC), homodomain-leucine zipper/Plant Zinc Finger/AT-HOOK MOTIF NUCLEAR LOCALIZED (HDZIP/PLINC/AHL), cysteine-rich polycomb-like protein (CPP), and SCHLAFMUTZE (SMZ/RAP2.7/TOE2) motifs were significantly associated with envPC1 (Fig. 4b; Table S4). Additionally, the occurrence rate of basic leucine zipper (BZIP) motifs was significantly associated with envPC2. Each of these motifs was also significant (all P ≤ 0.002, not FDR-corrected) when using the permulation approach described by Saputra et al. (2021), demonstrating that our associations are not an artifact of phylogenetic structure. Using a 1 kbp upstream window for motif scanning produced largely similar results, with several additional significant associations uncovered (Figure S14 and Table S4).
Figure 4.
Turnover at over one thousand motif instances is associated with environmental niche. a) Setup of global occurrence association models. One phylogenetic mixed model was run for each of the 35 motif types. b) Quantile–quantile plot showing observed P values from the global occurrence models versus P values expected under the null hypothesis. Observed P values were calculated from Wald tests across 350 fixed effect envPC terms. Terms with FDR < 0.01 are shown in yellow. c) Setup of orthogroup-specific models. One model was run for each motif–orthogroup combination for a total of 540,015 models. d) Quantile–quantile plot for orthogroup-specific models. Observed P values were calculated for ∼5.1 million fixed effect envPC terms. Terms with FDR < 0.01 are shown in yellow.
We further hypothesized that repeated gains or losses of motif instances at particular orthogroups are associated with environmental adaptation. For each motif–orthogroup combination, we associated the number of motifs per assembly at an orthogroup with the ten environmental PCs after controlling for phylogenetic relatedness (Fig. 4c). Across 540,015 models, we scanned for signals of convergent motif gain or loss. Under an FDR < 0.01 significance threshold, 1282 unique motif–orthogroup combinations (1578 total) were significantly associated with at least one envPC (Fig. 4d; Table S5), suggesting convergent motif gain or loss across independent lineages. Of these top associations, 1562 (99%) were significant under permulation (P < 0.05, not FDR-corrected). The 16 associations that were not significant under permulation tended to have very high variance explained by phylogeny or other envPCs, leading to unstable estimates for the focal envPC. To more broadly compare our results with those derived from permulation, we also calculated P values using permulation for 100 random motif/orthogroup/envPC models. Our original P values very closely tracked those derived from permulation (Spearman rho = 0.98; Figure S15), indicating that our results do not show aberrant statistical behavior due to phylogenetic structure. Re-running the orthogroup-specific models with a 1 kbp upstream window yielded different top associations (Figure S14 and Table S5), indicating that this model setup is sensitive to the choice of window size.
Gain of HSF/GARP motifs at an alpha-N-acetylglucosaminidase gene predicts cold adaptation
We reasoned that the orthogroups most strongly associated with envPCs would be enriched for pathways and processes linked to abiotic stress responses. Indeed, a GO enrichment analysis of the 1167 significant unique orthogroups (the genes downstream of the regulatory regions scanned for motifs) identified 55 enriched terms (P < 0.05), many of which were related to known abiotic stress response processes such as oxidative stress response (“oxidoreductase activity”, “heme binding”) and degradation of misfolded proteins (“peptidase activity”, “ubiquitin-dependent ERAD pathway”, “proteolysis involved in protein catabolic process”, “ubiquitin-specific protease binding”, “endoplasmic reticulum unfolded protein response”, “threonine-type endopeptidase activity”) (Fig. 5a; Table S6).
Figure 5.
Exploration of top motif–orthogroup pairs. a) GO enrichment analysis for 1167 significantly associated unique orthogroups. The top ten most significant terms are shown. b) Environmental PC1 values for taxa with varying numbers of HSF/GARP motifs at OG0018131. Dashed line shows a linear model fit to envPC1∼HSF/GARP motif count. P value for the HSF/GARP model term is shown. c) Repeated gain and loss of HSF/GARP motifs at OG0018131 across the grass family. Taxa with three or more copies of HSF/GARP are starred. d) HSF/GARP motif variation between Z. palustris and O. sativa at an open chromatin region upstream of OG0018131 (OsZS97_03G017950). FAIRE-seq and RNA Polymerase II ChIP-seq data generated from mature leaves are shown.
We investigated some of the individual motif–orthogroup combinations related to these enriched processes. One of the motif–orthogroup combinations most strongly associated with envPC1 was a Heat Shock Factor/Golden2, ARR-B, Psr1 (HSF/GARP) motif at OG0018131 encoding an alpha-N-acetylglucosaminidase family protein. Alpha-N-acetylglucosaminidase is involved in N-glycan degradation (Ronceret et al. 2008), which has been connected to protein misfolding response under chilling in Arabidopsis (Ma et al. 2016) and alfalfa (Xu et al. 2024). Additionally, HSF TFs are strongly implicated in thermal stress responses (Guo et al. 2016), and in grasses, various HSFs have been linked to misfolded protein responses (Cheng et al. 2015) and chilling tolerance (Gao et al. 2024).
Increasing counts of HSF/GARP motifs at OG0018131 predicted lower envPC1 values (P < 0.001; Fig. 5b), with the association primarily driven by taxa containing three or more HSF/GARP instances. At least seven independent lineages had three or more copies of the motif (Fig. 5c), representing frost-tolerant genera including Zizania (wild rice), Phragmites, and Poa. Notably, only a fraction of independent cold adaptation events (very roughly 25%) were associated with expansions of this motif, indicating that the strength of convergence at this gene was relatively weak despite strong statistical support (Wald test, P = 1.89e−10). Rice contains multiple copies of the motif in an open chromatin region, and its cold-tolerant relative Zizania palustris contains additional copies (Fig. 5d).
A recent study documented allele-specific expression of the closest maize ortholog (Zm00001eb394460) under drought conditions (Engelhorn et al. 2025) with the gene also overlapping a MOA-seq bQTL, indicating variation in TF occupancy (Engelhorn et al. 2025). In rice, the gene is found within known QTLs for abiotic stress tolerance, including a 200 kbp GWAS peak (36 genes) that is associated with survival rate under chilling conditions (Khatab et al. 2022). Furthermore, by querying the EMBL expression atlas, we found that this gene and its homologs are differentially expressed in response to multiple abiotic stressors in diverse grasses (maize, sorghum, rice, Brachypodium, and barley) as well as poplar and soybean (Table S7).
Discussion
Our results support a “stable motifs, variable binding sites” model of cis-regulatory evolution under which TF binding preferences remain conserved across deep divergence while individual TFBS across the genome are frequently gained and lost. The high overlap of UMR-enriched motifs across species suggests that minimal lineage-specific loss of binding preferences has occurred in grasses, consistent with the deep conservation of DNA binding domains observed within many TF families (Yamasaki et al. 2008; Weirauch and Hughes 2011; de Mendoza et al. 2013) and recent studies using DAP-seq to uncover conserved TF binding preferences across plant evolution (Baumgart et al. 2025; Zenker et al. 2025).
A few considerations should be kept in mind when interpreting these findings. First, the close similarity of motifs recognized by related TFs could lead to an overestimation of motif enrichment across multiple TFs if binding for a single TF is enriched. Additionally, our analysis does not rule out lineage-specific gains of binding site preference (perhaps via diversification of TF families), which could be investigated using de novo motif characterization. Quantitative estimates of binding affinity would be needed to more sensitively detect subtle shifts in binding preferences. Regardless, the high conservation of motif sequences we observed across distantly related species highlights that TFs in grasses share similar binding preferences, likely with transferable regulatory functions across species. Our implicit assumption is that UMR enrichment is evidence of motif function, although motif depletion can also be meaningful since selection on cis-regulatory regions may act to inhibit particular TF binding events (He et al. 2011). Additionally, many UMRs do not act as cis-regulatory elements, though we did observe similar enrichments in accessible chromatin regions for most motifs.
While much of grasses’ regulatory code appears to have been conserved across a hundred million years of evolution, our work demonstrates that extensive remodeling of individual cis-regulatory regions has occurred. Although some TFBS instances have been shown to be as strongly conserved as coding sequences (Heyndrickx et al. 2014), just 50% of the motif occurrences we observed were shared between maize and rice, which are roughly as genetically divergent as humans and mice (Gallaher et al. 2022). Since we quantified motif occurrences within a 500 bp window upstream of the translation start site, a “shared” motif across species does not imply nucleotide-level conservation, but rather that the motif occurrence is shared either by conservation or independent gains. Our collapsing of similar motifs could potentially underestimate TFBS turnover, since motifs counted as “shared” in our analysis may be divergent enough to be recognized by distinct TFs. On the other hand, TFBS present beyond 500 bp upstream of the translation start site might be misleadingly labeled as “lost” by our analysis. Overall, our estimates of shared occupancy are similar to those derived independently within Andropogoneae species by Stitzer et al. (2025) (∼60% shared between maize and sorghum) and from ChIP-seq in maize and rice (20% to 50% shared GLK binding events in proximal promoter regions). Estimates relying on nucleotide-level alignment between placental mammals of comparable divergence are even lower (10% to 22% shared) (Schmidt et al. 2010).
Despite extensive turnover overall, some motif instances appear to persist over deep evolutionary time scales. Shared motif occupancy at orthologous regions decays nonlinearly and tends to approach a minimum level of shared occupancy between 40% and 60% for most grass orthogroups. This suggests that regulatory regions may often contain a handful of strongly conserved motif instances alongside many motifs that are rapidly turned over. Our finding that regulatory genes, and specifically TFs, have slightly higher shared motif occupancy suggests that a large amount of cis-regulatory evolution within regulatory networks occurs at terminal target genes rather than just at a handful of key regulators. This finding supports a model in which TFs and other highly pleiotropic genes exhibit greater constraint, in line with past theoretical and empirical work (Prud’homme et al. 2007; Chesmore et al. 2016; Nocchi et al. 2024), but see Khaipho-Burch et al. (2023). Future studies could explicitly test whether the degree of pleiotropy of a gene predicts conservation of its TFBS. Alternatively, background selection on the target gene could be driving conservation of proximal TFBS rather than selection on the TFBS itself, though the low correspondence between coding sequence and shared motif occupancy suggests that this effect is minimal.
Our environmental association models suggest that cis-regulatory changes in grasses are not concentrated at a handful of key loci, but occur pervasively across a wide set of diversely functioning genes. We observed weakly convergent motif gain and loss at over a thousand genes with diverse molecular functions, underscoring that environmental adaptation involves altering regulation of a highly diverse set of genes, not just modifying a handful of pathways or processes. These results are consistent with theoretical findings that suggest that global adaptation of complex traits tends to be associated with many small-effect regulatory changes (Prud’homme et al. 2007). This challenges attempts to leverage cross-species insights for crop genetic engineering. Still, characterization and editing of key TFBS remains a promising strategy for precisely tuning expression levels.
Our results suggest that phylogenetic mixed models may be an effective approach to nominate key candidate loci using large-scale comparative genomics. With more sequenced genomes available, the ability of such comparative genomic approaches to detect convergently evolving genome features will improve (Smith et al. 2020), particularly for loci of small effect. Future studies could build on this framework by pairing large-scale comparative genomics screens with detailed molecular characterization of the top candidate loci. For example, a variety of mechanisms could underlie the HSF/GARP motif variation we observed at OG0018131. Gains of HSF/GARP motifs at this locus could alter TF binding affinity of HSF/GARP TFs to the region, or perhaps permit combinatorial binding of multiple distinct HSF/GARP TFs that recognize similar motifs. Further work could validate the functional effect of this motif and its relationship with environmental adaptation.
A number of trade-offs were required to scale up our analyses across hundreds of complex plant genomes. The relatively high proportion of gene-proximal motifs intersecting open chromatin and TF binding regions suggests that many gene-proximal motifs are likely to be bound by TFs in vivo. Still, a significant fraction of the motif instances we characterized are likely to be nonfunctional. Additionally, studies have found that only a fraction of TF binding changes are due to motif variants (Reddy et al. 2012; Krieger et al. 2022), underscoring that TF binding assays, though far less scalable, currently offer more reliable estimates of TFBS than sequence-based prediction. Deep learning models hold promise going forward for scalable characterization of cis-regulatory regions. The ability of such models to represent the orientation, positioning, and co-binding context of cis-regulatory elements can offer a more nuanced picture of cis-regulatory evolution beyond simple gain and loss of motifs.
Our study focused on characterizing motif variants in proximal upstream regions due to limited assembly contiguity. However, changes to distal cis-regulatory elements (Clark et al. 2006) and trans-acting factors also contribute strongly to regulatory adaptation and were not considered in our analyses of motif gain and loss. Another limitation of our approach is that many duplicated gene copies, particularly in the numerous polyploid taxa represented in the dataset, were filtered out. Additionally, because we filtered out small orthogroups with fewer than 200 taxa represented in our orthogroup-specific association models, lineage-specific genes that may be key to adaptation were not considered. Similarly, the reliance on convergent evolution in our association models may miss adaptive mechanisms that are influenced by evolutionary contingencies and constraints; e.g. lineages using C4 photosynthesis may have different physiological strategies “available” to them than C3 lineages. Future comparative genomic analyses will benefit from complementing broad phylogenetic comparisons with comparisons of closely related species that largely share the same physiology and genetics. Moreover, integrating more detailed estimates of ecological context into population genetic analyses of key taxa would offer greater resolution and nuance beyond our broad species-level estimates of adaptation.
Conclusion
We performed large-scale comparative genomic analyses across 589 grass species to investigate the cis-regulatory basis of diversification and environmental adaptation in grasses. We found that grasses share a deeply conserved regulatory code, with 377 cis-regulatory sequence motifs conserved across diverse species. We documented widespread gain and loss of specific motif instances across the grass family, suggesting extensive reorganization of orthologous cis-regulatory regions. Cis-regulatory changes do not appear to be highly concentrated at particular classes of genes, though they occur slightly less frequently at regulatory genes such as TFs. One thousand and two hundred eighty-two motif–orthogroup combinations show evidence of convergent gain and loss of motifs associated with environmental variables. However, even the most strongly significant motifs are only weakly repeatable, underscoring the diversity of cis-regulatory routes to environmental adaptation.
Materials and methods
TF motif scanning
We downloaded 805 TF motifs and clusters from the 2024 JASPAR CORE non-redundant plant collection (Rauluseviciute et al. 2024). This set of motifs has strong experimental evidence from TF binding assays such as DAP-seq and ChIP-seq primarily performed in Arabidopsis. All motif scans were performed using FIMO from the MEME suite v5.5.7 (Grant et al. 2011) with a P value threshold of 0.0001 and the parameters --max-strand --no-qvalue --skip-matched-sequence --max-stored-scores 100000000. Homogenous background mono- and dinucleotide frequencies (=0.25 per nucleotide) were used to avoid species-specific biases in motif detection thresholds. Seven hundred four of the 805 motifs in the JASPAR collection were detectable using our parameters, with the rest containing insufficient information content to enable statistically significant matches with FIMO. While we experimented with using quantitative scores based on the strength of the FIMO match to represent motif occupancy rather than simple presence/absence, we found that this did not increase biological signal in our analyses so we proceeded with presence/absence counts. A visual example of our motif scanning results upstream of the ZmICE1 gene can be seen in Figure S6.
Motif enrichment analysis
We downloaded processed UMRs for five diverse grass species (S. bicolor, O. sativa, B. distachyon, Z. mays, H. vulgare) (Crisp et al. 2020). Background sequences were generated by dinucleotide shuffling each UMR region 100 times, preserving local sequence composition. Motif scanning was performed using FIMO on both UMR and background sequences. The fitdistrplus::fitdist function (Delignette-Muller and Dutang 2015) was used to fit probability distributions to background motif counts. Two-tailed P values were then calculated to identify significantly over- and under-represented motifs in UMR regions. We quantified how many motifs were significantly over-represented (FDR < 0.01, Fisher's exact test) within and between species. The 377 commonly enriched motifs were used for subsequent analyses.
For accessible chromatin enrichment, we downloaded bulk accessible chromatin regions for S. bicolor, O. sativa, B. distachyon, Z. mays, H. vulgare, Setaria viridis, and Arabidopsis thaliana as processed by Lu et al. (2019) and performed enrichments as described above. We also compared motif enrichments in maize using Andropogoneae conserved noncoding regions (Stitzer et al. 2025) with intronic regions filtered out, processed scATAC peaks merged across all cell types (Marand et al. 2021) that we uplifted to B73 v5 coordinates, and processed MOA-seq peaks from 25 maize hybrids that had been mapped onto B73 v5 coordinates and merged (Engelhorn et al. 2025). For the maize comparison, we compared enrichments using two strategies to generate background regions: (i) the dinucleotide shuffling approach described above and (ii) permuted genomic regions that we generated using bedtools shuffle v2.31.1 (Quinlan and Hall 2010) with default parameters. We found from this analysis that dinucleotide shuffling yields much more enrichment consistency across feature types compared to using permuted genomic regions as background (Figure S3). The high enrichment variability across feature types with the permuted background is likely due to variable dinucleotide content across feature types which influences the odds of a motif match occurring by chance alone.
Compiling genomic dataset
We compiled a dataset of 727 genome assemblies representing 589 diverse grass species by combining 217 publicly available assemblies from NCBI (Sayers et al. 2024), Phytozome (Goodstein et al. 2012), and CoGE (Lyons and Freeling 2008) with 33 long-read assemblies from Stitzer et al. (2025) and 550 additional assemblies from short reads using the high-throughput assembly pipeline described in Schulz et al. (2023) (Table S2). For this study, we assembled 57 new short-read assemblies from WGS raw reads deposited in the NCBI SRA database (Sayers et al. 2024). We downloaded WGS data from all Poaceae species lacking an existing genome assembly if at least 15 GB of WGS data was available for that species. We then assembled the SRA genomes de novo using Megahit v1.2.9 (Li et al. 2015) with minimum k-mer size = 31 and default parameters for other assemblies, as described in Schulz et al. (2023). If >30 GB WGS data was available for the largest WGS accession for a species, we generated a single assembly using the largest accession. If <30 GB data was available for the largest accession, we merged WGS data from multiple accessions to improve assembly completeness. We performed three QC steps on our SRA assemblies. First, we ran Kraken2 v2.1.3 (Lu et al. 2022) on raw reads (subsampled to a depth of 10M reads) using the PlusPFP database to verify sample identity on a rough taxonomic scale and flag accessions with high levels of bacterial or fungal contamination. Second, we visualized the assemblies on a matK phylogenetic tree against labeled matK sequences from the BOLD database (Ratnasingham et al. 2024) to rule out obviously mislabeled accessions. To generate matK alignments, we downloaded a complete matK CDS sequence for Streptochaeta angustifolia (GenBank: AF164382.1) (Hilu et al. 1999). We queried the S. angustifolia sequence against all WGS short-read assemblies using minimap2 v2.17 (Li 2018) with the parameters -ax asm20 –eqx -I 100g –secondary=no, extracting the primary alignment in each assembly. Using the extracted matK sequences from our WGS assemblies and 7,338 Poaceae matK sequences from the BOLD database, we performed multiple alignment with mafft v7.520 (Katoh et al. 2002) --auto. Then we constructed a phylogenetic tree using raxml v8.2.13 (Stamatakis 2014) with the GTRGAMMA model. We calculated assembly contiguity statistics using assembly-stats v1.0 (Challis 2017). As a final assessment of assembly quality, we quantified assembly completeness using TABASCO, a BUSCO-like assembly metric designed specifically for use with grasses (Schulz et al. 2023) which labels a set of 5592 query genes as “complete”, “duplicated”, “fragmented”, or “missing”.
Orthogroup construction
Across the 727 genome assemblies, we selected 32 representative high quality long-read assemblies to construct orthogroups. In order to avoid potential annotation biases, we ran Helixer (Stiehler et al. 2021) to annotate each of the representative genomes and extracted the protein sequences. Based on protein sequence homology, orthogroups were constructed using OrthoFinder v2.6.4 (Emms and Kelly 2019). In total, we obtained 22,503 orthogroups with homologous sequences represented by more than eight of the representative genomes. From the multiple sequence alignment of each orthogroup, we reconstructed the ancestral protein sequence using the R/phangorn package (Schliep 2011). We used ancestral sequences of the orthogroups to query the orthologs in all 727 genomes with miniProt v0.13.0 (Li 2023).
Phylogenetic characterization
We calculated phylogenetic relatedness among the 727 studied genomes using the Angiosperms353 loci (McDonnell et al. 2021). We identified the orthogroups homologous to the Angiosperms353 loci using miniProt (v0.13.0) and generated gene trees for each of the orthogroups using RAxML v8.2.12 (GAMMA + GTR). ASTRAL-Pro v2 (Zhang and Mirarab 2022) was used to reconcile the species tree based on the gene trees. We generated a matrix of shared branch length among any pair of tips in the species tree to represent the relatedness between them. This matrix (hereafter phyloK matrix) was included in our phylogenetic mixed models to account for shared macroevolutionary history among species. Using the concatenated alignment of the Angiosperms353 loci, we calculated pairwise genetic distance among the 727 taxa based on the K81 model implemented in R/ape::dist.dna() function. We obtained divergence time estimates for Zea-Tripsacum from Chen et al. (2022) and estimated divergence time across Poaceae from Gallaher et al. (2022).
Motif annotation in orthologous upstream regions
Alignments were filtered to retain primary alignments that did not contain a frameshift or a premature stop codon, and that began with a start codon. To approximate 5′ UTR and promoter regions across assemblies, 500 bp sequence was extracted upstream of the aligned start codon. 500 bp was chosen as a conservative estimate of 5′ UTR and promoter regions to maximize species representation in the dataset given the limited contig lengths for the short-read assemblies. If 500 bp upstream sequence was not available, or if the sequence contained >5% Ns, the sequence was discarded from analysis. To minimize redundant, overlapping motif annotations, we collapsed similar overlapping motifs into a single interval. We defined motif similarity based on membership within the same matrix cluster, as designated by JASPAR 2024 (Rauluseviciute et al. 2024) (Table S1). The 377 motifs with conserved UMR enrichment represented 35 unique clusters. We manually annotated each of the 35 clusters with a descriptor of the motifs contained within (e.g. “HSF/GARP” if the cluster contained motifs from HSF and GARP TFs.) After collapsing overlapping motif instances by cluster, we quantified the number of cluster instances per upstream region for each assembly.
Motif abundance and variability
We used our motif counts in 500 bp upstream regions to estimate (i) the median occurrence rate and (ii) coefficient of variation of each motif cluster across assemblies. We calculated occurrence rates of each motif for each assembly by dividing the total number of instances of a particular motif in 500 bp upstream regions by the number of 500 bp upstream regions represented in that assembly, effectively calculating the average number of motif instances per upstream region. We then calculated the median and coefficient of variation for motif occurrence rates across all assemblies.
Shared occupancy of motif instances
Shared motif occupancy was defined for each assembly relative to maize B73, with the percent shared occupancy of maize motifs at a single orthogroup defined as (# motifs present at the target assembly gene/# of motifs present at the maize ortholog). As with most estimates of TFBS turnover, we did not consider motifs present in other lineages but absent in maize. For example, shared occupancy at a single ortholog with three motif types present might be quantified as follows:
| Motif A instances | Motif B instances | Motif C instances | Total maize motifs present | |
|---|---|---|---|---|
| Target assembly | 1 | 0 | 2 | 2 |
| Maize | 1 | 1 | 1 | 3 |
| % shared occupancy | 100% | 0% | 100% | 67% |
We calculated shared motif occupancy relative to maize both at individual orthogroups (“local”) and across all 13,114 orthogroups present in maize (“global”). For local and global shared occupancy, we plotted shared motif occupancy against genetic distance (calculated using the Angiosperms353 loci as described above). We fit exponential decay curves of the form
to the relationship between genetic distance (x) and shared motif occupancy (y).
To identify functional enrichments for orthogroups with high and low shared motif occupancy, we calculated the mean percent shared occupancy across taxa for maize motifs. To minimize biases from lineage-specific genes when calculating mean shared occupancy, we only considered orthogroups containing at least 200 taxa. Using the mean shared occupancy values for each orthogroup, we identified orthogroups within the top and bottom quartiles of shared motif occupancy. The top and bottom quartiles each corresponded to 3,168 maize genes, which we mapped to orthogroups using miniProt (v0.13.0) between the maize sequences and the reconstructed ancestral sequences of the orthogroups in this study. We then performed GO enrichment using the R/topGO package (Alexa and Rahnenführer 2009) with the “weight01” algorithm, which considers hierarchical relationships among terms. For background genes, we used the remaining 10,822 orthogroups not contained in the target quartile. Using maize B73 v5 GO annotations downloaded from MaizeGDB (Woodhouse et al. 2021), we tested for significant cellular component, biological process, and molecular function terms and calculated P values from Fisher statistics. Due to the topology-aware approach that topGO's “weight01” algorithm uses, the P values returned are considered to be already corrected for multiple testing. Therefore, we chose to present raw P values for the top GO terms rather than FDR-corrected values.
To compare shared motif occupancy at TF versus non-TF genes, we downloaded a list of maize TFs from Grassius (Gavgani et al. 2023) corresponding to 1,190 distinct orthogroups. Using orthogroup-level measures of shared motif occupancy, we used an asymptotic two-sample Kolmogorov–Smirnov test to evaluate the difference in shared occupancy between TFs and non-TFs (all orthogroups not contained in the Grassius list).
We wanted to determine whether false-positive motif hits without evidence of in vivo binding were causing us to overestimate motif turnover rates. To quantify this, we compared shared motif occupancy at genes with ChIP-seq support versus all genes. We downloaded processed ChIP-seq data from MaizeGDB for four maize TFs (Tu et al. 2020) matched to motifs used in our analyses: ereb17 (MA1818.1), glk53 (MA1830.2), bhlh47 (MA1834.2), and hb34 (MA1824.2). For each TF, we identified a set of orthogroups for which the 500 bp upstream region of the maize ortholog intersected with a ChIP-seq peak. Then, for the motif clusters corresponding to each ChIP-seq TF, we quantified shared occupancy motif instances between maize and every assembly across (i) all orthogroups and (ii) all orthogroups intersecting a maize ChIP-seq peak.
Environmental characterization
We followed an environmental characterization pipeline previously described by Hsu et al. (2025) to characterize the environmental niches of the studied species. Briefly, for each species, we retrieved the geographical coordinates of occurrence from BIEN (Maitner et al. 2017) and GBIF.org (2025) GBIF Occurrence Download (Derived dataset GBIF.org) and obtained various environmental features for each occurrence (Hsu 2025). We were able to characterize the habitat environments for 706 out of 727 taxa and thus excluded taxa where we were unable to curate environmental data from subsequent analysis (Hsu et al., unpublished data). The variation in habitat environments among species was summarized into environmental principal components (“envPCs”). Each envPC captures a different combination of spatial and climatic patterns (Figure S13), with the multivariate decomposition process resulting in synthetic variables (the envPCs) that preserve important statistical properties, such as orthogonality, lack of autocorrelation, and normality. The top 10 environmental PCs were modeled as predictor variables in our phylogenetic mixed models.
Cross-species environmental association models
For models associating global motif occurrence rates with environment, we fit 35 phylogenetic mixed models of the following form with ASReml-R v4.2 (Butler et al. 2009), with one model per motif and one observation per assembly in each model:
The phylogenetic relationship matrix (phyloK) was fit as a random effect to control for shared evolution across species. All other predictor variables were fit as fixed effects. Occurrence rates for each focal motif were calculated as described in the “Motif abundance and variability” section. We quantified global motif density (across all motifs) for each assembly by summing all 35 individual motif occurrence rates. Mono/dinucleotide principle coordinates were used as covariates to control for background genome characteristics such as GC content, which influence motif detection rates. To estimate mononucleotide and dinucleotide frequencies in background genomic regions, we extracted sequences for all introns shorter than 150 bp using the miniProt alignments previously described in the “Orthogroup construction” section. We used the fasta-get-markov function from the MEME suite with -m 1 to calculate mono/dinucleotide frequencies, and five principle coordinate axes were generated from these frequencies using the R/ade4 package (Dray and Dufour 2007). We fit the envPC terms as fixed effects, along with the mono/dinucleotide content and global motif density covariate terms. We calculated P values for each envPC term using Wald tests. We plotted the distribution of the 350 observed P values against P values expected under a uniform distribution to identify deviations from the null hypothesis.
For orthogroup-specific models of motif-environment association, we used ASReml-R v4.2 to run phylogenetic mixed models of the following form, with one observation per assembly:
Environmental PCs (envPCs) were specified as fixed effects with the phylogenetic relationship matrix (phyloK) as a random effect to control for shared evolution across species. In total 540,015 models were run (# of motifs × # of orthogroups). To minimize model instability, we filtered out motifs occurring at an orthogroup in fewer than ten taxa and required orthogroups to contain at least 200 taxa. Wald tests were performed on each envPC term to calculate P values. We plotted the distribution of the ∼5 million observed P values against P values expected under a uniform distribution to identify deviations from the null hypothesis (no association between environment and motif occurrence). To visualize how motif gain/loss is associated with temperature adaptation events, we reconstructed ancestral temperature values across the Poaceae phylogeny using the contMap function from phytools v2.3-0 (Revell 2024).
We used phylogenetic permulations (Saputra et al. 2021) to check the robustness of our association model results. It was computationally intractable to run permulations across all 5 million associations, so we permulated (i) 100 random motif/orthogroup/envPC models and (ii) all of our top association model hits from the global and orthogroup-specific models. For each model, we permulated the focal envPC trait 1000 times while holding the other envPCs constant and ran association models to generate null distributions of Wald statistics for comparison with the empirical statistics.
Visual comparison of orthologous regions
To compare orthologous upstream regions, we performed pairwise whole-genome alignments between O. sativa var. ZhenShan97 and Zizania palustris using Anchorwave v.1.2.2 (Song et al. 2022) with parameters -R 2 -Q 1. We generated chain files from Anchorwave alignments using a custom “MAFtoChain” script, then lifted over motifs into rice coordinates using CrossMap v0.7.3 (Zhao et al. 2014). To visualize open chromatin regions, we used RiceENCODE to download processed FAIRE-seq and RNA Polymerase II ChIP-seq tracks, both sampled from mature leaves from Zhao et al. (2020).
Supplementary Material
Acknowledgments
We thank Chelsea Specht, Charles Danko, Jian Hua, and the entirety of the Buckler lab for the valuable feedback throughout the project. Additionally, we thank two anonymous reviewers and the editors for valuable feedback on the manuscript. Thanks to Julia Engelhorn for early access to the MOA-seq data. We disclose the usage of GitHub Copilot (with the GPT-4.1 and Claude Sonnet 3.5 models) and ChatGPT 4o and GPT-5 to assist with code development. All AI-generated code was manually verified and edited, and the authors take full responsibility for the contents of our code. Obtaining plant material for this project would not be possible without our many collaborators in the field and herbaria. We are grateful for their expertise, willingness to aid in field collecting, and sharing of precious herbarium material. In no particular order, we thank the following individuals, governmental organizations, and research institutions: Individuals: Taylor AuBuchon-Elder led the germplasm collection and plant maintenance with help from the following individuals: Donald Danforth Plant Science Center Plant Growth Facility staff, Rémy Pasquet, Cassiano Welker, Chrissy McAllister, Pat Minx, Bess Bookout, Maria Vorontsova, Jordan Teisher, Pete Lowry, Sarah Mathews, Richard Jobson, Tim Teetaert, Julie Pelc, Rev. Dennis Testerman, Russel Juelg, James Cole, Ron Day, Courtney Angelo, Chris Matson, Douglas Rogers, Michael McKain, Marshall Shaw, Arthur Stiles, Chuck Byrd, Kyle Dillard, Robert Findling, Nancy Sferra, Jonathan Bailey, Lynn Riedel, Brian Anacker, Kirsti Harms, Brandon Crawford, Charlotte Reemts, Matt McCaw, Kevin Thuesen, Michelle Bertelsen, Ryan Middleton, and Wesley Newman. Plant vouchers from available material were deposited at Missouri Botanical Garden Herbarium; and for Australian wild collections, duplicates were deposited at Australia's National Herbarium in Canberra. Herbaria (silica dried leaf tissue and herbarium specimen tissue): Missouri Botanical Gardens, Royal Botanic Garden Kew, Muséum national d'Histoire naturelle, Australia National Herbarium, Queensland Herbarium, Northern Territory Herbarium, and National Herbarium of New South Wales. Governmental organizations (permits, permissions, and live plant or seed material): Australian National Government, New South Wales National Parks and Wildlife Service, Queensland Parks and Wildlife Service, Queensland Department of Environment Service, Northern Territory Government, Northern Territory Department of Natural Resources, Victoria Department of Environment, Land, Water, and Planning, North Carolina Department of Agriculture and Consumer Services, NC Plant Conservation Program, US National Park Service, US Department of Agriculture, City of Boulder Open Space and Mountain Parks, Florida Park Service and Department of Environmental Protection, and City of Austin Water Quality Protection Lands. Non-profits, NGOs, research institutes (permits, permissions, and live plant material): Katy Prairie Conservatory, Lady Bird Johnson Wildflower Center, The Nature Conservancy (Alabama, Arizona, Florida, Maine, Manitoba, Missouri, New Mexico, Texas), Native Prairie Association of Texas, University of Alabama, New Jersey Conservation Foundation, and University of Puerto Rico at Mayagüez. All collections were done in compliance with the Nagoya Protocol, with permits as required by local authorities. We acknowledge and honor the many custodians and stewards of wild and domesticated grass diversity worldwide who have shaped the study, preservation, and cultivation of grasses.
Contributor Information
Charles O Hale, Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY 14853 USA.
Sheng-Kai Hsu, Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853 USA.
Jingjing Zhai, Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853 USA.
Aimee Schulz, Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY 14853 USA.
Taylor Aubuchon-Elder, Donald Danforth Plant Science Center, St. Louis, MO 63132 USA.
Germano Costa-Neto, Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853 USA.
Allen Gelfond, Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY 14853 USA.
Mohamed Z El-Walid, Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY 14853 USA.
Matthew Hufford, Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011 USA.
Elizabeth A Kellogg, Donald Danforth Plant Science Center, St. Louis, MO 63132 USA.
Thuy La, Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853 USA.
Alexandre P Marand, Department of Genetics, University of Michigan, Ann Arbor, MI 48109 USA.
Arun S Seetharam, Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011 USA.
Armin Scheben, Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724 USA.
Michelle C Stitzer, Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853 USA.
Travis Wrightsman, Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY 14853 USA.
Maria Cinta Romay, Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853 USA.
Edward S Buckler, Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY 14853 USA; Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853 USA; USDA-ARS, Robert W. Holley Center for Agriculture and Health, Ithaca, NY 14853, USA.
Supplementary material
Supplementary material is available at Molecular Biology and Evolution online.
Funding
This work was supported by the National Science Foundation (#1822330) and the United States Department of Agriculture (USDA) Agricultural Research Service (8062-21000-052-004-A). E.S.B. is supported by the USDA- Agricultural Research Service (8062-21000-052-000-D). A.J.S. was supported by the National Science Foundation Graduate Research Fellowship Program (DGE-2139899). M.C.S. was supported by the National Science Foundation (#1907343). A.P.M. is supported by the National Institutes of Health (1R00GM144742). Computational resources were provided by the USDA Agricultural Research Service via SCINet (0201-88888-003-000D) and the AI Center of Excellence (0201-88888-002-000D). Additional computational resources and data management were provided by the Bioinformatics Facility (RRID:SCR_021757) at the Cornell Institute of Biotechnology.
Data availability
The short-read genome assemblies generated for this study are available at https://figshare.com/s/bc19e8f5dae558834cc2. Code to generate the analyses and figures can be found at https://github.com/maize-genetics/poaceae_tfbs.
References
- Alexa A, Rahnenführer J. Gene set enrichment analysis with topGO. 2009. 10.18129/B9.bioc.topGO. [DOI]
- Andrews G et al. Mammalian evolution of human cis-regulatory elements and transcription factor binding sites. Science. 2023:380:eabn7930. 10.1126/science.abn7930. [DOI] [PubMed] [Google Scholar]
- Baumgart LA et al. Recruitment, rewiring and deep conservation in flowering plant gene regulation. Nat Plants. 2025:11:1514–1527. 10.1038/s41477-025-02047-0. [DOI] [PubMed] [Google Scholar]
- Bennett MD, Smith JB. Nuclear dna amounts in angiosperms. Philos Trans R Soc Lond B Biol Sci. 1976:274:227–274. 10.1098/rstb.1976.0044. [DOI] [PubMed] [Google Scholar]
- Bennetzen JL, Freeling M. Grasses as a single genetic system: genome composition, collinearity and compatibility. Trends Genet. 1993:9:259–261. 10.1016/0168-9525(93)90001-X. [DOI] [PubMed] [Google Scholar]
- Bouchenak-Khelladi Y, Verboom GA, Savolainen V, Hodkinson TR. Biogeography of the grasses (Poaceae): a phylogenetic approach to reveal evolutionary history in geographical space and geological time. Bot J Linn Soc. 2010:162:543–557. 10.1111/j.1095-8339.2010.01041.x. [DOI] [Google Scholar]
- Buell CR. Poaceae genomes: going from unattainable to becoming a model clade for comparative plant genomics. Plant Physiol. 2009:149:111–116. 10.1104/pp.108.128926. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Butler D, Cullis B, Gilmour A, Gogel B. Mixed models for S language environments ASReml-R reference manual ASReml estimates variance components under a general linear mixed model by residual maximum likelihood (REML). 2009. https://asreml.kb.vsni.co.uk/wp-content/uploads/sites/3/ASReml-R-Reference-Manual-4.2.pdf.
- Carroll SB. Evo-devo and an expanding evolutionary synthesis: a genetic theory of morphological evolution. Cell. 2008:134:25–36. 10.1016/j.cell.2008.06.030. [DOI] [PubMed] [Google Scholar]
- Challis R. 2017. rjchallis/assembly-stats 17.02. Zenodo. 10.5281/zenodo.322347. [DOI]
- Chen L et al. Genome sequencing reveals evidence of adaptive variation in the genus Zea. Nat Genet. 2022:54:1736–1745. 10.1038/s41588-022-01184-y. [DOI] [PubMed] [Google Scholar]
- Cheng Q et al. An alternatively spliced heat shock transcription factor, OsHSFA2dI, functions in the heat stress-induced unfolded protein response in rice. Plant Biol. (Stuttg.). 2015:17:419–429. 10.1111/plb.12267. [DOI] [PubMed] [Google Scholar]
- Chesmore KN, Bartlett J, Cheng C, Williams SM. Complex patterns of association between pleiotropy and transcription factor evolution. Genome Biol Evol. 2016:8:3159–3170. 10.1093/gbe/evw228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clark RM, Wagler TN, Quijada P, Doebley J. A distant upstream enhancer at the maize domestication gene tb1 has pleiotropic effects on plant and inflorescent architecture. Nat Genet. 2006:38:594–597. 10.1038/ng1784. [DOI] [PubMed] [Google Scholar]
- Crisp PA et al. Stable unmethylated DNA demarcates expressed genes and their cis-regulatory space in plant genomes. Proc Natl Acad Sci U S A. 2020:117:23991–24000. 10.1073/pnas.2010250117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cui Y, Cao Q, Li Y, He M, Liu X. Advances in cis-element- and natural variation-mediated transcriptional regulation and applications in gene editing of major crops. J Exp Bot. 2023:74:5441–5457. 10.1093/jxb/erad248. [DOI] [PubMed] [Google Scholar]
- Delignette-Muller ML, Dutang C. Fitdistrplus: An R Package for fitting distributions. J Stat Softw. 2015:64:1–34. 10.18637/jss.v064.i04. [DOI] [Google Scholar]
- de Mendoza A et al. Transcription factor evolution in eukaryotes and the assembly of the regulatory toolkit in multicellular lineages. Proc Natl Acad Sci U S A. 2013:110:E4858–E4866. 10.1073/pnas.1311818110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Derived dataset GBIF.org (4 March 2025) Filtered export of GBIF occurrence data. 2025. 10.15468/dd.rbp65u. [DOI]
- Dray S, Dufour A. The ade4 package: implementing the duality diagram for ecologists. J Stat Softw. 2007:22:1–20. 10.18637/jss.v022.i04. [DOI] [Google Scholar]
- Edwards EJ, Smith SA. Phylogenetic analyses reveal the shady history of C4 grasses. Proc Natl Acad Sci U S A. 2010:107:2532–2537. 10.1073/pnas.0909672107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019:20:238. 10.1186/s13059-019-1832-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Engelhorn J et al. Genetic variation at transcription factor binding sites largely explains phenotypic heritability in maize. Nat Genet. 2025:57:2313–2322. 10.1038/s41588-025-02246-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fjellheim S, Boden S, Trevaskis B. The role of seasonal flowering responses in adaptation of grasses to temperate climates. Front Plant Sci. 2014:5:431. 10.3389/fpls.2014.00431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gallaher TJ et al. Grasses through space and time: an overview of the biogeographical and macroevolutionary history of Poaceae. J Syst Evol. 2022:60:522–569. 10.1111/jse.12857. [DOI] [Google Scholar]
- Gao L et al. Genetic variation in a heat shock transcription factor modulates cold tolerance in maize. Mol Plant. 2024:17:1423–1438. 10.1016/j.molp.2024.07.015. [DOI] [PubMed] [Google Scholar]
- Gavgani HN, Grotewold E, Gray J. Methodology for constructing a knowledgebase for plant gene regulation information. Methods Mol. Biol. 2023:2698:277–300. 10.1007/978-1-0716-3354-0_17. [DOI] [PubMed] [Google Scholar]
- Goodstein DM et al. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 2012:40:D1178–D1186. 10.1093/nar/gkr944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given motif. Bioinformatics. 2011:27:1017–1018. 10.1093/bioinformatics/btr064. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grass Phylogeny Working Group II . New grass phylogeny resolves deep evolutionary relationships and discovers C4 origins. New Phytol. 2012:193:304–312. 10.1111/j.1469-8137.2011.03972.x. [DOI] [PubMed] [Google Scholar]
- Guo M et al. The plant heat stress transcription factors (HSFs): structure, regulation, and function in response to abiotic stresses. Front Plant Sci. 2016:7:114. 10.3389/fpls.2016.00114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haudry A et al. An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions. Nat Genet. 2013:45:891–898. 10.1038/ng.2684. [DOI] [PubMed] [Google Scholar]
- He BZ, Holloway AK, Maerkl SJ, Kreitman M. Does positive selection drive transcription factor binding site turnover? A test with Drosophila cis-regulatory modules. PLoS Genet. 2011:7:e1002053. 10.1371/journal.pgen.1002053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hendelman A et al. Conserved pleiotropy of an ancient plant homeobox gene uncovered by cis-regulatory dissection. Cell. 2021:184:1724–1739.e16. 10.1016/j.cell.2021.02.001. [DOI] [PubMed] [Google Scholar]
- Heyndrickx KS, Van de Velde J, Wang C, Weigel D, Vandepoele K. A functional and evolutionary perspective on transcription factor binding in Arabidopsis thaliana. Plant Cell. 2014:26:3894–3910. 10.1105/tpc.114.130591. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hilu KW, Alice LA, Liang H. Phylogeny of Poaceae inferred from matK sequences. Ann Mo Bot Gard. 1999:86:835. 10.2307/2666171. [DOI] [Google Scholar]
- Housworth EA, Martins EP, Lynch M. The phylogenetic mixed model. Am Nat. 2004:163:84–96. 10.1086/380570. [DOI] [PubMed] [Google Scholar]
- Hsu SK et al. The Genomic basis of environmental adaptation in Poaceae. Unpublished data.
- Hsu S-K. Occurrence data for 614 Poaceae species with associated genome assemblies. Zenodo. 2025. 10.5281/zenodo.14968186. [DOI]
- Hsu S-K et al. Contrasting rhizosphere nitrogen dynamics in Andropogoneae grasses. Plant J. 2025:123:1–16. 10.1111/tpj.70319. [DOI] [Google Scholar]
- Hu Z-C et al. Evolution of a SHOOTMERISTEMLESS transcription factor binding site promotes fruit shape determination. Nat Plants. 2024:10:1–13. 10.1038/s41477-024-01622-1. [DOI] [PubMed] [Google Scholar]
- Jiang H et al. Natural polymorphism of ZmICE1 contributes to amino acid metabolism that impacts cold tolerance in maize. Nat Plants. 2022:8:1176–1190. 10.1038/s41477-022-01254-3. [DOI] [PubMed] [Google Scholar]
- Kaplow IM et al. Inferring mammalian tissue-specific regulatory conservation by predicting tissue-specific differences in open chromatin. BMC Genomics. 2022:23:291. 10.1186/s12864-022-08450-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kaplow IM et al. Relating enhancer genetic variation across mammals to complex phenotypes using machine learning. Science. 2023:380:eabm7993. 10.1126/science.abm7993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Katoh K, Misawa K, Kuma K-I, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002:30:3059–3066. 10.1093/nar/gkf436. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kellogg EA. Flowering plants. In: Monocots: Poaceae. Springer International Publishing; 2015. p. 3–416. [Google Scholar]
- Khaipho-Burch M et al. Elucidating the patterns of pleiotropy and its biological relevance in maize. PLoS Genet. 2023:19:e1010664. 10.1371/journal.pgen.1010664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khatab AA et al. Global identification of quantitative trait loci and candidate genes for cold stress and chilling acclimation in rice through GWAS and RNA-seq. Planta. 2022:256:82. 10.1007/s00425-022-03995-z. [DOI] [PubMed] [Google Scholar]
- King MC, Wilson AC. Evolution at two levels in humans and chimpanzees. Science. 1975:188:107–116. 10.1126/science.1090005. [DOI] [PubMed] [Google Scholar]
- Krieger G, Lupo O, Wittkopp P, Barkai N. Evolution of transcription factor binding through sequence variations and turnover of binding sites. Genome Res. 2022:32:1099–1111. 10.1101/gr.276715.122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lehmann CER et al. 2019 Mar 21. Functional diversification enabled grassy biomes to fill global climate space [preprint]. bioRxiv. https://www.biorxiv.org/content/10.1101/583625v1.full.pdf
- Lehti-Shiu MD, Panchy N, Wang P, Uygun S, Shiu S-H. Diversity, expansion, and evolutionary novelty of plant DNA-binding transcription factor families. Biochim Biophys Acta Gene Regul Mech. 2017:1860:3–20. 10.1016/j.bbagrm.2016.08.005. [DOI] [PubMed] [Google Scholar]
- Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015:31:1674–1676. 10.1093/bioinformatics/btv033. [DOI] [PubMed] [Google Scholar]
- Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018:34:3094–3100. 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H. Protein-to-genome alignment with miniprot. Bioinformatics. 2023:39:btad014. 10.1093/bioinformatics/btad014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li T et al. Modeling 0.6 million genes for the rational design of functional cis-regulatory variants and de novo design of cis-regulatory sequences. Proc Natl Acad Sci U S A. 2024:121:e2319811121. 10.1073/pnas.2319811121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liere K, Weihe A, Börner T. The transcription machineries of plant mitochondria and chloroplasts: composition, function, and regulation. J Plant Physiol. 2011:168:1345–1360. 10.1016/j.jplph.2011.01.005. [DOI] [PubMed] [Google Scholar]
- Lu J et al. Metagenome analysis using the Kraken software suite. Nat Protoc. 2022:17:2815–2839. 10.1038/s41596-022-00738-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu Z et al. The prevalence, evolution and chromatin signatures of plant regulatory elements. Nat Plants. 2019:5:1250–1259. 10.1038/s41477-019-0548-z. [DOI] [PubMed] [Google Scholar]
- Lyons E, Freeling M. How to usefully compare homologous plant genes and chromosomes as DNA sequences: how to usefully compare plant genomes. Plant J. 2008:53:661–673. 10.1111/j.1365-313X.2007.03326.x. [DOI] [PubMed] [Google Scholar]
- Ma J et al. Endoplasmic reticulum-associated N-glycan degradation of cold-upregulated glycoproteins in response to chilling stress in Arabidopsis. New Phytol. 2016:212:282–296. 10.1111/nph.14014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maitner BS et al. The BIEN R package: a tool to access the botanical information and ecology network (BIEN) database. Methods Ecol Evol. 2017:9:373–379. 10.1111/2041-210X.12861. [DOI] [Google Scholar]
- Marand AP, Chen Z, Gallavotti A, Schmitz RJ. A cis-regulatory atlas in maize at single-cell resolution. Cell. 2021:184:3041–3055.e21. 10.1016/j.cell.2021.04.014. [DOI] [PubMed] [Google Scholar]
- Marand AP, Eveland AL, Kaufmann K, Springer NM. Cis-regulatory elements in plant development, adaptation, and evolution. Annu Rev Plant Biol. 2023:74:111–137. 10.1146/annurev-arplant-070122-030236. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mascher M, Marone MP, Schreiber M, Stein N. Are cereal grasses a single genetic system? Nat Plants. 2024:10:719–731. 10.1038/s41477-024-01674-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McDonnell AJ et al. Exploring Angiosperms353: developing and applying a universal toolkit for flowering plant phylogenomics. Appl Plant Sci. 2021:9:1–5. 10.1002/aps3.11443. [DOI] [Google Scholar]
- McSteen P, Kellogg EA. Molecular, cellular, and developmental foundations of grass diversity. Science. 2022:377:599–602. 10.1126/science.abo5035. [DOI] [PubMed] [Google Scholar]
- Meng X et al. Predicting transcriptional responses to cold stress across plant species. Proc Natl Acad Sci U S A. 2021:118:e2026330118. 10.1073/pnas.2026330118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nitta KR et al. Conservation of transcription factor binding specificities across 600 million years of bilateria evolution. Elife. 2015:4:e04837. 10.7554/eLife.04837. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nocchi G, Whiting JR, Yeaman S. Repeated global adaptation across plant species. Proc Natl Acad Sci U S A. 2024:121:e2406832121. 10.1073/pnas.2406832121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’Malley RC et al. Cistrome and epicistrome features shape the regulatory DNA landscape. Cell. 2016:165:1280–1292. 10.1016/j.cell.2016.04.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Partha R et al. Subterranean mammals show convergent regression in ocular genes and enhancers, along with adaptation to tunneling. Elife. 2017:6:e25884. 10.7554/eLife.25884. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Phan MHQ et al. Conservation of regulatory elements with highly diverged sequences across large evolutionary distances. Nat. Genet. 2025:57:1524–1534. 10.1038/s41588-025-02202-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prud’homme B, Gompel N, Carroll SB. Emerging principles of regulatory evolution. Proc Natl Acad Sci U S A. 2007:104:8605–8612. 10.1073/pnas.0700488104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010:26:841–842. 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ratnasingham S et al. BOLD v4: a centralized bioinformatics platform for DNA-based biodiversity data. Methods Mol. Biol. 2024:2744:403–441. 10.1007/978-1-0716-3581-0_26. [DOI] [PubMed] [Google Scholar]
- Rauluseviciute I et al. JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2024:52:D174–D182. 10.1093/nar/gkad1059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reddy TE et al. Effects of sequence variation on differential allelic transcription factor occupancy and gene expression. Genome Res. 2012:22:860–869. 10.1101/gr.131201.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Revell LJ. Phytools 2.0: an updated R ecosystem for phylogenetic comparative methods (and other things). PeerJ. 2024:12:e16505. 10.7717/peerj.16505. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Riechmann JL et al. Arabidopsis transcription factors: genome-wide comparative analysis among eukaryotes. Science. 2000:290:2105–2110. 10.1126/science.290.5499.2105. [DOI] [PubMed] [Google Scholar]
- Rodgers-Melnick E, Vera DL, Bass HW, Buckler ES. Open chromatin reveals the functional maize genome. Proc Natl Acad Sci U S A. 2016:113:E3177–E3184. 10.1073/pnas.1525244113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ronceret A, Gadea-Vacas J, Guilleminot J, Devic M. The alpha-N-acetyl-glucosaminidase gene is transcriptionally activated in male and female gametes prior to fertilization and is essential for seed development in Arabidopsis. J. Exp. Bot. 2008:59:3649–3659. 10.1093/jxb/ern215. [DOI] [PubMed] [Google Scholar]
- Saputra E, Kowalczyk A, Cusick L, Clark N, Chikina M. Phylogenetic permulations: a statistically rigorous approach to measure confidence in associations in a phylogenetic context. Mol Biol Evol. 2021:38:3004–3021. 10.1093/molbev/msab068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Savadel SD et al. The native cistrome and sequence motif families of the maize ear. PLoS Genet. 2021:17:e1009689. 10.1371/journal.pgen.1009689. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sayers EW et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2024:52:D33–D43. 10.1093/nar/gkad1044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schliep KP. Phangorn: phylogenetic analysis in R. Bioinformatics. 2011:27:592–593. 10.1093/bioinformatics/btq706. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schmidt D et al. Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science. 2010:328:1036–1040. 10.1126/science.1186176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schmitz RJ, Grotewold E, Stam M. Cis-regulatory sequences in plants: their importance, discovery, and future challenges. Plant Cell. 2021:34:718–741. 10.1093/plcell/koab281. [DOI] [Google Scholar]
- Schulz AJ et al. The molecular evolution of perenniality across the grasses. Unpublished data.
- Schulz AJ et al. 2023. Fishing for a reelGene: evaluating gene models with evolution and machine learning. bioRxiv. https://www.biorxiv.org/content/10.1101/2023.09.19.558246.abstract
- Signor SA, Nuzhdin SV. The evolution of gene expression in cis and trans. Trends Genet. 2018:34:532–544. 10.1016/j.tig.2018.03.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith SD, Pennell MW, Dunn CW, Edwards SV. Phylogenetics is the new genetics (for Most of Biodiversity). Trends Ecol. Evol. 2020:35:415–425. 10.1016/j.tree.2020.01.005. [DOI] [PubMed] [Google Scholar]
- Song B et al. Conserved noncoding sequences provide insights into regulatory sequence and loss of gene expression in maize. Genome Res. 2021:31:1245–1257. 10.1101/gr.266528.120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Song B et al. AnchorWave: sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism, and whole-genome duplication. Proc Natl Acad Sci U S A. 2022:119:e2113075119. 10.1073/pnas.2113075119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Spriggs EL, Christin P-A, Edwards EJ. C4 photosynthesis promoted species diversification during the Miocene grassland expansion. PLoS One. 2014:9:e97722. 10.1371/journal.pone.0097722. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014:30:1312–1313. 10.1093/bioinformatics/btu033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stebbins GL. Polyploidy, hybridization, and the invasion of new habitats. Ann Mo Bot Gard. 1985:72:824. 10.2307/2399224. [DOI] [Google Scholar]
- Stergachis AB et al. Conservation of trans-acting circuitry during mammalian regulatory evolution. Nature. 2014:515:365–370. 10.1038/nature13972. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stiehler F et al. Helixer: cross-species gene annotation of large eukaryotic genomes using deep learning. Bioinformatics. 2021:36:5291–5298. 10.1093/bioinformatics/btaa1044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stitzer MC et al. 2025 Jan 24. Extensive genome evolution distinguishes maize within a stable tribe of grasses [preprint]. bioRxiv 2025.01.22.633974. https://www.biorxiv.org/content/10.1101/2025.01.22.633974v1.abstract
- Tu X et al. Reconstructing the maize leaf regulatory network using ChIP-seq data of 104 transcription factors. Nat Commun. 2020:11:5089. 10.1038/s41467-020-18832-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tu X et al. Limited conservation in cross-species comparison of GLK transcription factor binding suggested wide-spread cistrome divergence. Nat Commun. 2022:13:7632. 10.1038/s41467-022-35438-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Voichek Y, Hristova G, Mollá-Morales A, Weigel D, Nordborg M. Widespread position-dependent transcriptional regulatory sequences in plants. Nat Genet. 2024:56:1–9. 10.1038/s41588-023-01621-6. [DOI] [PubMed] [Google Scholar]
- Weirauch MT, Hughes TR. A catalogue of eukaryotic transcription factor types, their evolutionary origin, and Species distribution. In: Hughes TR, editors. A handbook of transcription factors. Springer Netherlands; 2011. p. 25–73. [Google Scholar]
- Wittkopp PJ, Haerum BK, Clark AG. Evolutionary changes in cis and trans gene regulation. Nature. 2004:430:85–88. 10.1038/nature02698. [DOI] [PubMed] [Google Scholar]
- Wong ES et al. Deep conservation of the enhancer regulatory code in animals. Science. 2020:370:eaax8137. 10.1126/science.aax8137. [DOI] [PubMed] [Google Scholar]
- Woodhouse MR et al. A pan-genomic approach to genome databases using maize as a model system. BMC Plant Biol. 2021:21:385. 10.1186/s12870-021-03173-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wray GA. 2007. The evolutionary significance of cis-regulatory mutations. Nat Rev Genet. 8:206–216. 10.1038/nrg2063. [DOI] [PubMed] [Google Scholar]
- Xu H et al. Study on molecular response of alfalfa to low temperature stress based on transcriptomic analysis. BMC Plant Biol. 2024:24:1244. 10.1186/s12870-024-05987-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yamasaki K et al. Structures and evolutionary origins of plant-specific transcription factor DNA-binding domains. Plant Physiol Biochem. 2008:46:394–401. 10.1016/j.plaphy.2007.12.015. [DOI] [PubMed] [Google Scholar]
- Yang S et al. Functionally conserved enhancers with divergent sequences in distant vertebrates. BMC Genomics. 2015:16:882. 10.1186/s12864-015-2070-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zeng R et al. A natural variant of COOL1 gene enhances cold tolerance for high-latitude adaptation in maize. Cell. 2025:188:1315–1329. 10.1016/j.cell.2024.12.018. [DOI] [PubMed] [Google Scholar]
- Zenker S et al. Many transcription factor families have evolutionarily conserved binding motifs in plants. Plant Physiol. 2025:198:kiaf205. 10.1093/plphys/kiaf205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang C, Mirarab S. ASTRAL-Pro 2: ultrafast species tree reconstruction from multi-copy gene family trees. Bioinformatics. 2022:38:4949–4950. 10.1093/bioinformatics/btac620. [DOI] [PubMed] [Google Scholar]
- Zhang T et al. Phylogenomic profiles of whole-genome duplications in Poaceae and landscape of differential duplicate retention and losses among major Poaceae lineages. Nat Commun. 2024:15:3305. 10.1038/s41467-024-47428-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao H et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics. 2014:30:1006–1007. 10.1093/bioinformatics/btt730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao L et al. Integrative analysis of reference epigenomes in 20 rice varieties. Nat Commun. 2020:11:2658. 10.1038/s41467-020-16457-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Citations
Supplementary Materials
Data Availability Statement
The short-read genome assemblies generated for this study are available at https://figshare.com/s/bc19e8f5dae558834cc2. Code to generate the analyses and figures can be found at https://github.com/maize-genetics/poaceae_tfbs.





