Abstract
Transcription factors control gene expression during development and in response to a broad range of internal and external stimuli. They regulate promoter activity by directly binding cis-regulatory elements in DNA. The angiosperm Arabidopsis (Arabidopsis thaliana) contains more than 1,500 annotated transcription factors, each containing a DNA-binding domain that is used to define transcription factor families. Analyzing the binding motifs of 686 and the binding sites of 335 Arabidopsis transcription factors, as well as motifs of 92 transcription factors from other plants, we identified a constrained vocabulary of 74 conserved motifs spanning 50 families in plants. Among 21 transcription factor families, we found 1 core motif for all analyzed members and between 2% and 72% overlapping binding sites. Five families show conservation of the motif along phylogenetic clades. Five families, including the C2H2 zinc finger family, show high diversity among motifs in plants, suggesting potential for the neofunctionalization of duplicated transcription factors based on the motif recognized. We tested whether conserved motifs remained conserved since at least 450 million years ago by determining the binding motifs of 17 transcription factors from 11 families in Marchantia (Marchantia polymorpha) using amplified DNA affinity purification sequencing. We detected nearly identical binding motifs as predicted from the angiosperm data. Our findings show a large repertoire of overlapping binding sites within a transcription factor family and species and a high degree of binding motif conservation for at least 450 million years, indicating more potential for evolution in cis- rather than trans-regulatory elements.
High degrees of transcription factor binding motif and peak conservation from bryophytes to angiosperms suggest higher likelihood of evolution in cis-regulatory regions.
Introduction
Plant transcriptional regulation is a highly diversified process with, for example, around 27,000 nuclear target genes and more than 1,500 transcription factors (TFs) in Arabidopsis (A. thaliana) (Riechmann et al. 2000). The plant kingdom is highly diverse with about 374,000 existing species (Christenhusz and Byng 2016), which evolved from an ancestral charophycean alga (Bowman 2022). Species diversity drastically expanded in the angiosperm lineages 125 to 112 million years ago (De Bodt et al. 2005) after the colonization of land by streptophytes. In all plants, development and responses to biotic and abiotic challenges require acclimation via changes in gene activity including regulation of transcription. Transcriptional regulation is mediated by sequence-specific TFs, which directly bind DNA. Binding sites (cis-regulatory elements) can be characterized in vivo and in vitro. For in vivo characterization, chromatin immunoprecipitation sequencing (ChIP-Seq) is most frequently used, which is influenced by additional in vivo factors, such as chromatin structure and partner proteins (Gordân et al. 2009; Li et al. 2011). In contrast, in vitro methods, such as protein binding microarrays, high-throughput in vitro selection, and DNA affinity purification sequencing (DAP-Seq) allow identification of pure DNA-binding sites without chromatin and methylation influence, but also lack potential binding partners (Berger et al. 2006; Jolma et al. 2013; O’Malley et al. 2016). DAP-Seq also enables comparison of bound sites. Identified binding sites of a TF can be aligned and summarized in a position weight matrix (PWM), resulting in a descriptive transcription factor binding motif (TFBM). A TFBM could be, for example the G-box CACGTG.
The type and position of binding sites on the DNA in regulatory regions spell out a code, which is read out by the TFs (Seeman et al. 1976; Rohs et al. 2009; O’Malley et al. 2016). Thus, the TFBMs are hypothesized to be a main component to guide the regulatory activity of TFs (Weirauch et al. 2014), but other factors like DNA shape or protein-protein interactions modulate binding sites as well (Appelhagen et al. 2011; Sielemann et al. 2021). Some studies have demonstrated that structurally similar TFs have similar TFBMs (Rushton et al. 1995; Berger et al. 2008; Weirauch et al. 2014; O’Malley et al. 2016; Galli et al. 2018; Lambert et al. 2019) while other studies show that small changes in TF amino acid sequence lead to changes in binding sites and therefore TFBMs (Cook et al. 1994; Noyes et al. 2008; Aggarwal et al. 2010).
TFs are grouped into families based on their shared DNA-binding domains (DBDs) and additional domains, for example for protein-protein interactions (Wilhelmsson et al. 2017). The majority of angiosperm TF families can already be found in the bryophyte Marchantia (Marchantia polymorpha), which has not undergone detectable whole genome duplications (Bowman et al. 2017). The divergence of bryophytes and angiosperms can be dated to around 458 to 467 million years ago (Bowman 2022), which we will refer to for simplicity as at least 450 million years ago from here on.
TF families in Marchantia generally consist of fewer members compared to the angiosperm model organism Arabidopsis. A subsequent increase in TF number is likely due to duplications within families (Catarino et al. 2016). Duplication allows for sequence and function divergence under relaxed selective pressure (Ohno 1970; Zhang 2003). Whole genome duplication evidently leads to higher retention rates for duplicated genes by balancing gene dosage effects as opposed to potentially detrimental effects created by single-copy duplications (Edger and Pires 2009; Schmitz et al. 2016). Although most duplicated copies of genes become nonfunctional within a short evolutionary time by accumulating deleterious mutations leading to pseudogenization (Lynch and Conery 2000), some gene copies neofunctionalize and adopt completely new functions compared to their paralog. Alternatively, both copies subfunctionalize so that each copy finally covers parts of the function of the original version (Ohno 1970; Zhang 2003).
In different origins of multicellularity and therefore in different origins of complex development, different families of TFs expanded and still expand, but the evolutionary source of the TF family frequently predates the expansion event (De Mendoza et al. 2013). Expansion corresponds to phylogeny, as phylogenetic branches share specific family expansion patterns (Lang et al. 2010; De Mendoza et al. 2013). In plants, there are several well-described examples of TF families that have expanded. The MYELOBLASTOSIS (MYB) superfamily, for example, can be found across all eukarya, but has increased by 9-fold in member number from the green alga Chlamydomonas (Chlamydomonas reinhardtii) to Arabidopsis (Feller et al. 2011; De Mendoza et al. 2013). MYB TFs are defined by their DNA-binding MYB domain, which consists of a variable number of (imperfect) MYB repeats, each forming 3 α-helices, with the second and third forming a helix-turn-helix structure to interact with the binding site on DNA. The MYB family can be divided into subfamilies based on the number of MYB repeats (Stracke et al. 2001). MYB-related TFs have a single or partial MYB repeat, which can be either of the R3-type or the R1/R2-type (Dubos et al. 2010). MYB-related TFs are involved in the regulation of diverse functions, including the flavonoid biosynthesis (Dubos et al. 2008) and the circadian clock (Lu et al. 2009).
Very large-scale cross-kingdom analyses have suggested that TF binding specificity to particular TFBMs is predicted by DBD amino acid sequence similarity and that extensive similarity in binding can be detected (Weirauch et al. 2014; Lambert et al. 2019). In yeast (Saccharomyces cerevisiae), however, 60% of TFs have evolved differential preferences to binding sites due to variations mainly outside of the DBD (Gera et al. 2022). For example, in the zinc finger family in yeast, predominantly intrinsically disordered regions dictate the binding specificity of a TF (Brodsky et al. 2020). In contrast, analyses of TFBMs in fly (Drosophila melanogaster) and human (Homo sapiens) have shown striking conservation between structurally similar TFs despite expansion and divergence of families over the span of 600 million years (Nitta et al. 2015). For example, activating transcription factor-2 has evolved different functions from its orthologue ATF7 in fly and divergent binding preferences in human and fly were detected (Sano et al. 2005; Nitta et al. 2015). Overall, in animals, extensive conservation of TFBMs is detected in some TF families but not all. The TF family with C2H2 zinc finger as DBD, for example, shows much higher variation in TFBMs despite high protein sequence conservation in metazoans and plants (Lambert et al. 2019). In contrast to animals, most plant genera are tolerant to autopolyploidy and allopolyploidy, which may provide an opportunity for different TF family expansions and for extensive neofunctionalization.
Here, we explored in vitro TF binding data from Arabidopsis TFs and showed that experimentally verified in vitro binding of multiple TFs to the same binding site is detectable. We show that TFBMs can be highly similar within a TF family or be conserved along phylogenetic boundaries within a TF family and quantify the overlapping binding sites. We hypothesized that common TFBMs are a product of TF family expansion and that this conservation dates back to the last common ancestor of angiosperms and bryophytes. To test this hypothesis, we determined the TFBMs of 17 TFs from the bryophyte Marchantia using amplified DNA affinity purification sequencing (ampDAP-Seq).
Results
To study the experimentally generated binding data of TFs in the genome of Arabidopsis, we mapped (amp)DAP-Seq peaks from O’Malley et al. (2016) and López-Vidriero et al. (2021) onto individual promoters of genes, here defined as the 1 kb region upstream of the transcriptional start site (TSS) (Fig. 1A). Visualizations were amended with accessible chromatin regions from 5 publications (Lu et al. 2017; Maher et al. 2018; Sijacic et al. 2018; Lu et al. 2019; Sullivan et al. 2019) to visualize accessibility. It is immediately apparent that peak stacks can be observed over defined promoter regions (Fig. 1A) showing multiple TFs binding to the same binding site in the promoter. Many of these binding sites are indeed accessible in leaves, roots, and whole seedling (Fig. 1A). Analyzing 27,206 promoters of nuclear protein coding genes in Arabidopsis, there is experimental evidence for 0 to 357 (1 outlier with 1,076) binding events with a median number of 37 binding events and on average 39 binding events per promoter (−1 kb to TSS; Fig. 1B). These numbers represent a lower bound as promoter definitions with increased size increase detectable binding events (Fig. 1B). The large positional overlap in TF binding events, which was also qualitatively observed in other promoter regions, suggests large overlap in TFBMs.
Figure 1.
TFBM conservation analysis. A) Representative visualization of (amplified) DAP-Seq binding events (peaks) of TFs (data reanalyzed from O’Malley et al. 2016; López-Vidriero et al. 2021) relative to the TSS on the AT2G46310 (CRF5) promoter (all Arabidopsis [A. thaliana] promoter figures as interactive visualizations are deposited under https://doi.org/10.4119/unibi/2982196). Binding events are colored by log10-tranformed signal value and open chromatin regions are depicted by four lines at the bottom. B) Histogram showing the number of experimentally determined in vitro binding sites of 534 TFs (same data as in A) on 2 different definitions (−1 kilobase [kb] and −1 kb, + 500 bases [b]) of 27,206 nuclear protein coding gene promoters in Arabidopsis (1 outlier each with >1,000 binding sites not shown). C) Schematic phylogenetic tree depicting major groups with TFBM data available (black) and chlorophytes as an outgroup (gray). The angiosperm Arabidopsis and bryophyte Marchantia (M. polymorpha) are highlighted as representative model organisms. Approximate million years from the last common ancestor below. D) Workflow for TFBM conservation analysis. Parts of this figure were created with BioRender.com. DAP-Seq, DNA affinity purification sequencing; TFs, transcription factors; TSS, transcription start site.
To analyze TFBM similarity within and across TF families in plants, we retrieved all available plant TFBMs in databases (O’Malley et al. 2016; Jin et al. 2017; Castro-Mondragon et al. 2022). To reduce bias, we recalculated the (amp)DAP-Seq-derived TFBMs for each TF through a common pipeline, and selected 1 representative TFBM (Supplementary Table S1) preferring ampDAP-Seq-based TFBMs due to the coverage of all genomic binding sites without methylation influence. All TFs with a TFBM were grouped into families based on their DBDs according to TAPscan TF family definitions for Arabidopsis (Wilhelmsson et al. 2017), and they were aligned and phylogenetically clustered (Fig. 1D). We then evaluated each TF family based on how many different TFBMs exist within this family and classified the families into either conserved, semi-conserved, or diverse (see methods for details, Supplementary Table S2). We determined in vitro TFBMs using ampDAP-Seq of orthologues as defined by OrthoFinder2 (hereafter orthologues) from Marchantia to test for TFBM conservation of at least 450 million years (Fig. 1, C and D)
Database queries yielded a total of 2,190 redundant entries from the Plant Cistrome (O’Malley et al. 2016), JASPAR (Castro-Mondragon et al. 2022), PlantTFDB (Jin et al. 2017) databases. This dataset was amended with DAP-Seq data from López-Vidriero et al. (2021) (Table 1). After removal of redundant and inferred entries, 764 different TFs from 13 different plant species and 50 TF families with TFBM data remained. Data density for plants other than Arabidopsis is low. Of the 1,725 DNA-binding TFs in Arabidopsis, 686 have TFBM data available (Table 1), mainly derived from in vitro experiments. These 686 TFs represent 50 of 71 annotated families in the TAPscan database for Arabidopsis (Wilhelmsson et al. 2017). Within these 50 families, we know on average 40.5% of TFBMs of all annotated TFs in Arabidopsis. We tested for overall TFBM similarity and found that the currently known plants TFBMs represent 74 different core TFBMs (Supplementary Table S3), which are between 5 and 21 bp in length with an average length of 8.9 bp (Table 1). Some TFBMs like the WRKY W-box TTGAC are limited to 1 specific family, while the E-/G-box (C)ACGTG can be found as a TFBM for BRI1-EMS-SUPPRESSOR (BES1), basic helix-loop-helix (bHLH), basic leucine zipper (bZIP) TF family members, and 1 Trihelix family member (Supplementary Table S3).
Table 1.
Overview of general TFBM statistics from public databases
| Name | Value |
|---|---|
| Total TFBMs JASPAR | 757 |
| Total TFBMs Plant Cistrome DB | 814 |
| Total TFBMs PlantTFDB | 619 |
| Total TFs (A. thaliana) | 1,725 |
| TFs with TFBM (A. thaliana) | 686 |
| TFs with TFBM (other species) | 78 |
| Distinct consensus TFBMs | 74 |
| Average TFBM length [bp] | 8.9 |
| TF families (A. thaliana) | 71 |
| TF families with TFBM (A. thaliana) | 50 |
| TFBMs not assigned to TAPscan family | 18 |
| Average TFBMs known per family (A. thaliana) [%] | 40.5 |
Many families show high levels of TFBM conservation and overlapping binding sites
The TF families with the most TFBMs are the APETALA2/ETHYLENE-RESPONSIVE ELEMENT BINDING PROTEIN (AP2/EREBP) family (107) and the MYB family consisting of R2R3- and 3R-type MYB TFs (67). Phylogenetic analyses were performed on complete amino acid sequences for the 32 TF families which contained at least 4 TFBMs to assess the relationship between amino acid sequence and TFBM similarity. To date the age of a phylogenetic subgroup, we assessed the presence of Marchantia orthologues determined by Orthofinder2 (Emms and Kelly 2019) which, if present, indicate an age of at least 450 million years for the branch (Bowman 2022).
Based on the low number of distinct TFBMs (Table 1) and known conservation for individual families from literature (Ciolkowski et al. 2008; Franco-Zorrilla et al. 2014; O’Malley et al. 2016), we hypothesized that some TF families retain conserved TFBMs despite extensive family expansion. A phylogeny of the WRKY TFs resolved 6 subclades of which all except clade II-a contain at least 1 Marchantia orthologue (Fig. 2A). We confirmed the previously described TFBM conservation in the WRKY family (Ciolkowski et al. 2008), with TFBMs available for 45 out of 73 members in Arabidopsis, all binding the TFBM TTGAC across all clades with minimal variation in the flanking sequences (Fig. 2A). Subclade II-d has an additional annotated Zn cluster domain, which does not appear to influence the recognized TFBM (Fig. 2A). Based on the conserved TFBM, WRKY TFs could theoretically compete for binding at all TTGAC occurrences and previous analyses have shown substantial overlap in in vitro binding sites (O’Malley et al. 2016; Sielemann et al. 2021). To remove potential bias in TFBM detection and to quantify the degree of overlap in binding sites, we analyzed the set of all common and distinct binding peaks of 33 WRKY TFs, where (amp)DAP-Seq data is available from O’Malley et al. (2016). Of 35,345 total peaks, 292 are bound by more than 75% of WRKY TFs, 1,568 are bound by one-half to 75%, 21,523 are bound by between two and half of all WRKY TFs and 11,962, the minority, are detected as uniquely bound by 1 WRKY TF (Fig. 2C, Supplementary Table S4). To test whether competition can be excluded via different spatial-temporal expression patterns, we compiled expression data from 6,033 Arabidopsis RNA-Seq experiments and tested for expression similarity by Pearson correlation. Of the 45 WRKY family members with a TFBM in Arabidopsis, 15 share expression with at least 1 other family member based on a correlation coefficient of >0.7. Thirty-three have similar expression with at least 1 other family member based on a correlation coefficient of at least >0.5 (Figure 2D). Contrasting expression patterns indicated by negative Pearson correlation are in the minority (Fig. 2D).
Figure 2.
Analysis of TF families with high TFBM conservation. A) Unrooted phylogenetic tree calculated using RAxML of full-length amino acid sequences aligned with MUSCLE of the WRKY family TFs with TFBMs. Support values at the nodes are based on 1,000 bootstrap iterations. The scale bar is in units of amino acid substitutions per site. Clade annotations are from Eulgem et al. (2000) and Interpro domain annotations. Collapsed phylogenetic tree is shown with indication of orthologous genes from Marchantia polymorpha (M.p.) in each subgroup. B) Consensus TFBMs of conserved families generated by merging individual TFBMs of TF family members for each TF family. Base height corresponds to information content. C) Merged peak set of 33 WRKY (amplified) DAP-Seq samples showing the subsets of shared and unique peaks. Darker color corresponds to a higher percentage of TFs sharing this peak subset and tile size encodes relative number of peaks in a given subset. D) Expression correlation of 45 WRKY family members in A. thaliana. The Pearson correlation coefficient is indicated by color and dot size. ARF, auxin response factor; AS2LOB, ASYMMETRIC LEAVES2/lateral organ boundary domain; BBRBPC, BARLEY B RECOMBINANT/BASIC PENTACYSTEINE; BES1, BRI1-EMS-SUPPRESSOR; bHLH TCP, basic helix-loop-helix TEOSINTE BRANCHED1/CYCLOIDEA/PROLIFERATING CELL FACTOR; Dof, DNA binding 1 finger; CAMTA, CALMODULIN-BINDING TRANSCRIPTION ACTIVATOR; CPP, cysteine-rich polycomb-like protein; E2FDP, eukaryotic 2 transcription factor/dimerization partner; GARP, GOLDEN2/ARR-B/PSR1; HD, homeodomain; HSF, heat shock factor; MADS MIKC; NAC, NAM/ATAF1/CUC2; SBP, SQUAMOSA promoter binding protein.
Examination of the other 31 TF families (Supplementary Data Set S1, Supplementary Table S2) with at least 4 characterized TFBMs revealed that 20 additional TF families show highly similar TFBMs within the TF family (Fig. 2B). For example, TFs of the C2C2 GATA zinc finger family unanimously bind the consensus TFBM GATC with little variation in the adjacent bases independent of the method used for TFBM determination (Supplementary Fig. S1). Three orthologs were found in Marchantia, suggesting potential evolutionary conservation of this TFBM. Among the conserved TF families, 21 of the 74 different consensus TFBMs (Fig. 2B) are represented by 332 TFs.
A detailed analysis of shared in vitro binding sites in each family with a conserved TFBM shows that for all families, sites onto which more than 75% of family members bind, were detected (Supplementary Table S4). For 7 families including the plant-specific C2C2 DNA binding with 1 finger (C2C2 Dof) and NAM/ATAF1/CUC2 (NAC) families, the sites with multiple binding TFs also exceed the uniquely bound sites. The other extreme is represented by the BARLEY B RECOMBINANT/BASIC PENTACYSTEINE (BBR/BPC) family in which despite the shared TFBM only 253 out of 15,492 detected binding sites are shared between at least 2 TFs (Supplementary Table S4). The expression divergence of family members within each family, shows that for each of the 21 TFBMs, in vitro binding sites and expression patterns overlap to some degree (Supplementary Tables S4, S5). In the analysis, we identified a total of 21 families with 1 conserved family TFBM (Fig. 2B), pointing to a large constraint in the de novo evolution of TFBMs as a way for neofunctionalization of TFs in at least these families.
Semi-conservation of TFBMs follows phylogenetic relationships
Not all TF families are under a similarly strict constraint for the TFBMs of their members. We expected that some families acquired more variation in the TFBM as a potential way to neofunctionalize and regulate different pathways in more than 450 million years of evolution. Our analysis revealed a continuous transition between TF families with a high degree of TFBM conservation (Fig. 2B) and families in which the TFs bind up to 13 different TFBMs across 33 analyzed members (C2H2 family) (Supplementary Fig. S2). Therefore, we established the classification of semi-conservation to cover TF families with up to 4 different consensus TFBMs and no more than 15% outliers (TFBMs represented by 1 member). Under this rule, the 5 families AT-rich interaction domain, bHLH, bZIP, MYB, and MYB-related are considered as semi-conserved (Supplementary Table S2).
The MYB-related TF family consists of 62 annotated TFs based on TAPscan for A. thaliana, out of which TFBMs for 33 TFs have been experimentally determined, as well as 1 TFBM each in maize (Zea mays), tomato (Solanum lycopersicum), and (Oryza sativa) (Fig. 3A). We observed 3 different TFBMs in the MYB-related family, GATAA, GATATT, and TAGGG, which generally coincide with phylogenetic subclades of the TF family members. We also observed 3 outliers in this family, including the determined TFBM from S. lycopersicum, which differs from other TFBMs in the phylogenetic tree (Fig. 3B). The monocot TFs, however, recognize the same TFBM as the most similar Arabidopsis TFs. The TFBMs thus reflect phylogeny of TF family members, highlighting conservation of these TFBMs across the tested plant species. Each of the 4 major branches in the phylogenetic tree contains at least 1 Marchantia TF in the orthogroup indicating that the subgroups have existed since the last common ancestor of bryophytes and angiosperms.
Figure 3.
Analysis of TF families with semi-conserved TFBMs. A) Unrooted phylogenetic tree calculated using RAxML of full-length amino acid sequences aligned with MUSCLE of the MYB-related family with support values based on 1,000 bootstraps. The scale bar is in units of amino acid substitutions per site. Interpro domain annotations indicate structural similarities. Clade annotations are from Chang et al. (2020) and domain annotations are from Interpro. Collapsed phylogenetic tree (not drawn to scale) with indication of orthologous TFs from Marchantia polymorpha (M.p.) in each subgroup. B) Collapsed phylogenetic tress (not drawn to scale) with orthologous M. polymorpha TFs found in the different semi-conserved TF families. Light gray indicates orthologues were not found. TFBMs represent the consensus TFBMs present in the clades. C) Merged peak sets for each of the MYB-related TFBM subgroup TF samples showing the subsets of shared and unique peaks. Darker color corresponds to a higher percentage of TFs sharing this peak subset and tile size encodes relative number of peaks in a given subset. ARID, AT-rich interaction domain; bHLH, basic helix-loop-helix; bZIP, basic leucine zipper; MYB, MYELOBLASTOSIS.
Within each subclade, the MYB-related TFs potentially compete for binding at binding sites. Expression analysis in Arabidopsis shows that within the 13 members binding the TFBM GATAA, 4 show similar expression patterns with at least 1 other family member based on a correlation coefficient of >0.5 (Supplementary Table S5) and 54.7% of all binding sites of the 12 members with DAP-Seq data within this subgroup are shared by at least 2 TFs (Fig. 3C, Supplementary Table S4). Among the 9 members binding the TFBM GATATT, 4 share expression patterns with at least 1 other family member based on a correlation coefficient of at least 0.7 (Supplementary Table S5). The majority of 78.3% of in vitro binding sites is not unique (Fig. 3C). Among the 7 members binding the TFBM TAGGG, 3 have similar expression patterns (>0.5) with at least 1 other family member (Supplementary Table S5). In this subgroup a very large proportion (94.1%) of uniquely bound sites is observed, which are mostly contributed by 1 of the 3 TFs (Fig. 3C). Although the higher level of divergence of the TFBMs reduces the potential for competition, we detected similar expression patterns between at least 2 members binding the same TFBM for 40 TFs, indicating that, also in the MYB-related family, different spatial-temporal expression patterns do not exclude the TFs from competition.
The bHLH family members bind 2 different TFBMs that differ predominantly in 2 base positions. The initial C of the G-box TFBM CACGTG is missing in the second TFBM and the first G is replaced by a T, leaving the TFBM ACTTG, which was found to be bound by 4 bHLH TFs (Fig. 3B). The bHLH TFs binding to the TFBM ACTTG are more distant from all other members in the tree and the previously analyzed TFs of this subgroup have no Orthofinder2-based orthologues in Marchantia (Fig. 3B). Again, binding site analyses and expression analysis suggest potential competition of different bHLHs for binding at the same binding sites (Supplementary Tables S4, S5). Out of 10,646 binding sites, 4,117 sites are bound by >1 TF. Four of the 28 members binding the CACGTG TFBM subgroup share expression pattern (>0.7) and 13 have similar expression patterns (>0.5). However, the 4 members with the TFBM ACTTG rarely share binding sites and they have distinct expression patterns with a correlation coefficient <0.5 (Supplementary Tables S4, S5), suggesting binding specificity is potentially established through both site preferences and different spatial-temporal expression.
Grouping of TFBMs along the phylogenetic relationship is also observed in the bZIP family and the MYB family (Fig. 3B), suggesting stability of TFBMs during clade-specific expansion. Binding site analyses show that many MYB TF binding peaks and the majority of bZIP binding sites are shared by at least 2 TFs in vitro (Supplementary Table S4). Expression analyses show that there is extensive sharing of expression patterns within members binding the same TFBM (Supplementary Table S5). This group of 5 semi-conserved TF families covers 18 of the 74 consensus TFBMs including TFBMs that are only bound by 1 member in a family and were excluded from consensus TFBM generation. The level of competition for each TFBM is reduced compared to the TF families for which the TFBMs are highly conserved with a lower number of TFs who share expression or have a similar expression pattern.
Diverse TF families bind a variety of different TFBMs
We identified 5 TF families that have more than 4 different consensus TFBMs or more than 15% outliers and classified them as diverse. The diverse TF families are ABSCISIC ACID INSINSITIVE3/VIVIPAROUS1 (ABI3/VP1), AP2/EREBP, C2H2, C3H and Trihelix (Supplementary Table S2). The 188 members in the TF families with less TFBM conservation cover 39 of the 74 identified distinct TFBMs (Table 1).
Trihelix TFs have 1 or 2 DBDs similar to the MYB domain and are therefore called MYB/SANT-like (Fig. 4). The Trihelix family is divided into 5 clades based on amino acid sequence similarities and the phylogeny recapitulates the grouping (Kaplan-Levy et al. 2012) (Fig. 4). As of now, TFBMs are available for 14 out of 26 members in Arabidopsis. All previously defined clades of the Trihelix family have at least 1 known TFBM (Fig. 4). The 5 TFs in GT-2 clade have a conserved TFBM of TTTAC. Two TFs in the GT-1 clade, as well as 1 TF in the SIP1 clade bind to the well-described GT-element with a consensus sequence of GGTTAA (Kaplan-Levy et al. 2012). The third TFBM from the GT-1 clade for TF GT-3a (AT5G01380) is CACGTG (Ayadi et al. 2004), which is also bound by bZIP and bHLH TFs (Fig. 3B). The GTγ and SH4 clades are only represented by 1 TFBM which are both different to others in the tree. The TFs AT1G76870, AT1G76880 and AT1G76890 are encoded directly next to each other likely as a result of tandem duplication (Kaplan-Levy et al. 2012), but AT1G76870 (GTγ clade) has only 1 elongated DBD and a different TFBM compared to the other 2 TFs (GT-2 clade) (Fig. 4).
Figure 4.
The trihelix family is an example of a TF family with diverse TFBMs. Unrooted phylogenetic tree calculated using RAxML of full-length amino acid sequences aligned with MUSCLE of the Trihelix TF family members with TFBM determined. Support values at the nodes are based on 1,000 bootstrap iterations and domain annotations from Interpro. The scale bar is in units of amino acid substitutions per site.
We also identified a high diversity of TFBMs bound by the C2H2 TF family members in plants (Supplementary Fig. S2). Within the C2H2 TF family in plants, a TFBM has been experimentally determined for 32 TFs from Arabidopsis and 1 from tomato. These TFs bind 13 different consensus TFBMs, which is the highest number of different consensus TFBMs within 1 TF family reported in this analysis (Supplementary Table S2).
Many TFBMs have been conserved for at least 450 million years
The bioinformatic analyses suggested that TFBMs of particular TF families or subfamilies have been constant since the last common ancestor of bryophytes and angiosperms. To test these hypotheses with laboratory experiments, we selected orthologous TF candidates in Marchantia as a bryophyte representative and performed ampDAP-Seq to experimentally determine the binding sites in vitro and derive the TFBMs.
Of the 14 WRKY TFs in Marchantia, we selected MpWRKY11 and MpWRKY14 for experimental testing. Mapping of bound DNA sequences on the Marchantia BoGa genome sequence (Beaulieu et al. 2025; https://doi.org/10.4119/unibi/2982437) and peak calling with a negative vector control as background revealed 682 binding sites of MpWRKY11 and 27,902 binding sites of MpWRKY14. The number of peaks loosely corresponds to the number of reads per sample, indicating potentially more low affinity binding sites were detected for MpWRKY14. Both experiments yielded a TFBM of TTGAC, which is stable when using only the top 10% of peaks for TFBM detection (Supplementary Fig. S3). The NAC family was represented by MpNAC3, which has 41,258 binding sites in the Marchantia genomic sequences and yielded the gapped TFBM C(G/T)T…AAG, as was predicted from the analyses of Arabidopsis TFBMs in the TF family. The representatives MpHD3 (HD Zip I-II, 2 members), MpDEL1 (eukaryotic 2 transcription factor/dimerization partner (E2F/DP) family, 6 members), MpGATA2 (C2C2 GATA family, 3 members), and MpBPCV (BBR/BPC family, 2 members) yielded the TFBMs (A/C)ATNAT, (G/C)GCGGG, GATC, and GAGAGA, respectively (Fig. 5A). All of these TFBMs were predicted as the likely family TFBM in the analysis of available seed plant TF data (Pearson correlation coefficient of PWMs >0.68) (Fig. 5A, Supplementary Table S6). Marchantia TFs in the semi-conserved MYB-related family were tested for each TFBM subgroup GATAA, GATATT, and TAGGG. Mp1R-MYB7 was chosen to test for GATAA, MpRVE for GATATT and Mp1R-MYB8 for TAGGG. Mp1R-MYB7 yielded a TFBM of GATNA, MpRVE yielded GATATT, and Mp1R-MYB8 yielded TAGGG with minor changes in the flanking bases (Fig. 5B). All TFBMs are highly similar to the TFBMs predicted based on phylogeny and a correlation coefficient above 0.93 in the MYB-related family (Fig. 5B, Supplementary Table S6). To pick candidates for testing in the semi-conserved bHLH family, all 42 TFs in Marchantia were integrated into the phylogenetic tree (Supplementary Fig. S4). Although Orthofinder2 did not yield an orthologous TF for the ACTTG TFBM subclade (Fig. 3B), bHLH TFs in Marchantia clustered between Arabidopsis TFs binding the TFBM ACTTG (Supplementary Fig. S4). Therefore, we tested the 2 M. polymorpha TFs MpBHLH27 and MpBHLH44, which yielded C(A/C)ACTTG and CACGTG (correlation >0.88), respectively as predicted based on phylogeny (Supplementary Fig. S4, Supplementary Table S5). Experimentally determined TFBMs of Marchantia TFs in the semi-conserved MYB and bZIP family also yielded the TFBM predicted from Arabidopsis data for 1 subgroup each (Supplementary Fig. S5). Taken together, we detected a highly conserved TFBM within the tested phylogenetic clades for at least 450 million years based on the experimental evidence in Marchantia. These results support the likely expansion of TFs within the subclades and an evolution of the differential TFBMs predating the split of bryophytes and angiosperms.
Figure 5.
TFBMs in M. polymorpha show high conservation in comparison with A. thaliana TFBMs. A) Consensus TFBMs determined for each family (left) and experimentally determined TFBMs of one orthologous TF family member in M. polymorpha by ampDAP-Seq (right). B) Consensus TFBM of each clade (left) and TFBM of one orthologous TF in M. polymorpha from each subclade. Tree not drawn to scale. Parts of this figure were created with BioRender.com. BBRBPC, BARLEY B RECOMBINANT/BASIC PENTACYSTEINE; E2FDP, eukaryotic 2 transcription factor/dimerization partner; HD, homeodomain; NAC, NAM/ATAF1/CUC2; MYB, MYELOBLASTOSIS.
Discussion
There is limited potential for TFBM evolution in some families
DNA-binding TFs of different families have different potential to evolve new TFBMs as a mechanism to neofunctionalize in plants. In families like auxin response factor (ARF), BES1, C2C2 GATA, E2F/DP, MADS MIKC, NAC, WRKY, and all other conserved families, all member TFs bind the same consensus TFBM based on currently available data (Fig. 2B), despite the fact that many of these families have existed at least since the last common ancestor of all land plants. Inclusion of TFBMs directly downloaded from JASPAR and PlantTFDB does not alter the conclusions about conservation indicating TFBMs detection is robust against different technologies (Supplementary Fig. S1, also see O’Malley et al. 2016). This long conservation suggests that, very much unlike the biotechnologically programable C2H2 TFs (Ichikawa et al. 2023), the defining DBD of these TF families is highly constrained. Conservation may extend to the last eukaryotic common ancestor in some cases as, for example, the strict E2F binding TFBM TTT(G/C)(G/C)CGC in humans (Rabinovich et al. 2008) nearly overlaps with the conserved E2F/DP TFBM TTTGGCG(C/G) determined in Arabidopsis. The C2C2 GATA TF family, named for (A/C/T)GATA(G/A) as their consensus cis-regulatory element sequence in metazoans, binds the TFBM GATC in plants. In Arabidopsis and Marchantia, all analyzed members bind to the GATC consensus TFBM, suggesting a shift in TFBMs from the last eukaryotic common ancestor to land plants is possible. GATA TFs contain 1 or 2 type IV zinc finger domains and are found in fungi, metazoans, and plants. In fungi and metazoans, GATA TFs generally respond to abiotic stimuli (light, nutrients) (Schwechheimer et al. 2022). Inactivating mutations in mammalian GATA TFs are associated with developmental diseases and expression of the 6 TF family members in human is tissue specific to regulate different functions (Tremblay et al. 2018). In plants, a conservation of involvement of GATA TFs in control of light-dependent responses is indicated (Reyes et al. 2004). The role of GATA TFs in chloroplast biogenesis in Arabidopsis was found to not be conserved in Marchantia (Frangedakis et al. 2024) despite a conserved TFBM (Fig. 5A) underscoring evolution in cis-regulatory elements rather than in trans-acting TFs.
On the opposite end of the spectrum, the C2H2 zinc finger family was reported as diverse in previous studies in metazoans fly and human (Nitta et al. 2015; Lambert et al. 2019), suggesting that some families are inherently more diverse across multiple kingdoms of life, despite unique evolutionary patterns in plants compared to metazoans (Bonchuk and Georgiev 2024). In human, zinc finger clusters need to be able to evolve at a high rate to repress endogenous retroviruses (Lukic et al. 2014). Analysis of the Trihelix family reveals 1 example of potential neofunctionalization of TFs after gene duplication, where AT1G76870 only has 1 DBD and a distinct TFBM (Fig. 4) compared to the 2 Trihelix TFs encoded directly next to it in the Arabidopsis genome. The TF families identified as diverse readily evolve TFBMs de novo compared to conserved and semi-conserved families, thus creating possibility for neofunctionalization of TFs via their TFBM.
Plant-specific TF families also show high degrees of TFBM conservation
The TF families ARF, ASYMMETRIC LEAVES2/lateral organ boundary domain (AS2/LOB), bHLH TCP, BBR/BPC, BES1, C2C2 Dof, GOLDEN2/ARR-B/PSR1 (GARP) ARR-B, GARP G2-like, homeodomain (HD) Zip I-II, -IV, -PLINC, MADS MIKC, NAC, SQUAMOSA promoter binding protein and WRKY are among the conserved families and found in the plant kingdom only (Bowman et al. 2017; Jin et al. 2017; Wilhelmsson et al. 2017; Blanc-Mathieu et al. 2024). Many semi-conserved TF families (Fig. 3) radiated at or before the last common ancestor of all land plants. Thus, it is impossible to determine their degree of conservation by comparison to data from other kingdoms. Within seed plant TF families, we identified a single consensus TFBM in 21 families, including 14 plant-specific families and semi-conservation along phylogenetic clades in 5 additional families. Given the presence of Marchantia orthologues in the subclades of the phylogenetic tree, we hypothesized that the subclades and TFBM differences are at least 450 million years old (e.g. Figures 2, 3 and Supplementary Data Set S1).
We tested for conservation in Marchantia for 7 conserved TF families, all subgroups of the semi-conserved MYB-related and bHLH family, as well as 1 subclade for the bZIP and MYB family each. Unsurprisingly, the E2F/DP family TFBM in Marchantia matches the TFBM from Arabidopsis E2F/DP TFs and is a subset of the strict human E2F TFBM. Within the plant-specific MADS MIKC, NAC, WRKY, and BBR/BPC TF families, the TFBM determined for Marchantia TFs matches the family TFBM determined in seed plants (Fig. 5A). These plant-specific TF families are thus similarly constrained as the E2F/DP TFs (Rabinovich et al. 2008) in their neofunctionalization with regard to their TFBM. This constraint appears true even though the in vitro binding site specificity varies greatly with BBR/BPC TFs binding rather specifically in the seed plant Arabidopsis while the other families show large overlap in in vitro binding sites between their members (Supplementary Table S4).
The duplication of TFs in families with conserved TFBMs, however, has not enabled evolution of new TFBMs as a mechanism for neofunctionalization in the samples examined (Fig. 5A). The analysis of all phylogenetic TFBM subgroups of the semi-conserved MYB-related family shows that each subgroup contains at least 1 Marchantia orthologue (Fig. 3A) and that the tested orthologue from each subgroup matches or nearly matches the TFBM (Fig. 5B, Supplementary Table S6). While these TF families are able to evolve new TFBMs, they do so only very rarely. The data suggests that they did not evolve a new TFBM since the last common ancestor of all land plants approximately 450 million years ago despite an increase of TF family members (Fig. 5B, Supplementary Fig. S4, S5). The evolution of a new TFBM in the single copy TF LEAFY between algae and the evolution of Marchantia has been documented (Sayou et al. 2014). It is tempting to speculate that the colonization of a new and empty niche despite the many challenges or maybe because of the many challenges enabled radiation among the usually constrained TF families. The presence of what are called outliers here, that are TFBMs, which occur only once within a TF family, such as a TFBM CATCAT among the MADS MIKC TF family (Supplementary Data Set S1), may either represent technical errors, such as mislabeling or co-occurring binding sites under the peak, or potentially represent a novel TFBM evolution within seed plants.
Since the last common ancestor of all land plants, seed plants have evolved a number of innovations, such as flowers, seeds, pollen, and secondary growth among others. (Bowman et al. 2017). Many of the TFs known to control the trait in the seed plant Arabidopsis are already present in the last common ancestor of all land plants. For example, the MADS MIKC TF AGAMOUS-LIKE 2 is involved in flowering in Arabidopsis and its orthogroup member in the non-flowering liverwort Marchantia MpMADS2 already binds the MADS typical “CArG box” (Fig. 5A, Supplementary Data Set S1). GATA TFs in Arabidopsis are also involved in flowering (Richter et al. 2013) and have thus acquired novel functions since the last common ancestor of Marchantia and Arabidopsis despite no changes in the conserved TFBM GATC (Figs. 2B and 5A). The hypothesis of a new “master regulator” for a new trait, such as posited, for example, for C4 photosynthesis (Westhoff and Gowik 2010) is tempting. However, the extensive TFBM conservation among many TF families (Figs. 2, 3, and 5) suggests that, if at all, evolution of a new TFBM is more likely in diverse families (Fig. 4, Supplementary Fig. S2). While unlikely based on the data presented here (Figs. 2, 3, and 5), evolution of a new TFBM within the single-copy pioneer TF LEAFY was based on 3 amino acid substitutions in the DBD (Sayou et al. 2014). On balance we postulate that the likelihood of changes in cis, that is to the promoter syntax of target genes, to enter an existing regulon (van den Bergh et al. 2014) is higher. The observed conservation of TFBMs between bryophytes and angiosperms (450 million years) echoes a similar finding in metazoans (600 million years) (Lambert et al. 2018) and yeast (Nitta et al. 2015). As in animals, we observe a number of diverse TF families in planta that have evolved multiple TFBMs over time (Fig. 4, Supplementary Table S2).
Multiple factors mediate binding specificity of TFs
While both anecdotal and large-scale systematic analyses of TFBMs suggested that phylogenetically related TFs bind the same or a very similar TFBM (Fig. 2, A and B) (Weirauch et al. 2014; O’Malley et al. 2016; Lambert et al. 2019; Tu et al. 2020), mutant analysis of TF knockout mutants suggests that phylogenetically related TFs have very different functions. For example, TFs of the R2R3-MYB family are involved in many processes including development, regulation of flavonoid biosynthesis, and resistance despite conserved DBDs (Stracke et al. 2001; Dubos et al. 2010). The similarity of TFBMs makes it difficult to conceptualize specific transcriptional regulation. Although Arabidopsis contains over 1,700 TFs, transcriptional regulation operates with a limited vocabulary of 74 TFBMs based on 686 TFs analyzed experimentally (Table 1). Similar small numbers of 72 consensus TFBM based on 619 TFs from Arabidopsis (Jores et al. 2021) and 85 consensus TFBMs based on 529 Arabidopsis TFs (O’Malley et al. 2016) have been reported previously based on slightly different methods to compare TFBMs. The PWMs reported here (Figs. 2 to 5) were generated robustly, that is by random sampling of all peaks, and were not different from PMWs generated by subsampling only the 10% peaks with the highest signal values (Supplementary Fig. S3). This method does not weigh for affinity (Machanick and Bailey 2011) and thus generates a consensus TFBM of all possible binding events rather than of only high-affinity binding events. With this approach, low-affinity binding sites which have been shown to be critical for function in animals (Tsai et al. 2017) are included. TFBMs represented by PWMs are, although very widely used, not an ideal representation of in vivo TF binding specificity. A reverse search with a PWM from a TF using FIMO (Bailey et al. 2009) demonstrates that the number of sequences matching the PWM far exceeds the number of experimentally determined binding sites in a plant genome (Sielemann et al. 2021). The observation is independent of whether ChIP-Seq (the determination of binding sites in vivo) or DAP-Seq (the determination of binding sites in vitro) was used (Sielemann et al. 2021). PWMs alone are not a sufficient predictor for actual binding sites even when in vitro binding is tested (Figs. 2B and 3C) (Sielemann et al. 2021; Yan et al. 2021). Since binding of a protein to DNA depends on both, the affinity between the binding partners and the concentration of them (Anashkina 2023), it cannot be excluded that at very high TF concentrations indeed all sites with a core TFBM are bound. New approaches utilizing machine learning are able to better capture and predict TF binding specificities (Yan et al. 2021; Avsec et al. 2021b). Complex formation of TFs with different proteins is also known to mediate binding specificity (Ramsay and Glover 2005). While DAP-Seq allows for homodimerization, only recently progress has been made to also analyze heterodimers (Li et al. 2023).
Potentially, DNA shape conferred by TFBM adjacent bases yields specificity (Gordân et al. 2009; Rohs et al. 2009; Sielemann et al. 2021). Here, we show that an in vitro binding site comparison complements the TFBM analysis and demonstrates extensive overlap of in vitro binding sites (Figs. 2C and 3C, Supplementary Table S4). While a minority of families and subfamilies, notably the BBR/BPC and the MYB-related TAGGG, have more than 90% uniquely bound sites despite shared TFBMs, the majority of TF families and subfamilies show large overlaps in TFBMs and in vitro binding sites (Figs. 2C and 3C, Supplementary Table S4). The reported overlap in binding sites provides a lower bound, as deeper sequencing might lead to detection of additional binding sites. The proportion of overlaps with more than half of the members of a family binding a particular site demonstrates that the overlap is not a function of the most recent whole genome duplication event (Figs. 1A, 2C and 3C, Supplementary Table S4) (Bowers et al. 2003).
In principle, specificity may also be achieved via different spatial-temporal expression patterns (Breuninger et al. 2008), different partner proteins (Kim et al. 2008; Appelhagen et al. 2011), histone modifications (Charron et al. 2010; Zhao et al. 2019), chromatin accessibility (Lu et al. 2019), and binding site number and spacing (O’Malley et al. 2016; Galli et al. 2018). Cooperativity has been demonstrated for plant TF binding sites (Jores et al. 2024). The redundancy of binding sites creates room for evolution through single point mutations in one of the binding sites. Here, we explored whether expression patterns between TFs that share a TFBM differ using 6,033 Arabidopsis wildtype RNA-Seq experiments and find similar expression patterns in TFs with a similar TFBM do occur (Fig. 2D, Supplementary Table S5). While this data covers temporal variation, it cannot be excluded that expression variation at the single cell level occurs. Genetic experiments have demonstrated redundancy among TFs in many families, for example the AP2/EREBPs C-REPEAT BINDING FACTOR 1 to 3 in cold regulation (Jia et al. 2016), the NACs SOMBRERO, BEARSKIN 1 and 2 in root cap development (Bennett et al. 2010) and MYBs in anthocyanin regulation (Gonzalez et al. 2008) which points to a functional role of the observed similarity in binding and expression (Figs. 2C and D, 3C, Supplementary Table S4, S5).
Chromatin accessibility may also only play a minor role in mediating specificity as TFs not only share a TFBM but also share binding sites (Figs. 1A, 2C, and 3C, Supplementary Table S4). Binding site spacing and protein-protein interactions were not explored in this work. For a subset of bHLH MYC-related TFs, a relative but not absolute preference for binding site spacings is known (López-Vidriero et al. 2021). If we assume that all effects in general are similarly small compared to expression patterns, DNA shape and chromatin accessibility, the overlap in in vitro binding sites and expression patterns (Figs. 2 and 3, Supplementary Tables S4, S5) opens up the possibility that TFs compete for the same binding site.
Competition at binding sites provides an explanation for the high variation in complementation experiments observed for TFs on binding sites, which are bound by many partners (Lee et al. 2007; Stracke et al. 2010; Gangappa and Botto 2016). Competition at binding sites and different binding affinities may also provide an explanation for the still substantial gap in predictability of gene expression from sequence and from binding events alone (Li et al. 2018; Avsec et al. 2021; de Almeida et al. 2022). It may also explain why TFs counterintuitively act as both activators and repressors (Mahendrawada et al. 2025), since, intriguingly, HY5 experiments with the TF carrying an enhancer or repressor domain still fail to clarify if native HY5 acts as a repressor or activator on its targets (Burko et al. 2020).
Conclusions
Taken together, our results show that TFs in plants act on a limited vocabulary of TFBMs, which has been conserved for at least 450 million years in selected families and act on overlapping binding sites in vitro. In conserved and semi-conserved families, further conservation of TFBMs along phylogenetic clades can be predicted from on phylogenetic analyses of available plant TFBMs and overlapping in vitro binding sites detected. The combination of limited neofunctionalization of TFs via variant TFBMs with the evolution of major innovations in seed plants supports the hypothesis that most evolution must occur in cis- rather than trans-regulatory elements, at least if the innovations are regulated by TF families with conserved or semi-conserved TFBMs. Competition at binding sites may play a larger role than previously appreciated.
Materials and methods
Amplified DAP-Seq (ampDAP-Seq)
Plants were grown for 6 weeks on half-strength Gamborg's medium (Gamborg B5; Duchefa Biochemie B.V., Netherlands) in petri dishes with a 16 h/8 h light/dark cycle at room temperature. DNA was extracted from male G2 generation Marchantia (M. polymorpha subsp. ruderalis BoGa) (Busch et al. 2019) with the cetyltrimethylammonium bromide method (https://dx.doi.org/10.17504/protocols.io.bcvyiw7w).
DNA (4 to 5 µg) was fragmented by sonication to 200 bp with the M220 Focused-Ultrasonicator (Covaris, USA). End-repair, A-tailing and Y-adaptor ligation were performed following the protocol of (Bartlett et al. 2017): For the sample clean-up the DNA was purified using AMPure XP beads (Beckman Coulter, USA) instead of ethanol precipitation and the NEBNext End Repair Module for end-repair. To obtain an ampDAP library, 15 ng of the DAP library was amplified with 11 cylces PCR. For the binding assay, 17 genes were individually cloned in pFN19A (N-terminal HaloTag) T7 SP6 Flexi vector (Promega, USA; discontinued) by Gibson assembly (Gibson et al. 2009) using AsiSI and DraI restriction sites. All primer sequences are available in Supplementary Table S7. TFs as well as an empty HaloTag vector control were expressed with TnT Coupled Wheat Germ Extract System (Promega, USA) using 2 µg plasmid DNA. Halo-fusion proteins were purified with Magne HaloTag Beads (Promega, USA) and then incubated with 50 ng ampDAP library. DNA was recovered, amplified and indexed with 16 to 20 PCR cycles. Fragments between 200 and 400 bp were extracted from a 1% agarose gel with QIAquick Gel Extraction Kit (Qiagen, Netherlands). The final library was sequenced as 85 bp long single-end reads on a NextSeq 550 or NextSeq 2000 (Illumina, USA).
TFBM determination
TF families and their members in Arabidopsis (A. thaliana) were retrieved from TAPscan (Lang et al. 2010; Wilhelmsson et al. 2017). DAP- and ampDAP-Seq data from O’Malley et al. (2016) was obtained from the Gene Expression Omnibus database (Barrett et al. 2013) under the accession GSE60143 and data from López-Vidriero et al. (2021) under GSE155321 and analyzed according to Sielemann et al. (2021): peak sequences were extracted from the TAIR10 reference genome (https://www.arabidopsis.org) of Arabidopsis and TFBMs were determined using MEME-ChIP (Machanick and Bailey 2011). The TFBM with the lowest e-value and less than 21 bases was chosen to avoid long artifacts. AmpDAP-Seq data from Marchantia was mapped to the reference genome (https://doi.org/10.4119/unibi/2982437) using Bowtie v.2.4.2 (Langmead and Salzberg 2012). Output files were converted to BAM format with SAMtools (Danecek et al. 2021) view and sorted (samtools sort -n). To remove duplicates, the SAMtools commands fixmate, sort and markdup were executed with default parameters. Peaks were called using GEM (Guo et al. 2012) or MACS3 (Zhang et al. 2008) including the empty vector sample as a control. Peak sequences centered around the peak summit or the top 10% of peaks selected based on the fold enrichment (signal value) were submitted to MEME-ChIP for TFBM extraction and yielded the same TFBMs (Supplementary Fig. S3). Additionally, we used experimentally determined TFBMs directly from the open access databases JASPAR2022 Plant Core (Castro-Mondragon et al. 2022) and PlantTFDB (Jin et al. 2017), if no DAP-Seq data was available. All used TFBMs for analyses in this study with their respective identifiers, source and the original experimental method are listed in Supplementary Table S1.
Family allocation
All TFs were assigned to TAPscan TF families in Arabidopsis. If a corresponding TF family in another database exists in TAPscan, annotations were converted. Accurate allocation was validated via annotated Interpro domains (Blum et al. 2021). TFBMs from other species were assigned to the TF families using blastp (Altschul et al. 1990) with the corresponding protein sequence against the Arabidopsis TAIR10 proteome (https://www.arabidopsis.org). The protein sequences of the TFs were retrieved from Uniprot or the respective proteome from Phytozome (Goodstein et al. 2012) if available. The best blast hit ranked by e-value and secondly percentage of identity was chosen to determine the nearest Arabidopsis putative orthologue, whose TF family annotation was then transferred. The databases contain 18 TFBMs, which could not be assigned to a distinct TAPscan TF family. The MYB family was manually reduced to contain only MYB3R and R2R3-MYB factors in accordance with annotations from Stracke et al. (2001) and Plant Cistrome database annotations.
Some TFs had multiple TFBMs generated by different methods, requiring the selection of 1 representative TFBM per protein. We preferred ampDAP-Seq data over DAP-Seq data followed by other determination methods, because DAP-Seq is a high throughput method capturing the most complete set of binding sites on gDNA (O’Malley et al. 2016) and ampDAP-binding is solely based on sequence and not influenced by methylation state. We reassigned the JASPAR TFBMs originally retrieved from ReMap to the original method of either ChIP-Seq or DAP-Seq.
Phylogenetic analyses
Full protein sequences were aligned per family using MUSCLE v3.8.31 (Edgar 2004) with default parameters. Phylogenetic trees were generated with RAxML v7.4.4 (Stamatakis 2014) with 1,000 bootstraps and the PROTGAMMAJTT matrix (method modified from Guedes Corrêa et al. 2008) for similarity measure. The phylogenetic trees with TFBMs were visualized using the R package motifStack (Ou et al. 2018) (Supplementary Data Set S1). Protein sequences were scanned for domain annotations from Pfam and Prosite using InterProScan (Jones et al. 2014).
The similarity of TFBMs within a family was assessed using compare_motifs from the R package universalmotif to calculate a Pearson correlation coefficient. Clusters were generated by cutting a dendrogram resulting from the distances of the TFBMs at 0.5 followed by manual curation (method modified from Jores et al. 2021). Since gapped TFBMs are more difficult to detect (Bailey et al. 2009), we considered sufficiently similar parts of a gapped TFBM as the same cluster. Consensus TFBMs were generated with mergeMotifs (motifStack) and for gapped TFBMs merge_motifs (universalmotif) for each cluster and trimmed with trim_motifs (universalmotif). We defined TFBMs appearing only once in a TF family as outliers and excluded them from consensus TFBM generation (method modified from Jores et al. 2021). Distinct TFBMs across all TF families were established by TFBM comparison as described before for individual families and the clusters generated from an hclust analysis in R and cutting the tree with a cutoff of h = 0.2.
We considered TF families with at least 4 TFs with a known TFBM for our assessment of conservation level within the family, as this is the cutoff for the generation of a phylogenetic tree with RAxML. Families with 1 consensus TFBM, subtracting outliers, and less than 15% outlier TFBMs were considered as a conserved family. Up to 4 different consensus TFBMs and less than 15% outliers classified families as semi-conserved (see Supplementary Table S2).
Orthologous proteins in the bryophyte Marchantia were determined using OrthoFinder2 (Emms and Kelly 2019) for the TFs with a TFBM in a given phylogenetic clade of the tree. TFBMs derived from the Marchantia ampDAP-Seq were correlated with the corresponding consensus family TFBM using the compare_motifs function with the parameter of min.overlap = 4 to account for short core TFBMs.
Additional analyses
Wild-type RNA-Seq experiments of A. thaliana were downloaded from the SRA (Leinonen et al. 2011) and mapped onto the TAIR10 reference genome (as described in Halpape et al. 2023). Pearson correlation was calculated within conserved families and within groups binding to the same TFBM in semi-conserved families and coefficients visualized using the corrplot R package. A coefficient above 0.5 was considered as similar and above 0.7 as shared expression pattern.
Common and distinct peaks between TFs with the same TFBM within families were identified using Homer (Heinz et al. 2010) mergePeaks -d given. Peak sets were visualized using the treemaps R package.
Accession numbers
Sequencing data generated in this article can be found in the NCBI SRA under Bioproject IDs_PRJNA1007631 and PRJNA1148438. Accession numbers of the genes described in this paper are listed in Supplementary Tables S1 and S6.
Supplementary Material
Acknowledgments
We gratefully acknowledge support by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure, and the CeBiTec compute cluster for computational resources.
Contributor Information
Sanja Zenker, Computational Biology, Faculty of Biology, Bielefeld University, 33615 Bielefeld, Germany; Center of Biotechnology (CeBiTec), Bielefeld University, 33615 Bielefeld, Germany.
Donat Wulf, Computational Biology, Faculty of Biology, Bielefeld University, 33615 Bielefeld, Germany; Center of Biotechnology (CeBiTec), Bielefeld University, 33615 Bielefeld, Germany.
Anja Meierhenrich, Computational Biology, Faculty of Biology, Bielefeld University, 33615 Bielefeld, Germany; Center of Biotechnology (CeBiTec), Bielefeld University, 33615 Bielefeld, Germany.
Prisca Viehöver, Center of Biotechnology (CeBiTec), Bielefeld University, 33615 Bielefeld, Germany; Genetics and Genomics of Plants, Faculty of Biology, Bielefeld University, 33615 Bielefeld, Germany.
Sarah Becker, Computational Biology, Faculty of Biology, Bielefeld University, 33615 Bielefeld, Germany.
Marion Eisenhut, Computational Biology, Faculty of Biology, Bielefeld University, 33615 Bielefeld, Germany; Center of Biotechnology (CeBiTec), Bielefeld University, 33615 Bielefeld, Germany.
Ralf Stracke, Center of Biotechnology (CeBiTec), Bielefeld University, 33615 Bielefeld, Germany; Genetics and Genomics of Plants, Faculty of Biology, Bielefeld University, 33615 Bielefeld, Germany.
Bernd Weisshaar, Center of Biotechnology (CeBiTec), Bielefeld University, 33615 Bielefeld, Germany; Genetics and Genomics of Plants, Faculty of Biology, Bielefeld University, 33615 Bielefeld, Germany.
Andrea Bräutigam, Computational Biology, Faculty of Biology, Bielefeld University, 33615 Bielefeld, Germany; Center of Biotechnology (CeBiTec), Bielefeld University, 33615 Bielefeld, Germany.
Author contributions
S.Z. analyzed the data and co-wrote the manuscript, D.W. conceived of the study, A.M. and S.Z. produced and analyzed ampDAP-Seq data in Marchantia, P.V. performed ampDAP-Sequencing, S.B. re-analyzed HY5 experiments, M.E. edited the manuscript, R.S. edited the manuscript, B.W. edited the manuscript, A.B. suggested analyses and co-wrote the manuscript.
Supplementary data
The following materials are available in the online version of this article.
Supplementary Figure S1. Phylogenetic tree of the C2C2 GATA TF family.
Supplementary Figure S2. Phylogenetic tree of the C2H2 zinc finger TF family.
Supplementary Figure S3. TFBMs determined from Marchantia ampDAP-seq experiments.
Supplementary Figure S4. Phylogenetic tree of the bHLH family with Marchantia orthologues.
Supplementary Figure S5. TFBMs of Marchantia orthologues in the bZIP and MYB family.
Supplementary Data Set S1. Phylogenetic trees of all TF families analyzed in this study.
Supplementary Table S1. All TFs with binding motif used in this study.
Supplementary Table S2. Overview of TF family centered data with conservation classification.
Supplementary Table S3. Unique core motifs identified across all currently characterized TFs in the plant kingdom.
Supplementary Table S4. Statistics of the merged subsets of peaks for (amp)DAP-Seq TF data.
Supplementary Table S5. Pearson correlation coefficients of expression patterns in Arabidopsis.
Supplementary Table S6. Pearson correlation coefficient of Marchantia ampDAP-seq TFBMs with consensus TFBMs.
Supplementary Table S7. All primer sequences used in this study.
Funding
This study was funded by the German Research Foundation (Deutsche Forschungsgemeinschaft; DFG) via grant TRR175-D04: “The Green Hub, Central Coordinator of Acclimation in Plants”, and through “Evolutionary network analysis based on the transcriptome atlas of Marchantia polymorpha” Funding ID: BR4617/1-1.
Data availability
Raw ampDAP-Seq data for Marchantia is available under the Bioprojects PRJNA1007631 and PRJNA1148438 on the NCBI SRA. Processed peak files are deposited on GitLab (https://gitlab.ub.uni-bielefeld.de/sanja.zenker/tfbm-evolution), as well as code used in this study and all consensus and individual TFBMs in MEME-format.
Visualizations of all Arabidopsis (amp)DAP-Seq binding sites on nuclear encoded gene promoters (see Fig. 1A) are deposited under https://doi.org/10.4119/unibi/2982196.
Dive Curated Terms
The following phenotypic, genotypic, and functional terms are of significance to the work described in this paper:
References
- Aggarwal P, Das Gupta M, Joseph AP, Chatterjee N, Srinivasan N, Nath U. Identification of specific DNA binding residues in the TCP family of transcription factors in Arabidopsis. Plant Cell. 2010:22(4):1174–1189. 10.1105/tpc.109.066647 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990:215(3):403–410. 10.1016/S0022-2836(05)80360-2 [DOI] [PubMed] [Google Scholar]
- Anashkina AA. Protein-DNA recognition mechanisms and specificity. Biophys Rev. 2023:15(5):1007–1014. 10.1007/s12551-023-01137-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Appelhagen I, Jahns O, Bartelniewoehner L, Sagasser M, Weisshaar B, Stracke R. Leucoanthocyanidin Dioxygenase in Arabidopsis thaliana: characterization of mutant alleles and regulation by MYB–BHLH–TTG1 transcription factor complexes. Gene. 2011:484(1–2):61–68. 10.1016/j.gene.2011.05.031 [DOI] [PubMed] [Google Scholar]
- Avsec Ž, Agarwal V, Visentin D, Ledsam JR, Grabska-Barwinska A, Taylor KR, Assael Y, Jumper J, Kohli P, Kelley DR. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods. 2021:18(10):1196–1203. 10.1038/s41592-021-01252-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Avsec Ž, Weilert M, Shrikumar A, Krueger S, Alexandari A, Dalal K, Fropf R, McAnany C, Gagneur J, Kundaje A, et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat Genet. 2021b:53(3):354–366. 10.1038/s41588-021-00782-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ayadi M, Delaporte V, Li YF, Zhou DX. Analysis of GT-3a identifies a distinct subgroup of trihelix DNA-binding transcription factors in Arabidopsis. FEBS Lett. 2004:562(1–3):147–154. 10.1016/S0014-5793(04)00222-4 [DOI] [PubMed] [Google Scholar]
- Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS. MEME suite: tools for motif discovery and searching. Nucleic Acids Res. 2009:37(Web Server):W202. 10.1093/nar/gkp335 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 2013:41(D1):D991–D995. 10.1093/nar/gks1193 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bartlett A, O’Malley RC, Huang SC, Galli M, Nery JR, Gallavotti A, Ecker JR. Mapping genome-wide transcription-factor binding sites using DAP-seq. Nat Protoc. 2017:12(8):1659–1672. 10.1038/nprot.2017.055 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beaulieu C, Libourel C, Mbadinga Zamar DL, El Mahboubi K, Hoey D, Greiff GRL, Keller J, Girou C, Helene SC, et al. The Marchantia polymorpha pangenome reveals ancient mechanisms of plant adaptation to the environment. Nat Genet. 2025:57(3):729–740. 10.1038/s41588-024-02071-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bennett T, van den Toorn A, Sanchez-Perez GF, Campilho A, Willemsen V, Snel B, Scheres B. SOMBRERO, BEARSKIN1, and BEARSKIN2 regulate root cap maturation in Arabidopsis. Plant Cell. 2010:22(3):640–654. 10.1105/tpc.109.072272 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berger MF, Badis G, Gehrke AR, Talukder S, Philippakis AA, Peña-Castillo L, Alleyne TM, Mnaimneh S, Botvinnik OB, Chan ET, et al. Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell. 2008:133(7):1266–1276. 10.1016/j.cell.2008.05.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berger MF, Philippakis AA, Qureshi AM, He FS, Estep PW, Bulyk ML. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat Biotechnol. 2006:24(11):1429–1435. 10.1038/nbt1246 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blanc-Mathieu R, Dumas R, Turchi L, Lucas J, Parcy F. Plant-TFClass: a structural classification for plant transcription factors. Trends Plant Sci. 2024:29(1):40–51. 10.1016/j.tplants.2023.06.023 [DOI] [PubMed] [Google Scholar]
- Blum M, Chang HY, Chuguransky S, Grego T, Kandasaamy S, Mitchell A, Nuka G, Paysan-Lafosse T, Qureshi M, Raj S, et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Res. 2021:49(D1):D344–D354. 10.1093/nar/gkaa977 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bonchuk AN, Georgiev PG. C2h2 proteins: evolutionary aspects of domain architecture and diversification. BioEssays. 2024:46(8):e2400052. 10.1002/bies.202400052 [DOI] [PubMed] [Google Scholar]
- Bowers JE, Chapman BA, Rong J, Paterson AH. Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature. 2003:422(6930):433–438. 10.1038/nature01521 [DOI] [PubMed] [Google Scholar]
- Bowman JL. The origin of a land flora. Nat Plants. 2022:8(12):1352–1369. 10.1038/s41477-022-01283-y [DOI] [PubMed] [Google Scholar]
- Bowman JL, Kohchi T, Yamato KT, Jenkins J, Shu S, Ishizaki K, Yamaoka S, Nishihama R, Nakamura Y, Berger F, et al. Insights into land plant evolution garnered from the Marchantia polymorpha genome. Cell. 2017:171(2):287–304.e15. 10.1016/j.cell.2017.09.030 [DOI] [PubMed] [Google Scholar]
- Breuninger H, Rikirsch E, Hermann M, Ueda M, Laux T. Differential expression of WOX genes mediates apical-basal axis formation in the Arabidopsis embryo. Dev Cell. 2008:14(6):867–876. 10.1016/j.devcel.2008.03.008 [DOI] [PubMed] [Google Scholar]
- Brodsky S, Jana T, Mittelman K, Chapal M, Kumar DK, Carmi M, Barkai N. Intrinsically disordered regions direct transcription factor in vivo binding specificity. Mol Cell. 2020:79(3):459–471.e4. 10.1016/j.molcel.2020.05.032 [DOI] [PubMed] [Google Scholar]
- Burko Y, Seluzicki A, Zander M, Pedmale UV, Ecker JR, Chory J. Chimeric activators and repressors define HY5 activity and reveal a light-regulated feedback mechanism. Plant Cell. 2020:32(4):967–983. 10.1105/tpc.19.00772 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Busch A, Deckena M, Almeida-Trapp M, Kopischke S, Kock C, Schüssler E, Tsiantis M, Mithöfer A, Zachgo S. MpTCP1 controls cell proliferation and redox processes in Marchantia polymorpha. New Phytol. 2019:224(4):1627–1641. 10.1111/nph.16132 [DOI] [PubMed] [Google Scholar]
- Castro-Mondragon JA, Riudavets-Puig R, Rauluseviciute I, Lemma RB, Turchi L, Blanc-Mathieu R, Lucas J, Boddie P, Khan A, As Manosalva P, et al. JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2022:50(D1):D165–D173. 10.1093/nar/gkab1113 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Catarino B, Hetherington AJ, Emms DM, Kelly S, Dolan L. The stepwise increase in the number of transcription factor families in the Precambrian predated the diversification of plants on land. Mol Biol Evol. 2016:33(11):2815–2819. 10.1093/molbev/msw155 [DOI] [PubMed] [Google Scholar]
- Chang X, Xie S, Wei L, Lu Z, Chen ZH, Chen F, Lai Z, Lin Z, Zhang L. Origins and stepwise expansion of R2R3-MYB transcription factors for the terrestrial adaptation of plants. Front Plant Sci. 2020:11:2086. 10.3389/fpls.2020.575360 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Charron J-BF, He H, Elling AA, Deng XW. Dynamic landscapes of four histone modifications during deetiolation in Arabidopsis. Plant Cell. 2010:21(12):3732–3748. 10.1105/tpc.109.066845 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Christenhusz MJM, Byng JW. The number of known plants species in the world and its annual increase. Phytotaxa. 2016:261(3):201–217. 10.11646/phytotaxa.261.3.1 [DOI] [Google Scholar]
- Ciolkowski I, Wanke D, Birkenbihl RP, Somssich IE. Studies on DNA-binding selectivity of WRKY transcription factors lend structural clues into WRKY-domain function. Plant Mol Biol. 2008:68(1–2):81–92. 10.1007/s11103-008-9353-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cook WJ, Mosley SP, Audino DC, Mullaney DL, Rovelli A, Stewart G, Denis CL. Mutations in the zinc-finger region of the yeast regulatory protein ADR1 affect both DNA binding and transcriptional activation. J Biol Chem. 1994:269(12):9374–9379. 10.1016/S0021-9258(17)37118-1 [DOI] [PubMed] [Google Scholar]
- Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, et al. Twelve years of SAMtools and BCFtools. Gigascience. 2021:10(2):giab008. 10.1093/gigascience/giab008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Almeida BP, Reiter F, Pagani M, Stark A. DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers. Nat Genet. 2022:54(5):613–624. 10.1038/s41588-022-01048-5 [DOI] [PubMed] [Google Scholar]
- De Bodt S, Maere S, Van De Peer Y. Genome duplication and the origin of angiosperms. Trends Ecol Evol. 2005:20(11):591–597. 10.1016/j.tree.2005.07.008 [DOI] [PubMed] [Google Scholar]
- De Mendoza A, Sebé-Pedrós A, Šestak MS, Matejčić M, Torruella G, Domazet-Lošo T, Ruiz-Trillo I. Transcription factor evolution in eukaryotes and the assembly of the regulatory toolkit in multicellular lineages. Proc Natl Acad Sci U S A. 2013:110(50):E4858–E4866. 10.1073/pnas.1311818110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dubos C, Le Gourrierec J, Baudry A, Huep G, Lanet E, Debeaujon I, Routaboul J, Alboresi A, Weisshaar B, Lepiniec L. MYBL2 is a new regulator of flavonoid biosynthesis in Arabidopsis thaliana. Plant J. 2008:55(6):940–953. 10.1111/j.1365-313X.2008.03564.x [DOI] [PubMed] [Google Scholar]
- Dubos C, Stracke R, Grotewold E, Weisshaar B, Martin C, Lepiniec L. MYB transcription factors in Arabidopsis. Trends Plant Sci. 2010:15(10):573–581. 10.1016/j.tplants.2010.06.005 [DOI] [PubMed] [Google Scholar]
- Edgar RC. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 2004:5(1):113. 10.1186/1471-2105-5-113 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Edger PP, Pires JC. Gene and genome duplications: the impact of dosage-sensitivity on the fate of nuclear genes. Chromosome Res. 2009:17(5):699–717. 10.1007/s10577-009-9055-9 [DOI] [PubMed] [Google Scholar]
- Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019:20(1):238. 10.1186/s13059-019-1832-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eulgem T, Rushton PJ, Robatzek S, Somssich IE. The WRKY superfamily of plant transcription factors. Trends Plant Sci. 2000:5(5):199–206. 10.1016/S1360-1385(00)01600-9 [DOI] [PubMed] [Google Scholar]
- Feller A, MacHemer K, Braun EL, Grotewold E. Evolutionary and comparative analysis of MYB and bHLH plant transcription factors. Plant J. 2011:66(1):94–116. 10.1111/j.1365-313X.2010.04459.x [DOI] [PubMed] [Google Scholar]
- Franco-Zorrilla JM, López-Vidriero I, Carrasco JL, Godoy M, Vera P, Solano R. DNA-binding specificities of plant transcription factors and their potential to define target genes. Proc Natl Acad Sci U S A. 2014:111(6):2367–2372. 10.1073/pnas.1316278111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frangedakis E, Yelina NE, Billakurthi K, Hua L, Schreier T, Dickinson PJ, Tomaselli M, Haseloff J, Hibberd JM. MYB-related transcription factors control chloroplast biogenesis. Cell. 2024:187(18):4859–4876.e22. 10.1016/j.cell.2024.06.039 [DOI] [PubMed] [Google Scholar]
- Galli M, Khakhar A, Lu Z, Chen Z, Sen S, Joshi T, Nemhauser JL, Schmitz RJ, Gallavotti A. The DNA binding landscape of the maize AUXIN RESPONSE FACTOR family. Nat Commun. 2018:9(1):4526–4514. 10.1038/s41467-018-06977-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gangappa SN, Botto JF. The multifaceted roles of HY5 in plant growth and development. Mol Plant. 2016:9(10):1353–1365. 10.1016/j.molp.2016.07.002 [DOI] [PubMed] [Google Scholar]
- Gera T, Jonas F, More R, Barkai N. Evolution of binding preferences among whole-genome duplicated transcription factors. Elife. 2022:11:e73225. 10.7554/ELIFE.73225 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gibson DG, Young L, Chuang R-Y, Venter JC, Hutchison CA, Smith HO. Enzymatic assembly of DNA molecules up to several hundred kilobases. Nat Methods. 2009:6(5):343–345. 10.1038/nmeth.1318 [DOI] [PubMed] [Google Scholar]
- Gonzalez A, Zhao M, Leavitt JM, Lloyd AM. Regulation of the anthocyanin biosynthetic pathway by the TTG1/bHLH/Myb transcriptional complex in Arabidopsis seedlings. Plant J. 2008:53(5):814–827. 10.1111/j.1365-313X.2007.03373.x [DOI] [PubMed] [Google Scholar]
- Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, Mitros T, Dirks W, Hellsten U, Putnam N, et al. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 2012:40(D1):D1178–D1186. 10.1093/nar/gkr944 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gordân R, Hartemink AJ, Bulyk ML. Distinguishing direct versus indirect transcription factor–DNA interactions. Genome Res. 2009:19(11):2090–2100. 10.1101/gr.094144.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guedes Corrêa LG, Riaño-Pachón DM, Guerra Schrago C, Vicentini dos Santos R, Mueller-Roeber B, Vincentz M. The role of bZIP transcription factors in green plant evolution: adaptive features emerging from four founder genes. PLoS One. 2008:3(8):e2944. 10.1371/journal.pone.0002944 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guo Y, Mahony S, Gifford DK. High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints. PLoS Comput Biol. 2012:8(8):e1002638. 10.1371/journal.pcbi.1002638 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Halpape W, Wulf D, Verwaaijen B, Stasche AS, Zenker S, Sielemann J, Tschikin S, Viehöver P, Sommer M, Weber APM, et al. Transcription factors mediating regulation of photosynthesis. bioRxiv. 10.1101/2023.01.06.522973, 2023. [DOI]
- Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell. 2010:38(4):576–589. 10.1016/j.molcel.2010.05.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ichikawa DM, Abdin O, Alerasool N, Kogenaru M, Mueller AL, Wen H, Giganti DO, Goldberg GW, Adams S, Spencer JM, et al. A universal deep-learning model for zinc finger design enables transcription factor reprogramming. Nat Biotechnol. 2023:41(8):1117–1129. 10.1038/s41587-022-01624-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jia Y, Ding Y, Shi Y, Zhang X, Gong Z, Yang S. The cbfs triple mutants reveal the essential functions of CBFs in cold acclimation and allow the definition of CBF regulons in Arabidopsis. New Phytol. 2016:212(2):345–353. 10.1111/nph.14088 [DOI] [PubMed] [Google Scholar]
- Jin J, Tian F, Yang DC, Meng YQ, Kong L, Luo J, Gao G. PlantTFDB 4.0: toward a central hub for transcription factors and regulatory interactions in plants. Nucleic Acids Res. 2017:45(D1):D1040–D1045. 10.1093/nar/gkw982 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, Morgunova E, Enge M, Taipale M, Wei G, et al. DNA-binding specificities of human transcription factors. Cell. 2013:152(1–2):327–339. 10.1016/j.cell.2012.12.009 [DOI] [PubMed] [Google Scholar]
- Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G, et al. InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014:30(9):1236–1240. 10.1093/bioinformatics/btu031 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jores T, Tonnies J, Mueth NA, Romanowski A, Fields S, Cuperus JT, Queitsch C. Plant enhancers exhibit both cooperative and additive interactions among their functional elements. Plant Cell. 2024:36(7):2570–2586. 10.1093/plcell/koae088 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jores T, Tonnies J, Wrightsman T, Buckler ES, Cuperus JT, Fields S, Queitsch C. Synthetic promoter designs enabled by a comprehensive analysis of plant core promoters. Nat Plants. 2021:7(6):842–855. 10.1038/s41477-021-00932-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kaplan-Levy RN, Brewer PB, Quon T, Smyth DR. The trihelix family of transcription factors—light, stress and development. Trends Plant Sci. 2012:17(3):163–171. 10.1016/j.tplants.2011.12.002 [DOI] [PubMed] [Google Scholar]
- Kim K-C, Lai Z, Fan B, Chen Z. Arabidopsis WRKY38 and WRKY62 transcription factors interact with histone deacetylase 19 in basal defense. Plant Cell. 2008:20(9):2357–2371. 10.1105/tpc.107.055566 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lambert SA, Jolma A, Campitelli LF, Das PK, Yin Y, Albu M, Chen X, Taipale J, Hughes TR, Weirauch MT. The human transcription factors. Cell. 2018:172(4):650–665. 10.1016/j.cell.2018.01.029 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lambert SA, Yang AWH, Sasse A, Cowley G, Albu M, Caddick MX, Morris QD, Weirauch MT, Hughes TR. Similarity regression predicts evolution of transcription factor sequence specificity. Nat Genet. 2019:51(6):981–989. 10.1038/s41588-019-0411-1 [DOI] [PubMed] [Google Scholar]
- Lang D, Weiche B, Timmerhaus G, Richardt S, Riano-Pachon DM, Correak LGG, Reski R, Mueller-Roeber B, Rensing SA. Genome-wide phylogenetic comparative analysis of plant transcriptional regulation: a timeline of loss, gain, expansion, and correlation with complexity. Genome Biol Evol. 2010:2:488–503. 10.1093/gbe/evq032 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012:9(4):357–359. 10.1038/nmeth.1923 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee J, He K, Stolc V, Lee H, Figueroa P, Gao Y, Tongprasit W, Zhao H, Lee I, Deng XW. Analysis of transcription factor HY5 genomic binding sites revealed its hierarchical role in light regulation of development. Plant Cell. 2007:19(3):731–749. 10.1105/tpc.106.047688 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leinonen R, Sugawara H, Shumway M. The sequence read archive. Nucleic Acids Res. 2011:39(Database):D19–D21. 10.1093/nar/gkq1019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li M, Yao T, Lin W, Hinckley WE, Galli M, Muchero W, Gallavotti A, Chen J-G, Huang SC. Double DAP-seq uncovered synergistic DNA binding of interacting bZIP transcription factors. Nat Commun. 2023:14(1):2600. 10.1038/s41467-023-38096-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li X-Y, Thomas S, Sabo PJ, Eisen MB, Stamatoyannopoulos JA, Biggin MD. The role of chromatin accessibility in directing the widespread, overlapping patterns of Drosophila transcription factor binding. Genome Biol. 2011:12(4):R34. 10.1186/gb-2011-12-4-r34 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Y, Shi W, Wasserman WW. Genome-wide prediction of cis-regulatory regions using supervised deep learning methods. BMC Bioinformatics. 2018:19(1):202. 10.1186/s12859-018-2187-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- López-Vidriero I, Godoy M, Grau J, Peñuelas M, Solano R, Franco-Zorrilla JM. DNA features beyond the transcription factor binding site specify target recognition by plant MYC2-related bHLH proteins. Plant Commun. 2021:2(6):100232. 10.1016/j.xplc.2021.100232 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu SX, Knowles SM, Andronis C, Ong MS, Tobin EM. CIRCADIAN CLOCK ASSOCIATED1 and LATE ELONGATED HYPOCOTYL function synergistically in the circadian clock of Arabidopsis. Plant Physiol. 2009:150(2):834–843. 10.1104/pp.108.133272 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu Z, Hofmeister BT, Vollmers C, DuBois RM, Schmitz RJ. Combining ATAC-seq with nuclei sorting for discovery of cis-regulatory regions in plant genomes. Nucleic Acids Res. 2017:45(6):e41–e41. 10.1093/nar/gkw1179 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu Z, Marand AP, Ricci WA, Ethridge CL, Zhang X, Schmitz RJ. The prevalence, evolution and chromatin signatures of plant regulatory elements. Nat Plants. 2019:5(12):1250–1259. 10.1038/s41477-019-0548-z [DOI] [PubMed] [Google Scholar]
- Lukic S, Nicolas J-C, Levine AJ. The diversity of zinc-finger genes on human chromosome 19 provides an evolutionary mechanism for defense against inherited endogenous retroviruses. Cell Death Differ. 2014:21(3):381–387. 10.1038/cdd.2013.150 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lynch M, Conery JS. The evolutionary fate and consequences of duplicate genes. Science. 2000:290(5494):1151–1155. 10.1126/science.290.5494.1151 [DOI] [PubMed] [Google Scholar]
- Machanick P, Bailey TL. MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics. 2011:27(12):1696–1697. 10.1093/bioinformatics/btr189 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mahendrawada L, Warfield L, Donczew R, Hahn S. Low overlap of transcription factor DNA binding and regulatory targets. Nature. 2025. 10.1038/s41586-025-08916-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maher KA, Bajic M, Kajala K, Reynoso M, Pauluzzi G, West DA, Zumstein K, Woodhouse M, Bubb K, Dorrity MW, et al. Profiling of accessible chromatin regions across multiple plant Species and cell types reveals common gene regulatory principles and new control modules. Plant Cell. 2018:30(1):15–36. 10.1105/tpc.17.00581 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nitta KR, Jolma A, Yin Y, Morgunova E, Kivioja T, Akhtar J, Hens K, Toivonen J, Deplancke B, Furlong EEM, et al. Conservation of transcription factor binding specificities across 600 million years of bilateria evolution. eLife. 2015:4. 10.7554/eLife.04837 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Noyes MB, Christensen RG, Wakabayashi A, Stormo GD, Brodsky MH, Wolfe SA. Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites. Cell. 2008:133(7):1277–1289. 10.1016/j.cell.2008.05.023 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ohno S. Evolution by gene duplication. Berlin Heidelberg: Springer-Verlag; 1970. [Google Scholar]
- O’Malley RC, Huang SSC, Song L, Lewsey MG, Bartlett A, Nery JR, Galli M, Gallavotti A, Ecker JR. Cistrome and epicistrome features shape the regulatory DNA landscape. Cell. 2016:165(5):1280–1292. 10.1016/j.cell.2016.04.038 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ou J, Wolfe SA, Brodsky MH, Zhu LJ. motifStack for the analysis of transcription factor binding site evolution. Nat Methods. 2018:15(1):8–9. 10.1038/nmeth.4555 [DOI] [PubMed] [Google Scholar]
- Rabinovich A, Jin VX, Rabinovich R, Xu X, Farnham PJ. E2f in vivo binding specificity: comparison of consensus versus nonconsensus binding sites. Genome Res. 2008:18(11):1763–1777. 10.1101/gr.080622.108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ramsay NA, Glover BJ. MYB–bHLH–WD40 protein complex and the evolution of cellular diversity. Trends Plant Sci. 2005:10(2):63–70. 10.1016/j.tplants.2004.12.011 [DOI] [PubMed] [Google Scholar]
- Reyes JC, Muro-Pastor MI, Florencio FJ. The GATA family of transcription factors in Arabidopsis and rice. Plant Physiol. 2004:134(4):1718–1732. 10.1104/pp.103.037788 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Richter R, Bastakis E, Schwechheimer C. Cross-repressive interactions between SOC1 and the GATAs GNC and GNL/CGA1 in the control of greening, cold tolerance, and flowering time in Arabidopsis . Plant Physiol. 2013:162(4):1992–2004. 10.1104/pp.113.219238 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Riechmann JL, Heard J, Martin G, Reuber L, Jiang CZ, Keddie J, Adam L, Pineda O, Ratcliffe OJ, Samaha RR, et al. Arabidopsis transcription factors: genome-wide comparative analysis among eukaryotes. Science. 2000:290(5499):2105–2110. 10.1126/science.290.5499.2105 [DOI] [PubMed] [Google Scholar]
- Rohs R, West SM, Sosinsky A, Liu P, Mann RS, Honig B. The role of DNA shape in protein-DNA recognition. Nature. 2009:461(7268):1248–1253. 10.1038/nature08473 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rushton PJ, Macdonald H, Huttly AK, Lazarus CM, Hooley R. Members of a new family of DNA-binding proteins bind to a conserved cis-element in the promoters of α-amy2 genes. Plant Mol Biol. 1995:29(4):691–702. 10.1007/BF00041160 [DOI] [PubMed] [Google Scholar]
- Sano Y, Akimaru H, Okamura T, Nagao T, Okada M, Ishii S. Drosophila activating transcription factor-2 is involved in stress response via activation by p38, but not c-jun NH2-terminal kinase. Mol Biol Cell. 2005:16(6):2934–2946. 10.1091/mbc.e04-11-1008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sayou C, Monniaux M, Nanao MH, Moyroud E, Brockington SF, Thévenon E, Chahtane H, Warthmann N, Melkonian M, Zhang Y, et al. A promiscuous intermediate underlies the evolution of LEAFY DNA binding specificity. Science. 2014:343(6171):645–648. 10.1126/science.1248229 [DOI] [PubMed] [Google Scholar]
- Schmitz JF, Zimmer F, Bornberg-Bauer E. Mechanisms of transcription factor evolution in Metazoa. Nucleic Acids Res. 2016:44(13):6287–6297. 10.1093/nar/gkw492 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schwechheimer C, Schröder PM, Blaby-Haas CE. Plant GATA factors: their biology, phylogeny, and phylogenomics. Annu Rev Plant Biol. 2022:73(1):123–148. 10.1146/annurev-arplant-072221-092913 [DOI] [PubMed] [Google Scholar]
- Seeman NC, Rosenberg JM, Rich A. Sequence-specific recognition of double helical nucleic acids by proteins. Proc Natl Acad Sci U S A. 1976:73(3):804–808. 10.1073/pnas.73.3.804 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sielemann J, Wulf D, Schmidt R, Bräutigam A. Local DNA shape is a general principle of transcription factor binding specificity in Arabidopsis thaliana. Nat Commun. 2021:12(1):1–8. 10.1038/s41467-021-26819-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sijacic P, Bajic M, McKinney EC, Meagher RB, Deal RB. Changes in chromatin accessibility between Arabidopsis stem cells and mesophyll cells illuminate cell type-specific transcription factor networks. Plant J. 2018:94(2):215–231. 10.1111/tpj.13882 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014:30(9):1312–1313. 10.1093/bioinformatics/btu033 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stracke R, Favory J-J, Gruber H, Bartelniewoehner L, Bartels S, Binkert M, Funk M, Weisshaar B, Ulm R. The Arabidopsis bZIP transcription factor HY5 regulates expression of the PFG1/MYB12 gene in response to light and ultraviolet-B radiation. Plant Cell Environ. 2010:33(1):88–103. 10.1111/j.1365-3040.2009.02061.x [DOI] [PubMed] [Google Scholar]
- Stracke R, Werber M, Weisshaar B. The R2R3-MYB gene family in Arabidopsis thaliana. Curr Opin Plant Biol. 2001:4(5):447–456. 10.1016/S1369-5266(00)00199-0 [DOI] [PubMed] [Google Scholar]
- Sullivan AM, Arsovski AA, Thompson A, Sandstrom R, Thurman RE, Neph S, Johnson AK, Sullivan ST, Sabo PJ, Neri F V, et al. Mapping and dynamics of regulatory DNA in maturing Arabidopsis thaliana siliques. Front Plant Sci. 2019:10:1434. 10.3389/fpls.2019.01434 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tremblay M, Sanchez-Ferras O, Bouchard M. GATA transcription factors in development and disease. Development. 2018:145(20):dev164384. 10.1242/dev.164384 [DOI] [PubMed] [Google Scholar]
- Tsai A, Muthusamy AK, Alves MR, Lavis LD, Singer RH, Stern DL, Crocker J. Nuclear microenvironments modulate transcription from low-affinity enhancers. Elife. 2017:6:e28975. 10.7554/eLife.28975 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tu X, Mejía-Guerra MK, Valdes Franco JA, Tzeng D, Chu P-Y, Shen W, Wei Y, Dai X, Li P, Buckler ES, et al. Reconstructing the maize leaf regulatory network using ChIP-seq data of 104 transcription factors. Nat Commun. 2020:11(1):5089. 10.1038/s41467-020-18832-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- van den Bergh E, Külahoglu C, Bräutigam A, Hibberd JM, Weber APM, Zhu X-G, Eric Schranz M. Gene and genome duplications and the origin of C4 photosynthesis: birth of a trait in the Cleomaceae. Curr Plant Biol. 2014:1:2–9. 10.1016/j.cpb.2014.08.001 [DOI] [Google Scholar]
- Weirauch MT, Yang A, Albu M, Cote AG, Montenegro-Montero A, Drewe P, Najafabadi HS, Lambert SA, Mann I, Cook K, et al. Determination and inference of eukaryotic transcription factor sequence specificity. Cell. 2014:158(6):1431–1443. 10.1016/j.cell.2014.08.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Westhoff P, Gowik U. Evolution of C4 photosynthesis—looking for the master switch. Plant Physiol. 2010:154(2):598–601. 10.1104/pp.110.161729 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilhelmsson PKI, Mühlich C, Ullrich KK, Rensing SA. Comprehensive genome-wide classification reveals that many plant-specific transcription factors evolved in streptophyte Algae. Genome Biol Evol. 2017:9(12):3384–3397. 10.1093/gbe/evx258 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yan J, Qiu Y, Ribeiro dos Santos AM, Yin Y, Li YE, Vinckier N, Nariai N, Benaglio P, Raman A, Li X, et al. Systematic analysis of binding of transcription factors to noncoding variants. Nature. 2021:591(7848):147–151. 10.1038/s41586-021-03211-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang J. Evolution by gene duplication: an update. Trends Ecol Evol. 2003:18(6):292–298. 10.1016/S0169-5347(03)00033-8 [DOI] [Google Scholar]
- Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nussbaum C, Myers RM, Brown M, Li W, et al. Model-based analysis of ChIP-seq (MACS). Genome Biol. 2008:9(9):1–9. 10.1186/gb-2008-9-9-r137 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao L, Peng T, Chen C-Y, Ji R, Gu D, Li T, Zhang D, Tu Y-T, Wu K, Liu X. HY5 interacts with the histone deacetylase HDA15 to repress hypocotyl cell elongation in photomorphogenesis. Plant Physiol. 2019:180(3):1450–1466. 10.1104/pp.19.00055 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Raw ampDAP-Seq data for Marchantia is available under the Bioprojects PRJNA1007631 and PRJNA1148438 on the NCBI SRA. Processed peak files are deposited on GitLab (https://gitlab.ub.uni-bielefeld.de/sanja.zenker/tfbm-evolution), as well as code used in this study and all consensus and individual TFBMs in MEME-format.
Visualizations of all Arabidopsis (amp)DAP-Seq binding sites on nuclear encoded gene promoters (see Fig. 1A) are deposited under https://doi.org/10.4119/unibi/2982196.





