Abstract
Transcription factors (TFs), which are central to the regulation of gene expression, are usually members of multigene families. In plants, they are involved in diverse processes such as developmental control and elicitation of defense and stress responses. To investigate if differences exist in the expansion patterns of TF gene families between plants and other eukaryotes, we first used Arabidopsis (Arabidopsis thaliana) TFs to identify TF DNA-binding domains. These DNA-binding domains were then used to identify related sequences in 25 other eukaryotic genomes. Interestingly, among 19 families that are shared between animals and plants, more than 14 are larger in plants than in animals. After examining the lineage-specific expansion of TF families in two plants, eight animals, and two fungi, we found that TF families shared among these organisms have undergone much more dramatic expansion in plants than in other eukaryotes. Moreover, this elevated expansion rate of plant TF is not simply due to higher duplication rates of plant genomes but also to a higher degree of expansion compared to other plant genes. Further, in many Arabidopsis-rice (Oryza sativa) TF orthologous groups, the degree of lineage-specific expansion in Arabidopsis is correlated with that in rice. This pattern of parallel expansion is much more pronounced than the whole-genome trend in rice and Arabidopsis. The high rate of expansion among plant TF genes and their propensity for parallel expansion suggest frequent adaptive responses to selection pressure common among higher plants.
Regulation of gene expression is central to a myriad of biological processes at the molecular level and is to a significant extent controlled by transcription factors (TFs). Most TFs are modular proteins consisting of a DNA-binding domain that interacts with cis-regulatory elements of its target genes and a protein-protein interaction domain that facilitates oligomerization between TFs or other regulators (Wray et al., 2003). Sequence divergence in the DNA-binding domains of related TFs may lead to differences in affinities to a set of cis-regulatory elements. Together with the propensity for TFs to homodimerize and/or heterodimerize, the large TF repertoire in a eukaryote genome provides a wide range of combinatorial relationships for transcriptional regulation. TFs usually form gene families that vary considerably in size among organisms (Riechmann et al., 2000; Wray et al., 2003). The reasons behind such differences are not known, although it is suggested that organismal complexity correlates with an increase in the absolute number and the proportion of TFs in a proteome (Levine and Tjian, 2003).
In Arabidopsis (Arabidopsis thaliana), at least 1,500 genes are TFs, and 45% of these TFs belong to families common to Caenorhabditis elegans, Drosophila melanogaster, and Saccharomyces cerevisiae (Riechmann et al., 2000). Some of these TF families are much larger in Arabidopsis, suggesting differential expansion. For example, the Myb family has 190 members in Arabidopsis but only six in fly, three in worm, and 10 in yeast. These differences lead to the suggestion that they may be involved in plant-specific regulatory functions (Riechmann et al., 2000). Interestingly, genes involved in signal transduction and transcription have been preferentially retained after the most recent whole-genome duplication event in the Arabidopsis lineage (Blanc and Wolfe, 2004; Seoighe and Gehring, 2004). These findings suggest important roles of TF duplicates in plant evolution. However, it remains to be determined if the TF family expansion in plants is in general more dramatic than that in other organisms. In addition, it is not known if the preferential retention of TFs has occurred at a longer time scale, such as after the divergence between rice (Oryza sativa) and Arabidopsis but before the last whole-genome duplication in Arabidopsis.
In this study, we evaluated whether plant TFs have a higher rate of lineage-specific expansion than that in other eukaryotes. We first identified TF families among 26 eukaryotes based on known Arabidopsis TFs. We then selected five genome pairs, including plants, animals, and fungi, that diverged approximately 100 to 250 million years ago (MYA) to evaluate the degrees of lineage-specific expansions of TF families. To see if TFs have expanded more than other plant genes, we compared the GeneOntology (GO; Harris et al., 2004) annotation of Arabidopsis genes to see if GO categories enriched in TFs have significantly higher than average number of duplicates per category. Finally, we examined the TF expansion at the orthologous group (OG) level to determine if lineage-specific expansions in the Arabidopsis and rice lineages are correlated with each other.
RESULTS
TF Family Sizes and Organismal Complexity
For the identification of TFs in 26 eukaryote genomes, we first consolidated the Arabidopsis TF family annotations from three databases (see “Materials and Methods”). The DNA-binding domain sequences of these Arabidopsis TFs (Fig. 1; Table I) were then used to recover related protein sequences in other genomes. Figure 1 shows the number of genes in each TF family among the eukaryote genomes analyzed. There are more plant TF families because some of the plant TFs may contain domains that are (1) too divergent from homologous sequences in other genomes or (2) plant specific. Due to this methodological bias, only relevant shared families are analyzed in all subsequent cross-species comparisons.
Figure 1.
The prevalence of different TF families in eukaryotes. The tree on the left indicates the phylogenetic relationships among the eukaryote genomes analyzed. The top row contains the names of the representative TF domains we analyzed. The number indicates the count of each representative protein domain in each genome. The metazoan and the plant sections are enclosed in solid and dotted lines, respectively. The number of TFs in domain families shared between metazoans and higher plants are in black boxes if they are the highest count in each column.
Table I.
Plant TF gene families and their defining protein domains
–, Family name not assigned, domain functions unknown, or domain reference only.
Defining Domaina | Familyb | Domain Functions | OGc | GainAc | GainOc | % Paralleld | re | P Valuee Smaller than: | Reference(s) |
---|---|---|---|---|---|---|---|---|---|
AP2 | AP2-EREBP | DNA binding | 37 | 23 (63) | 28 (76) | 64.5 | 0.6884 | 1.87e-05 | Ohme-Takagi and Shinshi (1995); Weigel (1995) |
ARID | ARID | DNA binding | 2 | 0 (0) | 1 (50) | 0.0 | n.d. | n.d. | Herrscher et al. (1995) |
AT_hook | EMF1 | DNA binding | 8 | 4 (50) | 1 (13) | 0.0 | −1 | 2.20e-16 | Reeves and Nissen (1990) |
B3 | ABI3VP1, ARF | DNA binding | 15 | 5 (34) | 8 (54) | 62.5 | −0.1324 | 0.755 | Suzuki et al. (1997); Ulmasov et al. (1997) |
bZIP | bZIP | DNA binding and protein-protein interaction | 27 | 15 (56) | 13 (49) | 33.3 | −0.2346 | 0.306 | Landschulz et al. (1988) |
CBF | CCAAT-HAP2 | DNA binding and protein-protein interaction | 5 | 4 (80) | 4 (80) | 100.0 | 0.9272 | 0.073 | Edskes et al. (1998) |
CBFD_NFYB_HMF | CCAAT-DR1, CCAAT-HAP3, CCAAT-HAP5 | DNA binding and protein-protein interaction | 9 | 4 (45) | 3 (34) | 40.0 | 0.0857 | 0.891 | Li et al. (1992) |
CG-1 | AtSR | DNA binding | 2 | 0 (0) | 0 (0) | n.d. | n.d. | n.d. | da Costa e Silva (1994) |
CXC | CPP | DNA binding | 2 | 2 (100) | 1 (50) | 50.0 | n.d. | n.d. | Hobert et al. (1996) |
DUF573 | GeBP | Unknown | 2 | 2 (100) | 2 (100) | 100.0 | n.d. | n.d. | – |
E2F_TDP | E2F-DP | DNA binding | 3 | 0 (0) | 3 (100) | 0.0 | n.a. | n.a. | Zheng et al. (1999) |
EIN3 | EIL | Unknown | 2 | 1 (50) | 2 (100) | 50.0 | n.d. | n.d. | Solano et al. (1998) |
FLO_LFY | LFY | Unknown | 1 | 0 (0) | 0 (0) | n.d. | n.d. | n.d. | Weigel et al. (1992) |
GATA | C2C2-Gata | DNA binding | 6 | 6 (100) | 6 (100) | 100.0 | 0.9289 | 0.007 | Omichinski et al. (1993) |
GRAS | GRAS | Unknown | 16 | 3 (19) | 5 (32) | 14.3 | −0.6860 | 0.132 | Pysh et al. (1999) |
HLH | bHLH | DNA binding | 52 | 31 (60) | 26 (50) | 32.6 | 0.1739 | 0.265 | Littlewood and Evan (1995) |
Homeobox | HB | DNA binding | 28 | 10 (36) | 12 (43) | 46.7 | −0.0108 | 0.970 | Scott et al. (1989) |
HSF_DNA-bind | HSF | DNA binding | 8 | 1 (13) | 4 (50) | 25.0 | −0.4937 | 0.506 | Fujita et al. (1989) |
Myb_DNA-binding | MYB, G2-like, MYB-related, Trihelix | DNA binding | 90 | 47 (53) | 46 (52) | 47.6 | 0.1835 | 0.150 | Klempnauer and Sippel (1987) |
NAM | NAC | DNA binding, protein-protein interaction | 26 | 9 (35) | 16 (62) | 38.9 | 0.6043 | 0.008 | Ernst et al. (2004) |
PHD | Alfin-like | Unknown | 30 | 12 (40) | 7 (24) | 35.7 | −0.1399 | 0.633 | Aasland et al. (1995) |
RWP-RK | NIN | Unknown | 5 | 3 (60) | 2 (40) | 66.7 | −0.5000 | 0.667 | Schauser et al. (1999) |
SBP | SBP | DNA-binding | 5 | 2 (40) | 3 (60) | 66.7 | 0.9878 | 0.099 | Klein et al. (1996) |
SRF-TF | MADS | DNA-binding | 15 | 9 (60) | 10 (67) | 46.2 | 0.6352 | 0.020 | Pellegrini et al. (1995) |
TCP | TCP | Unknown | 8 | 5 (63) | 6 (75) | 57.1 | 0.7201 | 0.068 | Cubas et al. (1999) |
Tub | TUB | DNA binding | 5 | 0 (0) | 3 (60) | 0.0 | n.a. | n.a. | Kleyn et al. (1996) |
WRKY | WRKY | DNA binding | 21 | 10 (48) | 10 (48) | 33.3 | 0.8513 | 5.690e-05 | Eulgem et al. (2000) |
YABBY | C2C2-YABBY | DNA binding | 3 | 1 (34) | 2 (67) | 50.0 | n.d. | n.d. | Bowman (2000) |
zf-B_box | C2C2-CO-like, STO | Unknown | 10 | 8 (80) | 8 (80) | 60.0 | 0.0696 | 0.849 | Borden (1998) |
zf-C2H2 | C2H2 | Nucleic acid binding | 40 | 17 (43) | 15 (38) | 39.1 | 0.8706 | 6.600e-08 | Klug and Rhodes (1987) |
zf-C3HC4 | C3H | Protein-protein interaction | 94 | 36 (39) | 35 (38) | 39.2 | 0.8706 | 6.600e-08 | Borden and Freemont (1996) |
zf-Dof | C2C2-Dof | DNA binding | 5 | 2 (40) | 4 (80) | 50.0 | 0.9683 | 0.032 | Shimofurutani et al. (1998) |
n.a. | GRF | – | 2 | 1 (50) | 1 (50) | 50.0 | n.d. | n.d. | van der Knaap et al. (2000) |
All TF families | – | – | 582 | 272 (47) | 286 (49) | 44.4 | 0.4623 | 2.200e-16 | – |
All families | – | – | 9,553 | 2,387 (25) | 2,549 (27) | 31.5 | 0.0699 | 1.800e-05 | – |
For each family, only sequences with the specified Pfam domain were analyzed. n.a., Pfam domain designation not available, and the family was defined by similarity criteria outlined in “Materials and Methods.”
TF families are designated according to Riechmann et al. (2000), AGRIS (http://arabidopsis.med.ohio-state.edu/), and Jen Sheen's transcription regulator site (http://genetics.mgh.harvard.edu/sheenweb/AraTRs.html).
GainA, Number of OGs with expansion in the Arabidopsis lineage. GainO, Number of OGs with expansion in the rice lineage. The percentages of OGs with expansion in Arabidopsis or rice are shown in parentheses.
Percentage of OGs with lineage-specific gains in both Arabidopsis and rice for each domain family. n.d., Not determined because there is no OG with gains in the domain family.
r, Pearson's correlation coefficient of Arabidopsis versus rice gains for each family. OGs with no Arabidopsis or rice gain were excluded. The P values indicate the significance level of its associated correlation. n.d., Not determined because the sample size is too small. n.a., Not available because this family has no gain in any OG in Arabidopsis or rice.
The numbers of members in the TF families shared among plants, animals, and fungi roughly correlate with organismal complexity. The TF families in animals and land plants are larger than those in fungi. The multicellular land plants have a much larger TF repertoire compared to the unicellular alga Chlamydomonas reinhardtii. In addition, human, chicken, and Takifugu rubripes in general have larger TF families than those of other animals with simpler body plans. However, the TF families in multicellular fungi are only slightly larger than those in unicellular fungi, which may be explained by the limited levels of tissue/organ differentiation in some of these multicellular fungi. Among the 19 families shared between plants and animals, most families are larger in plants than in animals. Between Arabidopsis and human, 14 shared families are larger in Arabidopsis, only four are larger in human, and one family is of equal size. This finding suggests that the TF duplication and/or retention rate is higher in plants than in animals.
Higher Plant TFs Have Undergone Dramatic Lineage-Specific Expansions
Comparisons of family sizes between genomes are rather rudimentary measures of expansion because family sizes tell us little about the timing and degree of expansion. To determine if plant TFs have undergone more dramatic expansions than their animal counterparts, we examined TFs in five species pairs (Arabidopsis-rice, human-chicken, fly-mosquito, C. elegans-Caenorhabditis briggsae, and Magnaporthe-Neurospora) diverged approximately 100 to 250 MYA and evaluated the lineage-specific gains in OGs. Expansion has occurred if any lineage-specific clade in an OG has more than one gene. For example, the GATA family in Arabidopsis and rice contains 10 OGs (Fig. 2A; Table I), and nine of them have expansion in at least one plant lineage and seven in both.
Figure 2.
Determination of OGs and the degrees of lineage-specific expansion in different genome pairs. A, The similarity cluster of the GATA family members from Arabidopsis and rice. The tree is subdivided into OGs. Within OGs, the Arabidopsis and rice members are in orange and blue, respectively. The gene names are in black if they are not classified into OGs. The OG sizes are shown on the far right representing the numbers of genes in the Arabidopsis and rice clades. B, The OGs are classified into four types based on their OG sizes (with x > 1 and y >1). The percentage of total OGs in each type is determined for five species pairs. The species abbreviations are taken from the species names shown in Figure 1.
The degree of expansion was evaluated in 14 TF families shared among plants, animals, and fungi. We separated the OGs into four different classes: no expansion (1:1), expansion in one lineage only (x:1 or 1:y; x,y > 1), or parallel expansion in two lineages (x:y; Fig. 2B). In the animal and fungal genome pairs examined, less than 10% of the OGs have undergone lineage-specific expansion. In contrast, 68% of the OGs between Arabidopsis and rice have expanded in at least one lineage. The OGs for other species pairs are mostly 1:1. The Arabidopsis-rice divergence was approximately 150 MYA (Chaw et al., 2004). The C. briggsae/C. elegans divergence is estimated to have occurred 80 to 110 MYA (Stein et al., 2003). The chicken and human lineages diverged approximately 310 MYA (Reisz and Modesto, 1996). The D. melanogaster and Anopheles gambiae lineages diverged approximately 247 to 283 MYA (Gaunt and Miles, 2002). The Magnaporthe grisea-Neurospora crassa divergence is reported to have occurred approximately 200 MYA (Hamer et al., 2001). Two points can be made from our findings and the divergence dates of the eukaryotes analyzed. First, except in plants, TF expansion is relatively rare regardless of divergence time. Second, the plant TF family expansion is not simply a consequence of longer divergence time. A large number of lineage-specific expansions also occur, for example, in D. melanogaster and A. gambiae (Zdobnov et al., 2002), in nematodes (Stein et al., 2003), and in mammals (S.-H. Shiu and W.-H. Li, unpublished data). Our findings indicate TF families have expanded at a much higher rate in plants than in other organisms.
Higher Duplicability of TFs than Other Genes in Plants
Since whole-genome duplications occur at a higher frequency in plants than in animals and fungi, the TF family expansion may simply be the consequence of a higher gene duplication rate in plants. Alternatively, the expansion of TF families may be due to elevated rates of retention, i.e. higher duplicability. To determine if TFs have higher duplicability than other genes, we examined the degrees of expansion of GO categories of Arabidopsis. We classified 7,298 OGs between Arabidopsis and rice into two classes: unexpanded (1:1) and expanded (x:1 and x:y) in the Arabidopsis lineage after the Arabidopsis-rice split. For each GO category, we compared the numbers of genes in expanded and unexpanded OGs against the average numbers of the whole dataset. The four functional categories related to transcriptional regulation all have higher proportions of genes derived from lineage-specific expansion than most other categories (Fig. 3, A and B). Nearly all the genes in these four categories are TF family members.
Figure 3.
The overrepresentation of transcription regulation categories. For each GO category, the total percentage of genes that are in expanded OGs is determined. The total percentages of value distributions are shown for molecular function (A) and biological process categories (B). Note that each transcription regulation-related category has higher than average percentage of genes in expanded OGs.
We then determined the expected numbers of genes in expanded and unexpanded OGs in these categories based on the whole data set (see “Materials and Methods”). These two numbers are compared to the observed numbers with chi-squared tests (Table II). In the four TF-related categories, the proportions of genes in expanded OGs are significantly higher than the average of all annotated genes. These findings indicate that TF families in general have higher duplicability than genes involved in most other functions in Arabidopsis. Interestingly, three of the same four categories in human and mouse have significantly lower than average duplicability. The only TF-related category with higher than average duplicability in these two mammals is DNA-dependent regulation of transcription, contributed only by the zinc-finger C2H2 family that has undergone lineage-specific expansions in both human and mouse. The fact that most TF-related categories have low duplicability in human and mouse is consistent with our conclusion that most TF families have expanded at much higher rates in plants than in other organisms. In addition, plant TFs are retained at higher rates compared to most other plant genes.
Table II.
Overrepresentation of Arabidopsis GO categories related to transcriptional regulation
Categories | Description | Oi,Ea | Oi,Ua | Obs. % in EOGb | Ei,Ea | Ei,Ua | Exp. % in EOGb | P Valuec |
---|---|---|---|---|---|---|---|---|
Molecular function | ||||||||
GO:0016563 | Transcriptional activator activity | 19 | 4 | 82.6 | 12.2 | 10.8 | 53.0 | 0.005 |
GO:0003700 | TF activity | 565 | 242 | 70.0 | 428.7 | 378.3 | 53.0 | 6.890e-22 |
Biological process | ||||||||
GO:0045449 | Regulation of transcription | 170 | 73 | 70.0 | 126.6 | 116.4 | 52.0 | 2.560e-08 |
GO:0006355 | Regulation of transcription, DNA dependent | 398 | 188 | 67.9 | 305.4 | 280.6 | 52.0 | 1.870e-14 |
For each category i, Oi,E is the observed number of genes in expanded OGs, Oi,U is the observed number of genes in unexpanded OGs, Ei,E is the expected number of genes in expanded OGs, and Ei,U is the expected number of genes in unexpanded OGs.
Observed (Obs.) or expected (Exp.) percentage of genes in expanded OGs (EOG).
For each category, the observed numbers of genes were compared to the expected in a 2 × 2 contingency table with chi-squared test. The significance (P value) of the chi-squared statistics is shown.
Pronounced Parallel Expansions of TF Families in Arabidopsis and Rice
We showed above that 69% of OGs in plant TF families have undergone expansion (Fig. 2B). Among these expanded TF OGs, 98 and 115 expanded in only the Arabidopsis and in only the rice lineage, respectively. The rest of the TF OGs have expanded in a parallel fashion. While lineage-specifically expanded TFs may be responsible for lineage-specific adaptation, the parallel expansion suggests common selection pressure contributing to the retention of certain TFs in both lineages. Of all the OGs between Arabidopsis and rice, 39% (3,672 out of 9,345) show various degrees of parallel expansion. However, the expansion of OGs in general does not occur in parallel as indicated by the rather poor linear fit (r2 = 0.07; Fig. 4A). In contrast, the OGs of TF families have a much better linear fit (r2 = 0.46; Fig. 4B). Although the degrees of parallel expansion vary greatly among TF families (Table I), our findings indicate that, if a particular ancestral TF is duplicated and retained in the Arabidopsis lineage, the corresponding gene in the rice lineage will tend to be retained. In addition, this parallel expansion is more prominent in TFs than in most other plant genes.
Figure 4.
Pronounced parallel expansion of TF families between rice and Arabidopsis. The number of Arabidopsis genes is plotted against the number of rice genes in the same OGs for all gene families (A) and TF families only (B). The equations for the linear fits and the r2 values are shown.
Whole-genome duplications have occurred in both lineages after the Arabidopsis-rice split (Blanc et al., 2000; Yu et al., 2002). In addition, we showed in the previous section that TFs have a higher retention rate compared to other plant genes. Therefore, the higher degree of parallel expansion among TF OGs may simply be the consequence of independent gains of large numbers of TFs. To determine if this is the case, we randomly shuffled the number of gains in the Arabidopsis and the rice lineages independently, and the correlation coefficient of the shuffled dataset was calculated. This random shuffling was repeated 10,000 times, and none of the r2 values in the randomly shuffled dataset was larger than 0.037, substantially lower than the r2 value of Arabidopsis-rice TF gains (Fig. 4B). This finding indicates that the parallel retention of TFs in plants is not simply due to independent gain but likely a property of the ancestral TF.
DISCUSSION
It is commonly believed that changes in cis-regulatory systems more often underlie the evolution of morphological diversity than do changes in gene number or protein function (Doebley and Lukens, 1998; Carroll, 2000). While this may be true, plant TF families tend to expand at much higher rates compared to their animal counterparts. It is known that polyploidization occurs frequently in plants (for review, see Wendel, 2000). At first look, our finding may simply be the consequence of a higher gene duplication rate in plants. However, the sizes of the TF families are in general similar between human and T. rubripes. Since one round of whole-genome duplication likely occurred in the teleost lineage (Taylor et al., 2001), the similarity in TF family sizes between human and T. rubripes indicates that most of the duplicated TFs are not retained. In addition, similar to the findings of prior studies (Blanc and Wolfe, 2004; Seoighe and Gehring, 2004), we showed that the functional categories containing predominantly TFs have significantly higher rates of expansion compared to most other categories in Arabidopsis. Judging from the gene family and OG sizes of rice TFs, this is most likely to be true in rice as well. Our findings indicate TF duplication likely contributes to regulatory novelties in development and/or responses to external stimuli much more significantly in plants than that in animals. However, other nonadaptive scenarios may be involved as well, as discussed below.
We showed that TF OGs have a significantly higher degree of parallel expansion. It should be noted that genes with higher duplicability do not necessarily expand in parallel. For example, the receptor-like kinase family has high duplicability, but most of the OGs in this gene family have not expanded in parallel (Shiu et al., 2004). There are several possible explanations. First, parallel expansion may indicate the presence of common selection pressure (Hughes and Friedman, 2003). One possible common selection pressure faced by plants is environmental stresses, biotic or abiotic. It is conceivable that elaborate systems were selected for perceiving environmental stresses and for adjusting plant growth and development accordingly. The expansion of the disease resistance gene family is consistent with the first role, whereas the latter may be fulfilled by TF duplicates. Second, the parallel expansion in TF may be due to the requirement for dosage balance (Papp et al., 2003; Yang et al., 2003). The dosage balance hypothesis asserts that duplications of all genes involved in any protein complex, as one would expect from whole-genome duplication, would be more tolerable than single gene duplication. Whole-genome duplications have occurred in both the Arabidopsis and rice lineages (Blanc et al., 2000; Vision et al., 2000; Yu et al., 2002). The higher duplicability of plant TFs may be due to stronger deleterious effects of TF losses than losses of most other plant genes after whole-genome duplication. If this is true, TFs are more likely to form complexes than most other plant genes. Finally, parallel expansion may be due to the ease of subfunctionalization among TF duplicates. Subfunctionalization is the process by which duplicates lose different subfunctions of their common ancestor, resulting in the indispensability of both copies (Force et al., 1999). If ease of subfunctionalization explains the higher duplicability of TFs, TFs will have on average more functional modules than other plant genes.
These three explanations are not necessarily mutually exclusive, as dosage effect and subfunctionalization may result in the initial retention of duplicates followed by functional divergence. To elucidate their relative importance, it will be of great interest to examine the dosage effect of TF duplicates and the expression patterns and functional differences of duplicates with outgroup species that do not have duplication. Since TF families have various degrees of expansion (Table I), between-family comparison should provide insights into to their differential expansions. Regardless of the mechanisms of retention, we found the degree of TF family expansion in plants is substantially higher than that in other eukaryotes or other plant genes. Given the importance of plant TFs in plant development and responses to environmental factors, we argue that the larger repertoire of recently acquired TF duplicates in plants plays a more significant role in developmental or other regulatory novelties than their animal counterparts. Several gene families and functional categories have similar or even greater rates of expansion compared to TF families, e.g. the kinase family, the proteolysis category, and the defense response category. The relative importance of different mechanisms in retaining genes with diverse functions remains an intriguing question.
MATERIALS AND METHODS
Identification of TFs
A list of Arabidopsis (Arabidopsis thaliana) TFs was compiled based on two resources: the Arabidopsis Gene Regulatory Information Server (AGRIS; http://arabidopsis.med.ohio-state.edu/AtTFDB/index.jsp; Davuluri et al., 2003) and the Arabidopsis Transcription Regulators homepage from the Jen Sheen laboratory (http://genetics.mgh.harvard.edu/sheenweb/AraTRs.html). The protein domains in these putative TFs were identified by searching against the SMART (Schultz et al., 2000) and Pfam (Sonnhammer et al., 1998) databases. We include only those with at least one DNA-binding domain that has been demonstrated to directly modulate gene expression. With this criterion, a total of 32 DNA-binding domains are present in the Arabidopsis TF set. To identify these TF-associated DNA-binding domains from other eukaryotes, the hidden Markov models were retrieved from Pfam to search against the protein sequences from 26 eukaryote genomes (Fig. 1). The Arabidopsis genome was analyzed in the same way to account for potential changes in annotation.
Lineage-Specific Expansion of TFs
Lineage-specific expansions are gene-gain events that occur specifically in a lineage. The lineage-specific expansion is determined by lineage-specific gains in putative OGs. The OGs were defined by the Cross-Species Best Match criterion detailed below. A distance matrix of each TF domain family of each organism was constructed by determining the pairwise scores in an all-against-all member BLAST search. For each TF (X) in species A, we first identified the highest scoring hit (Y) in species B. Then within-species searches were conducted to identify all TFs in A that have a higher score to X than to Y (referred to as the X set). Similarly, all TFs in B that have higher score to Y than to X were identified as the Y set. In this example, X, the X set, and Y, the Y set, belong to the same OG.
Delineation of Arabidopsis Gene Families
The sequences of Arabidopsis proteins were used in an all-against-all BLAST search. The expected (E) values were transformed by taking the absolute values of their logarithm. A score matrix constructed with these transformed E values was used for similarity clustering with Markov Clustering (http://micans.org/mcl/; Van Dongen, 2000). The clusters generated were regarded as gene families. OGs in each family were identified as described in the previous section.
GO Categories and Identification of Overrepresented Categories
The GO annotations of Arabidopsis genes were obtained from The Arabidopsis Information Resource (ftp://ftp.arabidopsis.org/home/tair/Genes/Gene_Ontology/). Only GO categories with at least 10 genes were analyzed to provide sufficient data points for statistical analyses. Using rice (Oryza sativa) genes as references, we determined the numbers of genes residing in OGs with or without expansion in the Arabidopsis lineage for each category X (GX,Obs,E and GX,Obs,U, respectively). We also determined the numbers of genes in expanded and unexpanded OGs for all categories (GAll,E and GAll,U, respectively). For each category X, the expected of genes in expanded or unexpanded OGs (GX,Exp,E and GX,Exp,U, respectively) are generated by the following.
![]() |
![]() |
The expected values were then compared to the observed numbers of genes in expanded and unexpanded OGs with a chi-squared test to determine if the observed values were significantly different from that of the expected.
Acknowledgments
We thank Donna E. Fernandez, Melissa D. Lehti-Shiu, Geoffrey Morris, and Arnar Palsson for comments and for discussion.
This work was supported by a National Institutes of Health (NIH) fellowship to S.-H.S. and NIH grants to W.-H.L.
References
- Aasland R, Gibson TJ, Stewart AF (1995) The PHD finger: implications for chromatin-mediated transcriptional regulation. Trends Biochem Sci 20: 56–59 [DOI] [PubMed] [Google Scholar]
- Blanc G, Barakat A, Guyot R, Cooke R, Delseny M (2000) Extensive duplication and reshuffling in the Arabidopsis genome. Plant Cell 12: 1093–1101 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blanc G, Wolfe KH (2004) Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell 16: 1679–1691 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Borden KL (1998) RING fingers and B-boxes: zinc-binding protein-protein interaction domains. Biochem Cell Biol 76: 351–358 [DOI] [PubMed] [Google Scholar]
- Borden KL, Freemont PS (1996) The RING finger domain: a recent example of a sequence-structure family. Curr Opin Struct Biol 6: 395–401 [DOI] [PubMed] [Google Scholar]
- Bowman JL (2000) The YABBY gene family and abaxial cell fate. Curr Opin Plant Biol 3: 17–22 [DOI] [PubMed] [Google Scholar]
- Carroll SB (2000) Endless forms: the evolution of gene regulation and morphological diversity. Cell 101: 577–580 [DOI] [PubMed] [Google Scholar]
- Chaw SM, Chang CC, Chen HL, Li WH (2004) Dating the monocot-dicot divergence and the origin of core eudicots using whole chloroplast genomes. J Mol Evol 58: 424–441 [DOI] [PubMed] [Google Scholar]
- Cubas P, Lauter N, Doebley J, Coen E (1999) The TCP domain: a motif found in proteins regulating plant growth and development. Plant J 18: 215–222 [DOI] [PubMed] [Google Scholar]
- da Costa e Silva O (1994) CG-1, a parsley light-induced DNA-binding protein. Plant Mol Biol 25: 921–924 [DOI] [PubMed] [Google Scholar]
- Davuluri RV, Sun H, Palaniswamy SK, Matthews N, Molina C, Kurtz M, Grotewold E (2003) AGRIS: Arabidopsis gene regulatory information server, an information resource of Arabidopsis cis-regulatory elements and transcription factors. BMC Bioinformatics 4: 25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Doebley J, Lukens L (1998) Transcriptional regulators and the evolution of plant form. Plant Cell 10: 1075–1082 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Edskes HK, Ohtake Y, Wickner RB (1998) Mak21p of Saccharomyces cerevisiae, a homolog of human CAATT-binding protein, is essential for 60 S ribosomal subunit biogenesis. J Biol Chem 273: 28912–28920 [DOI] [PubMed] [Google Scholar]
- Ernst HA, Nina Olsen A, Skriver K, Larsen S, Lo Leggio L (2004) Structure of the conserved domain of ANAC, a member of the NAC family of transcription factors. EMBO Rep 5: 297–303 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eulgem T, Rushton PJ, Robatzek S, Somssich IE (2000) The WRKY superfamily of plant transcription factors. Trends Plant Sci 5: 199–206 [DOI] [PubMed] [Google Scholar]
- Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J (1999) Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151: 1531–1545 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fujita A, Kikuchi Y, Kuhara S, Misumi Y, Matsumoto S, Kobayashi H (1989) Domains of the SFL1 protein of yeasts are homologous to Myc oncoproteins or yeast heat-shock transcription factor. Gene 85: 321–328 [DOI] [PubMed] [Google Scholar]
- Gaunt MW, Miles MA (2002) An insect molecular clock dates the origin of the insects and accords with palaeontological and biogeographic landmarks. Mol Biol Evol 19: 748–761 [DOI] [PubMed] [Google Scholar]
- Hamer L, Pan H, Adachi K, Orbach MJ, Page A, Ramamurthy L, Woessner JP (2001) Regions of microsynteny in Magnaporthe grisea and Neurospora crassa. Fungal Genet Biol 33: 137–143 [DOI] [PubMed] [Google Scholar]
- Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, et al (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res (Database issue) 32: D258–D261 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Herrscher RF, Kaplan MH, Lelsz DL, Das C, Scheuermann R, Tucker PW (1995) The immunoglobulin heavy-chain matrix-associating regions are bound by Bright: a B cell-specific trans-activator that describes a new DNA-binding protein family. Genes Dev 9: 3067–3082 [DOI] [PubMed] [Google Scholar]
- Hobert O, Jallal B, Ullrich A (1996) Interaction of Vav with ENX-1, a putative transcriptional regulator of homeobox gene expression. Mol Cell Biol 16: 3066–3073 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hughes AL, Friedman R (2003) Parallel evolution by gene duplication in the genomes of two unicellular fungi. Genome Res 13: 794–799 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Klein J, Saedler H, Huijser P (1996) A new family of DNA binding proteins includes putative transcriptional regulators of the Antirrhinum majus floral meristem identity gene SQUAMOSA. Mol Gen Genet 250: 7–16 [DOI] [PubMed] [Google Scholar]
- Klempnauer KH, Sippel AE (1987) The highly conserved amino-terminal region of the protein encoded by the v-myb oncogene functions as a DNA-binding domain. EMBO J 6: 2719–2725 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kleyn PW, Fan W, Kovats SG, Lee JJ, Pulido JC, Wu Y, Berkemeier LR, Misumi DJ, Holmgren L, Charlat O, et al (1996) Identification and characterization of the mouse obesity gene tubby: a member of a novel gene family. Cell 85: 281–290 [DOI] [PubMed] [Google Scholar]
- Klug A, Rhodes D (1987) Zinc fingers: a novel protein fold for nucleic acid recognition. Cold Spring Harb Symp Quant Biol 52: 473–482 [DOI] [PubMed] [Google Scholar]
- Landschulz WH, Johnson PF, McKnight SL (1988) The leucine zipper: a hypothetical structure common to a new class of DNA binding proteins. Science 240: 1759–1764 [DOI] [PubMed] [Google Scholar]
- Levine M, Tjian R (2003) Transcription regulation and animal diversity. Nature 424: 147–151 [DOI] [PubMed] [Google Scholar]
- Li XY, Mantovani R, Hooft van Huijsduijnen R, Andre I, Benoist C, Mathis D (1992) Evolutionary variation of the CCAAT-binding transcription factor NF-Y. Nucleic Acids Res 20: 1087–1091 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Littlewood TD, Evan GI (1995) Transcription factors 2: helix-loop-helix. Protein Profile 2: 621–702 [PubMed] [Google Scholar]
- Ohme-Takagi M, Shinshi H (1995) Ethylene-inducible DNA binding proteins that interact with an ethylene-responsive element. Plant Cell 7: 173–182 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Omichinski JG, Clore GM, Schaad O, Felsenfeld G, Trainor C, Appella E, Stahl SJ, Gronenborn AM (1993) NMR structure of a specific DNA complex of Zn-containing DNA binding domain of GATA-1. Science 261: 438–446 [DOI] [PubMed] [Google Scholar]
- Papp B, Pal C, Hurst LD (2003) Dosage sensitivity and the evolution of gene families in yeast. Nature 424: 194–197 [DOI] [PubMed] [Google Scholar]
- Pellegrini L, Tan S, Richmond TJ (1995) Structure of serum response factor core bound to DNA. Nature 376: 490–498 [DOI] [PubMed] [Google Scholar]
- Pysh LD, Wysocka-Diller JW, Camilleri C, Bouchez D, Benfey PN (1999) The GRAS gene family in Arabidopsis: sequence characterization and basic expression analysis of the SCARECROW-LIKE genes. Plant J 18: 111–119 [DOI] [PubMed] [Google Scholar]
- Reeves R, Nissen MS (1990) The A.T-DNA-binding domain of mammalian high mobility group I chromosomal proteins. A novel peptide motif for recognizing DNA structure. J Biol Chem 265: 8573–8582 [PubMed] [Google Scholar]
- Reisz RR, Modesto SP (1996) Archerpeton anthracos from the Joggins formation of Nova Scotia: a microsaur, not a reptile. Can J Earth Sci 33: 703–709 [Google Scholar]
- Riechmann JL, Heard J, Martin G, Reuber L, Jiang C, Keddie J, Adam L, Pineda O, Ratcliffe OJ, Samaha RR, et al (2000) Arabidopsis transcription factors: genome-wide comparative analysis among eukaryotes. Science 290: 2105–2110 [DOI] [PubMed] [Google Scholar]
- Schauser L, Roussis A, Stiller J, Stougaard J (1999) A plant regulator controlling development of symbiotic root nodules. Nature 402: 191–195 [DOI] [PubMed] [Google Scholar]
- Schultz J, Copley RR, Doerks T, Ponting CP, Bork P (2000) SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res 28: 231–234 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Scott MP, Tamkun JW, Hartzell GW III (1989) The structure and function of the homeodomain. Biochim Biophys Acta 989: 25–48 [DOI] [PubMed] [Google Scholar]
- Seoighe C, Gehring C (2004) Genome duplication led to highly selective expansion of the Arabidopsis thaliana proteome. Trends Genet 20: 461–464 [DOI] [PubMed] [Google Scholar]
- Shimofurutani N, Kisu Y, Suzuki M, Esaka M (1998) Functional analyses of the Dof domain, a zinc finger DNA-binding domain, in a pumpkin DNA-binding protein AOBP. FEBS Lett 430: 251–256 [DOI] [PubMed] [Google Scholar]
- Shiu SH, Karlowski WM, Pan R, Tzeng YH, Mayer KF, Li WH (2004) Comparative analysis of the receptor-like kinase family in Arabidopsis and rice. Plant Cell 16: 1220–1234 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Solano R, Stepanova A, Chao Q, Ecker JR (1998) Nuclear events in ethylene signaling: a transcriptional cascade mediated by ETHYLENE-INSENSITIVE3 and ETHYLENE-RESPONSE-FACTOR1. Genes Dev 12: 3703–3714 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sonnhammer ELL, Eddy SR, Birney E, Bateman A, Durbin R (1998) Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res 26: 320–322 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stein LD, Bao Z, Blasiar D, Blumenthal T, Brent MR, Chen N, Chinwalla A, Clarke L, Clee C, Coghlan A, et al (2003) The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics. PLoS Biol 1: E45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Suzuki M, Kao CY, McCarty DR (1997) The conserved B3 domain of VIVIPAROUS1 has a cooperative DNA binding activity. Plant Cell 9: 799–807 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taylor JS, Van de Peer Y, Braasch I, Meyer A (2001) Comparative genomics provides evidence for an ancient genome duplication event in fish. Philos Trans R Soc Lond B Biol Sci 356: 1661–1679 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ulmasov T, Hagen G, Guilfoyle TJ (1997) ARF1, a transcription factor that binds to auxin response elements. Science 276: 1865–1868 [DOI] [PubMed] [Google Scholar]
- van der Knaap E, Kim JH, Kende H (2000) A novel gibberellin-induced gene from rice and its potential regulatory role in stem growth. Plant Physiol 122: 695–704 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van Dongen SM (2000) Graph clustering by flow simulation. PhD thesis. University of Utrecht, The Netherlands
- Vision TJ, Brown DG, Tanksley SD (2000) The origins of genomic duplications in Arabidopsis. Science 290: 2114–2117 [DOI] [PubMed] [Google Scholar]
- Weigel D (1995) The APETALA2 domain is related to a novel type of DNA binding domain. Plant Cell 7: 388–389 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weigel D, Alvarez J, Smyth DR, Yanofsky MF, Meyerowitz EM (1992) LEAFY controls floral meristem identity in Arabidopsis. Cell 69: 843–859 [DOI] [PubMed] [Google Scholar]
- Wendel JF (2000) Genome evolution in polyploids. Plant Mol Biol 42: 225–249 [PubMed] [Google Scholar]
- Wray GA, Hahn MW, Abouheif E, Balhoff JP, Pizer M, Rockman MV, Romano LA (2003) The evolution of transcriptional regulation in eukaryotes. Mol Biol Evol 20: 1377–1419 [DOI] [PubMed] [Google Scholar]
- Yang J, Lusk R, Li WH (2003) Organismal complexity, protein complexity, and gene duplicability. Proc Natl Acad Sci USA 100: 15661–15665 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu J, Hu S, Wang J, Wong GK, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X, et al (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296: 79–92 [DOI] [PubMed] [Google Scholar]
- Zdobnov EM, von Mering C, Letunic I, Torrents D, Suyama M, Copley RR, Christophides GK, Thomasova D, Holt RA, Subramanian GM, et al (2002) Comparative genome and proteome analysis of Anopheles gambiae and Drosophila melanogaster. Science 298: 149–159 [DOI] [PubMed] [Google Scholar]
- Zheng N, Fraenkel E, Pabo CO, Pavletich NP (1999) Structural basis of DNA recognition by the heterodimeric cell cycle transcription factor E2F-DP. Genes Dev 13: 666–674 [DOI] [PMC free article] [PubMed] [Google Scholar]