Abstract
Transcription factors (TFs) recognize specific DNA sequences to control chromatin and transcription, forming a complex system that guides expression of the genome. We present a catalog of 1,639 likely human TFs and their binding specificities. Decoding how TFs find their genomic targets and how TF binding relates to regulation of transcription remain challenging. Cooperativity among TFs and associated chromatin factors can impart specificity, but relevant data are generally lacking. Major classes of human TFs differ markedly in their evolutionary trajectories and expression patterns, underscoring distinct functions. TFs likewise underlie many different aspects of human physiology, disease, and variation, highlighting the importance of understanding TF-mediated gene regulation.
Introduction
Transcription factors (TFs) directly interpret the genome, performing the first step in decoding the DNA sequence. Many function as “master regulators” and “selector genes”, exerting control over processes that specify cell types and developmental patterning (Lee and Young, 2013) and controlling specific pathways such as immune responses (Singh et al., 2014). In the laboratory, TFs can drive cell differentiation (Fong and Tapscott, 2013), and even de-differentiation and trans-differentiation (Takahashi and Yamanaka, 2016). Mutations in TFs and TF binding sites underlie many human diseases. Their protein sequences, regulatory regions, and physiological roles are often deeply conserved among metazoans (Bejerano et al., 2004; Carroll, 2008), suggesting that global gene regulatory “networks” may be similarly conserved. And yet, there is high turnover in individual regulatory sequences (Weirauch and Hughes, 2010), and over longer timescales, TFs duplicate and diverge. The same TF can regulate different genes in different cell types (e.g., GR in breast and endometrial cell lines (Gertz et al., 2012)), indicating that regulatory networks are dynamic even within the same organism. Determining how TFs are assembled in different ways to recognize binding sites and control transcription is daunting, yet paramount to understanding their physiological roles, decoding specific functional properties of genomes, and mapping how highly-specific expression programs are orchestrated in complex organisms.
This review considers our current understanding of TFs and their global functions to provide context for thinking about how TFs work individually and as an ensemble. We also provide a catalog of the human TF complement, and a comprehensive assessment of whether a DNA binding motif is known for each TF. We use this catalog to survey human TF function, expression, and evolution, highlighting the roles played by TFs in human disease, including the effect of variation within TF proteins and TF binding sites. A comprehensive review of ~1,600 proteins is impossible; instead, we attempt to exemplify emerging trends and techniques, as well as shortcomings in existing data.
Historically, the term transcription factor (TF) has been applied to describe any protein involved in transcription and/or capable of altering gene expression levels. In the current vernacular, however, the term is reserved for proteins capable of (i) binding DNA in a sequence-specific manner and (ii) regulating transcription (Figure 1A) (Fulton et al., 2009; Vaquerizas et al., 2009). TFs can have thousand-fold or greater preference for specific binding sequences relative to other sequences (Damante et al., 1994; Geertz et al., 2012). Because TFs can act by occluding the DNA binding site of other proteins (e.g. the classic lambda, lac and trp repressors (Ptashne, 2011)), the ability to bind to specific DNA sequences alone is often taken as an indicator of ability to regulate transcription.
Figure 1. The human transcription factor repertoire.

A. Schematic of a prototypical TF. B. Number of TFs and motif status for each DBD family. Inset displays the distribution of the number of C2H2-ZF domains for classes of effector domains (KRAB, SCAN or BTB domains); “Classic” indicates the related and highly conserved SP, KLF, EGR, GLI GLIS, ZIC and WT proteins. C. DBD configurations of human TFs. In the network diagram, edge width reflects the number of TFs with each combination of DBDs. D. Number of auxiliary (non DNA-binding) domains (from Interpro) present in TFs, broken down by DBD family.
Thus, these proteins cannot be understood functionally without accompanying detailed knowledge of the DNA sequences they bind.
TF DNA binding specificities are frequently summarized as “motifs” - models representing the set of related, short sequences preferred by a given TF, which can be used to scan longer sequences (e.g. promoters) to identify potential binding sites. Determining a DNA-binding motif is often the first step towards detailed examination of the function of a TF because identification of potential binding sites provides a gateway to further analyses. Our ability to generate both motifs and genomic binding sites has improved dramatically over the last decade, leading to an unprecedented wealth of data on TF-DNA interactions. To develop the current TF catalog, we have drawn heavily upon motif collections such as TRANSFAC (Matys et al., 2006), JASPAR (Mathelier et al., 2016), HT-SELEX (Jolma et al., 2013; Jolma et al., 2015; Yin et al., 2017), UniPROBE (Hume et al., 2015), and CisBP (Weirauch et al., 2014), along with previous catalogues of human TFs (Fulton et al., 2009; Vaquerizas et al., 2009; Wingender et al., 2015).
There is typically only a partial overlap between experimentally determined binding sites in the genome, and sequences matching the motif, however, and even experimentally determined binding sites are relatively poor predictors of genes that the TFs actually regulate (Cusanovich et al., 2014). At the same time, motif matches are often among the most enriched sequences in a ChIP-seq dataset, indicating that intrinsic DNA binding specificity is important for TF binding in vivo. In retrospect, this outcome should have been expected: most TF binding sites are small (usually 6–12 bases) and flexible, so a typical human gene (>20 kb) will contain multiple potential binding sites for most TFs (Wunderlich and Mirny, 2009). Well-established concepts such as cooperativity and synergy between TFs provide a ready solution to this deficit in specificity – most human TFs have to work together to get anything done - but the details of their interactions and relationships are generally lacking. The biochemical effects of TFs subsequent to binding DNA are also largely unmapped and known to be context dependent. As a result, decoding how gene regulation relates to TF binding motifs and gene sequences remains a major practical challenge; the resulting frustration has been embodied in the term “futility theorem” (Wasserman and Sandelin, 2004).
How transcription factors are identified
The major TF families in eukaryotes, such as C2H2-zinc finger (ZF), Homeodomain, basic Helix-Loop-Helix (bHLH), basic Leucine Zipper (bZIP), and Nuclear Hormone Receptor (NHR) were initially described in the 1980s (reviewed in (Johnson and McKnight, 1989)). Knowledge of binding sites, often identified by methods such as DNase footprinting or mobility shift, led to identification of the particular binding proteins using N-terminal peptide sequencing, phage libraries, or one hybrid screening. Similarities in amino acid composition and structure were then noted among different DNA binding proteins. New DNA-binding proteins continue to be identified by experimental methods (e.g. one-hybrid assays (see (Reece-Hoyes and Walhout, 2012)), DNA affinity purification-mass spectrometry (reviewed in (Tacheny et al., 2013)), and protein microarrays (Hu et al., 2009) can screen for new DNA binding proteins). Today, most known and putative TFs are instead identified by sequence homology to a previously-characterized DNA binding domain (DBD), which is also used to classify the TF (see (Weirauch and Hughes, 2011) for review). With the possible exception of the very simple AT-hook (Aravind and Landsman, 1998), all extant examples of DBDs are assumed to be derived from a small set of common ancestors representing the major DBD folds with the families arising by duplication. There are ~100 known eukaryotic DBD types, which are catalogued in Pfam (Finn et al., 2016), SMART (Letunic et al., 2015) or Interpro (Finn et al., 2017) as Hidden Markov Models (HMMs), which are used to scan protein sequences for these domains. DBD structures in complex with DNA are currently available in the Protein Data Bank (PDB) (Berman et al., 2000) for most families of human TFs, with AP2, BED-ZF, CP2, SAND and NRF being notable exceptions. To date, all but a handful of well-characterized mammalian TFs contain a known DBD (Fulton et al., 2009). It is likely that additional DBDs remain to be discovered; for example, extended homologous regions in polycomb-like proteins were recently found to bind motifs containing CG dinucleotides (Li et al., 2017).
Care must be taken when inferring function based only on a homology match to a DBD, because not all instances of these domains will necessarily bind specific DNA sequences. The CERS/Lass-type Homeodomains, for example, are not likely to be DNA binding proteins at all; they instead appear to have been co-opted to function in sphingolipid synthesis (Mesika et al., 2007). Likewise, only a subset of Myb/SANT, HMG, and ARID domain containing proteins bind specific DNA sequences. In addition, domains with similar names should not be confused. For example, C2H2-ZFs and CCCH-ZFs are structurally and evolutionarily distinct, and while C2H2-ZFs generally bind double stranded DNA, CCCH-ZFs typically bind single stranded RNA (reviewed in (Font and Mackay, 2010)).
Determining TF DNA binding motifs
Motifs are typically displayed as a sequence logo (Schneider and Stephens, 1990), which in turn represents an underlying table or “Position Weight Matrix” (PWM) of relative preference of the TF for each base in the binding site (Stormo and Zhao, 2010). At each base position, each of the four bases has a score, and multiplying these scores for each base of a sequence yields a predicted relative affinity of the TF to that sequence. In many cases, these logos reflect strong preference to one or a small number of related sequences, although they can also represent weak base preferences that nonetheless contribute to binding. In addition, complications can arise that are not captured by a PWM: there may be dependencies among base positions (Bulyk et al., 2002; Jolma et al., 2013), for example, due to DNA shape or deformability (Rohs et al., 2009); the TF may have multiple binding modes (e.g., different physical configurations of the protein leading to separate, distinct motifs) (Badis et al., 2009); cooperative interactions may influence the sites bound by a TF (Jolma et al., 2015); or DNA methylation can impact binding, positively or negatively (Yin et al., 2017). To account for these complexities, more complicated models have been developed, often incorporating preferences to dinucleotides and higher-order k-mers (reviewed in (Slattery et al., 2014)), with improvement in accuracy depending on the TF and its family. In many cases, however, the improvement is minor or even undetectable, especially when comparing across different data sets (Weirauch et al., 2013), and the PWM remains the most commonly used model for analysis of TF binding. Hereafter, we use the term “motif” to signify PWM.
The sequence preferences and binding sites of TFs can be assessed by a wide variety of techniques both in vitro and in vivo (reviewed in (Jolma and Taipale, 2011); Table 1 outlines the most prevalent methods, and their attributes. As a predictor of relative binding affinity, motifs are most accurately obtained from quantitative affinity measurements for a large number of sequences, preferably using purified proteins and DNA (Stormo and Zhao, 2010). Nonetheless, motifs for many well-studied proteins were initially obtained from very few sequences (e.g. dozens of Sanger reads) and used in thousands of subsequent studies (Mathelier et al., 2016; Matys et al., 2006), illustrating the utility of even approximate descriptions of binding ability.
Table 1. Experimental methods for determining and validating TF binding specificities.
In vitro and in vivo methods currently used to experimentally derive and confirm TF binding sites and motifs.
| Method | Description | Pro (−) and Con (+) | Ref |
|---|---|---|---|
| a. Low throughput assays | |||
| EMSA | Tests whether a sequence is bound by a TF through observing a shift in the electrophoretic migration of DNA | (−) Low throughput, (−) measurement errors, (+)Can be used as a step in other methods such as SELEX | 1 |
| SELEX or CASTing | TF is allowed to find its target sites from DNA pool with a randomized region. SELEX needs to be repeated many times in order to enrich target sites sufficiently for Sanger sequencing | (+) Suited for TFs with up to ~25N target sites (−) Low throughput, (−) Sampling error and false positives due to many SELEX cycles required;(−) non-specific DNA carryover | 2, 3 |
| Yeast one hybrid | A library of proteins is screened for ability to bind to a putative TF target site cloned into a minimal promoter that drives expression of a selection marker | (−) Low throughput; (+) Method for identifying a unknown protein that binds to a sequence of interest | 4 |
| Footprinting | DNA is labeled at one end, incubated with the TF and then degraded using DNAse1 or hydroxyl radicals resulting in cuts in all positions except those that were protected by the bound TF. After digestion protected region is recognized through gel electrophoresis | (−) Low throughput, TF concentration determines stringency | 5, 6 |
| DNA affinity purification | DNA sequence is immobilized and then used to select proteins from a cell extract followed by identification using SDS-PAGE, antibody detection, peptide sequencing or mass spectrometry | (−) Low throughput; (+) Method is suitable for identifying a unknown protein that binds to a sequence of interest | 7 |
| ITC, SPR, MSTP | Isothermal titration calorimetry (ITC), Surface plasmon resonance (SPR) and Microscale thermophoresis (MSTP) measure the binding affinity of TF to DNA | (−) Low throughput; (−) No de novo capability; (+) Determine physical rather than relative binding affinities | 8–10 |
| b. In vivo methods | |||
| ChIP assays | Proteins are crosslinked to DNA using formaldehyde and the specific protein is precipitated with an antibody followed by detection of bound DNA with qPCR, microarray (ChIP-chip)11 or sequencing (ChIP-seq)12 ChIP-Exo variant incorporates exonuclease treatment to enhance resolution13 | (−) Low throughput; (−) Requires high quality antibody and a cell line that expresses the TF; (-/+) Bound sites reflect both indirect and direct binding and the chromatin state of the cells; (−) Affected by the inherently skewed genomic sequence | 11–13 |
| DamID-seq | A TF is expressed in mammalian cells as a fusion to bacterial Dam-methylase that methylates consensus, which can then cleaved with DAM-methylation requiring restriction enzyme and then sequenced to determine the bound sites. | (−) Low throughput; (−) Affected by the inherently skewed genomic sequence distribution; (-/+) In vivo, modified sites reflect both indirect and direct binding and the chromatin state of the cells. | 14 |
| c) Current high throughput methods – PBMs, HT-SELEX, B1H | |||
| Bacterial one-hybrid | TF binding sites are selected in living bacterial cells from randomized library that is cloned in front of selection marker genes. Adaptations include reversed configuration (Naja), where a constant DNA target sequence is used to select from a variable protein sequence-coding library. | Suitable for analysis of TFs with <14 bp long target sites; (+) High throughput; (−) Saturation of selection marker expression, (−) Artefacts due to endogenous protein. Requires negative selection against sequences that cause auto-activation | 15–19 |
| HT-SELEX | SELEX combined with modern sequencing, Variations are known as SELEX-seq, Bind-n-Seq and. Adaptations include versions for CpG methylated library20, heterodimeric TF complexes21 and a version known was SMiLE-seq that uses a microfluidic platform for the selection | (+) Compatible with TFs with up to ~25N target sites; (+) High throughput; (−) Cycles 1 and 2 are skewed by saturation approaching conditions (too sloppy motifs) while later cycles are affected by exponential enrichment (too stringent motifs) (+) | 20–25 |
| DAP-seq | Single step SELEX using genomic sequence derived library. Diversity is much poorer than in the regular SELEX but the genome has co-evolved with the TF and thus sites should be present | (+) High throughput; (+) Potentially well suited for TFs with long sites; (−) Inherently skewed sequence distribution of the genome; | 26 |
| HiTS-FLIP or CHAMP | Using a sequencer flowcell as a PBM chip. To our knowledge the methods have been used only for a handful of TFs meaning that they are likely too difficult to implement in practice. | (+) Potentially High throughput; (+) Capable of analyzing up to 17N target sites | 27, 28 |
| Protein microarray | Screening individual DNA sequences one at a time against a large panel of proteins that have been immobilized into a microarray. An adaptation analyses CpG methylated DNA molecules. | (+) High throughput ;(+) Identifying a unknown TF that binds to a site; (−) High incidence of likely false positives that are not reproducible using other methods such as PBM or HT-SELEX | 29, 30 |
| d. Medium throughput assays for validation and refinement of DNA-binding parameters of predetermined target sequences | |||
| Spec-seq | Single cycle HT-SELEX with microarray synthesized library. Less complex library helps with the saturation issues and makes the enrichment analysis more quantitative | (+) High throughput; (−) Limited de novo capability; (+) more quantitative than regular SELEX | 31 |
| MITOMI | Microfluidics device is used to isolate DNA-protein complexes from free DNA instantaneously to measure relative binding affinities of TFs from few to tens of sites. | (+) High throughput; (−) No de novo capability; (+) Particularly useful for analysis of TFs with weak affinities | 32 |
| Competition ELISA | Determining accurate relative binding affinities to few tens of pre-determined sequences by testing variant sequences against a reference sequences in wells of 96-well plates. | (+) High throughput; (−) No de novo capability; (−) well-to-well variation | 33–35 |
ChIP-seq (Johnson et al., 2007) has revolutionized the study of TF binding sites in vivo, by enabling the genome-wide identification of region occupied by a TF of interest. The semi-quantitative measurements obtained have several limitations with regard to motif derivation, however. First, binding is influenced by chromatin state – many TFs bind almost exclusively in open chromatin - as well as biases in the sequence content of the genome. Second, ChIP-seq can clearly detect indirect binding, which can lead to identification of motifs for proteins other than the one ChIPped (Wang et al., 2013; Worsley Hunt and Wasserman, 2014). Third, due to the use of cross-linkers, ChIP does not measure equilibrium binding. Finally, ChIP data is highly dependent on antibody quality - many antibodies cross-react, and ChIP-grade antibodies are not available for many TFs. It is thus often helpful to use prior knowledge regarding the motif expected - for example, the C2H2-ZF “recognition code” (which relates DNA-contacting residues to preferred base positions in the binding site (Najafabadi et al., 2015)) can be used to restrict the analysis to those motifs that resemble computational-based specificity predictions. Some of these issues are in theory addressed by higher resolution approaches such as ChIP-exo (Rhee and Pugh, 2011), but relatively few examples are currently available.
In summary, we now appear to possess the tools needed to identify TF motifs globally. Having these motifs, however, is only a first step in decoding the functions of these proteins in gene regulation; we outline additional complexities in the following sections.
TF cooperativity and interactions with nucleosomes
Both theoretical arguments and practical observations indicate that metazoan TFs must, in general, work together to achieve needed specificity in both DNA binding and effector function – hence the “futility theorem” (Reiter et al., 2017; Wasserman and Sandelin, 2004; Wunderlich and Mirny, 2009). In human, it appears that very few proteins occupy most of their motif matches under physiological conditions; the only clear example out of hundreds that have been examined by ChIP-seq is CTCF, which occupies most of the ~14,000 matches to its ~14-base motif in the human genome with most of the sites occupied across the tested cell types (Fu et al., 2008; Kim et al., 2007). There are myriad ways that TFs are known to collaborate, including aiding each other in binding DNA (cooperative binding), or by impacting chromatin state or transcription through different mechanisms (synergistic regulation). TFs can also bind cooperatively as homodimers (e.g., bZIPs and bHLHs), trimers (e.g., heat shock factors), or higher-order structures (see below). TF interplay is intrinsically related to enhancer function and “logic” (reviewed in (Reiter et al., 2017; Spitz and Furlong, 2012)). Here we mainly consider how cooperative binding is achieved, as it is germane to TF function.
Cooperative binding can occur by several means (reviewed in (Morgunova and Taipale, 2017)). It is most easily understood when it is mediated by protein-protein interactions, which confers additional stability when two (or more) interacting proteins bind DNA in a compatible spacing and orientation. High-throughput in vitro studies indicate that cooperative binding often impacts the sequence preferences of TFs in a complex, and can also introduce constraints on intervening sequence between the two binding sites, presumably due to stereochemical requirements (Jolma et al., 2015; Slattery et al., 2011). Results from single-molecule imaging studies confirm that binding sites are occupied longer when multiple TFs bind together (Chen et al., 2014; Gebhardt et al., 2013).
Recent evidence suggests that DNA-mediated cooperative binding also plays an important role in TF function. A test of 9,400 human TF pairs using Consecutive Affinity Purification (CAP)-SELEX identified 315 pairs with clear spacing and orientation preferences between their binding sites (Jolma et al., 2015). Molecular modeling and structural analyses indicated that in some cases cooperativity was due to DNA facilitating contacts between the proteins. In other cases, the proteins bound on the opposite sides of the DNA, or relatively far from each other, suggesting that DNA directly mediated the cooperativity. That is, binding of one TF influenced the shape of the DNA in a manner that promoted the binding of the second TF. Indeed, one of the best-studied enhancers, the highly ordered IFNβ enhanceosome, appears to exemplify this mechanism. At this ~50bp locus, constrained spacing and orientation of binding sites for eight TFs facilitates interactions, allowing for the recruitment of three non-DNA binding cofactors. Structural analysis, however, reveals relatively few contacts among the TFs (Panne, 2008), with stability conferred instead by induced changes in DNA structure, and interactions with cofactors. DNA-mediated cooperative binding for TFs bound within ~10 bases of each other can also be mediated by DNA vibrational modes, which is predicted to occur to some extent between all possible pairs of TFs (Jolma et al., 2015; Kim et al., 2013).
In order to bind to nucleosomal DNA, TFs must either compete with nucleosomes or interact with nucleosomes or nucleosomal DNA in some way to access their sites. TFs can inherently cooperate with each other to compete with nucleosomes (reviewed in (Wunderlich and Mirny, 2009)), and indeed binding sites identified in ChIP-seq are often biased towards homotypic clusters, especially when low-affinity motifs are considered (Gotea et al., 2010). In addition, some TFs can initiate the displacement of nucleosomes, or at least change their conformations (e.g., Foxa1 (Iwafuchi-Doi et al., 2016; Swinstead et al., 2016a)), most likely by recruiting ATP-dependent chromatin remodelers and other TFs (reviewed in (Swinstead et al., 2016b)). The activity of these TFs may also be dependent on their ability to bind nucleosomal DNA, which can be influenced by the rotational positioning of the binding site on the nucleosome (e.g. the Yamanaka factors POU5F1, SOX2, KLF4, and MYC (Soufi et al., 2015)). An additional intriguing observation is that different chromatin remodelers possess preferences for specific DNA sequences and/or nucleosome conformations (Rippe et al., 2007), suggesting that both nucleosomes and nucleosome positioning mechanisms impart additional DNA sequence specificity to TF action.
TF effector functions
TFs vary dramatically in how they impact transcription upon DNA binding. Some human TFs (e.g. TBP) can directly recruit RNA polymerase, while others recruit accessory factors that promote specific phases of transcription (reviewed in (Frietze and Farnham, 2011)). As in bacteria, human TFs can lack a specific effector function, and instead act by steric mechanisms, such as occlusion (blocking other proteins from binding to the same site) (Akerblom et al., 1988). Most eukaryotic TFs, however, are thought to act by recruiting cofactors (Reiter et al., 2017). Such “coactivators” and “corepressors”, initially identified as mediators of TF effector activity, are frequently large multi-subunit protein complexes or multi-domain proteins that regulate transcription via multiple mechanisms. They commonly contain domains involved in chromatin binding, nucleosome remodeling and/or covalent modification of histones or other proteins, including TFs and RNA polymerase (Frietze and Farnham, 2011). The IFNβ enhanceosome is a classic illustration of coactivator recruitment, with the binding of multiple TFs resulting in the recruitment of GCN5/KAT2A and CBP/p300 histone acetyltransferases (reviewed in (Panne, 2008)). The resulting changes to the local chromatin environment recruit nucleosome remodelers such as the SWI/SNF complex to create room for RNA polymerase to bind and initiate transcription. Some coactivators and corepressors appear to be more widely used than others. p300 is often used as a marker of enhancers (Visel et al., 2009), associating with dozens of TFs (Frietze and Farnham, 2011). The Mediator complex, which bridges TFs and RNA Polymerase II, is similarly associated with thousands of loci – possibly, the majority of transcribed genes (Kagey et al., 2010) - and is recruited by dozens of TFs (Malik and Roeder, 2010).
Dedicated effector domains often mediate the recruitment of specific cofactors by TFs. The KRAB domain, for instance, is found in ~350 human C2H2-ZF proteins. It recruits TRIM28/KAP1, which in turn recruits HP1/CBX5 and SETDB1, catalyzing deposition of the repressive H3K9me3 histone mark (reviewed in (Ecco et al., 2017)). Likewise, ligand-binding domains of nuclear hormone receptors facilitate interactions with coactivators, corepressors, and other TFs in a ligand- and context-dependent manner (reviewed in (Rosenfeld et al., 2006)). Many TFs do not contain well-defined effector domains, however. Some are comprised almost entirely of a single DBD, and are thus unlikely to contain separable activation domains, especially in the bZIP (e.g. BATF, CREBL2, and MAFK) and bHLH (e.g. MAX, NHLH1, and ATOH7) families. Classical transcriptional activator sequences present in well-studied proteins (e.g. the acidic sequences found in TP53, E2F, and SP1) are often unstructured low-complexity sequences with small functional regions dubbed short linear motifs (Garza et al., 2009). The LxxLL motif, for instance, was originally identified as a protein-protein interaction interface of nuclear hormone receptors with their cofactors (NCoA, CBP, Mediator, etc.), but is also present in unrelated TF families (e.g., Myb/SANT and STAT) (Plevin et al., 2005). Many of the best-characterized C2H2-ZF TFs are also known to exploit unstructured regions and/or DBDs to interact with cofactors (Brayer and Segal, 2008).
TFs have traditionally been classified as either “activators” or “repressors”; however, this notion has been repeatedly questioned. Many TFs can recruit multiple cofactors that have opposite effects (Frietze and Farnham, 2011; Rosenfeld et al., 2006; Schmitges et al., 2016), dependent on the local sequence context and availability of cofactors (Meijsing et al., 2009; Wong and Struhl, 2011). MAX, for example, functions as an inhibitor when binding to DNA as a heterodimer with MNT or MXD1, and as an activator when binding as a heterodimer with MYC (reviewed in (Amati and Land, 1994)). A recent study used a complex pool of >4 million sequences to survey the effect on gene expression of the relative positions of various TF binding sites in diverse contexts, uncovering numerous motifs capable of both activation and repression in the same cell type (Ernst et al., 2016).
Because effects on transcription are so frequently context dependent, more precise terminology may be warranted, in general - for example, reflecting the biochemical activities of TFs and their cofactors. On a global level, however, the is no comprehensive catalogue of cofactors recruited by TFs. Moreover, the biochemical functions required for gene activation or communication between enhancers and promoters remain largely unknown (Zabidi and Stark, 2016). As many as 443 different chromatin modification proteins have been catalogued in human, and many interactions among cofactors and chromatin proteins have been described (e.g. (Marcon et al., 2014)). But, the same studies detected few TFs, suggesting that TF-cofactor interactions are weak/transient, or that relative stoichiometry is skewed against TFs. Given the large number of factors involved, it is conceivable that a complex network of thousands of interactions among TFs and cofactors exists, providing a ready explanation for context dependency.
The human TF repertoire
A key starting point in the global analysis of human TFs and gene regulation is a simple index of high-confidence human TFs, and what is known about them. There is no one-size-fits-all solution to automate the generation of such a list: domain structures do not perfectly predict TFs, the literature is highly heterogeneous, and electronic annotations are non-uniform. To our knowledge, the latest comprehensive reviews of human TFs were published in 2009 (Fulton et al., 2009; Vaquerizas et al., 2009). Fulton et al. curated a list of putative mouse and human TFs based on evidence of TF activity, including both DNA binding and regulation of transcription, identifying a total of 535 human TFs. Vaquerizas et al., annotated putative DBDs and proteins that contain them with confidence levels based on selectivity for known TFs and their likelihood of involvement in transcription. This list was then appended with Gene Ontology and Transfac TF annotations to yield a total of 1,391 human TFs. In recent years, the field has advanced substantially with dramatic expansions in data collection, including hundreds of motifs generated in vitro (Badis et al., 2009; Jolma et al., 2013; Wei et al., 2010; Weirauch et al., 2013; Weirauch et al., 2014; Yin et al., 2017). There have also been updates to gene annotations. We therefore undertook a de novo revised manual curation of the human TF collection, which forms the basis of the remainder of this review.
The overall approach is depicted in Figure S1A. We manually examined 2,765 proteins compiled by combining putative TF lists from several sources: the aforementioned papers (Fulton et al., 2009; Vaquerizas et al., 2009), domain searches (using HMMs and parameters from CisBP (Weirauch et al., 2014) and Interpro, as well as the Transfac-related database TFClass (Wingender et al., 2015)), Gene Ontology, and crystal and NMR structures of proteins in complex with DNA taken from the PDB (Berman et al., 2000). We created a web page for each protein containing all relevant information and links to external databases. We then assigned two curators (among the authors of this manuscript) to classify the protein’s status as a TF (“TF with a known motif”, “TF with a motif inferred from a close homolog”, “likely TF” (due to presence of a DBD or literature information), “ssDNA/RNA binding protein”, or “unlikely TF”), and its DNA binding mode (binds as a monomer or homomultimer, binds as an obligate heteromer, binds with low specificity, or does not bind DNA). Curators could also submit notes and citations supporting their assessments. Using data from CisBP and other sources, we recorded whether motifs are known for each TF (or a close homolog), along with the availability of a protein-DNA structure. We considered global sequence alignments and known DNA binding residues to make decisions for poorly characterized proteins within families where only a subset bind DNA (e.g., ARID, HMG, and Myb/SANT). To make the task feasible, we did not explore or record complexities such as protein modifications or binding partners. Three senior authors (TRH, MTW, JT) resolved cases of disagreement between reviewers, and manually reviewed all cases where both curators agreed that a protein without a canonical DBD is a likely TF. Table S1 contains the full curation results. The “HumanTFs” web site (http://humantfs.ccbr.utoronto.ca/) displays the results, with a separate page for each TF, along with all known motifs, and information and sequence alignments for each DBD type. The site also has an option for users to submit additional information.
The final tally encompasses 1,639 known or likely human TFs. Most contain at least one of only two DBD types (C2H2-ZFs (747) and Homeodomains (196)). Nearly half of the remainder (46%) are accounted for by an additional six (bHLH (108), bZIP (54), Forkhead (49), Nuclear Hormone Receptor (46), HMG/Sox (30), and ETS (27)) (Figure 1B). There are far fewer Myb/SANT and HMG domain TFs than previously estimated (Vaquerizas et al., 2009) (14 vs. 38 and 40 vs. 55, respectively) after accounting for known subclasses that lack DNA sequence specificity. The vast majority (93%) of the 1,639 TFs are known or expected to bind DNA as either a monomer or homomultimer. Many contain multiple copies of the same DBD type (Figure 1C), but most of these are C2H2-ZFs, which bind DNA as an array (Figure 1A). The number of C2H2-ZFs per protein varies substantially, depending partly on the effector domain (Figure 1B). The large numbers of C2H2-ZFs in the KRAB-containing subtype may be due to the specificity required to target individual transposable elements (see below). Only a small fraction of TFs (47, or ~3%) contain more than one type of DBD, with POU:Homeodomain being the most prevalent (Figure 1C). Most human TFs also contain additional protein domains (Figure 1D): in total, 426 different types of non-DNA-binding domains are represented, consistent with the notion of a diverse and extensive network of TF effector functions.
This survey includes 348 TFs not included in the Vaquerizas list. Notable additions include 134 C2H2-ZFs, 22 bHLHs, 14 AT-hooks, 13 Homeodomains, and the 12 recently described THAP finger proteins (Campagne et al., 2010)) (Figure S1B). The individual proteins in previous lists are, however, almost completely reconfirmed. 1,292 out of the 1,391 proteins (93%) identified by Vaquerizas et al. were also in our compilation, with 50 removed due to changes in gene annotations (pseudogenes and duplicates) and 49 removed using the guidelines above. Likewise, 98% of the TFs identified by manual curation by Fulton et al. (523/535) are considered to be TFs in our study.
It is likely that our current TF list is still incomplete, and entire DBD families may remain undiscovered. Indeed, 67 of the TFs in our list are categorized as “Unknown family”, due to the lack of a canonical DBD. Most of these proteins lack motifs (see below), crystal structures are largely unavailable, and the evidence for DNA binding typically includes only a handful of sequences identified in a single manuscript. Thus, TFs in this category should be treated with caution until further experimental data are available.
In addition, some known DBD families might be larger than is currently appreciated. For example, the simple AT-hook domain (represented by a 13 AA consensus) is predicted to be present in 3 and 21 human genes according to the Interpro and SMART databases, respectively. A more lenient definition, however, requiring only the presence of a GRP tripeptide flanked by multiple basic residues over a 22 base window (Aravind and Landsman, 1998) is present in hundreds of human proteins, each of which could represent a bona fide TF. The set of C2H2-ZFs will also warrant revisiting as better models emerge for recognizing these short (~23 AAs) domains and distinguishing those involved in DNA binding from those facilitating interactions with RNA or other proteins (Brayer and Segal, 2008), although most do appear to bind DNA in large surveys (Imbeault et al., 2017; Schmitges et al., 2016).
Sequence specificities of the human TFs
Roughly three-quarters (1,211) of the human TFs currently have a binding motif (1,107 “known”, i.e. measured experimentally, and a further 104 inferred from a closely-related homolog) (Weirauch et al., 2014). 913 of the known motifs were obtained from high-throughput in vitro assays such as HT-SELEX or PBM, and hence provide a profile of their intrinsic relative preferences to many DNA sequences. Figure 1B illustrates that most classes of TFs have high or complete motif coverage, while a handful have major gaps. Almost all Homeodomains (188/196), for example, have a known or inferred motif, likely due to their relative ease of study in vitro and their deep conservation enabling inference by homology. The C2H2-ZF class, in contrast, currently lacks hundreds of motifs (267/747) (Figure 1B, inset), possibly because they are difficult to study in vitro (many are large proteins) and relatively few are well conserved (Stubbs et al., 2011). By proportion, the AT-hook proteins, THAP finger, BED-ZF, and those with no known DBD are also poorly characterized.
Among the 1,107 proteins with a known motif, less than 2% (19) lack a canonical DBD, with only six of 69 such proteins having an in vitro derived motif - the other 13 are based on experiments such as ChIP-seq, and thus may describe binding through a cofactor. Nevertheless, the additional 50 non-canonical TFs were included in our list due to some evidence for direct sequence-specific DNA binding. An example of a bona fide non-canonical TF is NRF1, which was initially characterized in 1993 (Virbasius et al., 1993), with further high-throughput characterization occurring 20 years later (Jolma et al., 2013). Some of the likely TFs that do not contain a canonical DBD are obligate heterodimers that contribute to protein–DNA contacts in crystal structures of sequence specific protein complexes, but are unlikely to bind DNA on their own (e.g. NFYB and NFYC, which form a trimeric complex with NFYA (Nardini et al., 2013)).
Many TFs recognize similar motifs, typically corresponding to TF families or subfamilies, consistent with intuition and with many previous studies (e.g (Badis et al., 2009; Wei et al., 2010)) (Figure 2A). Notably, C2H2-ZF proteins contribute most of the diversity to the motif collection (Figure 2B) as expected from previous studies and from the diversity in their DNA contacting residues (Emerson and Thomas, 2009; Imbeault et al., 2017; Najafabadi et al., 2015; Schmitges et al., 2016; Stubbs et al., 2011). Figure 2C shows motifs for the NHR family, illustrating that TF diversity can involve changes in both monomeric DNA sequence preference and protein complex formation: many motifs in Figure 2C are recognized by dimers. In total, over 500 motif specificity groups are present in human (Table S2), indicative of the wide range of DNA sequences capable of functioning as human TF binding sites.
Figure 2. DNA binding specificities of the human transcription factors.

A. Heatmap showing similarity of human TF DNA binding motifs. Representative motif(s) were selected for each TF from the set of motifs directly determined by a high throughput in vitro assay. Pairwise motif similarities were calculated using MoSBAT energy scores (Lambert et al., 2016) and arranged by hierarchical clustering using Pearson dissimilarity and average linkage. B. Motif diversity within each family, as measured by the number of clusters supported by the optimal silhouette value (Lovmar et al., 2005). C. Detailed view of representative motifs for Nuclear Hormone Receptors, displayed on a phylogram according to DBD sequence similarity using motifStack (Ou et al., 2018).
Conservation and evolution of human TFs
Evolution of TFs is typically much slower than evolution of their regulatory sites. TF orthologs between human and Drosophila often display virtually identical sequence specificity (Nitta et al., 2015). Physiological roles of TFs are also often conserved - the HOX proteins, which specify the anterior-posterior body plan, are perhaps the best-known example (Burglin, 2011) - but there are numerous others, e.g. the regulation of cilia genes by RFX TFs (Choksi et al., 2014). Nonetheless, TFs do evolve, changing their motifs, binding partners, and expression patterns (Arendt et al., 2016; Grove et al., 2009; Lynch and Wagner, 2008; Schmitges et al., 2016). A striking example of duplication and divergence among human TFs is the hundreds of KRAB-containing C2H2-ZF proteins encoded by most mammalian genomes, many of which display hallmarks of diversifying selection (Emerson and Thomas, 2009) with complex orthology patterns even between human and mouse (Huntley et al., 2006). In human, KRAB C2H2-ZF proteins generally bind transposable elements (TEs) (mainly LINEs and endogenous retroviruses), presumably silencing them – at least initially - via the repressive function of the KRAB domain (Imbeault et al., 2017; Jacobs et al., 2014; Rowe et al., 2010; Schmitges et al., 2016). An “arms race” between the TEs and TFs provides a ready explanation for their rapid diversification. A “domestication” model is also supported, however, in which the KRAB-TE interaction is evolutionarily maintained to co-opt the TE for host gene regulation long after TEs degrade beyond pathogenic potential (reviewed in (Ecco et al., 2017; Imbeault et al., 2017)).
Based on their distribution across eukaryotic genomes (Figure 3A), the 1,639 TFs in our updated catalog fall into major groups with close relatives extending to metazoans, vertebrates, tetrapods, placental mammals, or primates. Strikingly, nearly all Homeodomain proteins have recognizable counterparts across vertebrate, while virtually all of the mammal-specific proteins contain C2H2-ZFs. Indeed, the divergence times between Ensembl-defined human TF-TF paralogs display a bimodal division: a first wave of duplications across diverse TF families occurred at the base of Bilateria, and a second wave of duplications, dominated by KRAB C2H2-ZFs, began in Amniota (Figure 3B, left). The earlier wave, with duplications across diverse TF families, is consistent with the postulation that two rounds of whole genome duplication occurred at or near the base of Vertebrates (Dehal and Boore, 2005). This event is roughly coincident with the expansion of cell type diversity, possibly facilitated by duplicated TFs available to regulate novel cell types (Arendt et al., 2016). The expansive KRAB radiation may be partly explained by the increased opportunity for retroviral transmission facilitated by the placenta (Hayward et al., 2015). Remarkably, TF-TF duplications during the KRAB radiation era dominate the distribution of all human paralog pairs arising over the last 300M years (Figure 3B, right).
Figure 3. Orthologs and paralogs of the human transcription factors.

A. Presence and absence of human TF orthologs across eukaryotic species. Amino acid percent identity is plotted for the most similar non-human TF gene in 32 eukaryotic species (from Ensembl Compara database (Herrero et al., 2016)).
TFs are ordered first by conservation level (approximated gene age), based on similarity to expected conservation patterns for each of the clades plotted. B. Left, number of human TF-TF paralog pairs that diverged in each clade shown; Right, proportion of all human paralog pairs from each clade that are a TF-TF pair.
Expression of human TFs across tissues and cell types
Tissue and cell type-specific expression of genes, including TFs, is often indicative of corresponding specific functions. We examined expression patterns for 1,554 TFs detected in 37 adult tissues using RNA-seq data from the Human Tissue Atlas (Figure 4A), adopting its quantitative definitions for tissue specific expression (tissue enriched, group enriched, or tissue enhanced) (Uhlen et al., 2015). This global view of gene expression patterns captures known roles for many well-characterized TFs. For example, SOX2, OLIG1, and POU3F2 (OCT7) are expressed almost exclusively in the cerebral cortex, and GATA4 and TBX20 are highly expressed only in cardiac muscle. Roughly one-third (543) of the human TFs in this dataset displayed tissue specific expression, including many with poorly characterized physiological roles.
Figure 4. Functional properties of the human transcription factors.

A. RNA-seq gene expression profiles for 1,554 human TFs across 37 human tissues (from the Human Tissue Atlas version 17 (Uhlen et al., 2015)), normalized by row and column. Tissues and TFs are arranged using hierarchical clustering by Pearson Correlation. Mean Expression Level indicates the mean pre-normalization mRNA expression level of each TF (in TPM) across all tissues in which the TF was expressed (TPM ≥ 1). B. TF gene set over-representation for human disease phenotypes (Kohler et al., 2014). Y-axis indicates the significance of the size of the intersection between the set of human TFs and the indicated gene set. Values indicate the number of TFs in the gene set. C. Diseases with GWAS signal (P<5×10−8) located proximal to TF-encoding genes. Loci containing multiple variants were restricted to the single most strongly associated variant, and subsequently expanded to incorporate variants in strong linkage disequilibrium (LD) (r2>0.8) with this variant using Plink (Purcell et al., 2007). The full set of genetic variants and sources for each disease are provided in Tables S3 and S4. Each resulting variant was assigned to its nearest gene, creating a gene set for each disease. For each gene set, the significance of its overlap with the list of human TFs was estimated using the hypergeometric distribution. P-values were corrected using Bonferroni’s method. Values indicate the number of TF-encoding loci associated with the given disease.
Comparing between TF classes, a striking trend emerges, mimicking the evolutionary sequence analysis above. C2H2-ZFs are markedly depleted for tissue specificity – only 19% vs. 49% for other types of TFs (P<10−13, Bonferroni-corrected Fisher’s Exact Test) (also visible at right in Figure 4A). Only 12% (41/339) of KRAB-containing C2H2-ZFs are tissue specific, possibly due to their role in the repression of transposable elements, which may be beneficial broadly across cell types. The majority are testes-specific (26/41), consistent with a role for KRAB C2H2-ZFs in retroelement silencing during gametogenesis (Ecco et al., 2017). Homeodomain TFs, in contrast, are highly enriched for tissue specific expression (133/162, 82%, P<10−13), and are also the only group overrepresented in the list of TFs that is not detected in the Human Tissue Atlas dataset (34/84; P<10−7), presumably reflecting well-established roles in early embryonic cell fate specification and/or roles in the maintenance of differentiation in adult tissues (Burglin, 2011; Dunwell and Holland, 2016). Across all other TF families, about half (49%) are tissue specific, providing a clue as to their specific physiological functions. Higher-resolution data – e.g. from single-cell RNA sequencing, which can resolve the different cell types that comprise tissues – will almost certainly lead to a more refined view of the associations between TFs, cell identity, and the genes regulated by the TFs.
Human TFs in genetics and disease
TFs represent ~8% of all human genes, and are associated with a wide array of diseases and phenotypes. TF mutations are often highly deleterious, presumably explaining why genomic loci encoding TFs are enriched for ultraconserved elements (Bejerano et al., 2004), and depleted of common variation within their DBDs (Barrera et al., 2016). The genetic analysis of TFs can be complicated by functional redundancies inherent to gene regulatory networks, because phenotypes might be difficult to detect or manifest only under specific conditions, or because variants with highly deleterious effects will be absent at the population level. Nonetheless, a global perspective on human TFs in clinical phenotypes does reveal common themes. Figure 4B illustrates human disease phenotypes that involve a significant number of mutations within or near genes encoding TFs, as compiled by The Human Phenotype Ontology (Kohler et al., 2014). The strongest enrichment is observed for Anterior pituitary hypoplasia, which occurs in association with congenital growth hormone deficiency – of the 15 genes known to be involved in this phenotype, 12 are TFs (P<10−11), including multiple Homeodomain and Sox family TFs. Overall, 313 (19.1%) of the human TFs are currently associated with at least one phenotype, a significantly higher fraction than that observed for all genes (16.2%) (P=0.002, proportions test). In contrast, TFs are depleted from the core set of essential genes in human cancer cell lines, based on data from recent CRISPR screens (3% vs. 10% (Hart et al., 2015)), perhaps because the human TF repertoire is utilized mainly for developmental or tissue-specific functions. Phenotypes have been associated with genetic perturbations of 304 (18.6%) of the 1,198 one-to-one human/mouse TF orthologs in mouse (Blake et al., 2017), often yielding phenotypes which are consistent with the TF’s known function in human. For example, six of the ten Rel family TFs result in “decreased B cell proliferation.”
Genome Wide Association Study (GWAS) signals for some polygenic diseases are also enriched for loci encoding TFs (Figure 4C). Many of these diseases have a strong immune-dependent component, suggesting a prominent role for the many immune-responsive TFs (reviewed in (Smale, 2014)). In addition, many individual TF loci harbour strong GWAS signals for multiple diseases. For example, variants within the loci encoding the Ikaros-family C2H2-ZFs IKZF1 and IKFZ3, which play critical roles in the adaptive immune response (John and Ward, 2011), reach genome-wide significance in 10 different GWAS studies; most of these studies involve autoimmune diseases with strong B and T cell-specific genetic signals (Hu et al., 2011).
The modular structure of TFs facilitates identification of the mechanistic impact of mutations. DBD mutations can alter sequence specificity; such mutations in HOXD13 have been associated with limb malformations (Barrera et al., 2016). Profound effects on gene expression can also result from mutations located outside of the DBD. For example, multiple variants within the TP53 protein affect its activity by altering protein interactions (reviewed in (Muller and Vousden, 2013)). In cancer, chromosomal abnormalities can create onco-fusion proteins with novel functions, such as the Ets factors ERG and FLI1 fusing with the RNA binding protein EWSR1 (Sizemore et al., 2017). Similarly, as for any gene, a mutation can fall within a regulatory region controlling the expression of a TF, ultimately resulting in altered TF function. For instance, weakening of a TCF7L2 (TCF-4) binding site within an enhancer that drives expression of MYC can decrease risk for tumorigenesis in the colon (reviewed in (Sur and Taipale, 2016)).
TFs are unique as a gene class in that they represent the proteins whose binding sites are impacted by variation or mutation in regulatory DNA. Numerous such examples have been established, covering a wide range of TF families and diseases (reviewed in (Deplancke et al., 2016)). For example, an intronic obesity-associated polymorphism in the FTO locus alters enhancer function by modulating the binding of ARID5B, leading to an increase in IRX3 and IRX5 expression, changing adipocyte cell fate and overall mitochondrial thermogenesis in adipose tissue (Claussnitzer et al., 2015). Deeper knowledge of how TFs find their targets and control gene expression patterns will be vastly beneficial for our understanding of the estimated 85–93% of common disease-associated genetic variation that is likely to impact gene regulation (Hindorff et al., 2009; Maurano et al., 2012).
Perspective: learning to read the genome
In 2003, Eric Lander presented a seven-word nano-lecture summary: “Genome: bought the book, hard to read,” emphasizing the difficulty of mechanistic interpretation of DNA sequence. Fifteen years later, the task of interpreting the function of noncoding sequence is still challenging – the “futility theorem” still holds. As an illustration, it is now known that many TFs bind preferentially within open chromatin, but the open chromatin itself is presumably controlled by TFs, and there is currently no algorithm that predicts open chromatin directly from sequence with both high sensitivity and precision: a leading model achieves 20%–35% sensitivity at a 20% False Discovery Rate (Kelley et al., 2016), and is most effective at identifying promoters.
This ongoing challenge can no longer be explained by a general lack of motifs for known TFs (Table S1). A clear hurdle to be addressed now is how to learn relevant combinations of binding sites and other sequence features. On a global scale, TF-TF cooperativity and TF-nucleosome interactions are largely unmapped, although both are likely to be prevalent. Because the number of factors involved is high, the number of functionally interacting combinations may be astronomical - the limited size of the human genome will likely pose challenges for the systematic detection of such higher-order interactions, due to a lack of statistical power.
Most of the functional DNA in the genome is likely regulatory (Kellis et al., 2014), with TFs playing a central role in its recognition and utilization. There is a clear role for TFs in many human diseases, highlighting the importance of continued efforts for understanding TF-mediated gene regulatory mechanisms. Other current challenges include addressing synergy and redundancy among multiple elements regulating the same gene, predicting enhancer-promoter contacts, the relevance of large-scale arrangement of regulatory features along chromosomes and in three dimensions, and various types of epigenetic memory. Computational methods examining these themes are a topic of ongoing research, and experimental techniques probing the role of TFs in nucleating and mediating these phenomena likewise continue to be developed. These advances will be instrumental in conquering what is likely to be the next frontier in human genetics - decoding the genome the way TFs do.
Supplementary Material
Figure S1. Identification of the human transcription factors. A. Overview of the strategy for identifying the human TF repertoire. B. Comparison between TFs reported in this study and a previous study (Vaquerizas et al., 2009). The number of TFs identified in the previous and the current study are depicted as bars. Disagreements between the two studies are shown in higher detail in the insets. “Database issues” consists of problems relating to ID mapping, re-categorization of a gene as a pseudogene, etc.
Table S2. Related to Figure 2B. Motif diversity within each family – collection of TF motif clusters.
Table S4. Related to Figure 4C and Table S3. Sources of GWAS variants included in TF enrichment analyses.
Table S1. Related to Figure 1B. Human TF annotations and curator comments.
Table S3. Related to Figure 4C. GWAS variants included in TF enrichment analyses.
Acknowledgements
We apologize to colleagues whose important primary studies could not be cited due to space constraints. This work was supported by NIH R01 NS099068–01A1, Lupus Research Alliance “Novel Approaches”, Cincinnati Children’s Hospital “Center for Pediatric Genomics pilot study”, “Trustee Award”, and “Endowed Scholar award” (M.T.W.), and a CIHR Foundation Award (T.R.H.). A.J. was supported by Swedish Research Council “Vetenskapsrådet” postdoctoral grant (2016–00158). S.A.L. and L.F.C. both have NSERC doctoral scholarships. T.R.H. is a Senior Fellow of CIFAR and the Billes Chair of Medical Research at the University of Toronto.
References
- Akerblom IE, Slater EP, Beato M, Baxter JD, and Mellon PL (1988). Negative regulation by glucocorticoids through interference with a cAMP responsive enhancer. Science 241, 350–353. [DOI] [PubMed] [Google Scholar]
- Amati B, and Land H (1994). Myc-Max-Mad: a transcription factor network controlling cell cycle progression, differentiation and death. Current opinion in genetics & development 4, 102–108. [DOI] [PubMed] [Google Scholar]
- Aravind L, and Landsman D (1998). AT-hook motifs identified in a wide variety of DNA-binding proteins. Nucleic Acids Res 26, 4413–4421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arendt D, Musser JM, Baker CVH, Bergman A, Cepko C, Erwin DH, Pavlicev M, Schlosser G, Widder S, Laubichler MD, et al. (2016). The origin and evolution of cell types. Nat Rev Genet 17, 744–757. [DOI] [PubMed] [Google Scholar]
- Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Metzler G, Vedenko A, Chen X, et al. (2009). Diversity and complexity in DNA recognition by transcription factors. Science 324, 1720–1723. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barrera LA, Vedenko A, Kurland JV, Rogers JM, Gisselbrecht SS, Rossin EJ, Woodard J, Mariani L, Kock KH, Inukai S, et al. (2016). Survey of variation in human transcription factors reveals prevalent DNA binding changes. Science 351, 1450–1454. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, and Haussler D (2004). Ultraconserved elements in the human genome. Science 304, 1321–1325. [DOI] [PubMed] [Google Scholar]
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, and Bourne PE (2000). The Protein Data Bank. Nucleic Acids Res 28, 235–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blake JA, Eppig JT, Kadin JA, Richardson JE, Smith CL, Bult CJ, and the Mouse Genome Database, G. (2017). Mouse Genome Database (MGD)-2017: community knowledge resource for the laboratory mouse. Nucleic Acids Res 45, D723–D729. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brayer KJ, and Segal DJ (2008). Keep your fingers off my DNA: protein-protein interactions mediated by C2H2 zinc finger domains. Cell Biochem Biophys 50, 111–131. [DOI] [PubMed] [Google Scholar]
- Bulyk ML, Johnson PL, and Church GM (2002). Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res 30, 1255–1261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Burglin TR (2011). Homeodomain subtypes and functional diversity. Subcell Biochem 52, 95–122. [DOI] [PubMed] [Google Scholar]
- Campagne S, Saurel O, Gervais V, and Milon A (2010). Structural determinants of specific DNA-recognition by the THAP zinc finger. Nucleic Acids Res 38, 3466–3476. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carroll SB (2008). Evo-devo and an expanding evolutionary synthesis: a genetic theory of morphological evolution. Cell 134, 25–36. [DOI] [PubMed] [Google Scholar]
- Chen J, Zhang Z, Li L, Chen BC, Revyakin A, Hajj B, Legant W, Dahan M, Lionnet T, Betzig E, et al. (2014). Single-molecule dynamics of enhanceosome assembly in embryonic stem cells. Cell 156, 1274–1285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Choksi SP, Lauter G, Swoboda P, and Roy S (2014). Switching on cilia: transcriptional networks regulating ciliogenesis. Development 141, 1427–1441. [DOI] [PubMed] [Google Scholar]
- Claussnitzer M, Dankel SN, Kim KH, Quon G, Meuleman W, Haugen C, Glunk V, Sousa IS, Beaudry JL, Puviindran V, et al. (2015). FTO Obesity Variant Circuitry and Adipocyte Browning in Humans. The New England journal of medicine 373, 895–907. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cusanovich DA, Pavlovic B, Pritchard JK, and Gilad Y (2014). The functional consequences of variation in transcription factor binding. PLoS Genet 10, e1004226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Damante G, Fabbro D, Pellizzari L, Civitareale D, Guazzi S, Polycarpou-Schwartz M, Cauci S, Quadrifoglio F, Formisano S, and Di Lauro R (1994). Sequence-specific DNA recognition by the thyroid transcription factor-1 homeodomain. Nucleic Acids Res 22, 3075–3083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dehal P, and Boore JL (2005). Two rounds of whole genome duplication in the ancestral vertebrate. PLoS biology 3, e314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deplancke B, Alpern D, and Gardeux V (2016). The Genetics of Transcription Factor DNA Binding Variation. Cell 166, 538–554. [DOI] [PubMed] [Google Scholar]
- Dunwell TL, and Holland PW (2016). Diversity of human and mouse homeobox gene expression in development and adult tissues. BMC developmental biology 16, 40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ecco G, Imbeault M, and Trono D (2017). KRAB zinc finger proteins. Development 144, 2719–2729. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Emerson RO, and Thomas JH (2009). Adaptive evolution in zinc finger transcription factors. PLoS Genet 5, e1000325. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ernst J, Melnikov A, Zhang X, Wang L, Rogov P, Mikkelsen TS, and Kellis M (2016). Genome-scale high-resolution mapping of activating and repressive nucleotides in regulatory regions. Nat Biotechnol 34, 1180–1190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Finn RD, Attwood TK, Babbitt PC, Bateman A, Bork P, Bridge AJ, Chang HY, Dosztanyi Z, El-Gebali S, Fraser M, et al. (2017). InterPro in 2017-beyond protein family and domain annotations. Nucleic Acids Res 45, D190–D199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, et al. (2016). The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res 44, D279–285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fong AP, and Tapscott SJ (2013). Skeletal muscle programming and re-programming. Curr Opin Genet Dev 23, 568–573. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Font J, and Mackay JP (2010). Beyond DNA: zinc finger domains as RNA-binding modules. Methods in molecular biology 649, 479–491. [DOI] [PubMed] [Google Scholar]
- Frietze S, and Farnham PJ (2011). Transcription factor effector domains. Sub-cellular biochemistry 52, 261–277. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fu Y, Sinha M, Peterson CL, and Weng Z (2008). The insulator binding protein CTCF positions 20 nucleosomes around its binding sites across the human genome. PLoS Genet 4, e1000138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fulton DL, Sundararajan S, Badis G, Hughes TR, Wasserman WW, Roach JC, and Sladek R (2009). TFCat: the curated catalog of mouse and human transcription factors. Genome Biol 10, R29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garza AS, Ahmad N, and Kumar R (2009). Role of intrinsically disordered protein regions/domains in transcriptional regulation. Life Sci 84, 189–193. [DOI] [PubMed] [Google Scholar]
- Gebhardt JC, Suter DM, Roy R, Zhao ZW, Chapman AR, Basu S, Maniatis T, and Xie XS (2013). Single-molecule imaging of transcription factor binding to DNA in live mammalian cells. Nature methods 10, 421–426. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Geertz M, Shore D, and Maerkl SJ (2012). Massively parallel measurements of molecular interaction kinetics on a microfluidic platform. Proc Natl Acad Sci U S A 109, 16540–16545. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gertz J, Reddy TE, Varley KE, Garabedian MJ, and Myers RM (2012). Genistein and bisphenol A exposure cause estrogen receptor 1 to bind thousands of sites in a cell type-specific manner. Genome Res 22, 2153–2162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gotea V, Visel A, Westlund JM, Nobrega MA, Pennacchio LA, and Ovcharenko I (2010). Homotypic clusters of transcription factor binding sites are a key component of human promoters and enhancers. Genome Res 20, 565–577. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grove CA, De Masi F, Barrasa MI, Newburger DE, Alkema MJ, Bulyk ML, and Walhout AJ (2009). A multiparameter network reveals extensive divergence between C. elegans bHLH transcription factors. Cell 138, 314–327. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hart T, Chandrashekhar M, Aregger M, Steinhart Z, Brown KR, MacLeod G, Mis M, Zimmermann M, Fradet-Turcotte A, Sun S, et al. (2015). High-Resolution CRISPR Screens Reveal Fitness Genes and Genotype-Specific Cancer Liabilities. Cell 163, 1515–1526. [DOI] [PubMed] [Google Scholar]
- Hayward A, Cornwallis CK, and Jern P (2015). Pan-vertebrate comparative genomics unmasks retrovirus macroevolution. Proc Natl Acad Sci U S A 112, 464–469. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Herrero J, Muffato M, Beal K, Fitzgerald S, Gordon L, Pignatelli M, Vilella AJ, Searle SM, Amode R, Brent S, et al. (2016). Ensembl comparative genomics resources. Database : the journal of biological databases and curation 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, and Manolio TA (2009). Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A 106, 9362–9367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu S, Xie Z, Onishi A, Yu X, Jiang L, Lin J, Rho HS, Woodard C, Wang H, Jeong JS, et al. (2009). Profiling the human protein-DNA interactome reveals ERK2 as a transcriptional repressor of interferon signaling. Cell 139, 610–622. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu X, Kim H, Stahl E, Plenge R, Daly M, and Raychaudhuri S (2011). Integrating autoimmune risk loci with gene-expression data identifies specific pathogenic immune cell subsets. American journal of human genetics 89, 496–506. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hume MA, Barrera LA, Gisselbrecht SS, and Bulyk ML (2015). UniPROBE, update 2015: new tools and content for the online database of protein-binding microarray data on protein-DNA interactions. Nucleic Acids Res 43, D117–122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huntley S, Baggott DM, Hamilton AT, Tran-Gyamfi M, Yang S, Kim J, Gordon L, Branscomb E, and Stubbs L (2006). A comprehensive catalog of human KRAB-associated zinc finger genes: insights into the evolutionary history of a large family of transcriptional repressors. Genome Res 16, 669–677. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Imbeault M, Helleboid PY, and Trono D (2017). KRAB zinc-finger proteins contribute to the evolution of gene regulatory networks. Nature 543, 550–554. [DOI] [PubMed] [Google Scholar]
- Iwafuchi-Doi M, Donahue G, Kakumanu A, Watts JA, Mahony S, Pugh BF, Lee D, Kaestner KH, and Zaret KS (2016). The Pioneer Transcription Factor FoxA Maintains an Accessible Nucleosome Configuration at Enhancers for Tissue-Specific Gene Activation. Mol Cell 62, 79–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jacobs FM, Greenberg D, Nguyen N, Haeussler M, Ewing AD, Katzman S, Paten B, Salama SR, and Haussler D (2014). An evolutionary arms race between KRAB zinc-finger genes ZNF91/93 and SVA/L1 retrotransposons. Nature 516, 242–245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- John LB, and Ward AC (2011). The Ikaros gene family: transcriptional regulators of hematopoiesis and immunity. Molecular immunology 48, 1272–1278. [DOI] [PubMed] [Google Scholar]
- Johnson DS, Mortazavi A, Myers RM, and Wold B (2007). Genome-wide mapping of in vivo protein-DNA interactions. Science 316, 1497–1502. [DOI] [PubMed] [Google Scholar]
- Johnson PF, and McKnight SL (1989). Eukaryotic transcriptional regulatory proteins. Annu Rev Biochem 58, 799–839. [DOI] [PubMed] [Google Scholar]
- Jolma A, and Taipale J (2011). Methods for Analysis of Transcription Factor DNA-Binding Specificity In Vitro. Subcell Biochem 52, 155–173. [DOI] [PubMed] [Google Scholar]
- Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, Morgunova E, Enge M, Taipale M, Wei G, et al. (2013). DNA-binding specificities of human transcription factors. Cell 152, 327–339. [DOI] [PubMed] [Google Scholar]
- Jolma A, Yin Y, Nitta KR, Dave K, Popov A, Taipale M, Enge M, Kivioja T, Morgunova E, and Taipale J (2015). DNA-dependent formation of transcription factor pairs alters their binding specificity. Nature 527, 384–388. [DOI] [PubMed] [Google Scholar]
- Kagey MH, Newman JJ, Bilodeau S, Zhan Y, Orlando DA, van Berkum NL, Ebmeier CC, Goossens J, Rahl PB, Levine SS, et al. (2010). Mediator and cohesin connect gene expression and chromatin architecture. Nature 467, 430–435. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kelley DR, Snoek J, and Rinn JL (2016). Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res 26, 990–999. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kellis M, Wold B, Snyder MP, Bernstein BE, Kundaje A, Marinov GK, Ward LD, Birney E, Crawford GE, Dekker J, et al. (2014). Defining functional DNA elements in the human genome. Proc Natl Acad Sci U S A 111, 6131–6138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim S, Brostromer E, Xing D, Jin J, Chong S, Ge H, Wang S, Gu C, Yang L, Gao YQ, et al. (2013). Probing allostery through DNA. Science 339, 816–819. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim TH, Abdullaev ZK, Smith AD, Ching KA, Loukinov DI, Green RD, Zhang MQ, Lobanenkov VV, and Ren B (2007). Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome. Cell 128, 1231–1245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kohler S, Doelken SC, Mungall CJ, Bauer S, Firth HV, Bailleul-Forestier I, Black GC, Brown DL, Brudno M, Campbell J, et al. (2014). The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res 42, D966–974. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lambert SA, Albu M, Hughes TR, and Najafabadi HS (2016). Motif comparison based on similarity of binding affinity profiles. Bioinformatics 32, 3504–3506. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee TI, and Young RA (2013). Transcriptional regulation and its misregulation in disease. Cell 152, 1237–1251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Letunic I, Doerks T, and Bork P (2015). SMART: recent updates, new developments and status in 2015. Nucleic Acids Res 43, D257–260. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H, Liefke R, Jiang J, Kurland JV, Tian W, Deng P, Zhang W, He Q, Patel DJ, Bulyk ML, et al. (2017). Polycomb-like proteins link the PRC2 complex to CpG islands. Nature. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lovmar L, Ahlford A, Jonsson M, and Syvanen AC (2005). Silhouette scores for assessment of SNP genotype clusters. BMC genomics 6, 35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lynch VJ, and Wagner GP (2008). Resurrecting the role of transcription factor change in developmental evolution. Evolution; international journal of organic evolution 62, 2131–2154. [DOI] [PubMed] [Google Scholar]
- Malik S, and Roeder RG (2010). The metazoan Mediator co-activator complex as an integrative hub for transcriptional regulation. Nat Rev Genet 11, 761–772. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marcon E, Ni Z, Pu S, Turinsky AL, Trimble SS, Olsen JB, Silverman-Gavrila R, Silverman-Gavrila L, Phanse S, Guo H, et al. (2014). Human-chromatin-related protein interactions identify a demethylase complex required for chromosome segregation. Cell Rep 8, 297–310. [DOI] [PubMed] [Google Scholar]
- Mathelier A, Fornes O, Arenillas DJ, Chen CY, Denay G, Lee J, Shi W, Shyr C, Tan G, Worsley-Hunt R, et al. (2016). JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic Acids Res 44, D110–115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, et al. (2006). TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 34, D108–110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, Reynolds AP, Sandstrom R, Qu H, Brody J, et al. (2012). Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meijsing SH, Pufall MA, So AY, Bates DL, Chen L, and Yamamoto KR (2009). DNA binding site sequence directs glucocorticoid receptor structure and activity. Science 324, 407–410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mesika A, Ben-Dor S, Laviad EL, and Futerman AH (2007). A new functional motif in Hox domain-containing ceramide synthases: identification of a novel region flanking the Hox and TLC domains essential for activity. J Biol Chem 282, 27366–27373. [DOI] [PubMed] [Google Scholar]
- Morgunova E, and Taipale J (2017). Structural perspective of cooperative transcription factor binding. Curr Opin Struct Biol 47, 1–8. [DOI] [PubMed] [Google Scholar]
- Muller PA, and Vousden KH (2013). p53 mutations in cancer. Nat Cell Biol 15, 2–8. [DOI] [PubMed] [Google Scholar]
- Najafabadi HS, Mnaimneh S, Schmitges FW, Garton M, Lam KN, Yang A, Albu M, Weirauch MT, Radovani E, Kim PM, et al. (2015). C2H2 zinc finger proteins greatly expand the human regulatory lexicon. Nat Biotechnol. [DOI] [PubMed] [Google Scholar]
- Nardini M, Gnesutta N, Donati G, Gatta R, Forni C, Fossati A, Vonrhein C, Moras D, Romier C, Bolognesi M, et al. (2013). Sequence-specific transcription factor NF-Y displays histone-like DNA binding and H2B-like ubiquitination. Cell 152, 132–143. [DOI] [PubMed] [Google Scholar]
- Nitta KR, Jolma A, Yin Y, Morgunova E, Kivioja T, Akhtar J, Hens K, Toivonen J, Deplancke B, Furlong EE, et al. (2015). Conservation of transcription factor binding specificities across 600 million years of bilateria evolution. eLife 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ou J, Wolfe SA, Brodsky MH, and Zhu LJ (2018). motifStack for the analysis of transcription factor binding site evolution. Nat Methods 15, 8–9. [DOI] [PubMed] [Google Scholar]
- Panne D (2008). The enhanceosome. Current opinion in structural biology 18, 236–242. [DOI] [PubMed] [Google Scholar]
- Plevin MJ, Mills MM, and Ikura M (2005). The LxxLL motif: a multifunctional binding sequence in transcriptional regulation. Trends in biochemical sciences 30, 66–69. [DOI] [PubMed] [Google Scholar]
- Ptashne M (2011). Principles of a switch. Nature chemical biology 7, 484–487. [DOI] [PubMed] [Google Scholar]
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, et al. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics 81, 559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reece-Hoyes JS, and Walhout AJ (2012). Yeast one-hybrid assays: a historical and technical perspective. Methods 57, 441–447. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reiter F, Wienerroither S, and Stark A (2017). Combinatorial function of transcription factors and cofactors. Curr Opin Genet Dev 43, 73–81. [DOI] [PubMed] [Google Scholar]
- Rhee HS, and Pugh BF (2011). Comprehensive Genome-wide Protein-DNA Interactions Detected at Single-Nucleotide Resolution. Cell 147, 1408–1419. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rippe K, Schrader A, Riede P, Strohner R, Lehmann E, and Langst G (2007). DNA sequence- and conformation-directed positioning of nucleosomes by chromatin-remodeling complexes. Proc Natl Acad Sci U S A 104, 15635–15640. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rohs R, West SM, Sosinsky A, Liu P, Mann RS, and Honig B (2009). The role of DNA shape in protein-DNA recognition. Nature 461, 1248–1253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rosenfeld MG, Lunyak VV, and Glass CK (2006). Sensors and signals: a coactivator/corepressor/epigenetic code for integrating signal-dependent programs of transcriptional response. Genes & development 20, 1405–1428. [DOI] [PubMed] [Google Scholar]
- Rowe HM, Jakobsson J, Mesnard D, Rougemont J, Reynard S, Aktas T, Maillard PV, Layard-Liesching H, Verp S, Marquis J, et al. (2010). KAP1 controls endogenous retroviruses in embryonic stem cells. Nature 463, 237–240. [DOI] [PubMed] [Google Scholar]
- Schmitges FW, Radovani E, Najafabadi HS, Barazandeh M, Campitelli LF, Yin Y, Jolma A, Zhong G, Guo H, Kanagalingam T, et al. (2016). Multiparameter functional diversity of human C2H2 zinc finger proteins. Genome Res 26, 1742–1752. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schneider TD, and Stephens RM (1990). Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 18, 6097–6100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Singh H, Khan AA, and Dinner AR (2014). Gene regulatory networks in the immune system. Trends Immunol 35, 211–218. [DOI] [PubMed] [Google Scholar]
- Sizemore GM, Pitarresi JR, Balakrishnan S, and Ostrowski MC (2017). The ETS family of oncogenic transcription factors in solid tumours. Nature reviews Cancer 17, 337–351. [DOI] [PubMed] [Google Scholar]
- Slattery M, Riley T, Liu P, Abe N, Gomez-Alcala P, Dror I, Zhou T, Rohs R, Honig B, Bussemaker HJ, et al. (2011). Cofactor binding evokes latent differences in DNA binding specificity between Hox proteins. Cell 147, 1270–1282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Slattery M, Zhou T, Yang L, Dantas Machado AC, Gordan R, and Rohs R (2014). Absence of a simple code: how transcription factors read the genome. Trends Biochem Sci 39, 381–399. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smale ST (2014). Transcriptional regulation in the immune system: a status report. Trends Immunol 35, 190–194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Soufi A, Garcia MF, Jaroszewicz A, Osman N, Pellegrini M, and Zaret KS (2015). Pioneer transcription factors target partial DNA motifs on nucleosomes to initiate reprogramming. Cell 161, 555–568. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Spitz F, and Furlong EE (2012). Transcription factors: from enhancer binding to developmental control. Nat Rev Genet 13, 613–626. [DOI] [PubMed] [Google Scholar]
- Stormo GD, and Zhao Y (2010). Determining the specificity of protein-DNA interactions. Nat Rev Genet 11, 751–760. [DOI] [PubMed] [Google Scholar]
- Stubbs L, Sun Y, and Caetano-Anolles D (2011). Function and Evolution of C2H2 Zinc Finger Arrays. Sub-cellular biochemistry 52, 75–94. [DOI] [PubMed] [Google Scholar]
- Sur I, and Taipale J (2016). The role of enhancers in cancer. Nature reviews Cancer 16, 483–493. [DOI] [PubMed] [Google Scholar]
- Swinstead EE, Miranda TB, Paakinaho V, Baek S, Goldstein I, Hawkins M, Karpova TS, Ball D, Mazza D, Lavis LD, et al. (2016a). Steroid Receptors Reprogram FoxA1 Occupancy through Dynamic Chromatin Transitions. Cell 165, 593–605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Swinstead EE, Paakinaho V, Presman DM, and Hager GL (2016b). Pioneer factors and ATP-dependent chromatin remodeling factors interact dynamically: A new perspective: Multiple transcription factors can effect chromatin pioneer functions through dynamic interactions with ATP-dependent chromatin remodeling factors. Bioessays 38, 1150–1157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tacheny A, Dieu M, Arnould T, and Renard P (2013). Mass spectrometry-based identification of proteins interacting with nucleic acids. Journal of proteomics 94, 89–109. [DOI] [PubMed] [Google Scholar]
- Takahashi K, and Yamanaka S (2016). A decade of transcription factor-mediated reprogramming to pluripotency. Nature reviews Molecular cell biology 17, 183–193. [DOI] [PubMed] [Google Scholar]
- Uhlen M, Fagerberg L, Hallstrom BM, Lindskog C, Oksvold P, Mardinoglu A, Sivertsson A, Kampf C, Sjostedt E, Asplund A, et al. (2015). Proteomics. Tissue-based map of the human proteome. Science 347, 1260419. [DOI] [PubMed] [Google Scholar]
- Vaquerizas JM, Kummerfeld SK, Teichmann SA, and Luscombe NM (2009). A census of human transcription factors: function, expression and evolution. Nat Rev Genet 10, 252–263. [DOI] [PubMed] [Google Scholar]
- Virbasius CA, Virbasius JV, and Scarpulla RC (1993). NRF-1, an activator involved in nuclear-mitochondrial interactions, utilizes a new DNA-binding domain conserved in a family of developmental regulators. Genes & development 7, 2431–2445. [DOI] [PubMed] [Google Scholar]
- Visel A, Blow MJ, Li Z, Zhang T, Akiyama JA, Holt A, Plajzer-Frick I, Shoukry M, Wright C, Chen F, et al. (2009). ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature 457, 854–858. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang J, Zhuang J, Iyer S, Lin XY, Greven MC, Kim BH, Moore J, Pierce BG, Dong X, Virgil D, et al. (2013). Factorbook.org: a Wiki-based database for transcription factor-binding data generated by the ENCODE consortium. Nucleic Acids Res 41, D171–176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wasserman WW, and Sandelin A (2004). Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 5, 276–287. [DOI] [PubMed] [Google Scholar]
- Wei GH, Badis G, Berger MF, Kivioja T, Palin K, Enge M, Bonke M, Jolma A, Varjosalo M, Gehrke AR, et al. (2010). Genome-wide analysis of ETS-family DNA-binding in vitro and in vivo. EMBO J. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weirauch MT, Cote A, Norel R, Annala M, Zhao Y, Riley TR, Saez-Rodriguez J, Cokelaer T, Vedenko A, Talukder S, et al. (2013). Evaluation of methods for modeling transcription factor sequence specificity. Nat Biotechnol 31, 126–134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weirauch MT, and Hughes TR (2010). Conserved expression without conserved regulatory sequence: the more things change, the more they stay the same. Trends Genet 26, 66–74. [DOI] [PubMed] [Google Scholar]
- Weirauch MT, and Hughes TR (2011). A catalogue of eukaryotic transcription factor types, their evolutionary origin, and species distribution. Subcell Biochem 52, 25–73. [DOI] [PubMed] [Google Scholar]
- Weirauch MT, Yang A, Albu M, Cote A, Montenegro-Montero A, Drewe P, Najafabadi HS, Lambert SA, Mann I, Cook K, et al. (2014). Determination and inference of eukaryotic transcription factor sequence specificity. Cell 158, 1431–1443. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wingender E, Schoeps T, Haubrock M, and Donitz J (2015). TFClass: a classification of human transcription factors and their rodent orthologs. Nucleic Acids Res 43, D97–102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wong KH, and Struhl K (2011). The Cyc8-Tup1 complex inhibits transcription primarily by masking the activation domain of the recruiting protein. Genes Dev 25, 2525–2539. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Worsley Hunt R, and Wasserman WW (2014). Non-targeted transcription factors motifs are a systemic component of ChIP-seq datasets. Genome Biol 15, 412. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wunderlich Z, and Mirny LA (2009). Different gene regulation strategies revealed by analysis of binding motifs. Trends Genet 25, 434–440. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yin Y, Morgunova E, Jolma A, Kaasinen E, Sahu B, Khund-Sayeed S, Das PK, Kivioja T, Dave K, Zhong F, et al. (2017). Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zabidi MA, and Stark A (2016). Regulatory Enhancer-Core-Promoter Communication via Transcription Factors and Cofactors. Trends in genetics : TIG 32, 801–814. [DOI] [PMC free article] [PubMed] [Google Scholar]
References
- 1.Fried M & Crothers DM Equilibria and kinetics of lac repressor-operator interactions by polyacrylamide gel electrophoresis. Nucleic acids research 9, 6505–6525 (1981). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Oliphant AR, Brandl CJ & Struhl K Defining the sequence specificity of DNA-binding proteins by selecting binding sites from random-sequence oligonucleotides: analysis of yeast GCN4 protein. Mol Cell Biol 9, 2944–2949 (1989). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Tuerk C & Gold L Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science 249, 505–510 (1990). [DOI] [PubMed] [Google Scholar]
- 4.Reece-Hoyes JS & Marian Walhout AJ Yeast one-hybrid assays: a historical and technical perspective. Methods 57, 441–447, doi: 10.1016/j.ymeth.2012.07.027 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Galas DJ & Schmitz A DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. Nucleic acids research 5, 3157–3170 (1978). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Tullius TD, Dombroski BA, Churchill ME & Kam L Hydroxyl radical footprinting: a high-resolution method for mapping protein-DNA contacts. Methods Enzymol 155, 537–558 (1987). [DOI] [PubMed] [Google Scholar]
- 7.Kadonaga JT & Tjian R Affinity purification of sequence-specific DNA binding proteins. Proceedings of the National Academy of Sciences of the United States of America 83, 5889–5893 (1986). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Liang Y Applications of isothermal titration calorimetry in protein science. Acta biochimica et biophysica Sinica 40, 565–576 (2008). [DOI] [PubMed] [Google Scholar]
- 9.Majka J & Speck C Analysis of protein-DNA interactions using surface plasmon resonance. Advances in biochemical engineering/biotechnology 104, 13–36 (2007). [PubMed] [Google Scholar]
- 10.Zhang W, Duhr S, Baaske P & Laue E Microscale thermophoresis for the assessment of nuclear protein-binding affinities. Methods Mol Biol 1094, 269–276, doi: 10.1007/978-1-62703-706-8_21 (2014). [DOI] [PubMed] [Google Scholar]
- 11.Ren B et al. Genome-wide location and function of DNA binding proteins. Science 290, 2306–2309, doi: 10.1126/science.290.5500.2306 290/5500/2306 (2000). [DOI] [PubMed] [Google Scholar]
- 12.Johnson DS, Mortazavi A, Myers RM & Wold B Genome-wide mapping of in vivo protein-DNA interactions. Science 316, 1497–1502, doi:1141319 [pii] 10.1126/science.1141319 (2007). [DOI] [PubMed] [Google Scholar]
- 13.Rhee HS & Pugh BF Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell 147, 1408–1419, doi: 10.1016/j.cell.2011.11.013 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Wu F, Olson BG & Yao J DamID-seq: Genome-wide Mapping of Protein-DNA Interactions by High Throughput Sequencing of Adenine-methylated DNA Fragments. Journal of visualized experiments : JoVE, e53620, doi: 10.3791/53620 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Enuameh MS et al. Global analysis of Drosophila Cys(2)-His(2) zinc finger proteins reveals a multitude of novel recognition motifs and binding determinants. Genome research 23, 928–940, doi: 10.1101/gr.151472.112 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Meng X, Brodsky MH & Wolfe SA A bacterial one-hybrid system for determining the DNA-binding specificity of transcription factors. Nature biotechnology 23, 988–994, doi:nbt1120 10.1038/nbt1120 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Noyes MB et al. Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites. Cell 133, 1277–1289 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Noyes MB et al. A systematic characterization of factors that regulate Drosophila segmentation via a bacterial one-hybrid system. Nucleic acids research 36, 2547–2560, doi:gkn048 10.1093/nar/gkn048 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Najafabadi HS et al. C2H2 zinc finger proteins greatly expand the human regulatory lexicon. Nature biotechnology, doi: 10.1038/nbt.3128 (2015). [DOI] [PubMed] [Google Scholar]
- 20.Yin Y et al. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356, doi: 10.1126/science.aaj2239 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Jolma A et al. DNA-dependent formation of transcription factor pairs alters their binding specificity. Nature 527, 384–388, doi: 10.1038/nature15518 (2015). [DOI] [PubMed] [Google Scholar]
- 22.Jolma A et al. Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome research 20, 861–873, doi:gr.100552.109 10.1101/gr.100552.109 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Zhao Y, Granas D & Stormo GD Inferring binding energies from selected binding sites. PLoS computational biology 5, e1000590, doi: 10.1371/journal.pcbi.1000590 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Zykovich A, Korf I & Segal DJ Bind-n-Seq: high-throughput analysis of in vitro protein-DNA interactions using massively parallel sequencing. Nucleic acids research 37, e151, doi:gkp802 10.1093/nar/gkp802 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Slattery M et al. Cofactor binding evokes latent differences in DNA binding specificity between Hox proteins. Cell 147, 1270–1282, doi: 10.1016/j.cell.2011.10.053 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Bartlett A et al. Mapping genome-wide transcription-factor binding sites using DAP-seq. Nat Protoc 12, 1659–1672, doi: 10.1038/nprot.2017.055 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Nutiu R et al. Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument. Nature biotechnology 29, 659–664, doi: 10.1038/nbt.1882 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Jung C et al. Massively Parallel Biophysical Analysis of CRISPR-Cas Complexes on Next Generation Sequencing Chips. Cell 170, 35–47 e13, doi: 10.1016/j.cell.2017.05.044 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Hu S et al. Profiling the human protein-DNA interactome reveals ERK2 as a transcriptional repressor of interferon signaling. Cell 139, 610–622, doi:S0092–8674(09)01111-8 [pii] 10.1016/j.cell.2009.08.037 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Hu S et al. DNA methylation presents distinct binding sites for human transcription factors. eLife 2, e00726, doi: 10.7554/eLife.00726 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Stormo GD, Zuo Z & Chang YK Spec-seq: determining protein-DNA-binding specificity by sequencing. Briefings in functional genomics 14, 30–38, doi: 10.1093/bfgp/elu043 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Maerkl SJ & Quake SR A systems approach to measuring the binding energy landscapes of transcription factors. Science 315, 233–237, doi:315/5809/233 10.1126/science.1131007 (2007). [DOI] [PubMed] [Google Scholar]
- 33.Hallikas O et al. Genome-wide prediction of mammalian enhancers based on analysis of transcription-factor binding affinity. Cell 124, 47–59 (2006). [DOI] [PubMed] [Google Scholar]
- 34.Hallikas O & Taipale J High-throughput assay for determining specificity and affinity of protein-DNA binding interactions. Nat Protoc 1, 215–222 (2006). [DOI] [PubMed] [Google Scholar]
- 35.Wei GH et al. Genome-wide analysis of ETS-family DNA-binding in vitro and in vivo. EMBO J, doi:emboj2010106 10.1038/emboj.2010.106 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Figure S1. Identification of the human transcription factors. A. Overview of the strategy for identifying the human TF repertoire. B. Comparison between TFs reported in this study and a previous study (Vaquerizas et al., 2009). The number of TFs identified in the previous and the current study are depicted as bars. Disagreements between the two studies are shown in higher detail in the insets. “Database issues” consists of problems relating to ID mapping, re-categorization of a gene as a pseudogene, etc.
Table S2. Related to Figure 2B. Motif diversity within each family – collection of TF motif clusters.
Table S4. Related to Figure 4C and Table S3. Sources of GWAS variants included in TF enrichment analyses.
Table S1. Related to Figure 1B. Human TF annotations and curator comments.
Table S3. Related to Figure 4C. GWAS variants included in TF enrichment analyses.
