Abstract
Transcription factors (TFs) regulate gene expression by interacting with DNA in a sequence-specific manner. High-throughput in vitro technologies, such as PBMs 1–6 and HT-SELEX 7,8, have revealed the DNA binding specificities of hundreds of TFs. However, they have limited ability to reliably identify lower affinity DNA binding sites, which are increasingly recognized as important for precise spatiotemporal control of gene expression 9–19. To address this limitation, we developed a novel technology to measure protein affinity to DNA by in vitro transcription and RNA sequencing (PADIT-seq). Using PADIT-seq, we comprehensively assayed the binding preferences of six TFs to all possible 10-bp DNA sequences, detecting hundreds of novel, lower affinity binding sites. The expanded repertoire of lower affinity binding sites unexpectedly revealed that nucleotides flanking high affinity DNA binding sites create overlapping lower affinity sites that together modulate TF genomic occupancy in vivo. We propose a model where TF binding is not determined by individual binding sites, but rather by the sum of multiple, overlapping binding sites. The overlapping binding model explains how competition between paralogous TFs for shared high-affinity binding sites is determined by flanking nucleotides that create differential numbers of overlapping, lower affinity binding sites. Critically, the model transforms our understanding of noncoding variant effects, revealing how single nucleotide changes simultaneously alter multiple overlapping sites to additively influence gene expression and human traits, including diseases.
Regulation of gene expression in response to diverse intracellular and extracellular signals depends on sequence-specific binding by transcription factors (TFs) to cis-regulatory elements. Hundreds of TFs have been profiled by high-throughput in vitro DNA binding assays, such as universal protein binding microarrays (uPBMs) 1–6, HT-SELEX 7,8, affinity purification-based approaches 20, Spec-seq 21, microfluidics-based approaches 22,23, and bacterial one-hybrid (B1H) assays 24. While in vitro data on TF DNA binding specificities have been powerful in helping to understand mechanisms underlying in vivo TF occupancy, nucleotides flanking motif matches in the genome have been found to unexpectedly alter TF occupancy 25–34. We hypothesized that gaps in the ability of in vitro DNA binding data to fully explain in vivo TF binding may be due to an inability to reliably detect lower affinity TF-DNA interactions.
Low affinity DNA binding sites are widespread and important for precise spatiotemporal control of gene expression during development 9–15. Homotypic clustering of multiple, non-overlapping, low affinity binding sites near each other can serve regulatory roles by increasing local TF concentration, either cooperatively 16 or non-cooperatively 17, to achieve phenotypic robustness 12,16. Multiple low affinity binding sites in repetitive elements have also been shown to increase local TF concentrations 18. Several high-throughput in vitro binding assays have been developed to quantify TF-DNA interaction strength and identify lower affinity binding sequences 35–37. However, it remains challenging to determine whether the lower affinity interactions are functional and how they contribute to overall TF occupancy. To address this gap and characterize mechanisms by which low affinity binding sites might play important roles in gene regulation, we have developed protein affinity to DNA by in vitro transcription and RNA sequencing (PADIT-seq), a high-throughput technology to measure TF-DNA binding preferences at far greater sensitivity than prior high-throughput methods.
PADIT-seq measures relative DNA affinity
PADIT-seq is based on a synthetic genetic circuit whose output is directly proportional to TF-DNA binding affinity and specificity (Figure 1a). Following in vitro transcription and translation (IVTT), the DBD binds to candidate DNA binding sites and recruits T7 RNA Polymerase via an ALFA-nbALFA interaction 38. In the absence of DNA binding, a minimal “D1” T7 promoter drives a low level of reporter gene expression 39. When bound, the TF increases downstream reporter gene expression directly proportional to the strength of the TF-DNA interaction. This direct coupling between TF binding and transcriptional output ensures that detected binding events, even at lower affinities, represent functionally relevant TF-DNA interactions.
Figure 1. PADIT-seq detects hundreds of lower affinity interactions missed by uPBM and HT-SELEX.

(a) The PADIT-seq synthetic genetic circuit is shown with a reporter library containing all possible 10-bp DNA sequences as candidate TFBS. (b) EGR1 PADIT-seq activities are compared to MITOMI-derived dissociation constants (Kd). Active k-mers, colored red, are those that significantly increased reporter gene expression upon TF binding (FDR 5%). Inactive k-mers are colored black. (c) Comparison of binding preferences measured by uPBM E-scores (x-axis) and PADIT-seq (y-axis) for 4 human TFs, HOXD13, NKX2.5, TBX5, and EGR1, and 2 yeast TFs, Pho4 and Cbf1. Active k-mers are colored red, and the ratio of active/inactive k-mers is shown for each TF. (d) ROC curves comparing the ability of uPBM E-scores (orange) and HT-SELEX cycles 1–4 (blue) to discriminate between PADIT-seq active and inactive k-mers. (inset) The area under each ROC curve (AUROC) indicates the overall performance of each method, with values closer to 1 indicating better discrimination. HT-SELEX data were not available for the yeast TFs Pho4 and Cbf1. (e) (left) Same data as in panel c, with blue points indicating k-mers that were significantly enriched in the specified HT-SELEX cycle. (right) Box plots comparing PADIT-seq activities between active k-mers that were enriched (blue) or not enriched (red) in HT-SELEX. The bounds of the box plots define the 25th, 50th and 75th percentiles, and whiskers extend to the furthest data points within 1.5× the interquartile range. Statistical significance was determined by two-sided Wilcoxon rank sum tests.
To assay TFs by PADIT-seq, we first constructed a reporter library with all possible 10-bp DNA sequences as candidate TF binding sites (TFBS; n = 1,048,576). TFBS were randomly associated with barcodes (BCs) during PADIT-seq reporter library construction, and TFBS-BC combinations were determined by Illumina sequencing. The PADIT-seq reporter library was then mixed with either a ‘no DBD’ control or a constitutive T7 promoter driving expression of the DBD of interest. To benchmark PADIT-seq, we initially assayed two well-characterized human TFs: EGR1 and HOXD13, representing the largest (C2H2-type zinc finger) and second largest (homeodomain) TF families in metazoans, respectively. Following IVTT and Illumina sequencing of reporter RNAs, BC counts per TFBS for HOXD13, EGR1 and the ‘no DBD’ control were found to be highly reproducible across triplicate assays (Extended Data Figure 1a,b). To quantify the TF-DNA interactions, we performed differential gene expression analysis of TFBS counts against the ‘no DBD’ control using DESeq2 40. We defined the resulting log2 (DBD / ‘no-DBD’) values as ‘PADIT-seq activity’, and ‘active’ TFBS as those that significantly increased reporter gene expression upon TF binding.
At a false discovery rate (FDR) of 5%, we identified 46,279 and 6,596 active 10-mers for HOXD13 and EGR1, respectively. To test the robustness of PADIT-seq in identifying active sequences, we constructed a smaller-scale PADIT-seq reporter library comprising 896 candidate 9-bp binding sites for HOXD13 or EGR1. For both TFs, all of the PADIT-seq active k-mers identified in the all 10-mers library were also active in this smaller-scale library (Extended Data Figure 1d,e). PADIT-seq activities were highly correlated between the ‘all-10-mers’ and smaller-scale 9-mers library for both TFs (Extended Data Figure 1d,e), supporting the reliability of PADIT-seq results. To assess the accuracy of PADIT-seq in quantifying TF-DNA interaction strengths, we compared EGR1 PADIT-seq activities to equilibrium dissociation constants (Kd) derived from an orthogonal, MITOMI assay 41. PADIT-seq activities for EGR1 from both libraries were highly correlated with MITOMI-derived Kd values (Figure 1b and Extended Data Figure 1f). Importantly, EGR1 binding to low affinity sites (Kd ~ 0.1 μM) was detected among the active 9-mers in both libraries (Figure 1b and Extended Data Figure 1f).
TFs bind hundreds of low affinity sites
After initial benchmarking using HOXD13 and EGR1, we applied PADIT-seq to four additional TFs (Extended Data Figure 1c) spanning diverse structural classes: two human TFs - NKX2.5 (homeodomain) and TBX5 (T-box), which play critical roles in cardiac development 42,43, and two paralogous Saccharomyces cerevisiae yeast TFs - Pho4 and Cbf1 (bHLH), which compete for binding to shared DNA target sites 44 to regulate phosphate homeostasis 45 and chromosome segregation 46, respectively.
To evaluate how PADIT-seq performs against established methods, we first compared activities from the ‘all-10-mers’ PADIT-seq library to uPBM data for all six TFs. Since uPBMs allow derivation of a TF binding score to all possible 8-mers and 9-mers, we first derived median k-mer PADIT-seq activities from the all-10-mers data, where k=8 for HOXD13, NKX2.5, TBX5, Pho4, and Cbf1, and k=9 for EGR1 based on its known, longer DNA binding site preference. PADIT-seq and uPBM data showed strong concordance across all 6 assayed TFs (Figure 1c). In addition to the highest scoring uPBM k-mers, PADIT-seq active k-mers extended to include lower affinity binding sites, detecting binding to sites with E-scores as low as 0.3, that traditional uPBM analysis would miss (Figure 1c). We quantified how well uPBM scores predict PADIT-seq active k-mers using area under the receiver operating characteristic curve (AUROC) analysis. uPBM E-scores and Z-scores both showed excellent discriminatory power across all TFs (AUROC > 0.97; Figure 1d and Extended Data Figure 1g). However, both of these PBM scores required different thresholds for different TFs, suggesting that using a fixed uPBM threshold across TFs, as has typically been done in prior studies 47,48, does not reliably distinguish bound sites.
The strong concordance with uPBM data suggests that PADIT-seq active k-mers represent genuine TF binding sites. However, uPBM measurements may be confounded by variable flanking sequences surrounding each k-mer. Therefore, to obtain independent validation of PADIT-seq active k-mers, we designed custom PBMs in which each k-mer was embedded within constant flanking sequences and tested independently (Extended Data Figure 2a). PADIT-seq activities showed strong correlation with custom PBM signal intensities across different protein concentrations for both HOXD13 (Pearson r > 0.90; Extended Data Figure 2b) and EGR1 (Pearson r > 0.86; Extended Data Figure 2c). Importantly, the correlation remained robust even for k-mers with lower PADIT-seq activities (0.01 < FDR < 0.10), demonstrating that these sequences represent genuine TF binding sites rather than technical artifacts of the PADIT-seq assay.
HT-SELEX, like uPBMs, has been used to assay the DNA binding preferences of hundreds of TFs 3,5,8. Consequently, we evaluated its ability to distinguish PADIT-seq active k-mers. While uPBM E-scores showed strong predictive power across all TFs (AUROC > 0.97), HT-SELEX enrichment scores demonstrated substantially lower performance, irrespective of the HT-SELEX cycle analyzed (Figure 1d). Further analysis revealed that HT-SELEX was biased for detecting high affinity TFBS and missed lower affinity interactions identified by PADIT-seq (Figure 1e). PWMs derived from HT-SELEX data 49,50 were similarly biased for detecting high affinity interactions (Extended Data Figure 2d,e). Therefore, PADIT-seq significantly outperforms existing high-throughput technologies in identification of lower affinity TF DNA binding sites.
TFs bind overlapping sites in the genome
To investigate the role of lower affinity DNA binding sites in TF occupancy in vivo, we used PADIT-seq data to inspect genomic regions where the six TFs were bound according to ChIP-seq or ChIP-nexus data: HOXD13 in the mouse forelimb bud 51, EGR1 in the mouse frontal cortex 52, NKX2.5 and TBX5 in human cardiomyocytes 53,54, and Pho4 and Cbf1 in S. cerevisiae 55. We first asked whether incorporating lower affinity TF-DNA interactions increases the ability to discriminate ChIP ‘bound’ from background genomic regions. Although lower affinity binding sites did not improve discrimination of bound versus unbound regions (Extended Data Figure 3a), we hypothesized they might have a quantitative effect on binding levels. Indeed, we found that the sum of PADIT-seq activity levels of all active k-mers within individual peaks was significantly correlated, albeit modestly, with normalized ChIP-seq and ChIP-nexus read counts (Pearson r = 0.29–0.50; Extended Data Figure 3b). Importantly, considering only the highest affinity k-mers resulted in lower correlation with ChIP-seq and ChIP-nexus signal than when lower affinity sites were included (Extended Data Figure 3c).
To investigate how lower affinity binding sites may increase TF genomic occupancy, we performed a sliding window analysis in which we scored every k-mer, moving in 1-bp steps, across ChIP-seq or ChIP-nexus peaks. Surprisingly, we found multiple, consecutive active k-mers within peaks for all 6 TFs. For example, at the HOXD13 ChIP-seq peak near Cadps, there were 6 consecutive, active HOXD13 8-mers, each of which was assayed independently with PADIT-seq (Figure 2a). This phenomenon, where multiple TFBS overlap each other, is distinct from homotypic clustering, in which multiple, non-overlapping binding sites for a particular TF are located typically tens to hundreds of bp apart, within a cis-regulatory element 56–58 and to our knowledge has not been reported previously.
Figure 2: Nucleotides flanking high affinity binding sites create overlapping lower affinity binding sites.

(a) Example of a HOXD13 ChIP-seq peak where PADIT-seq activities for overlapping 8-mers are tiled in 1-bp steps across the genomic region. (inset) Schematic showing that the six consecutive active 8-mers (red) were assayed independently using PADIT-seq. (b) Distribution of consecutive active k-mers within ChIP-seq and ChIP-nexus peaks (red) compared to random, length-matched genomic regions (black) for each TF. Statistical significance was determined by two-sided Wilcoxon rank sum tests. For EGR1, the large effect size resulted in a P-value below computational precision. (c) Analysis of overlapping binding sites in ChIP-seq and ChIP-nexus peaks. (Left) Heatmaps showing PADIT-seq activity centered on consecutive active k-mers, with 4 flanking inactive k-mers on each side. Each row represents one peak. (Right) Average PADIT-seq activity across all peaks in the corresponding heatmap. (d) Conservation analysis using PhastCons scores for genomic regions containing multiple consecutive active k-mers (red) and their flanking sequences (blue). The number of ChIP-seq peaks for each analysis is indicated in the corresponding row in panel c. * Adjusted P < 0.05, paired two-sided Wilcoxon tests. The bounds of the box plots define the 25th, 50th and 75th percentiles, and whiskers extend to the furthest data points within 1.5× the interquartile range. Conservation analysis was not performed for Pho4 and Cbf1 ChIP-nexus peaks due to insufficient peak numbers.
Across all 6 TFs, we found that ChIP-seq and ChIP-nexus peaks were significantly enriched for having a larger number of consecutive, active k-mers (Figure 2b). This enrichment remained significant irrespective of how background genomic sequences were defined (Extended Data Figure 4a) and regardless of the FDR cutoff chosen to define PADIT-seq active k-mers (Extended Data Figure 4b and 5). High affinity PADIT-seq active k-mers tended to be central and were flanked by nucleotides that created additional overlapping, lower affinity binding sites (Figure 2c). To investigate whether these consecutive active k-mers may be functionally important, we inspected the evolutionary conservation of the corresponding genomic regions. We found that genomic regions containing consecutive active k-mers were significantly more conserved than flanking sequences (Figure 2d) across varying numbers of consecutive, active k-mers (Supplementary Figure 1).
Since some TFs prefer binding sites with particular DNA shape features 59, we investigated whether the consecutive active k-mers found within ChIP-seq peaks were enriched for any DNA shape features. Using a deep learning based tool, Deep DNAShape 60, we found that minor groove width (Extended Data Figure 6a) and propeller twist (Extended Data Figure 6b) at the extended recognition sequences bound by TFs were distinct from flanking genomic regions. This is consistent with a previous study that found that distinct DNA shapes at nucleotides flanking motif matches influenced TF occupancy 34. However, further studies are required to better understand the relationship between TF DNA shape preferences and lower affinity DNA binding sites.
TFs bind each site independently in vivo
ChIP-nexus combines ChIP-based specificity with high-resolution 5’–3’ exonuclease trimming to precisely map individual TF-DNA binding events 61. Therefore, to test whether TFs interact independently with each overlapping binding site in vivo, we examined ChIP-nexus data for the S. cerevisiae yeast TFs Pho4 and Cbf1 stratifying ChIP-nexus peaks by the number of consecutive, overlapping binding sites in the genomic sequence (schema in Figure 3). Peaks with only 1 or 2 overlapping binding sites showed no significant cuts above background (not shown), consistent with weaker binding interactions at these regions. Our analysis consequently focused on comparing peaks containing 3 versus 4 versus 5 overlapping sites (Figure 3), allowing us to make three key observations:
Figure 3: ChIP-nexus footprinting provides direct evidence for independent TF binding to consecutive, overlapping binding sites.

(Left) Schema illustrating how the footprinting analysis was performed. (Center and right) Strand-specific distribution of ChIP-nexus 5’ ends (red: positive strand; black: negative strand) averaged across peaks containing exactly 3 (top), 4 (center) or 5 (bottom) consecutive overlapping active 8-mers for Cbf1 (center) or Pho4 (right). Dashed blue lines indicate boundaries of the consecutive overlapping active 8-mers. Letters indicate cuts occurring significantly above background in the flanking 15-bp genomic region (one-sided permutation test adjusted P < 0.05). Boxes at the bottom show calculations for the footprint size, measured as the distance between furthest cuts on positive and negative strands (green dashed lines in each plot). The number of ChIP-nexus peaks with exactly 3, 4 or 5 overlapping binding sites for Pho4 or Cbf1 are indicated in Figure 2b.
There was a progressive increase in the magnitude of cuts as the number of overlapping binding sites increased, from 3 to 5, for both Pho4 and Cbf1 (height of lettered cuts in Figure 3). This suggests that adding an overlapping binding site increases overall TF occupancy by contributing additively to overall affinity.
With each additional overlapping binding site, the number of cuts significantly above background (permutation test adjusted P < 0.05) expanded in 1-bp increments on both the positive and negative strands (Cbf1: 2→3→4 cuts; Pho4: 3→4 cuts; width of lettered cuts in Figure 3). This cannot be explained by models where TFs recognize a single site with extended, flanking sequence preferences.
The expansion of cuts occurred primarily inward toward the center of the binding region. Consequently, the overall footprint size, i.e., the distance between the furthest cuts on the positive and negative strands, increased incrementally with each additional overlapping site (Cbf1: 28→29→30 bp; Pho4: 27→28 bp; boxes in Figure 3). This progressive expansion of the protected region directly corresponds to additional TF-DNA contacts forming at each consecutive, overlapping binding site.
Each additional, overlapping binding site produces precisely one additional cut on each strand, leading to 1-bp increments in total footprint size. This corresponds precisely to the predicted molecular signature of independent TF binding at each overlapping site, providing strong in vivo support for the overlapping binding sites model.
This pattern could potentially arise from an artifact if the core CACGTG E-box motif were to occur at more positions within regions containing higher numbers of overlapping binding sites. However, for Pho4, we found that ~96% of ChIP-nexus peaks with 5 overlapping binding sites had CACGTG positioned at the 4th register (Extended Data Figure 7a). Consequently, the increased number of significant cuts and expanded footprint size observed for 5 overlapping binding sites cannot be attributed to CACGTG occurring at more registers. In contrast, for Cbf1, CACGTG positioning was indeed more variable in peaks with 5 overlapping binding sites. Therefore, we restricted our analysis to peaks where CACGTG occurred at the same register. Even after this positional constraint, the footprint expansion persists: 28-bp for 3 overlapping binding sites versus 30-bp for 5 overlapping binding sites (Extended Data Figure 7b). This controlled analysis conclusively demonstrates that the ChIP-nexus footprint expansion reflects authentic, additional TF binding events, rather than positional artifacts.
Each site additively increases affinity
We performed three independent analyses to validate that consecutive, overlapping active k-mers additively increase TF occupancy. First, we tiled HT-SELEX sequencing reads with PADIT-seq k-mers because each cycle enriches for sequences with higher TF affinity. Consistent with our model, sequences containing more overlapping active k-mers became progressively more abundant across successive rounds of selection (Extended Data Figure 8a). Second, uPBM signal intensities correlated linearly with the number of consecutive, overlapping active k-mers across ~40,000 60-bp probes for all 6 TFs (Extended Data Figure 8b). Third, our model predicts that TF binding to 10-mers should depend on the combined strength of all 3 overlapping 8-mers, rather than just the strongest 8-mer. For example, for all 10-mers with a central ACTTTACT 8-mer, the mean HOXD13 uPBM E-scores of the two overlapping 8-mers were highly correlated with the 10-mer PADIT-seq activities (Extended Data Figure 8c). This relationship held broadly. The mean uPBM E-scores of overlapping k-mers were positively correlated with the 10-mer PADIT-seq activities for a significant majority of active k-mers (Extended Data Figure 8d; Binomial test P < 2.2 × 10−16). Altogether, the three independent analyses confirmed that consecutive active k-mers increase TF affinity additively.
Mechanism of TF paralog competition
Given that TF binding is determined by multiple overlapping sites rather than individual high affinity sites, we next sought to test whether this model could explain a fundamental and long-standing question in transcriptional regulation: how do paralogous TFs achieve binding specificity despite recognizing similar motifs? To answer this question, we focused on Pho4 and Cbf1, which compete for binding to the same CACGTG E-box motif under low phosphate conditions when Pho4 is translocated to the nucleus. Using PADIT-seq, we identified 311 active 8-mers across both TFs, with 34 high affinity sites bound by both factors, while lower affinity sites showed specificity for either Cbf1 (n=58) or Pho4 (n=219) (Figure 4a). Since preferences for nucleotides flanking core E-box motifs have been shown to differ among bHLH TFs 30,62,63, we first analyzed published BET-seq data 30 measuring relative binding affinities of Pho4 and Cbf1 to the central CACGTG E-box with all possible 5-bp flanking nucleotides on either side. The distribution of binding preferences showed clear bimodality (Figure 4b), indicating that flanking sequences create differential binding specificity. Remarkably, the difference in the number of consecutive active 8-mers for each TF strongly predicted which TF would dominate binding (Pearson r = 0.796; Figure 4c,d). Incorporating the relative binding strengths, i.e., the k-mer’s PADIT-seq activities, yielded an even stronger correlation (Pearson r = 0.948; r2 = 0.898 ± 0.0004; Figure 5e). The overlapping binding sites model explains ~50% of the remaining variance that PWM models fail to capture (r2 = 0.795 ± 0.0007), providing substantial improvement in explaining the precise molecular mechanism of TF paralog competition.
Figure 4: Competition between the paralogous TFs Pho4 and Cbf1 is determined by differential numbers of overlapping binding sites.

(a) PADIT-seq activities for all 8-mers, showing binding by both TFs (red), Cbf1-only (blue), Pho4-only (yellow), or neither (purple). (b) Distribution of differential binding between Cbf1 and Pho4 measured by BET-seq (ΔΔΔG) to all possible 10-bp DNA sequences (n = 1,048,576) flanking a central CACGTG E-box (‘NNNNNCACGTGNNNNN’). (c) Box plots showing the relationship between differential binding measured by BET-seq (ΔΔΔG; y-axis) and the differential number of Pho4 and Cbf1 active 8-mers overlapping ‘NNNNNCACGTGNNNNN’ (n = 1,048,576). The bounds of the box plots define the 25th, 50th and 75th percentiles, and whiskers extend to the furthest data points within 1.5× the interquartile range. Statistical significance was determined by two-sided Wilcoxon tests, and P-values were adjusted for multiple testing by the Holm procedure. All pairwise comparisons have an adjusted P-value below computational precision (P ~ 0), except for the indicated pairs. (d) PADIT-seq activities of Pho4 (red) and Cbf1 (blue) across two representative 16-bp ‘NNNNNCACGTGNNNNN’ sequences. (e) Scatter plot showing the relationship between differential binding measured by BET-seq (ΔΔΔG; y-axis) and differential PADIT-seq activity between Cbf1 and Pho4 at the 9 8-mers overlapping all possible NNNNNCACGTGNNNNN DNA sequences. (f) Binding resilience when concentrations of the competing TF are increased in vitro (left: Cbf1 binding when Pho4 concentration is increased; middle: Pho4 binding when Cbf1 concentration is increased) or in vivo (right: Pho4 binding when Cbf1 concentration is increased). The bounds of the box plots define the 25th, 50th and 75th percentiles, and whiskers extend to the furthest data points within 1.5× the interquartile range. Statistical significance was determined by two-sided Wilcoxon tests, and P-values were adjusted for multiple testing by the Holm procedure. Adjusted P-values > 0.05 are indicated at n.s.
Figure 5: Noncoding variants alter TF binding by perturbing multiple overlapping binding sites.

(a) Difference in PADIT-seq activity between k-mers tiled across the reference and alternate alleles (x-axis) is plotted against the SNP-SELEX preferential binding scores (PBS; y-axis) for HOXD13 (left) and EGR1 (right). Pearson correlation for all SNPs except those colored black is indicated. (inset) Euler diagram of SNPs with differential PADIT-seq activity and significant SNP-SELEX preferential binding scores. (b) Schematic illustration of the custom PBM design to test differential TF binding to noncoding SNPs. (c) ROC curves comparing the performance of PADIT-seq, SNP-SELEX PBS, and MotifBreakR against custom PBM validation data (schema shown in panel B) for predicting variant effects on HOXD13 (left) and EGR1 (right) binding. (d-f) (Left) PADIT-seq k-mers tiled across the reference and alternate alleles. Red points represent k-mers that are active in both alleles, and green points represent k-mers that are uniquely active in the allele with higher TF binding. (Right) Box plots show custom PBM signals from 16 independent probes for the reference and alternate alleles. P-values from two-sided t-tests were adjusted for multiple testing by the Benjamini-Hochberg procedure. (g) SNP-SELEX preferential binding scores plotted against the number of overlapping PADIT-seq active k-mers altered by SNP alleles. P-values from two-sided Wilcoxon tests were adjusted for multiple testing by the Holm procedure. Adjusted P-values > 0.05 are indicated at n.s. (d-g) The bounds of the box plots define the 25th, 50th and 75th percentiles, and whiskers extend to the furthest data points within 1.5× the interquartile range. (h) The significantly expanded repertoire of lower affinity binding sites uncovered by PADIT-seq reveals that high affinity sites are flanked by nucleotides that create additional, overlapping lower affinity binding sites that together influence genomic TF occupancy.
To further determine whether these differential overlapping sites influence TF competition, we next analyzed genomic context PBM data 64. Consistent with our model, DNA sequences with a larger number of Pho4-specific consecutive active 8-mers showed reduced Cbf1 binding when Pho4 was present, and vice versa (Figure 5f, left and middle panels, respectively). We also observed this effect in vivo, where ChIP-seq peaks containing more Pho4-specific consecutive active 8-mers showed higher Pho4 occupancy in the presence of Cbf1 (Figure 5f, right panel).
Altogether, these results provide compelling evidence that nucleotides flanking high affinity binding sites create differential numbers of overlapping, lower affinity binding sites to control competition between paralogous TFs.
Mechanism of SNP effects on TF binding
Current models based on single binding sites do not accurately capture the effects of noncoding variants on TF binding. Therefore, we next investigated whether our overlapping binding sites model could provide a more predictive framework. We first analyzed 5,748 and 4,136 variants that were tested by SNP-SELEX for differential binding to HOXD13 and EGR1, respectively 65. PADIT-seq accurately identified 92.8% (39/42) and 96.4% (81/84) of variants found by SNP-SELEX to alter HOXD13 and EGR1 binding, respectively (Figure 5a). While these variants primarily had large effects on TF binding (red points, Figure 5a), PADIT-seq detected significantly more (>5-fold) variants with subtler effects (yellow points, Figure 5a). Although these variants were not identified as differentially bound by SNP-SELEX, the effects of these variants on PADIT-seq activity were nonetheless highly correlated with SNP-SELEX preferential binding scores (yellow, Figure 5a). We hypothesized that these variants are likely true positives that were below the sensitivity of SNP-SELEX. To systematically evaluate these predictions, we performed custom PBM experiments to test differential binding by HOXD13 and EGR1 to a representative subset (~280 variants each) of these variants (Figure 5b). PADIT-seq outperformed existing approaches in identifying differential TF binding, achieving AUROC values of 0.943 for HOXD13 and 0.962 for EGR1 (Figure 5c). MotifBreakR, a popular PWM-based algorithm that scores the effects of variants on TF binding 66, showed notably lower performance (AUROC = 0.790 for HOXD13, 0.872 for EGR1). In Figures 5d and Extended Data Figure 9a,b, we highlight some examples of how the overlapping binding sites model captures variant effects on TF binding that are missed by MotifBreakR.
A sliding window analysis of PADIT-seq active k-mers across the reference and alternate alleles revealed that the magnitude of differential binding scaled with the number of overlapping binding sites altered. Variants altering multiple overlapping k-mers, such as rs606231230 and rs79228650, exhibited large effects on TF binding in custom PBM experiments (Figure 5e). Notably, rs606231230 is pathogenic for preaxial polydactyly and occurs in a limb-specific enhancer bound by HOXD13 in the developing mouse limb bud 51,67,68. The risk allele creates multiple overlapping HOXD13 binding sites. Importantly, we also identified variants like rs1104802 and rs73414426 that altered single active k-mers while preserving adjacent overlapping sites. These variants showed modest but statistically significant effects on TF binding in custom PBM experiments (Figure 5f), providing further evidence that each overlapping site contributes independently to overall binding. We quantified this relationship by a systematic comparison between SNP-SELEX preferential binding scores and the number of consecutive active k-mers altered between alleles, which showed high correlation (Figure 5g). Together, these results provide strong evidence that noncoding variants modulate TF binding through cumulative effects on multiple, overlapping binding sites.
To demonstrate in vivo relevance, we validated PADIT-seq predictions using allele-specific ChIP-seq data 69. PADIT-seq identified the preferred allele with 91% concordance, substantially outperforming MotifBreakR (Extended Data Figure 9c). Critically, analysis of MPRA data 70–73 for EGR1, which is broadly expressed across different cell-types, confirmed that variants predicted by PADIT-seq to alter binding also significantly impact gene expression (Extended Data Figure 9d,e,f). These results demonstrate that noncoding variants modulate TF binding and target gene expression in vivo by perturbing multiple, overlapping binding sites.
TF binding sites are innately ‘weavable’
The overlapping binding sites model successfully explains both TF paralog competition and SNP effects on TF binding. This raises a critical question – is the model a general principle of TF-DNA recognition, or limited to the 6 TFs included in this study? To answer this question, we first asked what enables PADIT-seq active k-mers to form extended recognition sequences. We constructed a network graph representation in which nodes correspond to PADIT-seq active k-mers and edges indicate a (k-1)-bp overlap between the connected k-mers (Extended Data Figure 10a). For HOXD13, analysis of the 1,780 active 8-mers and their reverse complements revealed that 3,446 / 3,536 (~97.5%) nodes were part of the largest, single connected component, in which any two nodes can be connected through a continuous path of edges (Extended Data Figure 10b). In stark contrast, a random 8-mer network showed minimal connectivity, with only 7 / 3,536 (~0.2%) nodes forming the largest, single connected component (Extended Data Figure 10c). This suggests that HOXD13 binding sites possess an inherent capacity to weave together into extended recognition sequences, which is true for the remaining 5 TFs as well (Extended Data Figure 10d,e,f,g,h). Moreover, for all 6 TFs, we found that the binding affinity of a node correlated positively with its total number of edges (Extended Data Figure 10b,d,e,f,g,h), suggesting that high affinity k-mers are inherently more likely to be flanked by nucleotides that create overlapping lower affinity binding sites.
To examine whether ‘weavability’ represents a general property of TFBS, we analyzed uPBM data spanning 200 human and mouse TFs across 9 major DBD classes from the UniPROBE database (Extended Data Figure 10i) 74,75. For each TF, we constructed directed network graphs comprising its top 500 8-mers according to their uPBM E-score. Comparison against networks built from 500 randomly sampled 8-mers (performed 1,000 times) revealed that TF binding sites consistently formed more extensive networks (Extended Data Figure 10i). For 199 / 200 TFs, the largest connected component contained >80% of nodes (P < 0.001, permutation test). This suggests that ‘weavability’ is an intrinsic and general property of TFBS across diverse structural families in eukaryotes. The overlapping binding sites model, therefore, likely represents a fundamental feature of eukaryotic TF-DNA interactions.
Discussion
The expanded repertoire of bound sequences uncovered by PADIT-seq provides key insights into the role of lower affinity sites in modulating TF occupancy in vivo. We propose a model in which nucleotides adjacent to high affinity sites create overlapping, lower affinity sites that together influence binding by the same TF (Figure 5h). Formation of such extended recognition sequences stems from an inherent property of eukaryotic TF binding sites to interweave each other. Future studies are needed to determine whether the intrinsic ‘weavability’ of binding sites extends to prokaryotic TFs as well and how this property may have evolved.
Recent independent work showed that partition function models that sum binding contributions from short tandem repeats (STRs) flanking core motifs are predictive of overall TF binding 18. Additionally, using iMITOMI, homotypic clusters of low affinity binding sites have been shown to achieve substantial TF occupancy, comparable to an individual high affinity binding site 76. Our model differs from these studies in a fundamental way: while the STR and iMITOMI studies examined scenarios where multiple TF molecules simultaneously bind to distinct sites on a DNA fragment, our study focuses on how a single TF molecule interacts with multiple, overlapping DNA binding sites.
The overlapping binding sites model could be alternatively interpreted as an artifact of partial motif recognition of shared nucleotides instead of independent binding events. However, multiple lines of experimental evidence demonstrate that this is not the case. First, lower affinity PADIT-seq active k-mers were bound independently in custom PBM experiments, confirming that these represent genuine TF binding sites. Second, noncoding variants altering a single overlapping binding site, while preserving adjacent binding sites, significantly alter TF binding in the predicted direction. Finally, and most compellingly, ChIP-nexus footprinting analysis provided direct in vivo evidence for our model. With each additional, overlapping binding site, we observed precisely one additional cut on each strand, leading to 1-bp increments in total footprint size. This is a unique molecular signature that conclusively demonstrates independent TF occupancy at consecutive overlapping binding sites.
Altogether, our results revealed unanticipated simplicity in the rules governing TF-DNA interactions. The intrinsic property of high affinity sites to have overlapping, lower affinity sites may provide an energetic ‘sink’ for TF binding that is highly tunable. Remarkably, the overlapping binding sites model provides a single unifying mechanism that explains two seemingly different problems in the field. First, it elucidates how paralogous TFs achieve binding specificity despite recognizing highly similar core motifs. Evolution of flanking sequences that create differential numbers of overlapping sites may be a general mechanism controlling the relative binding of paralogous TFs, many of which are co-expressed in the same cells. Second, the model explains how noncoding variants flanking motifs can influence TF binding by modulating multiple, overlapping sites simultaneously rather than altering single binding sites in isolation. The expanded repertoire of lower affinity sites identified by PADIT-seq thus provides a new framework for interpreting how noncoding variation influences transcriptional regulation and human disease.
Limitation of PADIT-seq:
Detection of PADIT-seq active k-mers depends critically on: (1) the concentrations of TF and nbALFA-T7 RNA polymerase – suboptimal levels reduce the sensitivity of PADIT-seq for detecting lower affinity TF-DNA interactions (data not shown); (2) sequencing depth – inadequate coverage reduces statistical power for detecting lower affinity binding sites; and (3) FDR threshold selection – more stringent cutoffs improve confidence but exclude genuine lower affinity binding sites. Therefore, it is possible that despite enhanced sensitivity over existing methods, PADIT-seq does not capture the complete spectrum of biologically relevant lower affinity TF-DNA interactions.
To mitigate potentially confounding impacts when interrogating all possible 10-bp DNA sequences as candidate TFBS, the flanking nucleotides were intentionally selected to preclude all adjoining 8-mers with HOXD13 and EGR1 PBM E-scores >0.25. This stringent E-score threshold for flanking 8-mers proved prudent given the detected influence of adjacent lower affinity sites. However, this also means that surveying binding landscapes for other TFs using PADIT-seq necessitates consideration of potential impacts of sequences flanking candidate TFBS.
Methods
Cloning DBD and ALFA tag fusions into the pET28 vector:
The low-copy pET28b bacterial expression vector (Millipore Sigma) was used as a template for two high-fidelity Q5 polymerase PCR reactions (New England Biolabs; NEB). Primers ‘pET28_pcr2_FWD’ and ‘pT7LACI_REV’ were used to generate a 2823-bp fragment, and primers ‘T7_Terminator_FWD’ and ‘pET28_pcr1_REV’ were used to generate a 2260-bp fragment (see Tables S1 and S2). The two pET28 amplicons along with DNA sequences coding for DBD and ALFA tag fusions, obtained as an IDT gBlock gene fragment, were combined to construct the plasmid using 3-fragment Gibson assembly (NEB). The reaction mixture was transformed into TOP10 E. coli (Thermo Fisher Scientific). Multiple colonies were Sanger sequenced to identify clones with correct assembly and sequence. The region coding for DBD and ALFA tag fusions, along with the T7 promoter and Terminator, was PCR amplified using NEB Q5 polymerase and primers ‘T7_Promoter_Fwd’ and ‘T7_Terminator_REV’. This ‘pT7-DBD-T7Term’ amplicon was purified with QiaQuick PCR purification kit (Qiagen) and used as a template in an E. coli-based PURExpress in vitro transcription and translation (IVTT) reaction (NEB). Expression of the DBD and ALFA tag fusion protein was validated by SDS-PAGE and western blotting using an anti-FLAG antibody (Millipore Sigma; F3165).
Cloning nbALFA-T7-RNA-Polymerase into the pET28 vector:
DNA coding for nbALFA-T7 RNA Polymerase was ordered as two gBlock gene fragments (IDT), ‘pT7_6His_nbALFA_T7rnaPol’ (747-bp) and ‘T7rnaPol_T7Term’ (2600-bp) (see Tables S1 and S2). These two gBlocks, along with the two pET28 amplicons of lengths 2823-bp and 2260-bp previously generated as described above, were combined to construct the plasmid using 4-fragment Gibson assembly (NEB). Multiple colonies were Sanger sequenced to identify clones with correct assembly and sequence.
Protein purification of nbALFA-T7-RNA-Polymerase from E. coli:
Rosetta 2 (DE3) E. coli cells (Millipore Sigma) were transformed with a sequence-verified pET28b vector encoding N-terminal 6xHis-tagged nbALFA-T7-RNA-Polymerase. Cultures were induced overnight at 18°C with 1 mM IPTG maximized soluble nbALFA-T7-RNA-Polymerase. Rosetta 2 (DE3) E. coli cells from a 500 mL culture were harvested, resuspended in binding buffer (10 mM sodium phosphate, pH 7.4, 150 mM NaCl, 0.5 mM beta-mercaptoethanol, Triton-X 0.025%) supplemented with a cOmplete, Mini, EDTA-free protease inhibitor cocktail (Millipore Sigma), and lysed by sonication. The clarified lysate was incubated with Ni-NTA agarose resin (Protein Ark) pre-equilibrated in binding buffer for 1 hour at 4°C to allow binding of 6xHis-nbALFA-T7-RNA-Polymerase. The resin was washed 3 times with 20–25 mL binding buffer supplemented with 20 mM imidazole before eluting nbALFA-T7-RNA-Polymerase with binding buffer supplemented with 500 mM imidazole. The eluate was analyzed on SDS-PAGE gels to determine purity. After concentrating the eluate with Amicon Ultra centrifugal filters, purified nbALFA-T7-RNA-Polymerase was aliquoted and stored at −80°C in binding buffer with 50% glycerol. To determine RNA polymerase activity, purified nbALFA-T7-RNA-Polymerase was intubated with rNTPs and a T7 promoter driven DNA template at 37°C for 4 hrs. Robust synthesis of expected RNA product was comparable to commercially available T7 RNA polymerase (NEB).
Construction of small-scale PADIT-seq reporter library:
We designed an oligo pool (n = 14), where each oligo contained a 3-bp randomized region, resulting in 64 unique DNA sequences per oligo that served as candidate transcription factor binding sites (TFBS). The oligo pool (IDT) was combined with an Ultramer (‘20bpBC_Bottom’) containing all possible 20-bp DNA sequences as barcodes (BCs) and then double-stranded using one cycle of KAPA HiFi polymerase (Roche). The pGL4.23 plasmid vector backbone (Promega) was PCR amplified in 2 steps with Q5 High-Fidelity 2X Master Mix (NEB). First, the backbone was amplified with primers ‘pGL4.23_FWD’ and ‘pGL4.23_REV’ to exclude the luciferase open reading frame. The resulting amplicon (2359-bp) was then further amplified with primers ‘T7Term_pGL4.23_FWD’ and ‘pGL4.23_REV’ to add a 48-bp T7-terminator DNA sequence as an overlapping region for Gibson assembly, which was performed with the resulting amplicon (2407-bp) and the double-stranded oligo pool mixed in equimolar ratios. Following chemical transformation of OneShot TOP10 cells (n = 8) (Thermo Fisher Scientific), ~150,150 colonies were obtained, equivalent to ~167 barcodes per TFBS. After recovery and 7.5 hours of growth, the cells were maxi-prepped to obtain the small-scale PADIT-seq reporter plasmid library. Correct library assembly was validated by diagnostic PCR and Sanger sequencing of colonies.
Obtaining TFBS-BC pairings in the small-scale PADIT-seq reporter library:
To obtain TFBS-BC pairings, the library was PCR amplified using KAPA HiFi polymerase (Roche) with primers ‘AmpEZ_pGL4.2_Rev_RC’ and ‘#19_MPRA_cDNA_3.0’, which also added partial Illumina adaptors to the amplicon. Final Illumina adaptors and sample barcodes were added by Azenta, where paired-end Illumina sequencing was performed using their ‘AmpliconEZ’ service. The sequencing data were processed using custom scripts to extract TFBS-BC pairings by matching constant flanking regions. Barcodes associated with multiple TFBS were filtered out to obtain one-to-one TFBS-BC pairings.
Construction of the all-10mers PADIT-seq reporter library:
We designed and ordered two Ultramers (IDT) - one containing all possible 10-bp DNA sequences as candidate TFBS (‘All10mersTFBS_Top’) and another containing all possible 25-bp DNA sequences to serve as barcodes (‘25bpsBC_Bottom’). The two Ultramers were mixed in an equimolar ratio and double-stranded in a single PCR cycle using KAPA HiFi polymerase (Roche). The pGL4.23 plasmid vector backbone (Promega) was again PCR amplified in two steps with Q5 High-Fidelity 2X Master Mix (NEB). First, the backbone was amplified with primers ‘pGL4.23_FWD’ and ‘pGL4.23_REV’ to exclude the luciferase open reading frame. The resulting amplicon (2359-bp) was then further amplified with primers ‘T7Term_pGL4.23_F_2.0’ and ‘pGL4.23_REV’ to add a 56-bp DNA sequence as an overlapping region for Gibson assembly (NEB), which was performed with the resulting amplicon (2415-bp) and the double-stranded oligo-pool mixed in equimolar ratios. Following desalting with a mixed cellulose esters (MCE) hydrophilic membrane (0.025 um) (Millipore Sigma), the assembled reporter library plasmid was electroporated into E. cloni 10G Supreme Electrocompetent Cells (n = 13 transformations) (Biosearch Technologies). Based on plating experiments, the total number of transformants obtained was estimated to be 110 million, providing an average of ~100 barcodes per TFBS. The transformed cells were recovered, grown for 6.5 hours, and maxi-prepped (Thermo Fisher Scientific) to obtain the complete all-10mers PADIT-seq reporter plasmid library containing over 100 million clones. Correct library assembly was validated by diagnostic PCR and Sanger sequencing of 10 colonies.
Obtaining TFBS-BC pairing in the all-10mers PADIT-seq reporter library:
To obtain TFBS-BC pairings, the all-10mers PADIT-seq reporter library was PCR amplified using KAPA HiFi polymerase. Four forward primers were designed with partial Illumina adapters, 6N randomized bases, and 2-bp staggers (‘All_10mers_LibSeq_F1–4’). These were used individually with a single reverse primer (‘All_10mers_LibSeq_R’) to generate four PCR-1 products of expected sizes 213–219-bp (9 cycles). The four PCR-1 products were then used as template in PCR-2 (5 cycles) with TruSeq indexed primers to attach Illumina sample indexes. This generated four PCR-2 products of expected size 272–278-bp. After PCR amplification, the four products were SPRI cleaned (Beckman Coulter) and analyzed on an Agilent TapeStation to confirm expected sizes. The four indexed libraries were sequenced separately on a NovaSeq6000 (2×150 bp reads). The sequencing data from each of the four indexed libraries (F1-F4) was combined and processed using custom scripts to extract unique TFBS-BC combinations along with their counts by matching the constant flanking regions. Barcodes unambiguously associated with only one TFBS across all four libraries were classified as ‘single TFBS barcodes’ and retained.
The all-10mers PADIT-seq reporter plasmid library was amplified in four separate PCR reactions (F1-F4) with different TruSeq indexes to identify potential PCR-mediated recombination artifacts in the following way: for barcodes associated with multiple TFBS, an initial filter retained only TFBS observed independently in at least two of the four libraries, the rationale being that TFBS-BC occurrence in multiple independent libraries indicates likely true pairings versus artifacts of PCR-mediated recombination. After this first multi-library filtering step, any barcodes still associated with multiple TFBS were removed entirely to eliminate ambiguities. As an additional filter, barcodes where the top TFBS had fewer reads than the sum of discarded TFBS were removed. The ‘single TFBS barcodes’ and vetted multiple TFBS barcodes were combined to obtain high-confidence 1:1 TFBS-BC pairs for downstream analysis. This multi-step filtering process leveraged the independently prepared sequencing libraries to remove incorrect and ambiguous TFBS-BC pairings arising from PCR-mediated recombination. It enabled retaining high-confidence barcode-TFBS pairs reproducibly identified across multiple libraries while discarding likely PCR artifacts and errors.
PADIT-seq experiments:
To remove any supercoiling, PADIT-seq reporter libraries were first linearized with DrdI (NEB), which cuts a 12-bp DNA sequence (GACNNNN/NNGTC) only once in the pGL4.23 vector. For every DBD tested, the following 30-μl PURExpress IVTT reactions (NEB) were assembled: 10 μl Solution A, 7.5 μl Solution B, 1 μl murine RNase inhibitor (NEB), 3 μl 100 mM rNTPs, 0.45 μl 1 M magnesium acetate, 3 μl previously purified nbALFA-T7-RNA-Polymerase, ~300 ng linearized PADIT-seq reporter plasmid library, and ‘pT7 -DBD-T7Term’. The linearized PADIT-seq reporter plasmid library was mixed with ‘pT7 -DBD-T7Term’ amplicons in a 2:1 molar ratio.
For PADIT-seq experiments with HOXD13 and EGR1, the 30-μl PURExpress IVTT reactions were split into three wells, and all subsequent steps were performed separately (3 biological replicates). For PADIT-seq experiments with NKX2.5, TBX5, Pho4 and Cbf1, the PURExpress IVTT reactions were scaled to 50 μl, split into five wells, and all subsequent steps were performed separately (5 biological replicates). We performed a total of 7 control ‘no DBD’ reactions (10 μl each), 3 for the first experiment with HOXD13 and EGR1, and 4 for the second experiment with NKX2.5, TBX5, Pho4 and Cbf1.
cDNA synthesis of PADIT-seq reporter RNAs:
After 4 hours at 37°C, the 10-μl reactions were purified with RNAClean XP (Beckman Coulter) according to manufacturer’s instructions, eluting in 35 μl nuclease-free water. 2 μl barcoded cDNA synthesis primers (each at 0.1 uM final concentration) were added to 18 μl purified RNA, incubated at 75°C for 3 mins, then placed on ice. cDNA was synthesized by adding 10 μl 2X Multiscribe reaction mix (Thermo Fisher Scientific), and incubating at 25°C for 20 minutes, followed by 37°C for 120 minutes. Minus reverse transcriptase controls were performed in parallel. Excess primers were removed from the cDNA:RNA duplexes by adding 5 μl exonuclease I (NEB) and incubating at 37°C for 60 mins, followed by heat inactivation at 80°C for 20 mins. Quantitative PCR was performed to verify degradation of all excess primers and to determine the threshold cycle of sample cDNAs.
PADIT-seq library preparation for Illumina sequencing:
For PADIT-seq experiments with HOXD13 and EGR1 using the small-scale PADIT-seq library, barcoded cDNAs synthesized from reporter RNAs were pooled prior to PCR amplification. The pooled cDNA was amplified in a single PCR reaction using KAPA HiFi polymerase with primers ‘MPRA_AmpliconEZ_FWD’ and ‘MPRA_AmpEZ_REV2.0’. This generated a PCR-1 product that was then used as template for a second PCR with primers ‘#34_MPRA’ and ‘169_TruSeq_Multiplex_220_2’ to attach Illumina adapters and sample barcodes.
For PADIT-seq experiment with HOXD13 and EGR1 using the all-10-mers library, barcoded cDNAs were kept separate and amplified in individual PCR reactions rather than pooled. For PADIT-seq experiments with NKX2.5, TBX5, Pho4 and Cbf1 using the all-10-mers library, barcoded cDNAs from replicates were pooled for each TF. The cDNA from PADIT-seq reporter RNAs was amplified using KAPA HiFi HotStart Polymerase (Roche) with primers ‘MPRA_AmpliconEZ_FWD’ and ‘MPRA_AmpEZ_REV2.0’. This generated PCR-1 products that was then used as template for a second PCR with primers ‘#34_MPRA’ and indexed TruSeq primers to attach Illumina adapters and sample barcodes. This was followed by Illumina sequencing, aiming for >50X coverage for each replicate (sample sequencing statistics in Table S4).
Defining PADIT-seq activity and calling active k-mers:
Barcodes from sequencing libraries were mapped to the associated TFBS based on previously obtained TFBS-BC pairings. Barcode counts per TFBS were obtained for each library and merged into a single data frame. Quality control was performed by generating Pearson correlation heatmaps and principal component analysis (PCA) plots to assess reproducibility between replicates and overall structure of the data. For differential activity analysis, read counts for the DBD of interest and a ‘no DBD’ control, across 3–5 replicates each, were analyzed using DESeq2 40. TFBS significantly bound by the DBD of interest (active k-mers) were identified by applying a false discovery rate (FDR) threshold of 5%.
Extracting 8-mer and 9-mer PADIT-seq activities from the all-10mers PADIT-seq library:
To extract 8-mer PADIT-seq log2(fold change) values, all possible occurrences of each 8-mer in the all-10-mers PADIT-seq library were analyzed. This was done separately for 8-mers at the 3 different positions within the 10-mers, where median log2(fold change) values were retained for each position. Both forward and reverse orientations were analyzed and the orientation with the higher median log2(fold change) was retained for each position within the 10-mers.
To extract EGR1 9-mer PADIT-seq log2(fold change) values, all possible occurrences of each 9-mer in the all-10-mers PADIT-seq library were analyzed. This was done separately for 9-mers at the 2 different positions within the 10-mers, where median log2(fold change) values were retained for each position. Both forward and reverse orientations were analyzed and the orientation with the higher median log2(fold change) was retained for each position within the 10-mers.
Comparing PADIT-seq activities between the all-10mers and small-scale libraries:
For each 9-bp TFBS tested in the smaller focused PADIT-seq library (n = 896), the median log2(fold change) value was extracted from the 8 possible occurrences of the 9-bp TFBS in the all-10mers PADIT-seq library.
EGR1 9-mer E-scores:
EGR1 PBM data from UniPROBE were re-analyzed using the Universal PBM Analysis Suite modified to obtain 9-mer instead of 8-mer E-scores 3,4.
Analysis of HT-SELEX data:
Publicly available HOXD13, NKX2.5, TBX5 and EGR1 HT-SELEX sequencing data 8,77 were downloaded and analyzed with default parameters using the R SELEX package 78,79. K-mers were enriched if the observed-to-expected counts ratio was >3.
Calculation of ProBound relative affinities:
MotifCentral PWM model numbers 17019, 17845, 16238 and 12718 were used for HOXD13, NKX2.5, TBX5 and EGR1, respectively 50. When PWM models were longer than PADIT-seq active k-mers, we padded the k-mers with N’s to match the lengths of the PWM models. ProBound relative affinities for all possible permutations of N padding around a k-mer were calculated and the highest affinity values were retained for all 8-mers (HOXD13, NKX2.5 and TBX5) and 9-mers (EGR1).
Analysis of ChIP-seq data:
HOXD13, NKX2.5, TBX5 and EGR1 ChIP-seq data were processed using the ReMap database pipeline 80. Briefly, adapters were removed using Trim Galore, trimming reads up to 30-bp. Bowtie2 with options -end-to-end and -sensitive was then used to align all reads on the mouse genome (mm10; HOXD13 and EGR1) or the human genome (hg38; NKX2.5 and TBX5) 81. ChIP-seq peak calls were downloaded from the ReMap database. For every ChIP-seq peak, only the longest consecutive active k-mers were included when comparing the distribution of consecutive active k-mers in ChIP-seq peaks and background genomic regions. Random length-matched genomic intervals were generated using the ‘shuffle’ command in Bedtools 82.
Evolutionary conservation analysis:
PhastCons scores 83 were downloaded in BigWig format from the UCSC genome browser (‘mm10.60way.phastCons60wayGlire.bw’ and ‘hg38.phastCons30way.bw’). For every genomic region of interest, PhastCons scores were mapped to each nucleotide within using the Bedtools map command 82. Genomic regions with missing values were filtered out and paired two-sided Wilcoxon tests were performed to statistically compare PhastCons scores of consecutive nucleotides.
Analysis of Pho4 and Cbf1 ChIP-nexus data:
Processed bigwig files were downloaded from the GEO database (GSE207001) 55. Briefly, ChIP-nexus libraries had been sequenced on an Illumina NextSeq 500, followed by adapters trimming and alignment to the Saccharomyces cerevisiae (sacCer3) genome. PCR duplicates were removed via a barcode-aware deduplication strategy, and strand-specific 5’ read end locations were extracted to create strand-separated bigwig files.
For our analysis, we stratified ChIP-nexus peaks by the number of consecutive, overlapping PADIT-seq active 8-mers they contained and analyzed the distribution of 5’ read ends around these sites by combining cuts from 2 biological replicates. For each configuration (TF and number of overlapping sites), we computed the mean number of cuts (5’ read ends) at each nucleotide position relative to the boundaries of the extended sites of occupancy (indicated by blue dashed lines in Figure 3). While λ-exonuclease has been shown to exhibit sequence-dependent biases for G-quadruplex (G4) structures that can impede enzymatic progression 84, this is unlikely to affect our results because only ~1% of the ChIP-nexus peaks contained any G4 sequence identified using default options in the R package ‘pqsfinder’ 85. To eliminate any potential confounding effects, however, we nevertheless filtered out peaks containing predicted G4 quadruplexes from our analysis. Statistical significance was assessed using a permutation test in which the ChIP-nexus 5’ cuts from both strands were combined and randomly shuffled 100,000 times to generate a null distribution. 30-bp genomic sequences flanking consecutive overlapping binding sites on either side were tested for significance. P-values were calculated as the proportion of permutations where the shuffled mean count at each position exceeded the observed mean, followed by Benjamini-Hochberg correction for multiple hypothesis testing. Positions with an adjusted P < 0.05 were considered as significant cuts above background.
Network analysis:
Custom R scripts (see code availability) were written to obtain adjacency matrices from k-mer DNA sequences using the logic described in the main text. The fraction of k-mers within the largest, single connected component was obtained using the ‘igraph’ R package. For network analysis using uPBM data, only TFs for which all top 500 8-mers had PBM E-score > 0.35 were analyzed.
Differential PADIT-seq activity between variant alleles:
Variants were predicted to alter TF binding if two conditions were met: 1) the total number of active overlapping k-mers in the two alleles were not identical, and 2) the absolute difference in the sum of PADIT-seq activities between the alleles was >1. The second condition applies a minimum threshold on the effect size of variant effects on TF binding, i.e., if the variant results in a >2-fold change in reporter gene expression as compared to the reference allele by PADIT-seq.
Design of the custom PBM array:
To validate novel binding sequences identified through PADIT-seq, we designed a custom PBM incorporating active k-mers (HOXD13: 8-mers; EGR1: 9-mers) flanked by the same constant sequence contexts used in PADIT-seq (5’ flank: TGGCCTCGGC; 3’ flank: GGAACCTCTA). HOXD13 8-mers were extended with a flanking G nucleotide at both ends, while EGR1 9-mers were preceded by a T nucleotide. The same array was used to validate differential TF binding predictions for selected SNPs (HOXD13: n=280; EGR1: n=287), including all variants (HOXD13: n=42; EGR1: n=84) with significant SNP-SELEX PBS scores (P < 0.01), 50 randomly selected SNPs predicted by MotifBreakR alone to alter TF binding, 50 control SNPs predicted to have no effect on TF binding by any method, and the remaining SNPs predicted by PADIT-seq only to alter TF binding (HOXD13: n=138; EGR1: n=103). All sequences were synthesized as 36-bp oligonucleotide probes with a common 24-bp priming sequence (GTCTGTGTTCCGTTGTCCGTGCTG) appended for use in double-stranding the arrays. Each probe was tested in both orientations with 8 technical replicates, each randomly distributed across the array.
Custom PBM experiment:
Custom PBMs were performed following established protocols4 with minor modifications. Double-stranded arrays were generated via primer extension using Thermo Sequenase DNA Polymerase (final concentration = 0.109 U μl−1) from Cytiva Life Sciences (Catalog # E79000Y) and Cy3-dUTP for quality control. Arrays were blocked with 2% milk in PBS, then incubated with purified FLAG-tagged HOXD13 or EGR1 (300 nM, 500 nM, and 800 nM concentrations) in binding buffer containing BSA (0.2 mg/ml) and salmon testes DNA (0.3 μg/ml) as non-specific competitors. After washing, bound proteins were detected using Alexa Fluor 488-conjugated anti-FLAG antibody (1:40 dilution; Thermo Fisher Catalog # MA1–142-A488).
For probes designed to test PADIT-seq active k-mers, the custom PBM signal was defined to be the median signal intensity from the 8 technical replicates for the orientation (i.e., forward vs. reverse complement) with higher median signal intensity.
To detect differential binding of HOXD13 and EGR1 to SNPs, each allele was tested in both orientations, with each orientation present in 8 technical replicates per chamber. Both HOXD13 and EGR1 were profiled at 500 nM protein concentration in duplicate, providing a total of 16 measurements per binding site allele to ensure robust statistical power for detecting binding differences. For each variant, we determined differential binding using the orientation that gave us the lower P-value. Differential binding was assessed using a two-sided t-test, followed by Benjamini-Hochberg correction for multiple hypothesis testing.
Extended Data
Extended Data Figure 1: PADIT-seq demonstrates high reproducibility and correlates with orthogonal binding assays.

(a-b) First experiment with HOXD13, EGR1, and ‘NoDBD’ controls (R1–3). (a) Heatmap showing pairwise Pearson correlations between replicates with unsupervised row and column clustering. (b) PCA plot explaining ~93% of variation. (c) Second experiment with Pho4, Cbf1, TBX5, NKX2.5, and additional NoDBD controls (R4–7). Previous ‘NoDBD’ controls (R1–3) included for comparison. PCA plot explaining ~71% of variation. (d-e) PADIT-seq activities for HOXD13 (d) and EGR1 (e) from the all-10mers library and the small-scale library are compared. Red TFBS are active in both libraries. Black TFBS are not active in either libraries. Blue TFBS are active only in the small-scale library. (f) PADIT-seq activity from the small-scale library and MITOMI-derived dissociation constants (Kd) for EGR1 are compared. Red TFBS are active, whereas black TFBS are inactive. (g) Comparison of binding preferences measured by uPBM Z-scores (x-axis) and PADIT-seq (y-axis) for 4 human TFs, HOXD13, NKX2.5, TBX5, and EGR1, and 2 S. cerevisiae yeast TFs, Pho4 and Cbf1. Active k-mers are colored red, and inactive k-mers are colored black. (inset) AUROC comparing the ability of uPBM E-scores (orange) and Z-scores (purple) to discriminate between PADIT-seq active and inactive k-mers.
Extended Data Figure 2: Custom PBM confirms PADIT-seq active k-mers represent genuine TF binding sites.

(a) Schematic of custom PBM design showing PADIT-seq active k-mers (8-mers for HOXD13 and 9-mers for EGR1) embedded within constant flanking sequences. FLAG-tagged HOXD13 or EGR1 binding was detected using Alexa Fluor 488 conjugated anti-FLAG antibodies. (b) Scatter plots comparing PADIT-seq activity (y-axis) and custom PBM signal (x-axis) for HOXD13 at three protein concentrations (300 nM, 500 nM, and 800 nM). Points are colored by PADIT-seq false discovery rate (FDR): red (FDR < 0.01), cyan (0.01 ≤ FDR < 0.05), and orange (0.05 ≤ FDR < 0.10). (c) Corresponding analysis for EGR1 at the same three protein concentrations. (d) ROC curves comparing the predictive performance of uPBM E-scores (orange) and PWM scores from FIMO (blue) in distinguishing between PADIT-seq active and inactive k-mers. In cases where PWM models were longer than PADIT-seq active k-mers, we scanned all possible relative positions (registers) of the PWM model against each k-mer and retained the highest affinity score for our analysis. (e) ROC curves comparing the predictive performance of uPBM E-scores (orange) and ProBound-predicted affinities (blue) in distinguishing between PADIT-seq active and inactive k-mers. (insets) Due to the large imbalance between active and inactive k-mers, even seemingly small false positive rates translate to substantial numbers of false predictions. For example, a 5% FPR corresponds to 1,556 false positives for HOXD13 and even a 1% FPR corresponds to 1,305 false positives for EGR1.
Extended Data Figure 3: Lower affinity binding sites increase TF genomic occupancy at ChIP-seq and ChIP-nexus peaks.

(a) ROC curves comparing the performance of PADIT-seq, PWM FIMO, and ProBound in distinguishing ChIP-seq peaks from background genomic regions for HOXD13, NKX2.5, TBX5 and EGR1. For HOXD13, in addition to random, length-matched background genomic intervals, false positives were also determined with the background sequences defined to be embryonic forelimb bud ATAC-seq peaks not overlapping HOXD13 ChIP-seq peaks. For Pho4 and Cbf1, ROC curves compare the performance of PADIT-seq and PWM FIMO in distinguishing ChIP-nexus peaks from background genomic regions (ProBound motifs were not available). The number of foreground and background genomic intervals were equal for all 6 TFs. (b) The sum of PADIT-seq activities of all the active k-mers in ChIP-seq and ChIP-nexus peaks is plotted against the corresponding read counts normalized to peak length for each TF. Pearson correlation coefficients and significance values are shown. (c) Pearson correlation coefficient between normalized ChIP-seq and ChIP-nexus read counts and PADIT-seq predictions when varying the number of top active k-mers included. Red horizontal line indicates the maximum correlation achieved using PADIT-seq. Blue horizontal line shows the correlation achieved using PWM FIMO log-likelihood scores summed across peaks, which yielded higher correlation coefficients than using maximum PWM scores alone.
Extended Data Figure 4: ChIP-seq and ChIP-nexus peaks have significantly more consecutive active k-mers, irrespective of how background genomic sequences were defined.

(a) Histograms showing the distribution of consecutive active k-mers in peaks (red) for six TFs: HOXD13, NKX2.5, TBX5, and EGR1 (ChIP-seq), and Pho4 and Cbf1 (ChIP-nexus). Background regions were generated by selecting genomic sequences flanking each ChIP peak. Statistical significance was determined by two-sided Wilcoxon rank sum tests. For EGR1, the large effect size resulted in a P-value below computational precision. (b) The vast majority of binding sites are detected with high statistical confidence. Across all six TFs, 56–76% of active k-mers are found at FDR < 0.01.
Extended Data Figure 5: ChIP-seq and ChIP-nexus peaks are significantly enriched for consecutive active k-mers irrespective of FDR threshold.

Distribution of consecutive active k-mers within ChIP-seq and ChIP-nexus peaks (red) compared to random, length-matched genomic regions (black) for each TF at three different FDR thresholds, 1% (left panels), 5% (middle panels), and 10% (right panels). Statistical significance was determined by two-sided Wilcoxon rank sum tests.
Extended Data Figure 6: Minor groove width (MGW) and Propeller twist (ProT) at the extended recognition sequences bound by TFs is distinct from flanking genomic regions.

(a-b) Predicted MGW (a) and ProT (b) are shown for genomic regions containing consecutive active k-mers within ChIP-seq peaks (red) and their 4-bp flanking regions (blue): HOXD13 (13–15 bp with consecutive active 8-mers), NKX2.5 (11–13 bp with consecutive active 8-mers), TBX5 (10–12 bp with consecutive active 8-mers), and EGR1 (11–13 bp with consecutive active 9-mers). Adjusted P-values < 0.05 from paired two-sided Wilcoxon rank sum tests are indicated by *.
Extended Data Figure 7: ChIP-nexus footprint expansion persists after controlling for core motif positioning.

(a) For Pho4, the increased number of significant cuts and expanded footprint size observed for 5 overlapping binding sites is not because CACGTG occurs at more variable registers. Strand-specific distribution of ChIP-nexus 5’ends (red: positive strand; black: negative strand) averaged across peaks containing exactly 5 consecutive overlapping active 8-mers for Pho4. (b) For Cbf1, an expanded footprint size was observed even after constraining the position of CACGTG at the 3rd register. Strand-specific distribution of ChIP-nexus 5’ends (red: positive strand; black: negative strand) averaged across peaks containing exactly 3 (top) or 5 (bottom) consecutive overlapping active 8-mers with CACGTG constrained to be exclusively at the 3rd register. Peaks with 4 consecutive overlapping binding sites are not included because no significant cuts above background were observed due to low statistical power, which makes it difficult to objectively determine the size of footprints. (a-b) Dashed blue lines indicate the boundaries of the consecutive overlapping binding sites. Letters indicate cuts occurring significantly above background in the flanking 15-bp genomic regions (permutation test adjusted P < 0.05). Sequence logos above each plot show the relative frequency at which the 4 nucleotides occur at each position in the genomic sequences containing the indicated category of ChIP-nexus peaks. For these sequence logos, the y-axis represents information content ranging from a minimum of 0 to a maximum of 2.
Extended Data Figure 8: Overlapping binding sites additively increase TF occupancy in vitro.

(a) The fraction of HT-SELEX reads (y-axis) with consecutive overlapping PADIT-seq active k-mers after 0–4 rounds of selection (x-axis). (b) Box plots showing uPBM signal intensities for 60-bp probes (n ≈ 42,000) categorized by the number of consecutive overlapping active k-mers. The bounds of the box plots define the 25th, 50th and 75th percentiles, and whiskers extend to the furthest data points within 1.5× the interquartile range. Two-sided Wilcoxon tests for all pairwise comparisons have an adjusted P < 0.05 (not indicated). (c) HOXD13, NKX2.5, and TBX5 10-mer PADIT-seq activity versus mean uPBM E-scores of constituent 8-mers, centered around a fixed 8-mer in the center (green). EGR1 PADIT-seq activity of 10-mers containing ‘GCGTGGGTG’ (green) versus uPBM E-scores of constituent 9-mers. (d) For all the PADIT-seq active HOXD13, NKX2.5, and TBX5 8-mers: distribution of Pearson correlation coefficients between 10-mer PADIT-seq activities and mean uPBM E-scores of constituent 8-mers. For all the PADIT-seq active EGR1 9-mers: correlations between 10-mer PADIT-seq activity and uPBM E-scores of constituent 9-mers.
Extended Data Figure 9: PADIT-seq outperforms MotifBreakR in predicting variant effects on TF binding and gene expression.

(a-b) PADIT-seq outperforms MotifBreakR in predicting SNP effects on TF binding. Representative variants, rs62523478 (A) and rs2914146 (B), whose effects on HOXD13 and EGR1 binding, respectively, cannot be explained by PWM models. (Right) Box plots show custom PBM signals from 16 probes for the reference and alternate alleles. (Left) PADIT-seq k-mers tiled across the reference and alternate alleles. Red points represent k-mers that are active in both alleles; green points represent k-mers that are uniquely active in the allele with higher TF binding. The bounds of the box plots define the 25th, 50th and 75th percentiles, and whiskers extend to the furthest data points within 1.5× the interquartile range. (c) Comparison of PADIT-seq and MotifBreakR predictions with ADASTRA EGR1 ChIP-seq allelic skew measurements. (d-f) Noncoding variants alter multiple overlapping EGR1 binding sites to influence gene expression. (d) Allelic skew in MPRA activity of 149 SNPs with differential EGR1 PADIT-seq activity. Shapes represent the different studies from which allelic skew in MPRA activity was obtained. Red shapes correspond to concordant directions of effect between differential EGR1 PADIT-seq activity and MPRA allelic skew. (e) Euler diagram of variants with MPRA allelic skew predicted by MotifBreakR to alter EGR1 binding (green) and with differential PADIT-seq activity (blue). (Bottom) Boxplots comparing MotifBreakR (left) and PADIT-seq (right) effect sizes. The bounds of the box plots define the 25th, 50th and 75th percentiles, and whiskers extend to the furthest data points within 1.5× the interquartile range. * indicates two-sided Wilcoxon rank sum test P-value < 0.05. (f) Number of active overlapping 9-mers altered by the 149 variants with differential PADIT-seq activity. Brown variants are predicted to alter EGR1 binding by MotifBreakR, while blue is not.
Extended Data Figure 10: Weavability’ of binding sites is an inherent property of TFs from different DBD classes.

‘ (a) Schema to demonstrate the logic of network construction. All incoming and outgoing edges to and from the HOXD13 active 8-mer ‘ACTTTACT’ are shown. Active 8-mers are colored red, inactive 8-mers are colored black. Edges occur between active k-mers only. (b) (left) Network representation of HOXD13 active 8-mers and reverse complements (n = 3,536), connected by directed edges (arrows not shown). 3,446 out of 3,536 nodes (97.5%) form the largest, single connected component. (right) HOXD13 PADIT-seq activity of active 8-mers is plotted against the total number of incoming and outgoing edges per node. The bounds of the box plots define the 25th, 50th and 75th percentiles, and whiskers extend to the furthest data points within 1.5× the interquartile range. * Adjusted P < 0.05, two-sided Wilcoxon tests. (c) (left) Network representation of randomly selected 8-mers and reverse complements (n = 3,536) connected by directed edges (arrows not shown). Only 7 out of 3,536 nodes (0.2%) forms the largest, single connected component. (right) HOXD13 PADIT-seq activity of the 3,536 random 8-mers is plotted against the total number of incoming and outgoing edges per node. (d-h) Network representations and activity distributions for NKX2.5, TBX5, EGR1, Pho4, and Cbf1. For each TF, the fraction of active k-mers in the largest, single connected component is indicated. (i) Among the top 500 uPBM 8-mers and reverse complements, the fraction of nodes in the largest, single connected component is plotted for 200 TFs from 9 different families of DBDs. 1,000 random samplings of 500 8-mers and reverse complements are also shown, and were used to perform the permutation test.
Supplementary Material
Acknowledgments:
We thank Steve Gisselbrecht for help with calculating PBM 9-mer E-scores for EGR1. We thank Kevin Struhl, Luca Mariani and Xiao Katrina Liu for critical reading of the manuscript, and members of the Bulyk lab for helpful discussions. This work was supported by grants from the National Institutes of Health (R21 HG010200 and R01 HG010501 to M.L.B.; K99 HG013675 to S.K.) and the American Heart Association (24POST1192017 to S.K.).
Footnotes
Competing interests: The authors declare that they have no competing interests.
Additional information: Correspondence should be addressed to Martha L. Bulyk.
Data availability:
PADIT-seq data have been deposited in the GEO database under accession number GSE250601. MITOMI Kd data for EGR1 was downloaded from Supporting Information of reference 41. The UniPROBE database (http://thebrain.bwh.harvard.edu/pbms/UniPROBE_staging/browse.php) was used to access uPBM data 74,75. HT-SELEX data were downloaded using the ENA accession code ERP001826. ChIP-seq and ChIP-nexus data analyzed in the paper were downloaded using accession codes GSE81356, GSE89457, GSE85628, GSE67482, and GSE207001. BET-seq data were downloaded using the accession code GSE111936. Competition gcPBM data were downloaded from Supplementary Materials of reference 64. Processed SNP-SELEX data was downloaded from Supplementary Data of reference 65. Allele specific EGR1 ChIP-seq data were downloaded from the ADASTRA database: https://adastra.autosome.org/mabel. Processed MPRA data were downloaded from Supplementary Tables of references 70–73.
Code availability:
Code and processed data for generating the figures are available at https://github.com/BulykLab/PADIT-seq.
References
- 1.Bulyk ML, Gentalen E, Lockhart DJ & Church GM Quantifying DNA-protein interactions by double-stranded DNA arrays. Nat Biotechnol 17, 573–577 (1999). [DOI] [PubMed] [Google Scholar]
- 2.Mukherjee S et al. Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays. Nat Genet 36, 1331–1339 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Berger MF et al. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat Biotechnol 24, 1429–1435 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Berger MF & Bulyk ML Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors. Nat Protoc 4, 393–411 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Badis G et al. Diversity and complexity in DNA recognition by transcription factors. Science 324, 1720–1723 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Weirauch MT et al. Determination and inference of eukaryotic transcription factor sequence specificity. Cell 158, 1431–1443 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Jolma A et al. Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res 20, 861–873 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Jolma A et al. DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013). [DOI] [PubMed] [Google Scholar]
- 9.Driever W, Thoma G & Nüsslein-Volhard C Determination of spatial domains of zygotic gene expression in the Drosophila embryo by the affinity of binding sites for the bicoid morphogen. Nature 340, 363–367 (1989). [DOI] [PubMed] [Google Scholar]
- 10.Gaudet J & Mango SE Regulation of organogenesis by the Caenorhabditis elegans FoxA protein PHA-4. Science 295, 821–825 (2002). [DOI] [PubMed] [Google Scholar]
- 11.Rowan S et al. Precise temporal control of the eye regulatory gene Pax6 via enhancer-binding site affinity. Genes Dev 24, 980–985 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Crocker J et al. Low affinity binding site clusters confer hox specificity and regulatory robustness. Cell 160, 191–203 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Farley EK et al. Suboptimization of developmental enhancers. Science 350, 325–328 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Zandvakili A, Campbell I, Gutzwiller LM, Weirauch MT & Gebelein B Degenerate Pax2 and Senseless binding motifs improve detection of low-affinity sites required for enhancer specificity. PLoS Genet 14, e1007289 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Tanay A Extensive low-affinity transcriptional interactions in the yeast genome. Genome Res 16, 962–972 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Segal E, Raveh-Sadka T, Schroeder M, Unnerstall U & Gaul U Predicting expression patterns from regulatory sequence in Drosophila segmentation. Nature 451, 535–540 (2008). [DOI] [PubMed] [Google Scholar]
- 17.Giorgetti L et al. Noncooperative interactions between transcription factors and clustered DNA binding sites enable graded transcriptional responses to environmental inputs. Mol Cell 37, 418–428 (2010). [DOI] [PubMed] [Google Scholar]
- 18.Horton CA et al. Short tandem repeats bind transcription factors to tune eukaryotic gene expression. Science 381, eadd1250 (2023). [DOI] [PubMed] [Google Scholar]
- 19.Lim F et al. Affinity-optimizing enhancer variants disrupt development. Nature 626, 151–159 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Bartlett A et al. Mapping genome-wide transcription-factor binding sites using DAP-seq. Nat Protoc 12, 1659–1672 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Stormo GD, Zuo Z & Chang YK Spec-seq: determining protein-DNA-binding specificity by sequencing. Briefings in Functional Genomics 14, 30–38 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Fordyce PM et al. De novo identification and biophysical characterization of transcription-factor binding sites with microfluidic affinity analysis. Nat Biotechnol 28, 970–975 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Isakova A et al. SMiLE-seq identifies binding motifs of single and dimeric transcription factors. Nat Methods 14, 316–322 (2017). [DOI] [PubMed] [Google Scholar]
- 24.Meng X, Brodsky MH & Wolfe SA A bacterial one-hybrid system for determining the DNA-binding specificity of transcription factors. Nat Biotechnol 23, 988–994 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Stringham JL, Brown AS, Drewell RA & Dresch JM Flanking sequence context-dependent transcription factor binding in early Drosophila development. BMC Bioinformatics 14, 298 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Levo M et al. Unraveling determinants of transcription factor binding outside the core binding site. Genome Res 25, 1018–1029 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Dror I, Golan T, Levy C, Rohs R & Mandel-Gutfreund Y A widespread role of the motif environment in transcription factor binding across diverse protein families. Genome Res 25, 1268–1280 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Chaudhari HG & Cohen BA Local sequence features that influence AP-1 cis-regulatory activity. Genome Res 28, 171–181 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Cohen DM, Lim H-W, Won K-J & Steger DJ Shared nucleotide flanks confer transcriptional competency to bZip core motifs. Nucleic Acids Res 46, 8371–8384 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Le DD et al. Comprehensive, high-resolution binding energy landscapes reveal context dependencies of transcription factor binding. Proc Natl Acad Sci U S A 115, E3702–E3711 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Yang MG, Ling E, Cowley CJ, Greenberg ME & Vierbuchen T Characterization of sequence determinants of enhancer function using natural genetic variation. Elife 11, e76500 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Reiter F, de Almeida BP & Stark A Enhancers display constrained sequence flexibility and context-specific modulation of motif function. Genome Res 33, 346–358 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Rudnizky S et al. Single-molecule DNA unzipping reveals asymmetric modulation of a transcription factor by its binding site sequence and context. Nucleic Acids Res 46, 1513–1524 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Gordân R et al. Genomic regions flanking E-box binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape. Cell Rep 3, 1093–1104 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Aditham AK, Shimko TC & Fordyce PM BET-seq: Binding energy topographies revealed by microfluidics and high-throughput sequencing. Methods Cell Biol 148, 229–250 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Jung C et al. True equilibrium measurement of transcription factor-DNA binding affinities using automated polarization microscopy. Nat Commun 9, 1605 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Aditham AK, Markin CJ, Mokhtari DA, DelRosso N & Fordyce PM High-Throughput Affinity Measurements of Transcription Factor and DNA Mutations Reveal Affinity and Specificity Determinants. Cell Syst 12, 112–127.e11 (2021). [DOI] [PubMed] [Google Scholar]
- 38.Götzke H et al. The ALFA-tag is a highly versatile tool for nanobody-based bioscience applications. Nat Commun 10, 4403 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Hussey BJ & McMillen DR Programmable T7-based synthetic transcription factors. Nucleic Acids Res 46, 9842–9854 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Love MI, Huber W & Anders S Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15, 550 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Geertz M, Shore D & Maerkl SJ Massively parallel measurements of molecular interaction kinetics on a microfluidic platform. Proc Natl Acad Sci U S A 109, 16540–16545 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Toko H et al. Csx/Nkx2–5 is required for homeostasis and survival of cardiac myocytes in the adult heart. J Biol Chem 277, 24735–24743 (2002). [DOI] [PubMed] [Google Scholar]
- 43.Moskowitz IPG et al. The T-Box transcription factor Tbx5 is required for the patterning and maturation of the murine cardiac conduction system. Development 131, 4107–4116 (2004). [DOI] [PubMed] [Google Scholar]
- 44.Zhou X & O’Shea EK Integrated approaches reveal determinants of genome-wide binding and function of the transcription factor Pho4. Mol Cell 42, 826–836 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Ogawa N & Oshima Y Functional domains of a positive regulatory protein, PHO4, for transcriptional control of the phosphatase regulon in Saccharomyces cerevisiae. Mol Cell Biol 10, 2224–2236 (1990). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Cai M & Davis RW Yeast centromere binding protein CBF1, of the helix-loop-helix protein family, is required for chromosome stability and methionine prototrophy. Cell 61, 437–446 (1990). [DOI] [PubMed] [Google Scholar]
- 47.Payne JL & Wagner A The robustness and evolvability of transcription factor binding sites. Science 343, 875–877 (2014). [DOI] [PubMed] [Google Scholar]
- 48.Jaeger SA et al. Conservation and regulatory associations of a wide affinity range of mouse transcription factor binding sites. Genomics 95, 185–195 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Grant CE, Bailey TL & Noble WS FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Rube HT et al. Prediction of protein-ligand binding affinity from sequencing data with interpretable machine learning. Nat Biotechnol 40, 1520–1527 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Sheth R et al. Distal Limb Patterning Requires Modulation of cis-Regulatory Activities by HOX13. Cell Rep 17, 2913–2926 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Sun Z et al. EGR1 recruits TET1 to shape the brain methylome during development and upon neuronal activity. Nat Commun 10, 3892 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Anderson DJ et al. NKX2–5 regulates human cardiomyogenesis via a HEY2 dependent transcriptional network. Nat Commun 9, 1373 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Ang Y-S et al. Disease Model of GATA4 Mutation Reveals Transcription Factor Cooperativity in Human Cardiogenesis. Cell 167, 1734–1749.e22 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Alexandari AM et al. De novo distillation of thermodynamic affinity from deep learning regulatory sequence models of in vivo protein-DNA binding. bioRxiv 2023.05.11.540401 (2023) doi: 10.1101/2023.05.11.540401. [DOI] [Google Scholar]
- 56.Markstein M, Markstein P, Markstein V & Levine MS Genome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo. Proc Natl Acad Sci U S A 99, 763–768 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Lifanov AP, Makeev VJ, Nazina AG & Papatsenko DA Homotypic regulatory clusters in Drosophila. Genome Res 13, 579–588 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Gotea V et al. Homotypic clusters of transcription factor binding sites are a key component of human promoters and enhancers. Genome Res 20, 565–577 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Rohs R et al. The role of DNA shape in protein-DNA recognition. Nature 461, 1248–1253 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Li J, Chiu T-P & Rohs R Predicting DNA structure using a deep learning method. Nat Commun 15, 1243 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.He Q, Johnston J & Zeitlinger J ChIP-nexus enables improved detection of in vivo transcription factor binding footprints. Nat Biotechnol 33, 395–401 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.De Masi F et al. Using a structural and logics systems approach to infer bHLH-DNA binding specificity determinants. Nucleic Acids Res 39, 4553–4563 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Grove CA et al. A multiparameter network reveals extensive divergence between C. elegans bHLH transcription factors. Cell 138, 314–327 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Zhang Y, Ho TD, Buchler NE & Gordân R Competition for DNA binding between paralogous transcription factors determines their genomic occupancy and regulatory functions. Genome Res 31, 1216–1229 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Yan J et al. Systematic analysis of binding of transcription factors to noncoding variants. Nature 591, 147–151 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Coetzee SG, Coetzee GA & Hazelett DJ motifbreakR: an R/Bioconductor package for predicting variant effects at transcription factor binding sites. Bioinformatics 31, 3847–3849 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Landrum MJ et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res 42, D980–985 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Lettice LA et al. Disruption of a long-range cis-acting regulator for Shh causes preaxial polydactyly. Proc Natl Acad Sci U S A 99, 7548–7553 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Abramov S et al. Landscape of allele-specific transcription factor binding in the human genome. Nat Commun 12, 2751 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Tewhey R et al. Direct Identification of Hundreds of Expression-Modulating Variants using a Multiplexed Reporter Assay. Cell 165, 1519–1529 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Khetan S et al. Functional characterization of T2D-associated SNP effects on baseline and ER stress-responsive β cell transcriptional activation. Nat Commun 12, 5242 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Abell NS et al. Multiple causal variants underlie genetic associations in humans. Science 375, 1247–1254 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.McAfee JC et al. Systematic investigation of allelic regulatory activity of schizophrenia-associated common variants. Cell Genom 3, 100404 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Newburger DE & Bulyk ML UniPROBE: an online database of protein binding microarray data on protein-DNA interactions. Nucleic Acids Res 37, D77–82 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Hume MA, Barrera LA, Gisselbrecht SS & Bulyk ML UniPROBE, update 2015: new tools and content for the online database of protein-binding microarray data on protein-DNA interactions. Nucleic Acids Res 43, D117–122 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Shahein A et al. Systematic analysis of low-affinity transcription factor binding site clusters in vitro and in vivo establishes their functional relevance. Nat Commun 13, 5273 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
Additional references associated with Methods
- 77.Yin Y et al. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356, eaaj2239 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Slattery M et al. Cofactor binding evokes latent differences in DNA binding specificity between Hox proteins. Cell 147, 1270–1282 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Riley TR et al. SELEX-seq: a method for characterizing the complete repertoire of binding site preferences for transcription factor complexes. Methods Mol Biol 1196, 255–278 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Hammal F, de Langen P, Bergon A, Lopez F & Ballester B ReMap 2022: a database of Human, Mouse, Drosophila and Arabidopsis regulatory regions from an integrative analysis of DNA-binding sequencing experiments. Nucleic Acids Res 50, D316–D325 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Langmead B & Salzberg SL Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357–359 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Quinlan AR & Hall IM BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Siepel A et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15, 1034–1050 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Foulk MS, Urban JM, Casella C & Gerbi SA Characterizing and controlling intrinsic biases of lambda exonuclease in nascent strand sequencing reveals phasing between nucleosomes and G-quadruplex motifs around a subset of human replication origins. Genome Res 25, 725–735 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Hon J, Martínek T, Zendulka J & Lexa M pqsfinder: an exhaustive and imperfection-tolerant search tool for potential quadruplex-forming sequences in R. Bioinformatics 33, 3373–3379 (2017). [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
PADIT-seq data have been deposited in the GEO database under accession number GSE250601. MITOMI Kd data for EGR1 was downloaded from Supporting Information of reference 41. The UniPROBE database (http://thebrain.bwh.harvard.edu/pbms/UniPROBE_staging/browse.php) was used to access uPBM data 74,75. HT-SELEX data were downloaded using the ENA accession code ERP001826. ChIP-seq and ChIP-nexus data analyzed in the paper were downloaded using accession codes GSE81356, GSE89457, GSE85628, GSE67482, and GSE207001. BET-seq data were downloaded using the accession code GSE111936. Competition gcPBM data were downloaded from Supplementary Materials of reference 64. Processed SNP-SELEX data was downloaded from Supplementary Data of reference 65. Allele specific EGR1 ChIP-seq data were downloaded from the ADASTRA database: https://adastra.autosome.org/mabel. Processed MPRA data were downloaded from Supplementary Tables of references 70–73.
