Prediction of protein–ligand binding affinity from sequencing data with interpretable machine learning

H Tomas Rube; Chaitanya Rastogi; Siqian Feng; Judith F Kribelbauer; Allyson Li; Basheer Becerra; Lucas A N Melo; Bach Viet Do; Xiaoting Li; Hammaad H Adam; Neel H Shah; Richard S Mann; Harmen J Bussemaker

doi:10.1038/s41587-022-01307-0

. 2022 May 23;40(10):1520–1527. doi: 10.1038/s41587-022-01307-0

Prediction of protein–ligand binding affinity from sequencing data with interpretable machine learning

H Tomas Rube ^1,², Chaitanya Rastogi ², Siqian Feng ^3,^#, Judith F Kribelbauer ^2,^#, Allyson Li ^4,^#, Basheer Becerra ², Lucas A N Melo ², Bach Viet Do ², Xiaoting Li ², Hammaad H Adam ², Neel H Shah ⁴, Richard S Mann ^3,⁵, Harmen J Bussemaker ^2,^5,^✉

PMCID: PMC9546773 PMID: 35606422

Abstract

Protein–ligand interactions are increasingly profiled at high throughput using affinity selection and massively parallel sequencing. However, these assays do not provide the biophysical parameters that most rigorously quantify molecular interactions. Here we describe a flexible machine learning method, called ProBound, that accurately defines sequence recognition in terms of equilibrium binding constants or kinetic rates. This is achieved using a multi-layered maximum-likelihood framework that models both the molecular interactions and the data generation process. We show that ProBound quantifies transcription factor (TF) behavior with models that predict binding affinity over a range exceeding that of previous resources; captures the impact of DNA modifications and conformational flexibility of multi-TF complexes; and infers specificity directly from in vivo data such as ChIP-seq without peak calling. When coupled with an assay called K_D-seq, it determines the absolute affinity of protein–ligand interactions. We also apply ProBound to profile the kinetics of kinase–substrate interactions. ProBound opens new avenues for decoding biological networks and rationally engineering protein–ligand interactions.

Subject terms: Kinases, High-throughput screening, Machine learning, DNA methylation, Transcriptional regulatory elements

Protein–ligand binding affinity is predicted quantitatively from sequencing data.

Main

Critical cellular processes, such as gene regulation and signal transduction, rely on sequence-specific molecular recognition to guide constituent proteins to preferentially interact with specific nucleic acid or polypeptide ligands. The strength and specificity of such ‘sequence recognition’ often spans orders of magnitude, and even weak ligands can be functional^1–3. Thus, it is essential to comprehensively and quantitatively profile sequence recognition to decode these molecular networks.

Massively parallel sequencing has substantially increased the speed with which sequence recognition can be profiled. In particular, high-throughput methods that couple sequencing with in vitro selection on random ligand pools have emerged as powerful tools for the unbiased profiling of molecular interactions. This includes SELEX methods for TFs^4–14 and RNA-binding proteins^15,16 as well as protein display methods for proteases¹⁷ and T cell receptors¹⁸. As the randomized ligand pools used in these assays are extremely complex (and most sequences are observed rarely, if ever), machine learning methods have become essential for synthesizing sequencing data into ‘recognition models’ that encode how any sequence is recognized.

In recent years, several methods—using deep learning^19–21, probabilistic mixture models²² or high-dimensional embedding²³—have been developed to analyze TF:DNA binding data. However, although protein interactions are most rigorously quantified in terms of biophysical parameters such as dissociation constants (K_D), most of these methods focus on classifying sequences as bound or free or assign non-biophysical binding scores. Although some biophysical methods have been developed^24,25, they are limited to estimating relative K_D values for TFs and cannot systematically model SELEX enrichment over multiple rounds. Furthermore, although new assays have been developed to profile in vivo effects beyond direct sequence recognition^9,12,13,26, no current computational method can synthesize such complementary experiments into a unified binding model that captures the impact of co-factors and DNA methylation.

In this study, we solve these problems with a flexible machine learning framework, called ProBound, which is capable of learning biophysically interpretable models by synthesizing a wide range of sequencing data. Although we set out to analyze multi-round SELEX data, we soon realized that ProBound enabled the development of sequencing assays that probe previously inaccessible biophysical parameters. To illustrate this, we introduce K_D-seq (which measures absolute K_D values using the input, bound and unbound SELEX fractions) and Kinase-seq (which profiles kinase substrate specificity using a multi-time-point protein display assay). More broadly, our results illustrate how classical biochemical assays, which often use multiple fractions, time points or concentrations, can be upgraded with sequencing and principled machine learning to conduct biophysical measurements at unprecedented scale.

ProBound framework

ProBound uses three layers to systematically model multi-library sequencing data (Fig. 1 and Methods): a binding layer predicts the binding free energy or enzymatic efficiency from sequence using a sequence recognition model; an assay layer encodes the selection steps that generated the libraries and predicts frequencies of all ligands; and a sequencing layer models the stochastic sampling of the libraries during sequencing. These layers are combined in a likelihood function, which is optimized to infer the recognition model. Although many ligands have noisy counts or are entirely missing due to the complexity of randomized libraries, the final recognition model is robust because it has to optimally explain the full sequencing dataset. Each layer is easily extensible; for example, the binding layer, which, by default, corresponds to a position-specific affinity matrix²⁷, can be extended to include base–base interactions or cooperative binding by multiple TFs. Flexibility in the assay layer enables the modeling of alternative processes, such as enzymatic modification. Finally, multiple assays can be analyzed jointly to profile more complex phenomena (for example, methylation sensitivity).

A compendium of accurate TF binding models

Our initial objective was to analyze thousands of published SELEX datasets^{7,8,10,12,13,28–30} and produce high-quality TF binding models that capture low-affinity binding, an important yet difficult-to-detect gene regulatory phenomenon^1–3,25. This required us to quantify TF sequence recognition over a wide affinity range rather than merely classify sequences as ‘bound’ or ‘unbound’. We, therefore, assembled a training database of published SELEX experiments, which we analyzed with a uniform computational pipeline, yielding 1,632 binding models (Fig. 2a, Supplementary Table 1 and Methods). To assess the generalization performance of our models, we linked each TF to published protein-binding microarray (PBM), chromatin immunoprecipitation with sequencing (ChIP-seq) and non-training SELEX data. We computed three complementary performance metrics: meaningful affinity fold range (MAFR), a metric that provides a conservative bound on the ability of a model to detect low-affinity binding; R², the fraction of signal variance explained by the model; and area under the precision-recall curve (AUPRC), a common metric^19,20,25,31 for quantifying how well a model classifies genomic regions as bound or unbound as determined by ChIP-seq peaks³². We used these to benchmark our models to those in major resources and surveys, linking all JASPAR³³, DeepBind¹⁹, HOCOMOCO³⁴, Jolma et al.²⁸ and recently published DeepSELEX²⁰ models by TF. On average, ProBound outperformed these resources across all metrics (Fig. 2b), with the PBM and SELEX metrics displaying the largest improvement. Two comparisons—HOCOMOCO ChIP-seq AUPRC and DeepBind SELEX R²—showed no significant difference. The less notable improvement in AUPRC is likely due to bias toward high-affinity sequences in ChIP-seq peaks, for which accurate low-affinity predictions are less relevant²⁵. Below, we will introduce an alternative method for analyzing ChIP-seq data that eliminates the need for ChIP-seq peak discovery.

Fig. 2 — a, Breakdown of the training dataset used to build binding models by originating study and TF family (pie charts) and by availability of testing data used to evaluate them (Venn diagram). Representative SELEX (top) and PBM (middle) comparisons of observed and model-predicted binding signals used to quantify generalization performance. Each point in the scatterplots corresponds to either 500 SELEX probes or ten PBM probes; green indicates where the model predicts binding above an estimated baseline (Methods), whereas darker points indicate the MAFR of observed binding signal over which, at most, 5% of predicted binding was below the baseline. Representative precision-recall curve (bottom) for the ChIP-seq peak classification task used to quantify model performance in terms of AUPRC (1/3 corresponds to a random classifier). b, Performance comparison of ProBound models versus popular existing resources. For each ProBound and resource model pair (points), the average score was computed for all matching testing datasets. Horizontal bars indicate median performance. Significance was computed using the two-sided Wilcoxon signed-rank test (*** indicates P < 10⁻³).

Over the years, several TFs have been assayed many times by different research groups and SELEX platforms. We reasoned that jointly analyzing such data would produce a ‘consensus’ model focused on the true binding signal rather than platform-specific biases (Extended Data Fig. 1a). Such consensus models displayed significantly improved performance when compared to traditional single-experiment models (Extended Data Fig. 1b), indicating that multi-experiment analysis can improve binding predictions.

Extended Data Fig. 1 — (a) Schematic contrasting ProBound’s multi-experiment learning strategy that builds a consensus model for a TF by simultaneously training on all relevant SELEX data for the TF with the traditional approach that builds independent models for every individual dataset. (b) Generalization performance of consensus binding models (y-axis) and single-experiment models (x-axis) on three different metrics (scatterplots). Points correspond to models trained on individual experiments and lines connect experiments used to build the corresponding consensus model. Points above the diagonal correspond to instances where the consensus model outperforms single-experiment models.

To facilitate adoption by other researchers, we have made a curated version of our models, comparative analyses and computational tools readily available through a comprehensive resource at motifcentral.org.

Quantifying TF binding cooperativity

Variables beyond sequence, such as co-factor interactions and DNA methylation, substantially influence TF behavior in vivo, and, therefore, TF binding models must account for them to improve binding predictions. We first focused on co-factors, which modulate TF binding in a cell-type-specific manner. Despite the growing number of SELEX assays characterizing TF complexes^7,9,26, it remains a challenge to quantify sequence recognition in a way that clearly separates the contributions from many potential TF complexes and their various internal structural configurations—a problem that grows exponentially with the number of factors assayed. In an approach that builds upon our multi-experiment framework, we measure subunit binding specificity and cooperativity by explicitly modeling the allowed complexes in multiple SELEX datasets that probe different TF combinations.

We first applied this method on the complex formed by three highly conserved Drosophila homeodomain proteins: Homothorax (Hth), Extradenticle (Exd) and Ultrabithorax (Ubx). Previous studies showed that Ubx and Exd form fixed-spacer heterodimers^8,25 and that Hth uses multiple relative spacings to bind cooperatively with similar heterodimers²⁶. To characterize Hth:Exd:Ubx, we first performed SELEX-seq with all three factors and then analyzed these data in conjunction with our previous monomer and heterodimer data (Fig. 3a and Extended Data Fig. 2a). We modeled the ternary complex with two subunits representing Hth and Exd:Ubx; the total binding energy was the sum of their independent binding specificities and of a cooperativity term that depended on their relative position and orientation.

Extended Data Fig. 2 — (a) Schematic table describing the combinations of TFs assayed in five experiments (top) that were jointly analyzed to produce binding models of the different monomers and their complexes (bottom) by explicitly defining which models can form in each experiment (+ sign). (b) Distribution of probes (top) and the predicted relative contribution of every recognition mode (bottom) as a function of predicted binding selection strength (x-axis) in the first round of selection from SELEX-seq data assaying Hth, Exd, and UbxIV. (c) Integrative modeling of HT-SELEX and CAP-SELEX data for MEIS1 and DLX3 (schematic table) yields binding models for the monomers (energy logos) and configuration-dependent binding cooperativity for the MEIS1:DLX3 complex (same circle plot representation as in Fig. 3b). The bottom right logo shows the specificity of MEIS1:DLX3 for the most stable configuration (connecting arrow), aligned to a sequence previously crystallized with MEIS1:DLX31. (d) Table showing the availability of CAP-SELEX data for different TF-TF combinations. The 10 TFs with the most identified co-factors are included, and numbers indicate replicate count. (e) Distribution plot comparing the binding cooperativity inferred by ProBound at the configurations that were identified as cooperative in the original CAP-SELEX study (red line) and at all other configurations (gray line). The models were trained on the CAP-SELEX data tabulated in (d) and are shown in Extended Data Figure 3.

The resulting model revealed substantial cooperativity (ΔΔG_config ≈ 2RT) when Hth binds 8–13 base pairs (bp) upstream of Exd:Ubx (Fig. 3b), which, along with our monomer and heterodimer models, mirrored previous results^25,26. Although a larger spacing is tolerated when Hth is reversed, cooperativity is lost when Hth binds far away from the Exd:Ubx half-site, regardless of orientation. As expected, selection in the Hth-Exd-Ubx experiment was driven by multiple subcomplexes (Extended Data Fig. 2b), underscoring the need to simultaneously model all preferences.

To further validate our approach, we reanalyzed published data⁹ for the human TF heterodimer MEIS1:DLX3 and found strong cooperativity at the exact same configuration (i.e., relative spacing and orientation) previously confirmed⁹ using X-ray crystallography (Extended Data Fig. 2c). Subsequent systematic analysis of data for all pairwise combinations of the top ten most interacting TFs from the same study (Extended Data Fig. 2d) produced binding models with significant cooperativity for previously reported⁹ configurations (Extended Data Fig. 2e; P = 1.5 × 10⁻³⁰, Mann–Whitney test) and provided evidence of cooperativity for many other ones as well (Extended Data Fig. 3).

Extended Data Fig. 3 — Models are displayed as in Extended Data Figure 2c. Red and blue arrows indicate the configurations identified as cooperative in the original analysis of each dataset. These configurations (which correspond to the red line in Extended Data Figure 2e) were identified by aligning the inferred monomer binding modes to the position-probability matrices reported in the original study and selecting the configuration that minimizes the KL divergence.

Learning methylation-aware TF binding models

Next, we focused on another variable affecting in vivo binding: DNA methylation. Chemical modifications to DNA, such as fully methylated CpG dinucleotides (meCpG), are common epigenetic marks that can alter TF binding and, thus, gene regulation^35–38. Unlike existing methods that compare methylated and normal SELEX libraries to detect TF ‘methylation readout’ at the level of enriched subsequences^12,14,39, we used ProBound with an extended alphabet (Extended Data Fig. 4a and Methods) and our multi-experiment framework to learn methylation-aware binding models that resolve the position-specific impact of methylation (ΔΔG_CpG→meCpG), enabling binding predictions for any (un)methylated sequence.

Extended Data Fig. 4 — (a) Alphabet used to represent normal and methylated base pairs. (b) Same as Extended Data Figure 2a, but showing the combinations of ATF4, CEBPγ, and normal and methylated DNA that were included in each experiment and the resulting complexes that were modeled. (c) K-mer enrichment analysis for the observed ATF4 EpiSELEX-seq read counts (left), the counts predicted by a mononucleotide-only model (middle), and the counts predicted by a mono- and di-nucleotide model (right). Each scatterplot compares the 8-mer enrichment observed in the normal (x-axis) and methylated (y-axis) libraries. Every point represents an 8-mer and is colored according to the legend; color is assigned based on a 6bp matching substring between the 8mer and the IUPAC code.

We tested this approach by analyzing the effect of meCpG on the ATF4:CEBPγ heterodimer while controlling for the confounding influence of the respective homodimers. Using data for all combinations of ATF4/CEBPγ and normal/methylated DNA (Extended Data Fig. 4b), we simultaneously learned methylation-aware binding models for all three dimers (Fig. 3c and Methods). These predict methylation-induced stabilization/destabilization patterns (Fig. 3c and Extended Data Fig. 4c) consistent with previous analyses of the ATF4 homodimer¹³ and similar to those of the related CEBPβ homodimer¹³ and ATF4:CEBPβ heterodimer³⁹. Strikingly, ATF4 overrides CEBPγ to retain its methylation readout at the central position of the heterodimer complex. We used ChIP-seq data to estimate the impact of these position-specific methylation sensitivities in vivo and found that methylation significantly affected binding in the direction predicted by our models (Fig. 3d and Methods).

Other DNA modifications, such as N⁶-methyladenine (6mA) and 5-hydroxymethylcytosine (5hmC), can also be functional^40–45. To characterize their impact on TF binding, we extended the EpiSELEX-seq protocol to assay multiple sub-libraries simultaneously: unmethylated, meCpG, 5hmC and 6mA (Fig. 3e and Extended Data Fig. 5a). Not only is this simpler than assaying each methylation mark separately, it also reduces experimental error. Repeating the binding assay for CEBPγ and jointly analyzing all four libraries revealed substantial and distinct stabilization/destabilization patterns for both 5hmC and 6mA (Fig. 3e and Extended Data Fig. 5b). Notably, the inferred meCpG methylation sensitivity is identical to what we found above. These results illustrate both the versatility of our approach and the fact that 5hmC and 6mA can have a substantial impact on binding.

Extended Data Fig. 5 — (a) Schematic table describing the factors, library and binding model used in analyzing the extended EpiSELEX-seq assay (cf. Extended Data Figure 4b). (b) K-mer enrichment analysis comparing normal and modified EpiSELEX-seq libraries, computed and displayed as in Extended Data Figure 4c.

Measuring absolute binding constants using SELEX

Although we have focused on quantifying binding specificity in terms of relative affinities, knowledge of absolute affinities is necessary for predicting equilibrium occupancy and for comparing different TFs on a common scale. Fundamentally, SELEX assays probe relative ligand frequencies and, so far, have only been used to estimate relative affinities. To overcome this limitation, we developed an assay called K_D-seq. It uses ProBound to jointly analyze the input, bound and free probes from a selection round to produce both a specificity model and an estimate of the absolute dissociation constant (K_D) for a reference sequence. Intuitively, K_D-seq uses a sum rule that relates the relative ligand frequencies of the three libraries to infer absolute binding probabilities, which are then converted to K_D estimates in a way that corrects for binding saturation (Fig. 4a and Methods).

Fig. 4 — a, Schematic overview of the K_D-seq method. After a TF is incubated with a randomized DNA library, the bound, free and input probes are sequenced, measuring the relative probe frequencies in each fraction. This can be used to estimate the absolute binding probabilities (and, hence, K_D) with a sum rule that relates the three frequencies. b, K_D model for Dll consisting of a specificity model with an energy logo (top) and an interaction matrix (middle), which together predict the relative binding affinity, and the absolute K_D for a reference sequence (bottom). The interaction plot shows stabilizing (red) and destabilizing (blue) corrections to the energy logo for each pair of positions (boxes) and bases (pixels) in the logo. Gray indicates prohibited corrections. Model generated from data where [Dll] = 100 nM and [DNA] = 20 nM. c, Comparison of the predicted $K_{D}^{- 1}$ (x axis) and observed probe fractions (y axis) in the bound (top) and free (bottom) libraries. Points represent the average observed fraction for 500 probes binned by predicted K_D. The dashed line indicates expected value assuming equilibrium binding model. d, Comparison between EMSA-measured (y axis) and model-predicted (x axis) K_D values for four probes. The dashed line indicates perfect agreement. e, K_D of the sequence TTTAATTGGT as estimated by K_D-seq for different Dll and DNA concentrations.

We initially tested K_D-seq using the Drosophila homeodomain protein Distal-less (Dll) at low DNA and TF concentrations (100 nM and 20 nM, respectively) to achieve strong enrichment and avoid excessive binding saturation. The resulting model (Fig. 4b) accurately predicted enrichment in the bound and free libraries over three orders of magnitude in K_D (Fig. 4c). For validation, we measured the K_D values of the optimal model-predicted binding site and three suboptimal sequences using standard electromobility shift assays and found excellent quantitative agreement (Fig. 4d and Extended Data Fig. 6). We then confirmed the robustness of K_D-seq affinity measurements by repeating the assay at different TF and DNA concentrations (Extended Data Fig. 7a). The resulting specificity models were virtually identical (pairwise r² for ΔΔG ranging from 0.974 to 0.998), with the fraction of TF and DNA bound changing as expected (Extended Data Fig. 7b). Although the K_D estimate for the highest-affinity sequence was similar across several conditions, it shifted when the TF concentration was extremely high compared to the K_D or when the DNA concentration was much higher than that of the TF (Fig. 4e; see ‘Practical guidelines’ in the Methods).

Extended Data Fig. 6 — (a) EMSA experiments for Dll and four DNA probes. (b) Fraction bound DNA probes predicted by the equilibrium binding model (lines, computed using indicated K_D values and equation (45)) and estimated based on EMSA band intensities (dots).

Extended Data Fig. 7 — (a) Comparison between EMSA-measured (dashed line) and different model-predicted (points) K_D values for four binding probes. Various model training strategies (x-axis) used different sequencing libraries: the input/bound/free libraries from a single experiment (left); the input/bound/free libraries from multiple experiments at different TF concentrations (center); or the input/bound libraries from multiple experiments at different TF concentrations (right). (b) Fraction of DNA bound (top) and fraction of TF bound (bottom) as inferred by ProBound when learning binding models from individual K_D-seq experiments (cf. left points in (a)). (c) Example K_D model (left) and observed and predicted probe enrichments (right; cf. Fig. 4c) for a model from the central points in (a). (d) Same as (c), but for a model from the right points in (a). (e) Same as (c), but only using the bound/free libraries (analogous to Spec-seq). This model can only predict relative K_D, as the bound/free ratio is proportional to K_D for all TF concentrations. In addition, the model predicts enrichment in the data up to a global rescaling factor. (f) Same as (d), but for a model derived from RNA Bind-n-Seq data for RBFOX2.

To test the theoretical validity of K_D-seq, we used the binding model of Fig. 4b as the ‘ground truth’ and simulated data for a range of Dll and DNA concentrations. In all cases, ProBound accurately recovered the K_D model (Extended Data Fig. 8a–e). In simulations at various incubation times, ProBound inferred correct K_D values at times exceeding ~10% of the equilibration time of the slowest probe in the library (Extended Data Fig. 8f,g). Taken together, this shows that K_D-seq is theoretically valid and robust.

Extended Data Fig. 8 — (a) Plot showing bound fraction vs. binding affinity in simulation of equilibrium binding. ‘Ground truth’ binding affinities were computed using the binding model in Fig. 4b (K_D = 3.9nM). Lines correspond to simulations at different total TF concentrations. (b) Distributions of binding affinities in the input, bound and free libraries. Vertical lines indicate the median affinity in each library. (c) Comparison of the bound TF fraction in the simulation (‘truth’) vs. the fraction inferred by ProBound after analyzing the resulting synthetic reads. Each dot corresponds to a simulation with a unique [Dll]/[TF] combination, colored by the DNA concentration. (d) Same as (c) but showing the net bound DNA fraction colored by TF concentration. (e) K_D value for the highest-affinity sequence inferred from the synthetic data. (f) Same as (a) but showing the fraction of DNA bound in kinetic simulations using different incubation times t. k_off,min is the off-rate for the highest-affinity probe. (g) K_D value for the highest-affinity sequence inferred using synthetic data from the kinetic simulations.

ProBound can also learn K_D models by jointly analyzing the bound and input libraries of multiple SELEX experiments at different TF concentrations. Intuitively, this approach uses saturation effects to determine the absolute affinity scale. For Dll, the K_D models from the two approaches are very similar (Extended Data Fig. 7a,c,d). When applied to multi-concentration RNA Bind-N-seq¹⁶ data for RBFOX2, the resulting K_D model correctly captured the observed transition from linear to saturated selection in the experiments (Extended Data Fig. 7f). Finally, we note that ProBound can estimate relative affinities using only the free and bound libraries, as in the Spec-seq⁴⁶ assay (Extended Data Fig. 7e).

Peak-free motif discovery from ChIP-seq data

Although the preceding analyses have focused on quantifying the impact of co-factors and TF concentration on in vitro binding, we also wanted to learn their in vivo impact directly from ChIP-seq data. Standard motif discovery algorithms aim to discover overrepresented sequences within discrete genomic regions—identified by ‘peak callers’—that harbor a statistically significant enrichment of ChIP-seq reads. Peak calling is useful for identifying the most prominent genomic binding sites, but it ignores information about cis-regulatory logic contained within more weakly bound regions. We hypothesized that ProBound can extract such logic by directly modeling how the input and ChIP libraries relate to each other.

To test this approach, we used ProBound to discover the factors driving the selection in glucocorticoid receptor (GR) ChIP-seq data from the IMR90 cell line⁴⁷ (Methods). It found four binding models: one consistent with the GR consensus sequence^48,49 and three others consistent with known GR co-factors AP-1, FOXA1 and TEAD^47,50 (Fig. 5a). These models were qualitatively consistent with those discovered using well-established peak-based methods (Extended Data Fig. 9). Inspired by our multi-concentration analysis above, we next set out to quantify the impact that the nuclear concentration of a TF can have on its binding. We did so by jointly analyzing multiple ChIP-seq datasets that probe GR binding in the murine hippocampus after treatment with varying levels of corticosterone (CORT)⁵¹, an agonist that increases the nuclear concentration of GR (Fig. 5b). The resulting model captured sample-specific activity parameters reflective of GR nuclear concentration that were proportional to CORT concentration (Fig. 5b).

Fig. 5 — a, Binding models for GR and three co-factors (left) learned from GR ChIP-seq data from the IMR90 cell line⁴⁷ and for GR from a SELEX dataset (center). The scatterplot compares the energy coefficients learned from ChIP-seq (y axis) and SELEX (x axis) data⁷. b, Combined specificity (top) and sample-specific TF binding activity (bottom) model learned by jointly analyzing three GR ChIP-seq datasets after treatment with 30 μg kg⁻¹, 300 μg kg⁻¹ or 3,000 μg kg⁻¹ of CORT⁵¹. The scatterplot (left) compares the energy coefficients as in a.

Extended Data Fig. 9 — Top: Binding models inferred by peak-based methods (MEME-ChIP and HOMER) and peak-free methods (ProBound and NoPeak) from the GR ChIP-seq data published in Starick et al. (2015). For MEME-ChIP, the reverse-complement symmetry setting was activated. Bottom: Comparison of ChIP-based and SELEX-based binding models for GR, displayed as in Fig. 5a. Because the binding models generated by MEME-ChIP and HOMER contain base probabilities p, the negative logarithm of these values were compared to the ΔΔG/RT values from the SELEX model. None of the binding models found by NoPeak matched the GR consensus sequence.

It should be noted that the multi-concentration model was constructed on data where each library was intentionally downsampled to 10⁵ reads or 0.03 reads per kilobase (kb) of genomic sequence on average. Thus, even at extremely low coverage, ChIP-seq data clearly contain sufficient information to reliably infer TF binding models and quantify biologically meaningful cell state parameters. The free-energy parameters of both GR binding models showed good agreement with those from a model trained on in vitro data⁷ (r² = 0.97 and r² = 0.92, respectively; Fig. 5a,b), suggesting that in vitro and in vivo observations of binding specificity can, in fact, be highly concordant.

Profiling tyrosine kinase kinetics using Kinase-seq

Biological processes that employ sequence-specific protein–protein interactions are increasingly being studied with display assays using diverse DNA-templated protein libraries^17,18,52. Although these methods are profiling such interactions more comprehensively than ever before, interpreting the data remains challenging for many of the same reasons as above. Furthermore, current analytical methods tend to focus on detecting enriched sequence features rather than explicitly estimating binding constants or enzymatic parameters. Given the similarities with SELEX assays, we were motivated to use ProBound to characterize protein sequence recognition.

As a proof of concept, we focused on a process critical to many signal transduction pathways in the cell: the phosphorylation of tyrosine residues on proteins. Recently, the substrate sequence preferences of several tyrosine kinases were surveyed with a bacterial display library containing thousands of known kinase substrates⁵³. To comprehensively profile the preferences for one of these kinases, c-Src, in an unbiased way, we repeated the assay with a new library design that randomizes ten amino acid residues around a fixed central tyrosine and exposed this library to c-Src for varying durations (Fig. 6a and Methods). After sequencing (Extended Data Fig. 10), we jointly analyzed all time points to learn a model that predicts the sequence-specific catalytic efficiency k_eff, a simple metric that is often used to compare substrates for the same enzyme. Visualizing the inferred efficiency model as a sequence logo (Fig. 6b) revealed a position-specific pattern of favorable residues consistent with the earlier study⁵³. The model also accurately captures the observed fraction of phosphorylated peptides over a 100-fold range in k_eff for all three time points (Fig. 6c).

Fig. 6 — a, Schematic overview of the Kinase-seq assay used to profile the sequence specificity of the tyrosine kinase c-Src. b, k_eff model for c-Src with an energy logo (top) and an interaction matrix (bottom) trained on data from 5 minutes, 20 minutes and 60 minutes of exposure. The central position of the model was fixed to recognize tyrosine (gray). c, Comparison of the predicted k_eff (x axis) and phosphorylated fraction (y axis) for 5 minutes (blue), 20 minutes (purple) and 60 minutes (red) of exposure to c-Src. Points represent the average observed phosphorylated fraction for 500 probes binned by predicted k_eff. Dashed lines indicate expected value according to the model. d, Comparison of the HPLC-measured normalized initial phosphorylation rate v₀ (y axis, n = 3 technical replicates) and the model-predicted k_eff (x axis) for five disease-associated WT/MUT SNP pairs (arrows) and a peptide predicted to have high activity (Supplementary Table 2). The concentration of c-Src was 500 nM and that of the substrate peptide was 100 μM. Error bars indicate the s.e.m., and P values were computed using a two-sided t-test (*** indicates P < 10⁻³).

Extended Data Fig. 10 — (a) Bar chart showing the number of reads and unique sequences in each sequencing library. (b) Sequence logos showing the amino acid frequencies (left) and enrichments (right) at each position in each library.

To validate the model, we used high-performance liquid chromatography (HPLC) to measure the phosphorylation rates for 11 peptides. As genetic variants can impact phosphorylation rates⁵⁴, we used the PTMVars database⁵⁵ to find four disease-associated single-nucleotide polymorphisms (SNPs) that were predicted by our ProBound model to have a large allelic difference. Measurements of their normalized initial phosphorylation rate differed significantly in the direction predicted by the model (Fig. 6d). In addition, there was no measurable difference for a SNP predicted to cause only a small allelic difference for the F8 protein, and a model-defined high-efficiency peptide (Src-high) was indeed the highest. Predictions tracked HPLC measurements over three orders of magnitude in k_eff.

Discussion

A major goal of this study was to rigorously estimate biophysical parameters from massively parallel sequencing data using machine learning. Although biochemists have measured such parameters for decades, these measurements are generally low-throughput. By contrast, high-throughput sequencing-based analysis tends to focus on the detection of enrichment patterns that only indirectly reflect these quantities. Moreover, modern machine learning methods, such as deep neural networks, tend to yield highly overparametrized black box models whose parameters have no direct biophysical meaning. Here, we showed that, by explicitly modeling the assay process, we can use machine learning to turn DNA sequencers into virtual measurement devices that accurately quantify biophysical parameters. Molecular biologists and computer scientists often address the same question using very different language; for instance, classifier performance and binding free energies are both used to quantify sequence recognition. We hope that approaches such as ours help keep the literature more coherent and inspire direct experimental validation of algorithm performance.

Central to our approach is the observation that some quantities cannot be estimated through pairwise enrichment analysis but only through more structured integration of complementary data. One example is our combinatorial approach to the separation of different TF complexes, which we also extended to methylation-aware binding models. Another is how analyzing the bound, free and input fractions jointly—not pairwise—allows absolute affinities to be measured. Our approach is reminiscent of more traditional biochemical assays, which collect data across different time points, concentrations or fractions and use curve fitting to estimate constants. As we study increasingly complex aspects of sequence recognition—such as the combined impact of sequence, co-factors, DNA methylation and TF concentrations or the integration of in vitro and in vivo perspectives—we foresee that rigorous integration of complementary data along the lines that we have sketched here will become increasingly important. More generally, we anticipate that the accurate and unbiased profiling of sequence recognition that ProBound enables will have many applications in areas of biotechnology where the rational engineering of ligands or substrates is critical.

Methods

Overview of the algorithm

For each experiment, the data consist of a count table enumerating the probes in each SELEX round. The core of the algorithm is a statistical model of the experiment that defines the likelihood of a set of model parameters given the count table. On a high level, this likelihood is computed by first defining the probability that each probe is bound in terms of its sequence, then predicting the probe frequencies in each library using a cumulative selection function and, finally, modeling the stochastic sampling of sequencing. The model parameters are estimated from the data through numerical maximization of the likelihood.

Probabilistic motivation of the binding model

The binding model defines the probability that a probe is bound:

P_{bound} = \frac{Z_{bound}}{1 + Z_{bound}} .

Here, Z_bound is the partition function, which can be thought of as a weighted sum over microscopic states. Assuming that, at most, two protein molecules are bound to the probe, the partition function is given by

Z_{bound} = \sum_{a} \sum_{x} \frac{[P_{a}]}{K_{D, a} (S_{x})} + \sum_{a, b} \sum_{x_{1}, x_{2}} \frac{[P_{a}] [P_{b}]}{K_{D, a} (S_{x_{1}}) K_{D, b} (S_{x_{2}})} ω_{a : b} (x_{1}, x_{2}),

where a is a “binding mode” index that denotes protein type; [P_a] is the concentration of protein a; S_x is a probe subsequence of length L_a starting at an offset and strand denoted by x; K_D,a(S_x) is the dissociation constant for protein a binding S_x; and ω_a:b(x₁, x₂) quantifies the cooperativity between factors a and b binding at positions x₁ and x₂, respectively. Note that ω_a:b(x₁, x₂) equals 1 if a and b bind independently from each other, equals 0 for prohibited conformations and is greater than 1 if the factors bind cooperatively.

It is convenient to express K_D in terms of its value for a references sequence S₀ and a modifying factor quantifying the relative binding strength²⁷:

K_{D, a}^{r e l} (S_{x}) = \frac{K_{D, a} (S_{x})}{K_{D, a} (S_{0})} = \exp (\frac{Δ Δ G_{a} (S_{x})}{R T}) .

Here, ΔΔG_a(S) ≡ ΔG(S) − ΔG(S₀) is the difference in free-energy penalty ΔG of binding between S and S₀; R denotes the ideal gas constant; and T is the absolute temperature.

A central goal of our algorithm is to learn how ΔΔG_a(S) depends on the sequence. ProBound models this as a sum of additive contributions associated with sequence features ϕ:

- \frac{Δ Δ G_{a} (S_{x})}{R T} = \sum_{ϕ \in Φ} β_{a, ϕ} X_{ϕ} (S_{x}) \equiv {\vec{β}}_{a} \cdot \vec{X} (S_{x})

Here, Φ is the set of sequence features; β_ϕ is the energetic impact of ϕ; and X_ϕ(S_x) is a binary indicator of whether sequence S_x contains ϕ. By default, Φ is simply the letter sequence along S_x. In this case $\vec{β}$ encodes a position-specific affinity matrix (PSAM)^24,27,56 with size matching the length of S_x. ProBound can also include letter pairs as features, both adjacent (giving dinucleotide interactions for DNA as in, for example, NRLB²⁵) and non-adjacent.

Finally, although ProBound is similar to MODER²² in that both methods model monomeric and dimeric binding, these methods have several differences: (1) ProBound predicts the quantitative equilibrium binding probability in terms of the biophysically interpretable partition function Z_bound, whereas MODER uses a mixture model and the expectation–maximization algorithm to perform motif discovery; (2) ProBound jontly analyzes all available SELEX rounds, whereas MODER analyzes a single set of bound sequences; (3) MODER allows dimeric interactions to modify the combined position weight matrix for two closely spaced or clashing motifs; and (4) ProBound has broad applicability beyond discovery of dimeric motifs.

Implementation of binding layer

Although the above derivation provides a motivation for the binding model, it has to be adapted for SELEX experiments. First, it is clear from Eq. (2) that the protein concentration [P_a] and binding constant K_D,a(S₀) for a given factor a cannot be separately estimated from the data, but only the ratio α_a = [P_a] / K_D,a(S₀) can, a quantity that we call the binding mode activity. We similarly define the binding mode interaction activities as α_a:b = [P_a][P_b] / K_D,a(S₀)K_D,b(S₀). Second, because the free protein concentration can vary between SELEX rounds r, the activities can take independent values in each round. Third, most experiments are performed in a low-protein-concentration regime where Z_bound ≪ 1 and P_bound ∝ Z_bound. Because the data only provide information about the relative rate at which probes are selected, only the relative values of α_a and α_a:b are meaningful in this limit. Fourth, although PSAM models can be accurate for close-to-consensus sequences, they severely underestimate the affinity of far-from-consensus sequences, for which non-specific binding is dominant⁵⁷. This can be addressed by including a non-specific binding term α_N.S. in Z_bound. Finally, it is sometimes important to include a factor ω_a(x) that models biases in binding along the probe. Putting all of this together gives that the partition function in selection round r is given by:

\begin{matrix} Z_{bound, r} = α_{N.S., r} + \sum_{a} α_{a, r} \sum_{x} ω_{a} (x) e^{{\vec{β}}_{a} \cdot \vec{X} (S_{x})} \\ + \sum_{a, b} α_{a : b, r} \sum_{x_{1}, x_{2}} e^{{\vec{β}}_{a} \cdot \vec{X} (S_{x_{1}}) + {\vec{β}}_{b} \cdot \vec{X} (S_{x_{2}})} ω_{a : b} (x_{1}, x_{2}) \end{matrix}

The binding probes typically feature a variable region flanked by constant sequences. The sliding window sum over subsequences S_a can be configured to include f_a letters from the flanking sequences. By default, the sum runs over both strands, but it can be restricted to only one strand (which is useful for modeling RNA and peptides).

Assay layer

The selection model predicts the relative concentrations f_i,r of each binding probe i in each selection round r. By default, the concentrations in two subsequent rounds are related through an enrichment factor proportional to the binding. It is convenient to express this as

f_{i, r} = f_{i, r - 1} {(Z_{bound, i, r})}^{ρ} {(1 + Z_{bound, i, r})}^{γ}

where Z_bound,i,r is the partition function evaluated for probe i in round r. Experiments conducted in the low-protein-concentration limit are modeled by setting (ρ, γ) = (1, 0). Binding saturation can be accounted for by setting (ρ, γ) = (1, −1). Although previous methods have modeled enrichment between a pair of SELEX libraries (such as the linear selection model used by NRLB²⁵ and the saturated binding model used by BEESEM to optimally explain the k-mer enrichment in HT-SELEX data²⁴), and although the recent DeepSELEX method analyzes multiple SELEX rounds using a multi-layer neural network (although in a way that neither models the thermodynamics of binding nor the cumulative effect of repeated enrichment)²⁰, no other method rigorously models how a full SELEX library evolves across multiple selection rounds.

Some experiments (such as K_D-seq; see below) do not use repeated binding enrichment but, rather, derive multiple libraries directly from the input. Such experiments are better modeled using

f_{i, r} = f_{i, 0} {(Z_{bound, i, r})}^{ρ_{r}} {(1 + Z_{bound, i, r})}^{γ_{r}}

Finally, kinetic experiments that enrich and sequence modified or unmodified probes can be modeled using the constant-rate-enrichment model:

f_{i, r} = f_{i, r - 1} (\frac{1}{1 + e^{- δ}} e^{- Z_{bound, i, r}} + \frac{1}{1 + e^{δ}} (1 - e^{- Z_{bound, i, r}}))

Here, δ→∞ and δ→−∞ correspond to the unmodified and modified fractions, respectively.

Sequencing layer

The sequencing model computes the likelihood of the observed count tables k_i,r given the relative concentrations f_i,r predicted by the selection model. The counts are assumed to follow a Poisson distribution with expectation value

E [k_{i, r}] = η_{r} f_{i, r}

Here, the parameter η_r normalizes the relative probe concentration and adjusts to the correct sequencing depth. The (rescaled) likelihood is then

\log L = \sum_{r, i} [k_{i, r} \log (η_{r} f_{i, r}) - η_{r, i} f_{i, r}] / k_{total} + const.

where k_total is the total number of reads and where the last term is independent of model parameters and can be ignored for the purpose of optimization. Because f_i,r is proportional to f_i,0, the latter parameter can be optimized analytically and substituted back into Eq. (10), giving

\log L = \sum_{r, i} (k_{i, r} \log p_{r; i}) / k_{total} + const.

where $p_{r; i} = η_{r} f_{i, r} / \sum_{r^{'}} η_{r^{'}} f_{i, r^{'}}$ . Note that Eq. (11) also can be derived by assuming that the counts for each probe follow the multinomial distribution across columns with probability p_r;i. Also note that, because all unobserved probes have k_i,r = 0 and do not contribute to the likelihood, the sum over i only runs over the observed probes. This is a major advantage compared to NRLB²⁵, where the sum is over all 4^L probes, with L as the number of variable positions. This sum can only be evaluated using dynamic programming, and this restricts NRLB to data from only a single round of affinity-based enrichment in the absence of saturation.

A second advantage of this approach is that it seeks to predict the quantitative count of all observed sequences and give the appropriate weight to both (the relatively rare) high-count sequences and (the much more numerous) low-count sequences. This differs substantially from DeepSELEX²⁰ (which builds a multi-library sequence classifier using the top 15,000 sequences and then disregards the sequencing count), DeepBind¹⁹ (which truncates the sequencing counts of a selected SELEX library into present or absent, generates a synthetic input library and then builds a binary classifier of selected versus input), MODER²² (which performs motif discovery within one set of sequences without counts) and BEESEM²⁴ (which minimizes the error in the predicted library-wide k-mer frequencies).

Finally, note that Eq. (11) is independent of the initial probe frequencies f_i,0, meaning that the initial library need not be random but can consist of genomic DNA fragment or custom-designed sequences.

Multi-experiment learning

ProBound simultaneously models multiple experiments by computing the likelihood $L_{e}$ of each experiment e and then optimizing the combined likelihood

\begin{matrix} \log L & = & \sum_{e} \log L_{e} \end{matrix}

The precise way in which the likelihood $L_{e}$ is evaluated can be tailored to the details of each experimental design:

A different configuration of binding modes and their interactions can be chosen for each experiment when computing Z_bound when desired.
The binding mode (and interaction) activities can either take independent values α_a,e in each experiment or be constrained to $α_{a, e} = {[P_{a}]}_{e} α_{a}$ , where α_a is the global activity of binding mode a and [P_a] is a set parameter. The latter is useful when integrating experiments conducted at different protein concentrations or in kinetic assays where [P_a] is set to the treatment time.
Chemical modifications are encoded by expanding the alphabet and transliterating letters to appropriate experiments. For example, meCpG modifications can be encoded using the alphabet ACcGgT and the complementarity rules A ↔ T, C ↔ G and c ↔ g, expanding the feature set Φ of the binding model to include the additional letters and performing the transliteration CG → cg for methylated probes.

To our knowledge, no other methods have similar functionality for jointly analyzing multiple complementary SELEX datasets.

Regularization

Three regularization terms were included to avoid overfitting and to improve the stability of the numerical optimization. The first was a L₂ regularization term for the parameter vector

\begin{matrix} \vec{θ} = {β_{ϕ}, \log α_{a}, \log α_{a : b}, \log ω_{a} (x), \log ω_{a : b} (x_{1}, x_{2}), \log η_{r}} \end{matrix}

with weight λ. The second term was inspired by the Dirichlet distribution, which commonly is used as a prior for probability parameters. Thus, for each feature ϕ, we identified all features Φ^c(ϕ) that are of the same class c (monomer, or dimer with the same spacing) and located at the same position within the binding site, and then we defined a feature probability

\begin{matrix} p (ϕ) & = & e^{β_{ϕ}} {(\sum_{ϕ^{'} \in Φ^{c} (ϕ)} e^{β_{ϕ^{'}}})}^{- 1} \end{matrix}

The regularization term is then computed as the rescaled log-PDF of p(ϕ) in the Dirichlet distribution

\frac{k_{Dirichlet}}{k_{total}} \sum_{ϕ} \log p (ϕ)

where k_Dirichlet is analogous to a pseudocount. The final regularization term in the likelihood is defined as

\sum_{i} (e^{θ_{i} - θ_{\max}} + e^{- θ_{i} - θ_{\max}})

and introduces an exponential barrier (by default $θ_{\max} = 40$ ) that prevents the optimizer from failing or getting trapped in regions with large numerical errors.

Procedure for setting k_Dirichlet

The importance of the Dirichlet regularizer in Eq. (15) is set by k_Dirichlet. For fits with all-by-all interactions, the inferred coefficients tended to be unstable for small values of k_Dirichlet. Although increasing k_Dirichlet stabilizes the coefficients, they shrink toward 0 when k_Dirichlet is excessively large. We, thus, developed a procedure for setting k_Dirichlet and applied it uniformly in all analyses that included dinucleotide or all-by-all interactions. In this procedure, we ran ProBound using a wide range of Dirichlet weights (k_Dirichlet ∈ {0, 10, 20, 50, 100, 200, 500, 1,000, 2,000}), fixed the monomer coefficients ${\vec{β}}_{mono}$ and dimer coefficients ${\vec{β}}_{di}$ in each resulting model using the mismatch gauge (see below) and computed the pairwise Pearson correlation r² between the inferred ${\vec{β}}_{di}$ for different values of k_Dirichlet. The resulting matrix r²(k₁, k₂), where k₁ and k₂ are values of k_Dirichlet, had a block-like structure where ${\vec{β}}_{di}$ was highly correlated for large values of k₁ and k₂ but only weakly correlated when k₁ or k₂ was small. We considered the coefficients to have stabilized when r² > 0.8 between a model and the model with the next-smaller value of k_Dirichlet. Using this procedure, we fixed k_Dirichlet to be 0 for the Hth-Exd-Ubx analysis (Fig. 3b), 0 for the ATF4/CEBPγ EpiSELEX-seq analysis (Fig. 3c), 0 for the CEBPγ:CEBPγ multi-EpiSELEX-seq analysis (Fig. 3e), 200 for the RBFOX2 analysis (Extended Data Fig. 7f), 200 for the single-experiment Dll analyses (Fig. 4b), 1,000 for the multi-experiment Dll analyses (Extended Data Fig. 7c–e) and 50 for the Src analysis (Fig. 6b). k_Dirichlet was set to 20 in all analyses that lacked interactions—namely, the SELEX benchmarking (Fig. 2), the CAP-SELEX analyses (Extended Data Figs. 2c and 3) and the ChIP-seq analysis (Fig. 5).

Model optimization scheme

To estimate the model parameters, ProBound uses the quasi-Newton optimization method L-BFGS to minimize the loss function. As gradient-based methods cannot guarantee convergence to the global minimum, we developed a heuristic method that escapes common local minima. Specifically, given an optimal binding model, closely related but suboptimal models can be generated by (1) shifting the motif to the left or right, (2) extending or shrinking the motif to the left or right and (3) increasing or deceasing the flank length²⁵. Thus, given that L-BFGS converges at a minimum, our method explores the above transformations to find the model with the optimal footprint.

More precisely, ProBound optimizes the loss function by first restricting it to include only the first binding mode (and non-specific binding) and optimizing this model and then sequentially including and optimizing additional binding modes (and interactions as they become possible). As each new binding mode a (or interaction a:b) is included and optimized, the algorithm takes seven substeps: (1) heuristic adjustment of α_a (or α_a:b) so that it is expected to contribute to 5% to Z_bound; (2) freezing the values of all model parameters; (3) unfreezing and optimizing η to avoid shocks from incorrectly predicted sequencing depth; (4) unfreezing and optimizing the monomer features in ${\vec{β}}_{a}$ mode to give an initial binding model (ω_a:b (x₁,x₂) is unfrozen and optimized for interactions); (5) greedy exploration of alternative binding models with different frame shift (shifting the recognized sequence features to left or right), footprint (expanding the region of feature recognition to the left and/or right) or flank length (including subsequences located further into the fixed flanking regions when computing Z_bound); (6) sequential unfreezing and optimization of dimer features and ω_a(x) if applicable; and (7) unfreezing of all model parameters. At each substep, L-BFGS is used to optimize the unfrozen parameters. By default, the parameters are seeded with small random numbers, but the binding modes can also optionally be seeded using International Union of Pure and Applied Chemistry (IUPAC) codes. Additional constraints can be imposed on the parameters to implement reverse-complement symmetric binding modes or translationally symmetric interactions.

Gauge fixing

Models with pairwise letter interactions are over-parametrized, meaning that an infinite set of parameter values $\vec{β}$ encode the same sequence specificity. Specifically, for any binding site sequence S, $\vec{β} \cdot \vec{X} (S)$ is invariant under transformations of the form

β_{ϕ} \to β_{ϕ} + A \forall ϕ \in Φ_{mono} (x_{1})

β_{ϕ} \to β_{ϕ} - A \forall ϕ \in Φ_{di} (x_{1}, x_{2}, n)

where Φ_mono(x₁) is the set of monomer features at position x₁; Φ_di(x₁, x₂, n) is the set of dimer features connecting positions x₁ and x₂ and with n at x₂; and A is the transformation coordinate. For visualization and model comparison purposes, it is convenient to select one representative model for each sequence specificity (analogous to gauge fixing in physics). Here, we use a convention that we call the ‘mismatch gauge’. In this convention, the coefficients are such that, first, only one monomer coefficient contributes for single-edit variations of reference sequence S₀, and, second, at most one of the dimer coefficients contributes for each double-edit variation of S₀. After imposing mutation gauge, the resulting PSAMs were visualized using standard energy logos²⁷, and the interaction coefficients were displayed using heat maps.

Benchmarking ProBound

Model training

To benchmark ProBound, we first curated a training database of published TF SELEX datasets^{7,8,10,12,13,28–30}. Although this database contained 2,272 datasets, Yang et al.³⁰ contained re-sequenced libraries from Jolma et al.²⁸, and, thus, the database contained 1,767 unique experiments. Datasets with low sequencing depth or low enrichment were filtered out as described below, giving 2,116 datasets (1,632 experiments).

We next developed a uniform computational pipeline to analyze each dataset. This was complicated by experimental differences between the SELEX platforms, including the number of selection rounds, selection strength and sequencing depth. Furthermore, several artifacts are known to impact HT-SELEX datasets, including contamination between wells, inconsistent selection between rounds and sequence biases^{6,19,23,25,28}. Although such challenges can be overcome using manual inspection^19,28, we instead chose to develop a fully automated system. This system first uses ProBound to analyze each dataset (subsampled to 100,000 reads per sequencing library) using three different settings (that differ in the number of binding modes and in how non-specific binding is modeled; see Extended Data Methods) and then prunes each fit to retain only the most relevant binding mode and, finally, selects the setting that produced the best-performing binding model (based only on the training data).

Model pruning

For each fit generated by ProBound, one binding mode typically captured the TF sequence specificity, and the other typically had small values or encoded platform-specific artifacts, such as sequence bias or contamination. Although identifying the biophysically relevant binding mode manually is straightforward in most cases, we wanted to automate this process and, therefore, developed a quality score that ranks and selects the most relevant binding mode:

r_{mode}^{2} + \log I_{mono}

Here, $r_{mode}^{2}$ is the the Pearson correlation (across the SELEX probes in the training dataset) of the log-transformed binding affinity predicted by the mode (plus an optimized non-specific term) and the log-transformed binding predicted by the full fit, and I_mono is the information content of the mononucleotide coefficients after imposing the mismatch gauge. This score favors the binding mode that contributes the most to the final prediction and has the highest specificity. Conversely, it disfavors binding modes corresponding to sequence bias (which can affect many probes but typically have low information content) and contamination (which typically impacts few probes but can give rise to highly specific binding modes). We, thus, selected the binding mode with the highest quality score for downstream analysis.

Model selection

We next compared the binding models learned using the three settings. Although very similar in most cases, poor models were occasionally observed having suboptimal motif shifts or encoding the aforementioned artifacts. To automatically select the best model, we developed the quality score S_training, which measures model performance in predicting the training data. As the heterogeneity of the training data made it difficult to quantify this performance using a single measure, S_training was defined to be the average of six sub-scores that quantify different aspects of model performance:

\begin{matrix} S_{training} = & mean (\{F_{logit} (r_{fit,8mer}^{2}; 0.5), F_{logit} (R_{fit,affinity}^{2}, 0.95), F_{\log} (f_{fit,affinity}; 5.0),)) \\ F_{logit} (R_{scoring,training}^{2}; 0.95), F_{\log} (M A F R_{scoring,training}; 5.0), \\ ((F_{\log} (I_{scoring,mono}; 3.0)\}) \end{matrix}

where the functions $F_{logit} (x; x_{0}) = expit (logit (x) - logit (x_{0}))$ and $F_{\log} (x; x_{0}) = expit (\log (x) - \log (x_{0}))$ map the metric x to the unit interval such that the threshold x₀ maps to 0.5. Here,

$r_{fit,8Mer}^{2}$ was computed by first using the full ProBound model to predict the training count table, then counting the number of occurrences $n_{8mer}^{obs/pred} (k, r)$ of each 8mer k in each round r of the of the observed and predicted count tables and then computing the observed and predicted 8mer enrichment between the first and last round using
$f_{8mer}^{obs/pred} (k) = \frac{1}{r_{last} - r_{first}} \log (\frac{1 + n_{8mer}^{obs/pred} (k, r_{last})}{1 + n_{8mer}^{obs/pred} (k, r_{first})})$ 21
and, finally, computing the Pearson correlation between $f_{8mer}^{obs}$ and $f_{8mer}^{pred}$ .
$R_{fit,affinity}^{2}$ and f_fit,affinity were computed by first using the full ProBound model to predict the training count table. Then, for each pair of subsequent rounds r and next(r) (ignoring rounds with fewer than 10,000 reads), the probes were sorted (conjointly in the observed and predicted tables) by the predicted enrichment between the rounds. The probes were then divided into bins i associated with the observed and predicted probe counts $n_{bin}^{obs/pred} (i, r)$ such that $n_{bin}^{obs} (r) + n_{bin}^{obs} (next (r)) = 1000$ in each bin. After computing the observed and predicted enrichment using
$f_{bin}^{obs/pred} (i; r) = \frac{1}{next (r) - r} \log (\frac{1 + n_{bin}^{obs/pred} (i, next (r))}{1 + n_{bin}^{obs/pred} (i, r)})$ 22
we finally computed the metrics
$R_{fit,affinity}^{2} = R_{k}^{2} \max_{r} (f_{bin}^{obs} (i; r), f_{bin}^{pred} (i; r))$ 23

$f_{fit,affinity} = \max_{r} (\frac{\max_{i} f_{bin}^{obs} (i; r)}{\min_{i} f_{bin}^{obs} (i; r)})$ 24
where $R_{i}^{2}$ denotes the coefficient of variation evaluated across bins i.
$R_{scoring,training}^{2}$ and MAFR_{scoring,training} were computed using the same method that was used to quantify generalization performance in predicting testing SELEX data (see below) but, instead, predicting the training data.
I_scoring,mono is the information content of the scoring model, computed using the monomer coefficients after imposing the mismatch gauge.

Finally, as each of the re-sequenced experiments had two associated fits (based on data from Jolma et al.²⁸ and Yang et al.³⁰, respectively), we selected the fit with the best training performance S_training for benchmarking purposes.

Evaluation of model performance

To benchmark the resulting binding models, we curated a testing database of published SELEX (same as training database), PBM^58–60 and ENCODE ChIP-seq³² datasets. We then quantified the ability of the above binding models to predict the testing data. Binding models and testing data were matched by TF and species; if no match was found, the matching criteria were expanded to consider orthologous human and mouse TFs. For comparison, we also downloaded binding models from the JASPAR, DeepBind and HOCOMOCO databases, the original HT-SELEX TF binding survey and from the recently published DeepSELEX method^{19,20,28,33,34}, and we repeated all analysis using these models. For the SELEX dataset predictions, comparisons were skipped if either the ProBound model or the downloaded model were known to be trained on the testing dataset in question (or other datasets from the same laboratory).

For the SELEX and PBM experiments, we used the binding models to predict the total affinity (denoted x_i) for each probe i and quantified how well these predictions agree with the measured binding y_i. For the SELEX experiments, the signal consisted of the probe count enrichment k_i,r+1 / k_i,r between subsequent SELEX rounds (with maximum normalized to 1). For the PBM experiments, the background-subtracted and minimum–maximum normalized binding signal was used. For both platforms, we encountered two challenges. First, the measurements for individual probes were too noisy to quantify model performance accuracy (for SELEX, typical sequences were observed just once; for PBM, the signal depends strongly on the position of the binding site in the probe, which varies). Inspired by earlier PBM analyses that removed position bias by considering the 8mer-binned median signal^31,56, we sorted and binned the probes using x_i (with bin size 500 for SELEX and 10 for PBM) and then computed the binned signal y_i (using the bin-averaged enrichment, with pseudocount 1, for SELEX, and the median signal for PBM). Second, binding signals can be distorted by experimental artifacts, such as binding saturation, background and non-specific binding not modeled by the model. To correct for such distortions, x_i was transformed using the binding saturation function:

{\hat{y}}_{i} = \frac{β_{0}}{1 + {(β_{C} (x_{i} + β_{NSB}))}^{- 1}}

Here, β₀ sets the scale, β_C > 0 sets the concentration and β_NSB sets the non-specific binding. These parameters were estimated by minimizing $\sum_{i} {[\log (y_{i} / {\hat{y}}_{i})]}^{2}$ for SELEX (with β₀ > 0 and β_NSB > 0) and $\sum_{i} {(y_{i} - {\hat{y}}_{i})}^{2}$ for PBM (for which y_i can be negative). Model quality was then quantified using the coefficient of determination R² of y_i and ${\hat{y}}_{i}$ (on a logarithmic scale for SELEX) and the MAFR, which is defined as $(\max_{i} y_{i}) / y_{bg}$ where y_bg is the weakest signal detected by the model. To estimate y_bg, we first defined a set of (binned) probes predicted to be bound as ${\hat{y}}_{i} > 1.25 Q_{1} (\hat{y})$ (where Q₁ is the first quartile) and then defined y_bg to be the smallest value of y_i identifying the bound set at 5% false discovery rate (FDR). For multi-round SELEX experiments, R² and the effective range were computed for all rounds, and the largest values were recorded.

For the ChIP-seq experiments, we quantified model performance using the AUPRC in classifying binding peak versus background sequences. To get the peak sequences, we downloaded narrowPeak files from the ENCODE portal (see below) and extracted the genome sequence from the 500 peaks with the strongest enrichment. To generate the background set, we shifted the peak interval one peak length to the left and right and extracted the genome sequences.

Filtering of SELEX training datasets

We first curated a database of published SELEX experiments and downloaded the associated raw sequencing data^{7,8,10,12,13,28–30}. Methylated SELEX experiments were not considered. For each experiment, we downsampled the sequencing libraries to contain, at most, 100,000 reads and tabulated the probe counts in each SELEX round. We then filtered out low-quality experiments using three criteria. First, low-coverage experiments were removed by requiring at least two rounds to have at least 10,000 reads. Second, experiments were discarded if no sequencing library before round three had 10,000 or more reads. Third, experiments with low enrichment were discarded. The enrichment was quantified by first tabulating the frequencies p(k, r) (using pseudocount 5) of all 5mers k in each SELEX round r and then, for each pair of rounds r_i and r_j with 10,000 or more reads, computing the rescaled Kullback–Leibler (KL) divergence

\begin{matrix} D_{KL} (r_{2}, r_{1}) & = & \frac{1}{r_{2} - r_{1}} \sum_{k} p (k, r_{2}) \log_{2} \frac{p (k, r_{2})}{p (k, r_{1})} \end{matrix}

Only experiments with rescaled KL divergence exceeding 0.01 for at least one combination of rounds were retained.

Scoring of binding probes

In quantifying generalization performance, we predicted the occupancy of DNA sequences using both the ProBound binding models and previously published models. For DeepBind, we exponentiated the scores returned from the deepbind scoring tool, which is proportional to binding affinity. For JASPAR and original HT-SELEX TF survey, the binding models were position–frequency matrices (containing counts). These were first converted to position probability matrices (PPMs, using a pseudocount of 1), which were then used to compute the binding probability at each offset in the sequence. The occupancy was then defined to be the sum of the binding probabilities. For HOCOMOCO, the binding models were PPMs, and the occupancies were computed as described above. For DeepSELEX, which outputs the difficult-to-interpret quantity $A = \max (\vec{p} (R_{4})) + \max (\vec{p} (R_{3})) - \max (\vec{p} (R_{0})) \in [- 1, 2]$ (where $\vec{p} (R_{k})$ is a vector containing the predicted probability for SELEX round k along the scored sequence), the values were transformed using the linear map (A + 1) / 3 to occupy [0, 1].

ENCODE ChIP-seq datasets

ENCODE datasets were downloaded in December 2018 using this query string.

Binding by multi-protein complexes

ProBound analysis

ProBound was configured to jointly analyze SELEX experiments performed with different combinations of TFs, as described in the Extended Data Methods. In the case of Hth-Exd-Ubx, we analyzed published SELEX-seq data for Exd-Ubx, Hth, Exd and Ubx. In addition, we performed a SELEX-seq assay for Hth-Exd-Ubx (see below). CAP-SELEX data for human TF pairs were analyzed jointly with matched single-TF HT-SELEX data as described in the Extended Data Methods and Supplementary Table 3.

Experimental protocol

The Hth-Exd-Ubx SELEX experiment was carried out following previously published methods^8,61. In brief, after expressing and purifying the wild-type homeodomain proteins, a final concentration of 50 nM was assembled, incubated with excess DNA (10–20 fold) for 30 minutes and loaded onto an EMSA gel. A DNA library with 30 randomized bases was used. The TF-bound fraction was isolated from the gel and amplified and either subjected to another round of enrichment or prepared for sequencing. Three rounds of enrichment were performed. After each selection round, the DNA was extracted from the gel and amplified by using Ilumina’s small RNA primer sets. Sequencing barcodes were added in a five-cycle PCR step, and the final library was gel-purified using a native TBE gel before sequencing. Libraries were sequenced at the New York Genome Center using separate lanes on an Illumina HiSeq 2000 sequencing machine.

Effect of DNA methylation

ProBound analysis

ProBound learns methylation-aware binding models by jointly analyzing normal and methylated SELEX libraries after encoding the methylation state of each base pair using an extended alphabet (Extended Data Fig. 4a and configuration in Extended Data Methods). Encoding methylation status in this manner allows us to infer the position-specific free-energy impact of such chemical modifications. For the ATF4/CEBPγ homodimers and heterodimers, we jointly analyzed two published EpiSELEX-seq experiments for ATF4 and CEBPγ and a new EpiSELEX-seq experiment that included both ATF4 and CEBPγ. We also generated EpiSELEX-seq data for CEBPγ in combination with the chemical modifications meCpG, 5hmC and 6mA.

Experimental protocol

ATF4 protein purification and EpiSELEX-seq experiments were performed as described previously¹³. Purified CEBPγ protein was kindly donated by the Lomvardas laboratory at the Zuckerman Institute at Columbia University. To generate randomized 5hmC or 6mA libraries, single-stranded oligos with a 16-bp randomized region were ordered from TriLink Biotechnologies, substituting (1) deoxycytidine triphosphate (dCTP) with deoxy-(5hm)-cytidine triphosphate (d5hmCTP) or (2) deoxyadenosine triphosphate (dATP) with deoxy-(6m)-adenosine triphosphate (d6ATP) during the synthesis step. For double-stranding, a standard mix of deoxy-nucleotides was used, resulting in hemi-modified libraries. meCpG libraries were generated by enzymatic treatment with M.SssI (NEB) as described previously¹³. The library sequences consisted of left and right constant adapters (GGTAGTGGAGG- and -CCAGGGAGGTGGAGTAGG, respectively) flanking a library specific barcode and a 16-bp randomized sequence:

no modification: -TGGG-CCTGG-N16-
meCpG: -GCAC-CCTGG-N16-
5hmC-Library: -CAGT-CCTGG-N16- (5hmC instead of C in 16N)
6mA-Library: -AGTG-CCTGG-N16- (6mA instead of A in 16N)

GLM analysis of ATF4 and CEBPγ ChIP data

To estimate the effect of DNA methylation on in vivo AFT4 and CEBPγ binding, we first scanned the genome for close-to-consensus motif matches i with CG at positions predicted by the model to have strong methylation readout: TGACGTCA and TGACGTCG for ATF4:AFT4; TTGCGCAA for CEBPγ:CEBPγ; and TTGCGTCA and TTGCATCG for CEBPγ:ATF4. We next downloaded aligned ATF4 and CEBPγ ChIP-seq reads and matched input from ENCODE (ENCFF872NFM, ENCFF801LQC and ENCFF713PVH), extended the alignments to 125 bp and computed the genome coverages (k_ATF4,i, k_CEBPγ,i, k_Input,i) at each motif match. The DNase-seq coverage (k_DNase,i, ENCFF971AHO) and bisulfite sequencing methylation status (f_meCpG,i, ENCSR765JPC, binarized using 20% and 80% thresholds and keeping matches with at least ten reads) were also recorded. We finally modeled the ATF4 and CEBPγ ChIP-seq coverage at the relevant motif matches (excluding CEBPγ:CEBPγ matches for ATF4 and ATF4:ATF4 matches for CEBPγ) using two separate binomial generalized linear models:

k_{ChIP,i} ~ Binomial (k_{ChIP, i} + k_{Input, i}, \frac{e^{η_{i}}}{1 + e^{η_{i}}})

η_{i} = β_{0, a} + k_{DNase, i} β_{DNase} + f_{meCpG, i} β_{meCpG, a}

In this model, β_0,a encodes the relative affinity of motif a; β_DNase encodes the impact of DNA accessibility; and β_meCpG encodes the impact of DNA methylation for motif a and is the sought-after variable. The significance of the methylation readout was assessed using an F-test (Supplementary Table 4). For TGACGTCG, we assumed that the methylation readout of the two CGs contribute independently and that the readout of the central CG can be estimated using the sequence TGACGTCA.

Inferring absolute K_Ds

The K_D-seq assay incubates a TF (or other protein) with a library of DNA probes (or RNA or peptide probes), separates the bound and free probes and sequences the input (I), bound (B) and free (F) fractions. In equilibrium, the probability that probe i is bound or free is given by

\begin{matrix} p (B ∣ i) & = & \frac{{[{DNA}_{i}]}_{B}}{{[{DNA}_{i}]}_{I}} = \frac{{[P]}_{F}}{{[P]}_{F} + K_{D_{i}}} \\ p (F ∣ i) & = & \frac{{[{DNA}_{i}]}_{F}}{{[{DNA}_{i}]}_{I}} = \frac{K_{D, i}}{{[P]}_{F} + K_{D, i}} \end{matrix}

where ${[{DNA}_{i}]}_{I}$ , ${[{DNA}_{i}]}_{B}$ and ${[{DNA}_{i}]}_{F}$ are the probe concentrations in the input, free and bound libraries; [P]_F is the free protein concentration; and K_D,i is the dissociation constant that we want to measure. The sequencer does not measure p(B∣i) or p(F∣i) directly but, rather, gives the probe counts k_i,I, k_i,B and k_i,F. The expectation values of these counts are given by

\begin{matrix} \frac{E [k_{i, I}]}{k_{I}} & = & \frac{{[{DNA}_{i}]}_{I}}{{[DNA]}_{I}} = p (i) \\ \frac{E [k_{i, B}]}{k_{B}} & = & \frac{{[{DNA}_{i}]}_{B}}{{[DNA]}_{B}} = p (i ∣ B) \\ \frac{E [k_{i, F}]}{k_{F}} & = & \frac{{[{DNA}_{i}]}_{F}}{{[DNA]}_{F}} = p (i ∣ F) \end{matrix}

where [DNA]_I, [DNA]_B and [DNA]_F are the DNA concentrations in the respective fractions and k_I, k_B and k_F are the sequencing depths of the libraries, which are treated as fixed experimental setting. To estimate the dissociation constants, note that

\frac{K_{D, i}}{{[P]}_{F}} = \frac{p (F ∣ i)}{p (B ∣ i)} = \frac{p (i ∣ F) p (F)}{p (i ∣ B) p (B)}

where p(B) and p(F) are the net fractions of DNA that is bound and free. Intuitively, these can fractions can be estimated from the data by finding the values that make the observed probabilities in Eq. (30) satisfy the sum rule:

p (i) = p (i, F) + p (i, B) = p (i ∣ F) p (F) + p (i ∣ B) p (B)

ProBound can be configured to learn a K_D model by analyzing the probe frequencies in the input, bound and free libraries (r = {I, B, F}). Specifically, configuring ProBound to use the non-cumulative enrichment model (Eq. (7)) with ρ_r = {0, 1, 0} and γ_r = {0, − 1, − 1} and restricting the activities to be constant across columns implements the binding probabilities in Eq. (29). With these settings, the dissociation constant is

K_{D, i} = {[P]}_{F} / Z_{bound, i}

Here, the free-protein concentration can be computed using

{[P]}_{F} = {[P]}_{T} - {[DNA]}_{I} p (B)

where [P]_T is the total protein concentration. In most cases, [P]_F is close to the more readily measured [P]_T due to the low average affinity of randomized ligand libraries. However, here, p(B) is implicitly estimated by ProBound and can be computed by equating the expected counts in ProBound

E [k_{i, I}] = η_{I} f_{i, I}

E [k_{i, B}] = η_{B} f_{i, I} p (B ∣ i)

E [k_{i, F}] = η_{F} f_{i, I} p (F ∣ i)

with the corresponding expectation values in Eq. (30), computing the bound-to-input ratio, and using Bayes’ theorem to simplify, giving

p (B) = \frac{k_{B}}{k_{I}} \frac{η_{I}}{η_{B}}

To test the modeling assumptions (Fig. 4c), the probes were binned by the predicted K_D,i, and, for each bin, the observed and predicted binding probabilities were computed using

p (B ∣ i) = \frac{E [k_{i, B}]}{E [k_{i, I}]} \frac{η_{I}}{η_{B}}

Here, E[k_i,B] and E[k_i,I] were evaluated using the observed and predicted read counts in each bin.

Simulations

To test the theoretical consistency of the K_D-seq, we developed simulations of the assay and analyzed the resulting reads with ProBound to see if the ‘ground truth’ parameters used in the simulations were recovered. In a first set of simulations, we computed the binding equilibrium for different TF and DNA library concentrations to test the theoretical consistency and robustness of our approach. A major goal of these simulations was to see if K_D-seq suffers from being in the ‘titration regime’⁶². For single-ligand binding experiments, the titration regime occurs when the concentration of the constant fraction (for example, the DNA probes) greatly exceeds the dissociation constant of the interaction; in this regime, most of the varied fraction (for example, the TF) will be bound until the total concentration of the varied fraction exceeds that of the constant fraction. The resulting quick change in the (unobserved) free concentration makes extraction of accurate K_D values challenging. We, thus, wondered if this phenomenon impacts K_D-seq, which uses a library of randomized (mostly low-affinity) DNA probes.

To simulate this, we first enumerated all 10-bp DNA probe sequences and computed the K_D values of these using the binding model for Dll shown in Fig. 4b as the ground truth. To model the coupled binding equilibrium, we first estimated the initial probe frequencies ${[{DNA}_{i}]}_{I}$ by matching the base frequencies to those observed in the input library (28.8% A, 26.5% C, 14.4% G and 30.3% T) and then used the secant method to find the root of

{[P]}_{F} = {[P]}_{T} - \sum_{i} {[{DNA}_{i}]}_{I} \frac{{[P]}_{F}}{{[P]}_{F} + K_{D, i}},

and finally used the resulting value of [P]_F combined with equations (29) and (30) to compute the relative concentrations of all probes in the input, bound and free libraries. Then, 10⁶ sequences were sampled for each library using the multinomial distribution, and ProBound was finally used to learn a K_D-model. This procedure was repeated for all combinations of [P]_T and [DNA]_I used in Fig. 4e. As expected, the fraction of bound TF molecules increased with DNA concentration (ranging between 0.2–1.1%, 1.0–5.5% and 4.8–24% in the simulations with 20 nM, 100 nM and 500 nM (Extended Data Fig. 8c)). Thus, although both the TF and total DNA concentrations exceed the K_D for the strongest sequence, the concentration of such probes is very low (because a large majority of probes have low affinity; Extended Data Fig. 8b), and the titration regime can generally be avoided (also see ‘Practical guidelines’ below). Finally, the inferred K_D values were very close to those predicted by the ground truth model (Fig. 8e), demonstrating the theoretical consistency of our approach.

In a second set of simulations, we investigated how slow binding kinetics of high-affinity probes might impact the final K_D model. To this end, we modeled the binding kinetics of the library using

\partial_{t} {[{DNA}_{i}]}_{B} = k_{on, i} {[P]}_{F} {[{DNA}_{i}]}_{F} - k_{off, i} {[{DNA}_{i}]}_{B}

where k_on,i and k_off,i are the on-rates and off-rates for probe i. Because most protein is free even at equilibrium (see the equilibrium simulation above), we solved this differential equation under the assumption [P]_F = [P]_T, giving

p (B ∣ i, t) \equiv \frac{{[{DNA}_{i}]}_{B} (t)}{{[{DNA}_{i}]}_{I}} = \frac{{[P]}_{T}}{{[P]}_{T} + K_{D, i}} (1 - e^{- t (k_{off, i} + {[P]}_{T} k_{on, i})})

To simulate the scenario where high-affinity probes have the slowest kinetics, we assumed that k_on is diffusion limited (and, thus, sequence independent) and that the sequence specificity is driven by variation in k_off. After expressing k_off,i in terms of the value for the highest-affinity sequence,

k_{off, i} = k_{off, \min} \frac{K_{D, i}}{K_{D, \min}},

the binding probability becomes:

p (B ∣ i, t) = \frac{{[P]}_{T}}{{[P]}_{T} + K_{D, i}} (1 - e^{- k_{off, \min} t (K_{D, i} + {[P]}_{T}) / K_{D, \min}})

Note that this probability only depends on k_on and k_off through K_D, which is known, and $k_{off, \min}$ . To test how robust K_D-seq is to the value of the latter, we simulated experiments with $k_{off, \min} t \in \{0.001, 0.01, 0.1\}$ (Extended Data Fig. 8f), analyzed the resulting reads using ProBound and compared the final K_D model to the ground truth parameters used in the simulation (Extended Data Fig. 8g). This showed that the true model was recovered for $t \geq 0.1 k_{off, \min}^{- 1}$ , with even shorter incubation times being acceptable at high protein concentrations.

Experimental protocol

6×His tagged Drosophila Dll protein lacking amino acids N terminal to its homeodomain (DllΔN) was purified by standard procedures. Next, 0.05% Tween 20 was included in the lysis buffer and in the elution buffer to prevent the target protein from sticking to plasticware. The purified protein was quantified by Bradford assay, using BSA as the standard. The 10mer R0 library was generated by annealing the library oligo (GTTCAGAGTTCTACAGTCCGACCTGG-10N-CCAGGACTCGGACCTGGACTAGG) and the SELEX-R primer (CCTAGTCCAGGTCCGAGT), followed by a Klenow-mediated primer extension reaction. The library DNA was purified using Qiagen minElute columns and was quantified using NanoDrop. The SELEX procedure was largely the same as previously described⁸, except that a Cy5-labeled DNA probe, instead of a P32-labeled probe, was used as the marker to indicate where the bound and unbound fractions were. The Cy5-labeled DNA probe was generated by annealing a Cy5-labeled primer to a DNA probe with the desired DNA sequence, followed by Klenow reaction. EDTA was used to stop the reaction. The probe was directly used in the binding reaction, without further purification.

For each SELEX condition, 15 μl of protein solution (at 2× final concentration) in dialysis buffer (20 mM HEPES pH 8.0, 200 mM NaCl, 10% glycerol, 2 mM MgCl₂, 0.05% Tween 20) was made. The library mixture was made by adding desired amount of the R0 library to 6 μl of 5× binding buffer (50 mM Tris-HCl pH 7.5, 250 mM NaCl, 5 mM MgCl₂, 20% glycerol, 2.5 mM DTT, 2.5 mM EDTA, 125 ng μl⁻¹ of polydIdC, 100 ng μl⁻¹ of BSA, 0.125% Tween 20) and filling to 15 μl with water. The protein and DNA parts were mixed and incubated at room temperature for 30–40 minutes before loading the gel. For Cy5-labeled markers, 15 μl of 200 nM DllΔN in dialysis buffer was mixed to 15 μl of DNA mixture (6 μl of 5× binding buffer, 8 μl of water and 1 μl of 200 nM probe) and incubated at room temperature for 30–40 minutes.

After running the gel, gel slices corresponding to the bound and unbound fractions were cut from the gel and were each place in a 500-μl tube with several needle poked holes at the bottom. The 500-μl tubes were each placed within a 2-ml tube and spun at maximum speed at room temperature to smash the gel. Then, 650 μl of DNA extraction buffer (10 mM Tris-HCl, pH 7.5, 150 mM NaCl, 1 mM MgCl₂, 0.5 mM EDTA, pH 8.0) and 50 μl of 20% SDS were added to each smashed gel sample, and the tubes were rotated at room temperature for 2–4 hours. The tubes were then spun at maximum speed at room temperature for 2 minutes. Then, 650 μl of sample was transferred to a Spin-X filter column and spun at room temperature at the maximum speed for 2 minutes. The DNA in flow-through was purified by phenol chloroform extraction, followed by isopropanol precipitation. Then, 20 μg of glycogen was used to facilitate precipitation, and the DNA pellet was dissolved in 20 μl of Qiagen EB buffer.

Each purified SELEX DNA was properly diluted such that the following PCR program gave good library yield for all samples. The one-step library preparation was done in a 50-μl reaction, which contains 5 μl of properly diluted SELEX DNA, 10 nM of one of the eight SELEX-for primers, 10 nM of the common SELEX-rev primer, 1 μM of NEB universal primer for Illumina and 1 μM of selected NEB index primer for Illumina. PCR was done with the Phusion DNA polymerase (NEB), using the following program: one cycle of 98 °C for 30 seconds; five cycles of 98 °C for 10 seconds, 60 °C for 30 seconds and 72 °C for 15 seconds; ten cycles of 98 °C for 10 seconds and 65 °C for 75 seconds; one cycle of 65 °C for 5 minutes; and hold at 4 °C. Amplified libraries were purified using 1.5 volume (75 μl) of AMPure beads and eluted with 15 μl of Qiagen EB buffer. The libraries were pooled and sequenced using Illumina NextSeq 550, following standard procedures. The forward primers consisted of left and right constant sequences (ACACTCTTTCCCTACACGACGCTCTTCCGATCT- and -GTTCAGAGTTCTACAGTCCGA, respectively), flanking a library-specific barcode: 1) --, 2) -AGAC-, 3) -TCAGAC-, 4) -CAGAC-, 5) -C-, 6) -GAC-, 7) -AC- and 8) -TTCAGAC-. In addition, we used the reverse primer GACTGGAGTTCAGACGTGTGCTCTTCCGATCT-CCTAGTCCAGGTCCGAGT, the NEB universal primer AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTA-CACGACGCTCTTCCGATCT and the NEB index primer CAAGCAGAAGACGGCATACGAGAT-[6bp index]-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT.

EMSA validation

The same batch of the DllΔN protein that was used in the SELEX experiments was also used in the measurement of the absolute K_D values of DllΔN to selected DNA sequences. The EMSA experiments were performed following regular protocol. In brief, the protein was diluted with dialysis buffer to 2× of the desired final concentration in a total volume of 15 μl. The DNA mixture was made by mixing 6 μl of 5× binding buffer, 8 μl of water and 1 μl of 200 nM Cy5-labeled DNA probe. The DNA probes had the same flanks as the 10mer SELEX library and the indicated middle 10 bp. The protein part and the DNA part were mixed well (giving a final DNA probe concentration of 6.7 nM) and incubated at room temperature for 30–40 minutes before loading the 0.5× native TBE gel.

After running the gel, an image was taken using the Typhoon imager, and the band intensity was quantified using Fiji version 1.52n (Supplementary Table 5). In brief, each band was selected using the rectangle selection tool, and the selected regions were converted to histograms. A straight line was drawn at the bottom of each histogram, and the areas of the enclosed peak regions were quantified and used as band intensity.

For each probe, K_D was finally estimated by fitting the binding probability

\begin{matrix} p (B; {[P]}_{T, a}) \\ = {(1 + \frac{2 K_{D}}{{[P]}_{T, a} - {[DNA]}_{T} - K_{D} + \sqrt{{({[P]}_{T, a} - {[DNA]}_{T} - K_{D})}^{2} + 4 K_{D} {[P]}_{T, a}}})}^{- 1}, \end{matrix}

where [P]_T,a is the total TF concentration in band a, and [DNA]_T is the total DNA concentration, to the quantitated intensities y_B,a and y_F,a of the bound and free bands, respectively (Supplementary Table 5). Specifically, after introducing the band-specific intensity scaling factors α_B and α_F, we found the parameters that minimized the loss function

\begin{matrix} (K_{D}, α_{B}, α_{F}) \\ = \sum_{a} [{(p (B; {[P]}_{T, a}) - α_{B} y_{B, a})}^{2} + {((1 - p (B; {[P]}_{T, a})) - α_{F} y_{F, a})}^{2}] . \end{matrix}

Practical guidelines

As with any assay, K_D-seq can produce inaccurate measurements given unsuitable experimental conditions. One strength of K_D-seq is that many such conditions can be diagnosed computationally. Below are practical guidelines for designing successful K_D-seq experiments and for detecting problems, should they occur.

Robust probe depletion in the free library. For a K_D-seq experiment to be successful, ProBound needs to estimate the net fraction of bound DNA p(B). Intuitively, ProBound accomplishes this by separately computing the relative probe frequencies in the input, bound and free libraries and then finding the value of p(B) that makes the relative frequencies satisfy the sum rule in equation (32) (technically, ProBound maximizes the likelihood of the full model, as detailed above). For this estimate to be robust, is important that some high-affinity probes have detectable depletion in the free library; otherwise, the input and free libraries are identical, and the sum rule is satisfied for p(B) = 0. This estimate becomes less robust in two experimental regimes. First, no probe will be depleted if the TF concentration is well below the K_D of the strongest probe. Second, the depletion signal in the free library is reduced when [DNA]_I ≫ [P]_T because, at most, a small fraction of the library can be bound in this regime. An example of the latter is the experiment with 500 nM DNA and 100 nM TF, where only 2% of the library was bound. Computationally, low depletion in the free library is most easily detected using the enrichment plots in Fig. 4c.

Robust estimate of relative binding affinities. ProBound estimates relative K_D values using both probe enrichment in the bound library and probe depletion in the free library. Thus, although saturation compresses the relative selection for high-affinity probes in the bound library (because all saturated probes have P(B∣i) ≈ 1), relative K_D values can still be estimated because the saturated probes differ in depletion in the free library. However, because the number of reads corresponding to high-affinity probes decreases as these probes become increasingly saturated, excessive saturation (that is, ${[P]}_{T} ≫ \min_{i} K_{D, i}$ ) tends to make the K_D estimates for the highest-affinity probes less robust. Examples of this include the experiments with 3,300 nM Dll in Fig. 4e. Excessive saturation is most easily detected using the enrichment plots in Fig. 4c.

Avoiding the titration regime. As discussed above, single-ligand K_D measurements can be compromised when conducted in the ‘titration regime’⁶²; if K_D is much smaller than the ligand concentration (assuming this is the constant fraction), K_D no longer corresponds to the protein concentration at which 50% of ligands are bound but must, rather, be estimated through non-linear curve fitting that models titration to estimate the free protein concentration. However, such curve fitting becomes increasingly error-prone as the ligand concentration increases. This regime should generally be avoided.

However, K_D-seq has two advantages compared to single-ligand experiments: First, the vast majority of ligands have low affinity (see simulation above), and the concentration of high-affinity ligands is, therefore, much lower than the total ligand concentration. Thus, titration can be avoided even when the total library concentration substantially exceeds the smallest K_D in the library. Second, ProBound estimates the fraction of ligands bound, which, in turn, can be used to estimate the fraction of protein bound (Equation (34)). This provides an internal measure to monitor titration effects. If more than 5–10% of the TF molecules are estimated to be bound (for example, experiment with 500 nM library and 100 nM Dll in Fig. 4e), the assay should be repeated with decreased library concentration.

Binding equilibrium. For K_D measurements to be accurate, it is important that the binding reaction has reached equilibrium⁶². In particular, high-affinity probes can have a low off-rate and, thus, take longer time to reach equilibrium. However, our simulations above indicated that K_D-seq produces stable binding models after 10% of the naively expected equilibrium time (based on the off-rate for the highest-affinity probe). To understand this, note that Equation (42) can be used to express the equilibration time t_eq,i for probe i as

t_{eq, i} = k_{off, i}^{- 1} \frac{1}{1 + {[P]}_{T} / K_{D, i}}

We, thus, see that saturated probes, which have [P]_T/K_D,i > 1, reach binding equilibrium faster than naively expected given $k_{off, i}^{- 1}$ . This observation, combined with the experimental constraint that high-affinity probes should be at least moderately saturated (see above), explains the relative robustness of K_D-seq with regard to incubation time. Nonetheless, when working with systems for which the off-rates are unknown, it is advisable to repeat the assay for multiple incubation times to validate that equilibrium has been reached.

Validating the binding curve. Although ProBound can estimate K_D values using binding data for a single protein concentration, the method assumes that the binding probability follows Equation (29). However, deviations from this binding curve can occur—for example, due to cooperative binding at high protein concentrations. When characterizing a new protein, it can, therefore, be prudent to validate the binding curve by repeating the assay for multiple protein concentrations.

Multi-concentration input-versus-bound experiments. ProBound can learn a K_D model by jointly analyzing the input and bound libraries of SELEX experiments conducted at different protein concentrations (Extended Data Fig. 7d). Intuitively, this approach uses low-concentration libraries (which ideally have a linear affinity-versus-binding relationship) to learn the relative binding affinities and high-concentration libraries (which should have saturated high-affinity probes) to determine the affinity scale. Although limited saturation of high-affinity probes in the lowest-concentration library can be acceptable as long as the relative-affinity model (which then mainly is constrained by the non-saturated lower-affinity probes) generalizes to the highest-affinity probes, such saturation should be avoided if possible. This effect may explain the slightly lower dissociation constant estimated in Extended Data Fig. 7d (which uses input/bound) compared to Extended Data Fig. 7c (which also uses the free library).

Peak-free motif discovery from ChIP-seq data

ProBound analysis

To analyze the GR ChIP-seq data from the IMR90 cell line⁴⁷, we first aligned the (single-end) Input and ChIP reads to the genome and extracted a sufficiently long (200-bp) sequence downstream of the $5^{'}$ -end genomic position of the mapped read. Next, we randomly sampled 10⁶ reads from each library and constructed a count table containing the Input and ChIP read counts in the first and second columns, respectively. ProBound was then configured to model this table as a single-round SELEX experiment. Because GR binds DNA as a homodimer, we configured ProBound to impose reverse-complement symmetry while fitting free-energy parameters for the primary motif. We then iteratively added three additional binding modes to the model to capture the influence of potential co-factors. To analyze the GR ChIP-seq data from the murine hippocampus⁵¹, we followed a similar procedure and constructed one count table for each of the three CORT concentrations (sampling 10⁵ sequences per library) and then configured ProBound to jointly model all count tables using a single reverse-complement-symmetric binding mode.

Other methods

Raw FASTQ files corresponding to the IMR90 GR ChIP and Input sequences from Starick et al.⁴⁷ were downloaded from the European Nucleotide Archive using accession number PRJEB7372. SAM files of the input and ChIP sequences were created by aligning to the hg19 genome using bowtie2 (version 2.4.4) with default settings.

HOMER: HOMER (version 4.11.1)⁶³ with default settings was used to analyze the SAM files; ‘tag directories’ for both the ChIP and Input sequences were first created using makeTagDirectory. Next, the command analyzeChIP-Seq.pl Tagged_GR_ChIP/ hg19 -i Tagged_GR_Input/ was executed to infer binding motifs.

MEME-ChIP: MACS2 (version 2.2.7.1)⁶⁴ with default settings was used to discover enriched peak regions. Then, 500-bp genomic regions—250 bp upstream and downstream of the discovered peak centers—were extracted from the resulting BED files using bedtools. The MEME-ChIP webserver was used to analyze these sequences with default settings and the ‘Look for palindromes only’ option selected.

NoPeak: The NoPeak repository⁶⁵ was downloaded from GitHub, and the SAM files were converted to BED files following the example in the repository: samtools view -bS GR_chip.sam ∣ bedtools bamtobed ∣ sort -k1,1 -k2n > GR_chip.bed.

These BED files were analyzed using NoPeak with default settings (kmer length = 8). This required 128 GB of RAM to complete; other kmer lengths were tried (>8) but failed as NoPeak ran out of memory.

Kinase-seq

ProBound analysis

In this assay, a library of peptide substrates S_i is treated with a enzyme E, and the concentrations of the products P_i are quantified using high-throughput sequencing (see below). This reaction can be modeled using Michaelis–Menten kinetics generalized to multiple substrates:

E + S_{i} ⇌_{k_{off, i}}^{k_{on, i}} E : S_{i} \underset{k_{cat, i}}{\to} E + P_{i}

In the limit of low enzyme concentration, the reaction quickly reaches a quasi-steady state with

[E : S_{i}] = [E] [S_{i}] / K_{M, i}

where K_M,i = (k_off + k_cat,i) / k_on,i is the Michaelis constant for substrate i. In this limit, the change in substrate concentration is given by

\partial_{t} [S_{i}] = - k_{eff, i} [S_{i}] [E]

where k_eff,i = k_cat,i / K_M,i is the catalytic efficiency. Integrating this equation yields

[S_{i}] (t) = [S_{i}] (0) e^{- k_{eff, i} \int_{0}^{t} [E] (t^{'}) d t^{'}}

where [S_i](0) is the substrate concentration right after the quasi-equilibrium was reached. The concentrations in the product library can then be expressed as

[P_{i}] (t) = {[S_{i}]}_{total} (1 - \frac{1 + [E] (t) / K_{M, i}}{1 + [E] (0) / K_{M, i}} e^{- k_{eff, i} \bar{E} (t) t})

where ${[S_{i}]}_{total} = [S_{i}] + [E : S_{i}] + [P_{i}]$ is concentration in the initial library, and $\bar{E} (t) = t^{- 1} \int_{0}^{t} [E] (t^{'}) d t^{'}$ is the time-averaged enzyme concentration. This can be simplified further by noting that only a small fraction of substrates are bound in the limit of low enzyme concentration

[E : S_{i}] / [S_{i}] = [E] / K_{M, i} ≪ 1

and, thus,

[P_{i}] (t) = {[S_{i}]}_{total} (1 - e^{- k_{eff, i} \bar{E} (t) t})

Note that the selection only differs between probes through k_eff,i. ProBound can, thus, model the assay using Eq. (8) with δ→−∞ and

Z_{bound, i, P} = k_{eff, i} \bar{E} (t) t

Here, $\bar{E} (t)$ depends on both K_D,i and [S_i] throughout the reaction and is generally unknown. We here assume that most enzyme is free so that $\bar{E} (t) = {[E]}_{total}$ ; a lower (free) enzyme concentration would lead to a global rescaling of k_eff,i but not affect the relative efficiency or its sequence dependence.

Preparation of degenerate peptide library to profile tyrosine kinase specificity

The degenerate peptide library contained 11 residue sequences with five randomized amino acids flanking either side of a fixed central tyrosine residue. These sequences were fused to the eCPX bacterial surface display scaffold⁶⁶. To clone this library, we first amplified the eCPX-coding sequence with a $3^{'}$ SfiI restriction site. This was fused to the random library in another PCR step using the following degenerate oligonucleotide: GCTGGCCAGTCTGGCCAG-NNSNNSNNSNNSNNStatNNSNNSNNSNNSNNS-GGAGGGCAGTCTGGGCAGTCTG, which contains a 5′ SfiI site. The resulting amplified product was digested with SfiI restriction endonuclease, purified and ligated into the SfiI-digested pBAD33-eCPX plasmid, as described previously⁵³. The ligation reaction was concentrated and desalted and then used to transform DH5α cells by electroporation. Transformed cells were grown overnight in liquid culture, and then the plasmid DNA library was extracted and purified using a commercial Midiprep kit.

Preparation of biotinylated antibody

The phosphotyrosine monoclonal antibody (pY20, conjugated to the fluorophore, perCP-eFluor 710, Invitrogen, cat. no. 46-5001-42) was desthiobiotinylated before use in the specificity screen. The antibody was first purified away from BSA and gelatin by anion exchange using a salt gradient of 0 M NaCl to 1 M NaCl in 0.1 M potassium phosphate buffer. The fractions that eluted after 0.2 M NaCl were pooled and then buffer-exchanged into 0.1 M potassium phosphate by dilution and centrifugal filtration. The antibody was then labeled in a 200-μl small-scale reaction using the DSB-X labeling kit (Molecular Probes) according to the manufacturer’s instructions. Concentration of the antibody was monitored by its absorbance at 490 nm to determine percentage yield. The average final concentration of the antibody was around 0.2 mg ml⁻¹. The specificity of the antibody was validated using cells expressing displayed peptides. Cells treated with a tyrosine kinase without ATP show no background antibody staining. By contrast, cells expressing displayed peptides, treated with tyrosine kinase and 1 mM ATP, show increasing antibody staining as a function of phosphorylation time.

High-throughput specificity screen

The catalytic domain of the human tyrosine kinase c-Src was screened against the degenerate peptide library as described previously⁵³—one main difference being the use of magnetic beads to isolate phosphorylated cells rather than fluorescence-activated cell sorting. In short, Escherichia coli MC1061 cells transformed with the library were grown to an optical density of 0.5 at 600 nm. Expression of the surface-displayed peptides was induced with 0.4% arabinose for 4 hours at 25 °C. After expression, the cell pellets were collected and subject to a wash in PBS. Phosphorylation reactions of the library were conducted with 500 nM of purified c-Src and 1 mM ATP in a buffer containing 50 mM Tris, pH 7.5, 150 mM NaCl, 5 mM MgCl₂, 1 mM TCEP and 2 mM sodium orthovanadate. Time points were taken at 5 minutes, 20 minutes and 60 minutes. Kinase activity was quenched with 25 mM EDTA, and the cells were washed with PBS. Kinase-treated cells were labeled with roughly 0.05 mg ml⁻¹ of the biotinylated pY20 antibody for 1 hour and then washed again with PBS containing 0.2% BSA.

The phosphorylated cells were isolated with Dynabeads FlowComp Flexi (Invitrogen) following the manufacturer’s protocol. In total, two populations were collected for each time point: cells that did not bind to the magnetic beads and eluted after each wash (unbound) and cells that bound to the magnetic beads and eluted after the addition of the release buffer (bound). After isolation of these two populations, the cell pellet was collected, resuspended in water and then lysed by boiling at 100 °C for 10 minutes. The supernatant from this lysate was then used as a template in a 50-μl PCR reaction to amplify the peptide codon DNA sequence using the same forward and reverse TruSeq-eCPX primers as described previously⁵³. The product of this PCR reaction was then used as a template for a second PCR reaction to append unique 5′ and 3′ indices. The resulting PCR products were purified by gel extraction, and the concentration of each sample was determined using QuantiFluor dsDNA System (Promega). Each sample was pooled to equal molarity and sequenced by paired-end Illumina sequencing on a MiSeq instrument. The deep sequencing data were processed as described previously^53,67. The paired-end reads were merged using FLASH (version FLASH2-2.2.00)⁶⁸, and the adapter sequences were trimmed using the software Cutadapt (version 3.5)⁶⁹. The remaining sequences were translated into amino acid codes, and sequences containing stop codons were removed.

Validation measurement of phosphorylation rates

To validate predictions made by ProBound, phosphorylation rates were determined in vitro using purified c-Src and 11 synthetic peptides (purchased from Synpeptide). The phosphorylation reactions were carried out at 37 °C using 500 nM purified c-Src and 100 μM peptide in a buffer containing 50 mM Tris, pH 7.5, 150 mM NaCl, 5 mM MgCl₂, 1 mM TCEP and 2 mM sodium orthovanadate. Reactions were initiated by the addition of 1 mM ATP, and, at various time points, 100 μl of the solution was quenched with 25 mM EDTA (every 10 seconds for the faster reactions, every 2–10 minutes for the slower reactions). Each reaction was carried out in triplicate.

The concentration of the substrate and the phosphorylated product at each time point was determined by reversed-phase HPLC with UV detection at 214 nm (Agilent 1260 Infinity II). A 40-μl volume of the quenched reaction was injected onto a C18 column (ZORBAX 300SB-C18, 5 μm, 4.6 × 150 mm). A gradient system was used with solvent A (water and 0.1% TFA) and solvent B (acetonitrile and 0.1% TFA). Elution of the peptides was performed at a flow rate of 1 ml min⁻¹ using the following gradient: 0–2 minutes: 5% B; 2–12 minutes: 5–95% B; 12–13 minutes: 95% B; 13–14 minutes: 95–5% B; and 14–17 minutes: 5% B. The peak areas of the substrate and product were calculated using Agilent OpenLAB ChemStation software (version C.01.09). The initial rate for each peptide was obtained by fitting a straight line to a graph of peak area as a function of time in the linear regime of the reaction progress curve and calculating the slope of the line.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Online content

Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41587-022-01307-0.

Supplementary information

Supplementary Information^{(2.2MB, pdf)}

Software Manual and Description of Configuration Files

Reporting Summary^{(1.1MB, pdf)}

Supplementary Tables^{(208KB, xlsx)}

Training Database, HPLC Validation, SELEX Experiments, ChIP+DNA+DNAse GLM and EMSA Validation

Acknowledgements

Research reported in this publication was supported by NIMH award R01MH106842 and NHGRI award R01HG003008 to H.J.B. and NIGMS award R35GM118336 to R.S.M. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. We are grateful to J. Hunt for valuable discussions about experimental methods for measuring dissociation constants.

Extended data

Author contributions

H.T.R. and H.J.B. developed the methodology, with important contributions from C.R. H.T.R. implemented ProBound, with contributions from C.R., B.V.D. and H.H.A. S.F. performed the K_D-seq experiments and validation measurements under the supervision of R.S.M. J.F.K. performed the SELEX-seq and EpiSELEX-seq experiments and developed the GLM analysis under the supervision of R.S.M. and H.J.B. A.L. performed the Src sequencing and validation experiments under the supervision of N.H.S. B.B. developed the web portal under the supervision of H.J.B., H.T.R. and C.R. L.A.N.M. and H.T.R. performed ChIP-seq analyses. X.L. performed model validation analyses. H.T.R., C.R. and H.J.B. wrote the manuscript, with input from all authors.

Peer review

Peer review information

Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Data availability

The sequencing data generated during the current study have been deposited in the Gene Expression Omnibus under accession number GSE175942. Source data for Figs. 4d and 6d are provided in Supplementary Tables 2 and 5.

Code availability

TF binding models and software for using them can be accessed at motifcentral.org. The ProBound software and a dedicated compute server for running ProBound are available at probound.bussemakerlab.org.

Competing interests

H.J.B., C.R. and H.T.R. have filed a patent application describing the design, composition and function of ProBound. The remaining authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Siqian Feng, Judith F. Kribelbauer, Allyson Li.

Extended data

is available for this paper at 10.1038/s41587-022-01307-0.

Supplementary information

The online version contains supplementary material available at 10.1038/s41587-022-01307-0.

References

1.Crocker J, et al. Low affinity binding site clusters confer hox specificity and regulatory robustness. Cell. 2015;160:191–203. doi: 10.1016/j.cell.2014.11.041. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Farley EK, et al. Suboptimization of developmental enhancers. Science. 2015;350:325–328. doi: 10.1126/science.aac6948. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Tanay A. Extensive low-affinity transcriptional interactions in the yeast genome. Genome Res. 2006;16:962–972. doi: 10.1101/gr.5113606. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Zykovich A, Korf I, Segal DJ. Bind-n-Seq: high-throughput analysis of in vitro protein–DNA interactions using massively parallel sequencing. Nucleic Acids Res. 2009;37:e151. doi: 10.1093/nar/gkp802. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Zhao, Y., Granas, D. & Stormo, G. D. Inferring binding energies from selected binding sites. PLoS Comput. Biol.5, e1000590 (2009). [DOI] [PMC free article] [PubMed]
6.Jolma A, et al. Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res. 2010;20:861–873. doi: 10.1101/gr.100552.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Isakova A, et al. SMiLE-seq identifies binding motifs of single and dimeric transcription factors. Nat. Methods. 2017;14:316–322. doi: 10.1038/nmeth.4143. [DOI] [PubMed] [Google Scholar]
8.Slattery M, et al. Cofactor binding evokes latent differences in DNA binding specificity between Hox proteins. Cell. 2011;147:1270–1282. doi: 10.1016/j.cell.2011.10.053. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Jolma A, et al. DNA-dependent formation of transcription factor pairs alters their binding specificity. Nature. 2015;527:384–388. doi: 10.1038/nature15518. [DOI] [PubMed] [Google Scholar]
10.Rodriguez-Martinez JA, Reinke AW, Bhimsaria D, Keating AE, Ansari AZ. Combinatorial bZIP dimers display complex DNA-binding specificity landscapes. eLife. 2017;6:e19272. doi: 10.7554/eLife.19272. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Zhu F, et al. The interaction landscape between transcription factors and the nucleosome. Nature. 2018;562:76–81. doi: 10.1038/s41586-018-0549-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Yin Y, et al. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science. 2017;356:eaaj2239. doi: 10.1126/science.aaj2239. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Kribelbauer JF, et al. Quantitative analysis of the DNA methylation sensitivity of transcription factor complexes. Cell Rep. 2017;19:2383–2395. doi: 10.1016/j.celrep.2017.05.069. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Zuo Z, Roy B, Chang YK, Granas D, Stormo GD. Measuring quantitative effects of methylation on transcription factor–DNA binding affinity. Sci. Adv. 2017;3:eaao1799. doi: 10.1126/sciadv.aao1799. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Lambert N, et al. RNA Bind-n-Seq: quantitative assessment of the sequence and structural binding specificity of RNA binding proteins. Mol. Cell. 2014;54:887–900. doi: 10.1016/j.molcel.2014.04.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Dominguez D, et al. Sequence, structure, and context preferences of human RNA binding proteins. Mol. Cell. 2018;70:854–867. doi: 10.1016/j.molcel.2018.05.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Zhou J, et al. Deep profiling of protease substrate specificity enabled by dual random and scanned human proteome substrate phage libraries. Proc. Natl Acad. Sci. USA. 2020;117:25464–25475. doi: 10.1073/pnas.2009279117. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Gee MH, et al. Antigen identification for orphan T cell receptors expressed on tumor-infiltrating lymphocytes. Cell. 2018;172:549–563. doi: 10.1016/j.cell.2017.11.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 2015;33:831–838. doi: 10.1038/nbt.3300. [DOI] [PubMed] [Google Scholar]
20.Asif M, Orenstein Y. DeepSELEX: inferring DNA-binding preferences from HT-SELEX data using multi-class CNNs. Bioinformatics. 2020;36:i634–i642. doi: 10.1093/bioinformatics/btaa789. [DOI] [PubMed] [Google Scholar]
21.Ben-Bassat I, Chor B, Orenstein Y. A deep neural network approach for learning intrinsic protein–RNA binding preferences. Bioinformatics. 2018;34:i638–i646. doi: 10.1093/bioinformatics/bty600. [DOI] [PubMed] [Google Scholar]
22.Toivonen J, et al. Modular discovery of monomeric and dimeric transcription factor binding motifs for large data sets. Nucleic Acids Res. 2018;46:e44. doi: 10.1093/nar/gky027. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Yuan H, Kshirsagar M, Zamparo L, Lu Y, Leslie CS. BindSpace decodes transcription factor binding signals by large-scale sequence embedding. Nat. Methods. 2019;16:858–861. doi: 10.1038/s41592-019-0511-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Ruan S, Swamidass SJ, Stormo GD. BEESEM: estimation of binding energy models using HT-SELEX data. Bioinformatics. 2017;33:2288–2295. doi: 10.1093/bioinformatics/btx191. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Rastogi C, et al. Accurate and sensitive quantification of protein–DNA binding affinity. Proc. Natl Acad. Sci. USA. 2018;115:E3692–E3701. doi: 10.1073/pnas.1714376115. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Kribelbauer, J. F. et al. Context-dependent gene regulation by Homeodomain transcription factor complexes revealed by shape-readout deficient proteins. Mol. Cell78, 152–167 (2020). [DOI] [PMC free article] [PubMed]
27.Foat BC, Morozov AV, Bussemaker HJ. Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics. 2006;22:e141–e149. doi: 10.1093/bioinformatics/btl223. [DOI] [PubMed] [Google Scholar]
28.Jolma A, et al. DNA-binding specificities of human transcription factors. Cell. 2013;152:327–339. doi: 10.1016/j.cell.2012.12.009. [DOI] [PubMed] [Google Scholar]
29.Nitta KR, et al. Conservation of transcription factor binding specificities across 600 million years of bilateria evolution. eLife. 2015;4:e04837. doi: 10.7554/eLife.04837. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Yang L, et al. Transcription factor family-specific DNA shape readout revealed by quantitative specificity models. Mol. Syst. Biol. 2017;13:910. doi: 10.15252/msb.20167238. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Weirauch MT, et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 2013;31:126–134. doi: 10.1038/nbt.2486. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Davis CA, et al. The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res. 2018;46:D794–D801. doi: 10.1093/nar/gkx1081. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Khan A, et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 2018;46:D260–D266. doi: 10.1093/nar/gkx1126. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Kulakovskiy IV, et al. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res. 2018;46:D252–D259. doi: 10.1093/nar/gkx1106. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Weber M, et al. Distribution, silencing potential and evolutionary impact of promoter DNA methylation in the human genome. Nat. Genet. 2007;39:457–466. doi: 10.1038/ng1990. [DOI] [PubMed] [Google Scholar]
36.Dantas Machado AC, et al. Evolving insights on how cytosine methylation affects protein–DNA binding. Brief. Funct. Genomics. 2015;14:61–73. doi: 10.1093/bfgp/elu040. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Zhu H, Wang G, Qian J. Transcription factors as readers and effectors of DNA methylation. Nat. Rev. Genet. 2016;17:551–565. doi: 10.1038/nrg.2016.83. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Kribelbauer, J. F., Lu, X.-J., Rohs, R., Mann, R. S. & Bussemaker, H. J. Towards a mechanistic understanding of DNA methylation readout by transcription factors. J. Mol. Biol.10.1016/j.jmb.2019.10.021 (2019). [DOI] [PMC free article] [PubMed]
39.Mann IK, et al. CG methylated microarrays identify a novel methylated sequence bound by the CEBPB∣ATF4 heterodimer that is active in vivo. Genome Res. 2013;23:988–997. doi: 10.1101/gr.146654.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Kumar S, Chinnusamy V, Mohapatra T. Epigenetics of modified DNA bases: 5-methylcytosine and beyond. Front. Genet. 2018;9:640. doi: 10.3389/fgene.2018.00640. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Fu Y, et al. N6-methyldeoxyadenosine marks active transcription start sites in Chlamydomonas. Cell. 2015;161:879–892. doi: 10.1016/j.cell.2015.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Xiao C-L, et al. N6-methyladenine DNA modification in the human genome. Mol. Cell. 2018;71:306–318. doi: 10.1016/j.molcel.2018.06.015. [DOI] [PubMed] [Google Scholar]
43.Wu TP, et al. DNA methylation on N6-adenine in mammalian embryonic stem cells. Nature. 2016;532:329–333. doi: 10.1038/nature17640. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Kriaucionis S, Heintz N. The nuclear DNA base 5-hydroxymethylcytosine is present in Purkinje neurons and the brain. Science. 2009;324:929–930. doi: 10.1126/science.1169786. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Münzel M, et al. Quantification of the sixth DNA base hydroxymethylcytosine in the brain. Angew. Chem. Int. Ed. Engl. 2010;49:5375–5377. doi: 10.1002/anie.201002033. [DOI] [PubMed] [Google Scholar]
46.Zuo Z, Stormo GD. High-resolution specificity from DNA sequencing highlights alternative modes of Lac repressor binding. Genetics. 2014;198:1329–1343. doi: 10.1534/genetics.114.170100. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Starick SR, et al. ChIP-exo signal associated with DNA-binding motifs provides insight into the genomic binding of the glucocorticoid receptor and cooperating transcription factors. Genome Res. 2015;25:825–835. doi: 10.1101/gr.185157.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Luisi BF, et al. Crystallographic analysis of the interaction of the glucocorticoid receptor with DNA. Nature. 1991;352:497–505. doi: 10.1038/352497a0. [DOI] [PubMed] [Google Scholar]
49.Glass CK. Differential recognition of target genes by nuclear receptor monomers, dimers, and heterodimers. Endocr. Rev. 1994;15:391–407. doi: 10.1210/edrv-15-3-391. [DOI] [PubMed] [Google Scholar]
50.Biddie SC, et al. Transcription factor AP1 potentiates chromatin accessibility and glucocorticoid receptor binding. Mol. Cell. 2011;43:145–155. doi: 10.1016/j.molcel.2011.06.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Polman JAE, de Kloet ER, Datson NA. Two populations of glucocorticoid receptor-binding sites in the male rat hippocampal genome. Endocrinology. 2013;154:1832–1844. doi: 10.1210/en.2012-2187. [DOI] [PubMed] [Google Scholar]
52.Liu G, et al. Antibody complementarity determining region design using high-capacity machine learning. Bioinformatics. 2020;36:2126–2133. doi: 10.1093/bioinformatics/btz895. [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Shah NH, Löbel M, Weiss A, Kuriyan J. Fine-tuning of substrate preferences of the Src-family kinase Lck revealed through a high-throughput specificity screen. eLife. 2018;7:e35190. doi: 10.7554/eLife.35190. [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Ryu G-M, et al. Genome-wide analysis to predict protein sequence variations that change phosphorylation sites or their corresponding kinases. Nucleic Acids Res. 2009;37:1297–1307. doi: 10.1093/nar/gkn1008. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Hornbeck PV, et al. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res. 2015;43:D512–D520. doi: 10.1093/nar/gku1267. [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Zhao Y, Stormo GD. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat. Biotechnol. 2011;29:480–483. doi: 10.1038/nbt.1893. [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Maerkl SJ, Quake SR. A systems approach to measuring the binding energy landscapes of transcription factors. Science. 2007;315:233–237. doi: 10.1126/science.1131007. [DOI] [PubMed] [Google Scholar]
58.Badis G, et al. Diversity and complexity in DNA recognition by transcription factors. Science. 2009;324:1720–1723. doi: 10.1126/science.1162327. [DOI] [PMC free article] [PubMed] [Google Scholar]
59.Berger MF, et al. Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell. 2008;133:1266–1276. doi: 10.1016/j.cell.2008.05.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Weirauch MT, et al. Determination and inference of eukaryotic transcription factor sequence specificity. Cell. 2014;158:1431–1443. doi: 10.1016/j.cell.2014.08.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Riley, T. R. et al. SELEX-seq: a method for characterizing the complete repertoire of binding site preferences for transcription factor complexes. In: Hox Genes, 255–278 (Springer, 2014). [DOI] [PMC free article] [PubMed]
62.Jarmoskaite I, AlSadhan I, Vaidyanathan PP, Herschlag D. How to measure and evaluate binding affinities. eLife. 2020;9:e57264. doi: 10.7554/eLife.57264. [DOI] [PMC free article] [PubMed] [Google Scholar]
63.Heinz S, et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell. 2010;38:576–589. doi: 10.1016/j.molcel.2010.05.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
64.Bailey TL, et al. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 2009;37:W202–W208. doi: 10.1093/nar/gkp335. [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Menzel M, Hurka S, Glasenhardt S, Gogol-Döring A. NoPeak: k-mer-based motif discovery in ChIP-Seq data without peak calling. Bioinformatics. 2021;37:596–602. doi: 10.1093/bioinformatics/btaa845. [DOI] [PubMed] [Google Scholar]
66.Rice JJ, Daugherty PS. Directed evolution of a biterminal bacterial display scaffold enhances the display of diverse peptides. Protein Eng. Des. Sel. 2008;21:435–442. doi: 10.1093/protein/gzn020. [DOI] [PMC free article] [PubMed] [Google Scholar]
67.Shah NH, et al. An electrostatic selection mechanism controls sequential kinase signaling downstream of the T cell receptor. eLife. 2016;5:e20105. doi: 10.7554/eLife.20105. [DOI] [PMC free article] [PubMed] [Google Scholar]
68.Magoč T, Salzberg SL. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics. 2011;27:2957–2963. doi: 10.1093/bioinformatics/btr507. [DOI] [PMC free article] [PubMed] [Google Scholar]
69.Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17https://journal.embnet.org/index.php/embnetjournal/article/view/2000 (2011).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information^{(2.2MB, pdf)}

Software Manual and Description of Configuration Files

Reporting Summary^{(1.1MB, pdf)}

Supplementary Tables^{(208KB, xlsx)}

Training Database, HPLC Validation, SELEX Experiments, ChIP+DNA+DNAse GLM and EMSA Validation

Data Availability Statement

[CR1] 1.Crocker J, et al. Low affinity binding site clusters confer hox specificity and regulatory robustness. Cell. 2015;160:191–203. doi: 10.1016/j.cell.2014.11.041. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Farley EK, et al. Suboptimization of developmental enhancers. Science. 2015;350:325–328. doi: 10.1126/science.aac6948. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Tanay A. Extensive low-affinity transcriptional interactions in the yeast genome. Genome Res. 2006;16:962–972. doi: 10.1101/gr.5113606. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Zykovich A, Korf I, Segal DJ. Bind-n-Seq: high-throughput analysis of in vitro protein–DNA interactions using massively parallel sequencing. Nucleic Acids Res. 2009;37:e151. doi: 10.1093/nar/gkp802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Zhao, Y., Granas, D. & Stormo, G. D. Inferring binding energies from selected binding sites. PLoS Comput. Biol.5, e1000590 (2009). [DOI] [PMC free article] [PubMed]

[CR6] 6.Jolma A, et al. Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res. 2010;20:861–873. doi: 10.1101/gr.100552.109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Isakova A, et al. SMiLE-seq identifies binding motifs of single and dimeric transcription factors. Nat. Methods. 2017;14:316–322. doi: 10.1038/nmeth.4143. [DOI] [PubMed] [Google Scholar]

[CR8] 8.Slattery M, et al. Cofactor binding evokes latent differences in DNA binding specificity between Hox proteins. Cell. 2011;147:1270–1282. doi: 10.1016/j.cell.2011.10.053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Jolma A, et al. DNA-dependent formation of transcription factor pairs alters their binding specificity. Nature. 2015;527:384–388. doi: 10.1038/nature15518. [DOI] [PubMed] [Google Scholar]

[CR10] 10.Rodriguez-Martinez JA, Reinke AW, Bhimsaria D, Keating AE, Ansari AZ. Combinatorial bZIP dimers display complex DNA-binding specificity landscapes. eLife. 2017;6:e19272. doi: 10.7554/eLife.19272. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Zhu F, et al. The interaction landscape between transcription factors and the nucleosome. Nature. 2018;562:76–81. doi: 10.1038/s41586-018-0549-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Yin Y, et al. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science. 2017;356:eaaj2239. doi: 10.1126/science.aaj2239. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Kribelbauer JF, et al. Quantitative analysis of the DNA methylation sensitivity of transcription factor complexes. Cell Rep. 2017;19:2383–2395. doi: 10.1016/j.celrep.2017.05.069. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Zuo Z, Roy B, Chang YK, Granas D, Stormo GD. Measuring quantitative effects of methylation on transcription factor–DNA binding affinity. Sci. Adv. 2017;3:eaao1799. doi: 10.1126/sciadv.aao1799. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Lambert N, et al. RNA Bind-n-Seq: quantitative assessment of the sequence and structural binding specificity of RNA binding proteins. Mol. Cell. 2014;54:887–900. doi: 10.1016/j.molcel.2014.04.016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Dominguez D, et al. Sequence, structure, and context preferences of human RNA binding proteins. Mol. Cell. 2018;70:854–867. doi: 10.1016/j.molcel.2018.05.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Zhou J, et al. Deep profiling of protease substrate specificity enabled by dual random and scanned human proteome substrate phage libraries. Proc. Natl Acad. Sci. USA. 2020;117:25464–25475. doi: 10.1073/pnas.2009279117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Gee MH, et al. Antigen identification for orphan T cell receptors expressed on tumor-infiltrating lymphocytes. Cell. 2018;172:549–563. doi: 10.1016/j.cell.2017.11.043. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 2015;33:831–838. doi: 10.1038/nbt.3300. [DOI] [PubMed] [Google Scholar]

[CR20] 20.Asif M, Orenstein Y. DeepSELEX: inferring DNA-binding preferences from HT-SELEX data using multi-class CNNs. Bioinformatics. 2020;36:i634–i642. doi: 10.1093/bioinformatics/btaa789. [DOI] [PubMed] [Google Scholar]

[CR21] 21.Ben-Bassat I, Chor B, Orenstein Y. A deep neural network approach for learning intrinsic protein–RNA binding preferences. Bioinformatics. 2018;34:i638–i646. doi: 10.1093/bioinformatics/bty600. [DOI] [PubMed] [Google Scholar]

[CR22] 22.Toivonen J, et al. Modular discovery of monomeric and dimeric transcription factor binding motifs for large data sets. Nucleic Acids Res. 2018;46:e44. doi: 10.1093/nar/gky027. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Yuan H, Kshirsagar M, Zamparo L, Lu Y, Leslie CS. BindSpace decodes transcription factor binding signals by large-scale sequence embedding. Nat. Methods. 2019;16:858–861. doi: 10.1038/s41592-019-0511-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Ruan S, Swamidass SJ, Stormo GD. BEESEM: estimation of binding energy models using HT-SELEX data. Bioinformatics. 2017;33:2288–2295. doi: 10.1093/bioinformatics/btx191. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Rastogi C, et al. Accurate and sensitive quantification of protein–DNA binding affinity. Proc. Natl Acad. Sci. USA. 2018;115:E3692–E3701. doi: 10.1073/pnas.1714376115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Kribelbauer, J. F. et al. Context-dependent gene regulation by Homeodomain transcription factor complexes revealed by shape-readout deficient proteins. Mol. Cell78, 152–167 (2020). [DOI] [PMC free article] [PubMed]

[CR27] 27.Foat BC, Morozov AV, Bussemaker HJ. Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics. 2006;22:e141–e149. doi: 10.1093/bioinformatics/btl223. [DOI] [PubMed] [Google Scholar]

[CR28] 28.Jolma A, et al. DNA-binding specificities of human transcription factors. Cell. 2013;152:327–339. doi: 10.1016/j.cell.2012.12.009. [DOI] [PubMed] [Google Scholar]

[CR29] 29.Nitta KR, et al. Conservation of transcription factor binding specificities across 600 million years of bilateria evolution. eLife. 2015;4:e04837. doi: 10.7554/eLife.04837. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Yang L, et al. Transcription factor family-specific DNA shape readout revealed by quantitative specificity models. Mol. Syst. Biol. 2017;13:910. doi: 10.15252/msb.20167238. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Weirauch MT, et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 2013;31:126–134. doi: 10.1038/nbt.2486. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Davis CA, et al. The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res. 2018;46:D794–D801. doi: 10.1093/nar/gkx1081. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Khan A, et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 2018;46:D260–D266. doi: 10.1093/nar/gkx1126. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Kulakovskiy IV, et al. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res. 2018;46:D252–D259. doi: 10.1093/nar/gkx1106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Weber M, et al. Distribution, silencing potential and evolutionary impact of promoter DNA methylation in the human genome. Nat. Genet. 2007;39:457–466. doi: 10.1038/ng1990. [DOI] [PubMed] [Google Scholar]

[CR36] 36.Dantas Machado AC, et al. Evolving insights on how cytosine methylation affects protein–DNA binding. Brief. Funct. Genomics. 2015;14:61–73. doi: 10.1093/bfgp/elu040. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Zhu H, Wang G, Qian J. Transcription factors as readers and effectors of DNA methylation. Nat. Rev. Genet. 2016;17:551–565. doi: 10.1038/nrg.2016.83. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Kribelbauer, J. F., Lu, X.-J., Rohs, R., Mann, R. S. & Bussemaker, H. J. Towards a mechanistic understanding of DNA methylation readout by transcription factors. J. Mol. Biol.10.1016/j.jmb.2019.10.021 (2019). [DOI] [PMC free article] [PubMed]

[CR39] 39.Mann IK, et al. CG methylated microarrays identify a novel methylated sequence bound by the CEBPB∣ATF4 heterodimer that is active in vivo. Genome Res. 2013;23:988–997. doi: 10.1101/gr.146654.112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Kumar S, Chinnusamy V, Mohapatra T. Epigenetics of modified DNA bases: 5-methylcytosine and beyond. Front. Genet. 2018;9:640. doi: 10.3389/fgene.2018.00640. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Fu Y, et al. N6-methyldeoxyadenosine marks active transcription start sites in Chlamydomonas. Cell. 2015;161:879–892. doi: 10.1016/j.cell.2015.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.Xiao C-L, et al. N6-methyladenine DNA modification in the human genome. Mol. Cell. 2018;71:306–318. doi: 10.1016/j.molcel.2018.06.015. [DOI] [PubMed] [Google Scholar]

[CR43] 43.Wu TP, et al. DNA methylation on N6-adenine in mammalian embryonic stem cells. Nature. 2016;532:329–333. doi: 10.1038/nature17640. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.Kriaucionis S, Heintz N. The nuclear DNA base 5-hydroxymethylcytosine is present in Purkinje neurons and the brain. Science. 2009;324:929–930. doi: 10.1126/science.1169786. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] 45.Münzel M, et al. Quantification of the sixth DNA base hydroxymethylcytosine in the brain. Angew. Chem. Int. Ed. Engl. 2010;49:5375–5377. doi: 10.1002/anie.201002033. [DOI] [PubMed] [Google Scholar]

[CR46] 46.Zuo Z, Stormo GD. High-resolution specificity from DNA sequencing highlights alternative modes of Lac repressor binding. Genetics. 2014;198:1329–1343. doi: 10.1534/genetics.114.170100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR47] 47.Starick SR, et al. ChIP-exo signal associated with DNA-binding motifs provides insight into the genomic binding of the glucocorticoid receptor and cooperating transcription factors. Genome Res. 2015;25:825–835. doi: 10.1101/gr.185157.114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR48] 48.Luisi BF, et al. Crystallographic analysis of the interaction of the glucocorticoid receptor with DNA. Nature. 1991;352:497–505. doi: 10.1038/352497a0. [DOI] [PubMed] [Google Scholar]

[CR49] 49.Glass CK. Differential recognition of target genes by nuclear receptor monomers, dimers, and heterodimers. Endocr. Rev. 1994;15:391–407. doi: 10.1210/edrv-15-3-391. [DOI] [PubMed] [Google Scholar]

[CR50] 50.Biddie SC, et al. Transcription factor AP1 potentiates chromatin accessibility and glucocorticoid receptor binding. Mol. Cell. 2011;43:145–155. doi: 10.1016/j.molcel.2011.06.016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR51] 51.Polman JAE, de Kloet ER, Datson NA. Two populations of glucocorticoid receptor-binding sites in the male rat hippocampal genome. Endocrinology. 2013;154:1832–1844. doi: 10.1210/en.2012-2187. [DOI] [PubMed] [Google Scholar]

[CR52] 52.Liu G, et al. Antibody complementarity determining region design using high-capacity machine learning. Bioinformatics. 2020;36:2126–2133. doi: 10.1093/bioinformatics/btz895. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR53] 53.Shah NH, Löbel M, Weiss A, Kuriyan J. Fine-tuning of substrate preferences of the Src-family kinase Lck revealed through a high-throughput specificity screen. eLife. 2018;7:e35190. doi: 10.7554/eLife.35190. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR54] 54.Ryu G-M, et al. Genome-wide analysis to predict protein sequence variations that change phosphorylation sites or their corresponding kinases. Nucleic Acids Res. 2009;37:1297–1307. doi: 10.1093/nar/gkn1008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR55] 55.Hornbeck PV, et al. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res. 2015;43:D512–D520. doi: 10.1093/nar/gku1267. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR56] 56.Zhao Y, Stormo GD. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat. Biotechnol. 2011;29:480–483. doi: 10.1038/nbt.1893. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR57] 57.Maerkl SJ, Quake SR. A systems approach to measuring the binding energy landscapes of transcription factors. Science. 2007;315:233–237. doi: 10.1126/science.1131007. [DOI] [PubMed] [Google Scholar]

[CR58] 58.Badis G, et al. Diversity and complexity in DNA recognition by transcription factors. Science. 2009;324:1720–1723. doi: 10.1126/science.1162327. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR59] 59.Berger MF, et al. Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell. 2008;133:1266–1276. doi: 10.1016/j.cell.2008.05.024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR60] 60.Weirauch MT, et al. Determination and inference of eukaryotic transcription factor sequence specificity. Cell. 2014;158:1431–1443. doi: 10.1016/j.cell.2014.08.009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR61] 61.Riley, T. R. et al. SELEX-seq: a method for characterizing the complete repertoire of binding site preferences for transcription factor complexes. In: Hox Genes, 255–278 (Springer, 2014). [DOI] [PMC free article] [PubMed]

[CR62] 62.Jarmoskaite I, AlSadhan I, Vaidyanathan PP, Herschlag D. How to measure and evaluate binding affinities. eLife. 2020;9:e57264. doi: 10.7554/eLife.57264. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR63] 63.Heinz S, et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell. 2010;38:576–589. doi: 10.1016/j.molcel.2010.05.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR64] 64.Bailey TL, et al. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 2009;37:W202–W208. doi: 10.1093/nar/gkp335. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR65] 65.Menzel M, Hurka S, Glasenhardt S, Gogol-Döring A. NoPeak: k-mer-based motif discovery in ChIP-Seq data without peak calling. Bioinformatics. 2021;37:596–602. doi: 10.1093/bioinformatics/btaa845. [DOI] [PubMed] [Google Scholar]

[CR66] 66.Rice JJ, Daugherty PS. Directed evolution of a biterminal bacterial display scaffold enhances the display of diverse peptides. Protein Eng. Des. Sel. 2008;21:435–442. doi: 10.1093/protein/gzn020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR67] 67.Shah NH, et al. An electrostatic selection mechanism controls sequential kinase signaling downstream of the T cell receptor. eLife. 2016;5:e20105. doi: 10.7554/eLife.20105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR68] 68.Magoč T, Salzberg SL. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics. 2011;27:2957–2963. doi: 10.1093/bioinformatics/btr507. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR69] 69.Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17https://journal.embnet.org/index.php/embnetjournal/article/view/2000 (2011).

PERMALINK

Prediction of protein–ligand binding affinity from sequencing data with interpretable machine learning

H Tomas Rube

Chaitanya Rastogi

Siqian Feng

Judith F Kribelbauer

Allyson Li

Basheer Becerra

Lucas A N Melo

Bach Viet Do

Xiaoting Li

Hammaad H Adam

Neel H Shah

Richard S Mann

Harmen J Bussemaker

Abstract

Main

ProBound framework

Fig. 1. Overview of the ProBound method.

A compendium of accurate TF binding models

Fig. 2. Validation of TF binding model performance.

Extended Data Fig. 1. Integrative analysis of multiple TF SELEX datasets produces consensus binding models.

Quantifying TF binding cooperativity

Fig. 3. Integrated modeling of complementary assays quantifies the impact of methylation and co-factors on TF binding.

Extended Data Fig. 2. Integrative modeling to quantify TF binding cooperativity.

Extended Data Fig. 3. Binding models learned through joint analysis of CAP-SELEX and HT-SELEX data.

Learning methylation-aware TF binding models

Extended Data Fig. 4. Learning methylation-aware binding models from EpiSELEX-seq data.

Extended Data Fig. 5. Extending EpiSELEX-seq to measure the impact of 5hmC and 6mA on CEBPγ binding.

Measuring absolute binding constants using SELEX

Fig. 4. ProBound infers absolute KD values.

Extended Data Fig. 6. EMSA validation measurements.

Extended Data Fig. 7. The robustness of KD-seq.

Extended Data Fig. 8. Testing theoretical validity of KD-seq using equilibrium and kinetic simulations.

Peak-free motif discovery from ChIP-seq data

Fig. 5. ProBound learns quantitative binding models and sample-specific TF activities using peak-free ChIP-seq analysis.

Extended Data Fig. 9. Comparison of GR binding models learned using different algorithms.

Profiling tyrosine kinase kinetics using Kinase-seq

Fig. 6. ProBound quantifies sequence-dependent kinetics of the tyrosine kinase c-Src.

Extended Data Fig. 10. Composition of the Kinase-seq libraries.

Discussion

Methods

Overview of the algorithm

Probabilistic motivation of the binding model

Implementation of binding layer

Assay layer

Sequencing layer

Multi-experiment learning

Regularization

Procedure for setting kDirichlet

Model optimization scheme

Gauge fixing

Benchmarking ProBound

Model training

Model pruning

Model selection

Evaluation of model performance

Filtering of SELEX training datasets

Scoring of binding probes

ENCODE ChIP-seq datasets

Binding by multi-protein complexes

ProBound analysis

Experimental protocol

Effect of DNA methylation

ProBound analysis

Experimental protocol

GLM analysis of ATF4 and CEBPγ ChIP data

Inferring absolute KDs

Simulations

Experimental protocol

EMSA validation

Practical guidelines

Peak-free motif discovery from ChIP-seq data

ProBound analysis

Other methods

Kinase-seq

ProBound analysis

Preparation of degenerate peptide library to profile tyrosine kinase specificity

Preparation of biotinylated antibody

High-throughput specificity screen

Fig. 4. ProBound infers absolute K_D values.

Extended Data Fig. 7. The robustness of K_D-seq.

Extended Data Fig. 8. Testing theoretical validity of K_D-seq using equilibrium and kinetic simulations.

Procedure for setting k_Dirichlet

Inferring absolute K_Ds