Direct inference of protein–DNA interactions using compressed sensing methods

Mohammed AlQuraishi; Harley H McAdams

doi:10.1073/pnas.1106460108

. 2011 Aug 8;108(36):14819-14824. doi: 10.1073/pnas.1106460108

Direct inference of protein–DNA interactions using compressed sensing methods

Mohammed AlQuraishi ^a,^b,^c, Harley H McAdams ^a,¹

PMCID: PMC3169146 PMID: 21825146

Abstract

Compressed sensing has revolutionized signal acquisition, by enabling complex signals to be measured with remarkable fidelity using a small number of so-called incoherent sensors. We show that molecular interactions, e.g., protein–DNA interactions, can be analyzed in a directly analogous manner and with similarly remarkable results. Specifically, mesoscopic molecular interactions act as incoherent sensors that measure the energies of microscopic interactions between atoms. We combine concepts from compressed sensing and statistical mechanics to determine the interatomic interaction energies of a molecular system exclusively from experimental measurements, resulting in a “de novo” energy potential. In contrast, conventional methods for estimating energy potentials are based on theoretical models premised on a priori assumptions and extensive domain knowledge. We determine the de novo energy potential for pairwise interactions between protein and DNA atoms from (i) experimental measurements of the binding affinity of protein–DNA complexes and (ii) crystal structures of the complexes. We show that the de novo energy potential can be used to predict the binding specificity of proteins to DNA with approximately 90% accuracy, compared to approximately 60% for the best performing alternative computational methods applied to this fundamental problem. This de novo potential method is directly extendable to other biomolecule interaction domains (enzymes and signaling molecule interactions) and to other classes of molecular interactions.

Keywords: DNA motifs, structural biology, machine learning, protein–DNA binding, DNA binding sites

The foundation of molecular analyses of chemical and biological phenomena is the energy potential, a mathematical description of the energy of every possible interaction in a molecular system (Fig. 1B). The accuracy of computational and laboratory studies of phenomena ranging from pharmaceutical drug interactions and protein folding to material phase transitions and thin film growth is often limited by the accuracy of these energy potentials. Currently, potentials are inferred using a mixture of theoretical modeling and experimental data (Fig. 1A). “Physical potentials” rely on theoretical models to specify the potential’s mathematical form and use experimental data to fit few model parameters (1). In contrast, “statistical potentials” fit many parameters to experimental data and use theoretical models for the expected statistics of interactions under randomness to infer a potential (2). In both approaches, theoretical models shape and constrain the inferred potential, resulting in a so-called parametric model. There are several drawbacks to this: (i) The a priori assumptions underlying the inferred potentials may be inaccurate. (ii) Substantial domain knowledge is required (often exceeding what is known). (iii) Potential modeling is lengthy and technically difficult. The theoretical development of some potentials has taken decades (3). To overcome these problems, potentials could in principle be determined strictly from experimental data without recourse to theoretical modeling by experimentally measuring the energies of all distinct interactions. In practice, direct measurement of interatomic potentials has been possible only for the simplest systems, due to a combinatorial explosion in the number of possible interactions that renders experiment-based inference intractable. We have developed a general method for the inference of “de novo” potentials that circumvents the experimental intractability barrier by exploiting recent discoveries in information theory known as compressed sensing. This approach results in a nonparametric potential that does not require an a priori assumption of a theoretical model, overcoming a fundamental limitation of both physical and statistical potentials.

Fig. 1. — Types of energy potentials. (A) A potential V(r,i) mathematically specifies the energies of all microscopic interactions in a molecular system in terms of distance r and interaction type i. Conventional physical and statistical potentials are parametric mathematical models similar to the examples shown. Our de novo potentials are nonparametric; i.e., they do not assume a mathematical model. (B) A potential can be visualized as a heat map where the interaction energy of every atom pair as a function of the atoms’ separation distance is represented by a color (pink: high potential energy, repulsive region; blue: low potential energy, attractive region).

Below, we demonstrate our method by applying it to the prediction of sequence-specific protein–DNA-binding interactions, a classic problem in molecular biology. Sequence-specific protein–DNA binding is a central phenomenon underlying transcriptional regulation of the cell in all organisms. Here, we describe a de novo potential for interatomic protein–DNA interactions and use this potential to computationally predict the DNA-binding sites of proteins with near experimental accuracy. A schematic overview summarizing the key steps in our approach is shown in Fig. S1.

The crux of our method is a unique mathematical formulation that recasts the determination of potentials as a signal acquisition problem using compressed sensing techniques (4, 5). By exploiting a key property of compressed sensing (discussed below), we circumvent the experimental intractability of determining energy potentials. This formulation rests on three key observations: (i) In certain systems, e.g., biomolecular interactions, only a few types of interatomic interactions are energetically important, such as hydrogen bonds. (ii) Mesoscopic interactions, e.g., protein–DNA binding, can be viewed as sensors of the underlying microscopic potential. (iii) There is a correspondence between the mathematical formulation of logistic regression and the statistical mechanical concept of the canonical ensemble. Using these ideas, we formulate the determination of energy potentials as a tractable signal acquisition problem as described in Methods. Central to this approach is the distinction between microscopic and mesoscopic interactions. The term “mesoscopic interactions” refers to interactions between molecules or molecular complexes, whereas “microscopic interactions” refers to interactions at the atomic scale. This distinction is ultimately application-specific and depends on what interaction energies are to be determined and what experimental data are available. An essential requirement is that mesoscopic interactions be comprised of microscopic interactions, such that the energy of a mesoscopic interaction is the sum of the energies of its constituent microscopic interactions. The identities of the microscopic interactions constituting each mesoscopic interaction must also be known. When these requirements are met, the method can determine the energies of the microscopic interactions using observed energies or probabilities of the mesoscopic interactions as the experimental data. For protein–DNA interactions, the mesoscopic interactions are protein–nucleotide binding events represented by protein–DNA crystal structures with known binding energies or relative binding probabilities, and the microscopic interactions are pairwise, distance-dependent, contacts between protein atoms and nucleotide atoms [standard atom type categories are used (6), and distances are segmented into discrete bins whose widths are treated as model parameters as described in SI Text S1, Section 2]. Thus, each protein–nucleotide crystal structure characterizes a mesoscopic interaction whose constituent microscopic interactions are readily identified from the structure. Intraprotein and intra-DNA energies are ignored as protein–DNA interactions are our focus. We use these data together with the known energies or probabilities of protein–nucleotide binding events to infer the interatomic protein–DNA potentials, and we show that these de novo potentials can be used to predict protein–DNA-binding motifs with unprecedented accuracy.

Model

Overview.

We outline our signal-based formulation of the potential inference problem here and derive it formally in SI Text S1, Section 1. Signal acquisition is comprised of three parts: the signal (conventionally an image), the sensor (camera photographing the image), and the sensor measurements (light intensities of the image) (Fig. 2A). A signal is represented by a vector whose elements are the signal intensities at different locations (e.g., different image pixels). Conventional sensing theory (Nyquist–Shannon) stipulates that signal acquisition requires twice as many measurements as the length of this vector for complete recovery of the signal. However, the compressed sensing (CS) framework has shown that under certain conditions, far fewer measurements are necessary when the signal is inferred using ℓ₁ minimization (5). We exploit this property to circumvent the combinatorial explosion noted earlier that causes experimental intractability. The CS technique requires two conditions for applicability (5): (i) The signal must be nearly sparse; i.e., most vector elements must have negligible intensity. (ii) The sensors must be incoherent; i.e., they measure the integrated intensity of multiple signal vector elements (Fig. 2B), and the set of vector elements sensed must be highly variable (ideally, random) between sensors (SI Text S1, Section 1). Also, the identity of the vector elements sensed by each sensor must be known. We reformulate potential determination as a CS problem by treating the interatomic potential as the signal we wish to acquire, with mesoscopic interactions as the sensors and mesoscopic interaction energies as the measurements (Fig. 2C). The signal’s vector is comprised of the energies of all distinct microscopic interactions, with different vector elements corresponding to different microscopic interactions and signal intensity corresponding to interaction energy. In the protein–DNA application, we treat distinct combinations of protein atoms, nucleotide atoms, and distance bins as distinct interactions (SI Text S1, Section 2). This leads to a combinatorial explosion in the number of possible interactions, causing the signal’s vector to be extremely long (up to approximately 50,000 elements) (SI Text S1, Section 2). However, the vector will be nearly sparse because most interactions are energetically negligible (7). In our crystal structure dataset, we found that only 9% of interaction energies were nonnegligible (Results). This satisfies the first condition. Regarding the second condition, the energies of mesoscopic interactions are incoherent measurements, because (i) they are the summed energies of the microscopic interactions and thus integrate the intensity of multiple vector elements, and (ii) the set of microscopic interactions present in each mesoscopic interaction is highly variable as discussed below. Because the vector elements sensed by each measurement must be known, the microscopic interactions comprising each mesoscopic interaction must be known. Protein–DNA crystal structures provide a dataset of mesoscopic interactions whose constituent microscopic interactions are identified from the positions of the protein and nucleotide atoms in the contact regions of the structure. We used a set of 63 such nonredundant structures (Dataset S1), combined with their measured binding affinities, as the dataset for the de novo potential determination described below. The nonredundancy of these structures ensures that each mesoscopic interaction samples a different set of microscopic interactions, because the intrinsic variability of the structures due to their varying spatial conformations and different amino acid compositions results in high variability in the microscopic interactions that constitute each mesoscopic interaction. (The degree of incoherence is quantified later in Results). Now, because we have recast potential determination as a CS problem, only a small number of incoherent measurements, i.e., experimentally characterized protein–nucleotide binding events, are needed. This circumvents the experimental combinatorial explosion problem cited earlier.

Fig. 2. — Comparison of conventional sensing, compressed sensing, and de novo potential determination. (A) In conventional image sensing the intensities of all pixels are acquired directly. (B) In compressed sensing, the image is inferred using ℓ₁ minimization from a relatively small number of measurements that sum the signal intensity of multiple image pixels. (C) Potential determination as an application of compressed sensing. The potential is represented by a heat map of microscopic interaction energies ranging from repulsive (dark pink) to attractive (dark blue). The interatomic protein–DNA potential can be inferred by ℓ₁ minimization from measurements provided by a small number of sensors (protein–DNA structures + binding energies). See text for additional details and Fig. S2 for a more mathematical treatment.

Mathematical Formulations.

We show that ℓ₁-regularized linear regression (8) infers potentials from mesoscopic interaction energies in SI Text S1, Section 1 (see also Fig. S2). We also derive a probability-based formulation that uses the relative probability of a mesoscopic interaction within a collection of possible alternative interactions (e.g., alternative DNA sites where the protein binds) as experimental data. This collection must form a canonical ensemble, i.e., a set of physical states in which the energy may vary, but the volume, temperature, and number of particles are fixed. Multiple distinct canonical ensembles can be used to infer a single potential (e.g., multiple protein–DNA complexes can be used to infer a single protein–DNA potential). We derive this formulation using a constrained version of ℓ₁-regularized multinomial logistic regression (9) (SI Text S1, Section 1). In our protein–DNA application, a collection of protein–nucleotide complexes in which the protein is fixed and individual nucleotides are varied forms a canonical ensemble, and the protein’s relative binding probabilities to different nucleotides are the experimental data. These probabilities are obtained from experimentally determined position weight matrices (PWMs) of protein binding sites or from consensus binding sequences (by assuming that consensus nucleotides bind with 100% probability).

Results

Application to Prediction of Protein Binding Sites.

We have used the probability-based formulation to determine the protein–DNA potential of helix-turn-helix (HTH) proteins and predict their consensus binding sequences and PWM motifs. We focus on HTH proteins as they are the most widely distributed family of DNA-binding proteins, occurring in all biological kingdoms, with a large number of structures in the Protein Data Bank (10). HTH proteins include virtually all bacterial transcription factors and about 25% of human transcription factors (11). For the prediction of consensus binding sequences, the potential is inferred using probabilities derived from the consensus sequences of protein–DNA structures in a dataset reserved for training the algorithm (SI Text S1, Section 3). A separate set of protein–DNA structures is used to test predictions made with the inferred potential. For each protein–DNA structure in the test set, every DNA sequence position is mutated in silico to every possible pair of nucleotides, and the relative binding affinities of the mutated structures are computed (SI Text S1, Section 3). In silico mutagenesis was carried out using the 3DNA software package (12, 13), which maintains the backbone atoms of the DNA molecule, but replaces the base pair atoms in a way that is consistent with the backbone orientation in the crystal. We assume independence of DNA positions and repeat this process for every position. The most probable nucleotides at all positions are predicted with 12.9% error, compared to 42.1% error by the leading alternative method (Table 1, Baseline model). For the more complex problem of predicting quantitative PWMs, we determine the potential using probabilities derived from published experimentally determined PWMs of the 63 protein–DNA structures in the dataset (Dataset S2 and SI Text S1, Section 3). Compared to leading physical and statistical potentials (6, 14–16), our de novo potential method produces the best PWM score (Table 1) on the symmetric Kullback–Leibler divergence (SKLD) metric (SI Text S1, Section 3). Note that the second and third best performing potentials require consensus binding sequences as input. Providing that input significantly simplifies the problem, whereas our method infers the consensus binding sequences.

Table 1.

Performance of de novo potential and other leading potentials

Potential	Type	Intra-DNA interactions	Prediction quality
Potential	Type	Intra-DNA interactions	Consensus sequence error	PWM symmetric KL divergence
Random model	N/A	N/A	75%	3.335
Methods requiring consensus sequences
Rosetta (12)	physical	yes	N/A	2.632
Cumulative contacts (13)	statistical	no	N/A	2.033
DNAPROT (14)	statistical	yes	N/A	1.991
Methods not requiring consensus sequences
DNAPROT^* (14)	statistical	no	60.20%	3.279
Rosetta^* (12)	physical	no	50.80%	2.719
Quasichemical (6)	statistical	no	42.10%	2.248
Our methods (do not require consensus sequences)
Baseline	de novo	no	12.90%	1.96
Transformed inputs	de novo	no	10.20%	1.861
Region specific	de novo	no	13.70%	1.792
Both generalizations	de novo	no	10.10%	1.699

Open in a new tab

Performance is assessed based on predictions of consensus sequences and PWMs averaged over the 63 structures in the dataset. For consensus sequence prediction, error is measured by the percentage of incorrectly predicted bases. For PWM predictions, the average SKLD over all DNA positions is reported (lower is better). A random model in which all DNA base pairs are assumed to be equally likely is also shown for reference.

^*Only the direct readout components of potentials are used in those tests because they do not require consensus sequences as input.

Generalizations.

We also consider two generalizations that relax physical constraints, yielding pseudopotentials that perform better in practice. First, we mathematically transform the regression inputs to improve their statistical and numerical properties (SI Text S1, Section 2). Second, we infer distinct potentials for interactions occurring in different regions of the HTH–DNA-binding interface, motivated by the observation that binding affinity is strongest in the core region of the binding interface and gets progressively weaker away from the core region (17) (SI Text S1, Section 2 and Fig. S3). We tested these generalizations individually and in combination (see Table 1). The consensus sequence predictions are slightly improved (10.1% vs. 12.9% error), but the improvement in PWM prediction is dramatic (SKLD of 1.699 vs. 1.960), larger than the gain obtained in going from the Quasichemical (6) to the DNAPROT (16) algorithm (2.248 vs. 1.991), which accounts for intra-DNA interactions and requires consensus sequences. The positive impact of this second generalization on PWM prediction, but not on consensus sequence prediction, results because consensus sequences do not capture the relative binding strength of protein–DNA interactions for alternative DNA sequences. Fig. 3 shows a bar chart of the accuracy of all 63 predictions made using our de novo potential with both generalizations, along with representative best, average, and worst case predictions of consensus sequences and PWMs. Fig. 4 compares predictions for the proteins where the de novo algorithm exhibited the greatest improvements relative to other methods.

Fig. 3. — Representative performance of de novo potential in predicting DNA-binding sites of 63 proteins. (A) Bar chart of the errors (fraction of incorrect bases) in consensus sequences predicted using de novo potential method. Each bar represents a single prediction made by the algorithm, with shorter bars corresponding to better predictions. Highlighted examples (pink bars) represent best, average, and worst cases, with insets comparing experimentally determined consensus sequences (*Top*) to predictions (*Bottom*). (B) SKLD scores (lower is better) for PWM predictions, with insets comparing experimentally determined PWMs (*Top*) to predictions (*Bottom*).

Fig. 4. — Examples highlighting significant improvement in prediction quality between de novo potential and other leading potentials. (A) Experimental and predicted consensus sequences for the *Drosophila melanogaster* Ultrabithorax Hox protein (*Left*) and the *Saccharomyces cerevisiae* MATα2 (*Right*) protein are shown. (*Top* to *Bottom*) Experimental, de novo potential, Rosetta (direct readout only), Quasichemical, and DNAPROT (direct readout only). (B) Experimental and predicted PWMs for the *Homo sapiens* Pax6 Paired domain (*Left*) and *D. melanogaster* Engrailed homeodomain (*Right*) are shown. (*Top* to *Bottom*) Experimental, de novo potential, Rosetta, Cumulative Contacts, Quasichemical, and DNAPROT.

Characterization of Best Performing Model.

As discussed earlier, a collection of sensors must be incoherent to ensure high-quality reconstruction of the underlying potential with compressive sensing methods, and the potential must be sufficiently sparse relative to the number of available measurements (SI Text S1, Section 1). To determine the degree to which these requirements are satisfied by the potentials derived from the protein:DNA complexes in our dataset, we consider the best performing baseline model, applied without the two generalizations discussed in the previous section. This model produced an energy potential with a total of 2,997 unique microscopic interactions using 1.3-Å wide distance bins and 5.9-Å cutoff distance (SI Text S1, Section 2). The total number of sensors in the dataset is 592, as each protein–DNA crystal structure yields multiple sensors because we make the common assumption of independence between DNA base pair positions. As previously noted, accurate inference is still possible despite having a smaller number of sensors than the number of unique microscopic interactions, if the potential is sufficiently sparse. Of the 2,997 unique interactions, only 270 have nonzero energy, suggesting that our dataset will yield accurate potentials. In fact, it is likely that the best performing choices of binning width and cutoff distance used for the baseline model represent the optimal trade-off between spatial resolution and statistical power.

An additional way to address the suitability of the dataset for potential inference is based on the consideration of all the pairwise angles between the sensor vectors in the dataset. The distribution of the absolute values of the cosines of these pairwise angles (Fig. S4) characterizes the incoherence of the sensors (18). The mean and median values of this distribution are 0.081 and 0.041, respectively. These values are significantly lower than 1, thus indicating that the set of sensors comprised by the protein:DNA crystal structures in our dataset has good incoherence properties (18, 19).

Discussion

Potential for Improving Performance.

The protein–DNA-binding site predictions we report are an application of our de novo potential inference method, and they exhibit a dramatic improvement over the leading alternative methods, with predictions within the experimental error of the PWMs for at least half of the cases studied. Although the accuracy depicted in Fig. 3 is substantially better than achieved with alternative methods, the predictions for the proteins in the lower quarter of Fig. 3 need to be improved. The 63 protein–DNA complexes in our database may provide biased or insufficient coverage of some of the microscopic interactions. Or, there could be significant variance in the quality of the crystal structures determined for the different complexes in the database that we curated from the Protein Data Base (10). Additionally, prediction errors might reflect the effects of other mechanisms that affect the shape and accessibility of the DNA in vivo so that the structure of the crystallized complex differs from the in vivo structure. Also, some transcription factors have been observed to bind DNA with two or more distinct motifs (20), and the alternate motifs would be missing from our dataset.

Principled Selection of Crystallization Targets.

As in other CS application domains, the accuracy of the inferred potentials depends on the characteristics of the sensor matrix: in this case, the collection of protein–DNA structures available. As discussed above, quantitative measures such as coherence can be used to assess the suitability of a sensor matrix for compressive sensing (5, 18, 19). These measures can provide a principled framework for selecting additional crystallization targets that will maximally enhance the sensing performance of a protein–DNA structural dataset. We showed above that our current dataset has good incoherence properties, yet we expect that the addition of more protein–DNA crystal structures, specifically chosen to yield a sensor matrix with even lower coherence, will yield more accurate energy potentials and better binding site predictions.

Advantages over Statistical Potentials.

Although statistical potentials and our de novo potentials both use experimental datasets to derive the final energy potential, de novo potentials are nonparametric; i.e., they do not assume an underlying mathematical form. In contrast, statistical potentials rely on experimental datasets to fit a parametric model with a fixed mathematical form. De novo potentials overcome additional limitations specific to statistical potentials. First, although statistical potentials utilize only atomic data such as crystal structures to fit their parameters, de novo potentials combine structural information and experimental binding data into a single formulation for inference. In the field of protein–DNA-binding site prediction, combining these types of data has been a long-standing objective (21–24). Second, statistical potentials implicitly assume that all structures come from the same canonical ensemble. This oversimplification ignores the chain connectivity and amino acid composition of proteins, and it is thought to be the cause of common anomalies observed in statistical potentials (2). In de novo potentials, this assumption is eliminated in the energy-based formulation, and it is relaxed significantly in the probability-based formulation, so that only subsets of the data are assumed to form canonical ensembles (SI Text S1, Section 1). Third, by assigning equal weight to microscopic interactions observed in different structures, statistical potentials implicitly assume that different structures have the same binding or formation energy. This is not the case, as different protein–DNA complexes are known to have different binding affinities. From a statistical mechanical standpoint, high affinity complexes correspond to more frequent mesoscopic interactions than low affinity complexes, requiring that the underlying microscopic interactions be given proportionally greater weight. De novo potentials take this into account, as the binding energies of mesoscopic interactions are an explicit part of the formulation. Fourth, there is no theoretical assurance that as more data are added for the statistical potential fitting process, the inferred energies will ultimately converge to the true underlying interaction energies. The same observation holds for physical potentials. In contrast, for de novo potentials, if the formal requirements of compressed sensing are satisfied, then additional sensors in the dataset will lead to ever closer estimates of the interaction energies.

De Novo Potentials in Other Molecular Interaction Domains.

The important insight that underlies our de novo methodology is that experimental datasets relating to mesoscopic interactions in a wide range of fields can be cast as incoherent measurements of microscopic interactions in the compressed sensing framework. In these cases, powerful compressed sensing methods can be used to determine de novo potentials. In the current protein–DNA application, microscopic interactions were defined to always involve one protein atom and one DNA atom, thus neglecting intra-DNA interactions. If microscopic interactions are defined to include noncovalent contacts between two DNA atoms, then the indirect readout component of protein–DNA interactions can also be modeled, capturing intra-DNA interactions.

Similarly, to infer a potential for protein–protein interactions or for protein folding, noncovalent contacts between protein atoms would be treated as the microscopic interactions. However, interactions need not be restricted to those involving pairs of atoms. Interactions involving multiple atoms, as well as coarse-grained potentials in which the “atoms” of the systems are residues for example, could be used. For mesoscopic measurements of protein–protein interactions, biochemical data on protein–protein binding kinetics can be used. The Protein–Protein Interaction Thermodynamic Database (PINT) is a database of such measurements (25). For mesoscopic measurements of protein folding, the kinetics or mean folding times of proteins are necessary. Although such measurements are difficult to obtain, significant progress in experimental techniques has been made recently (26–28). The advent of these recent experimental techniques promises to make such measurements more readily available in the future.

Supplementary Material

Supporting Information

supp_108_36_14819__index.html^{(971B, html)}

Acknowledgments.

R. Altman, G. Bejerano, J. Boyd Kozdon, A. Deacon, S. Hong, V. Pande, K. Sachs, and L. Shapiro provided helpful comments. We thank E. Candes, T. Hastie, and M. Levitt for insightful discussions, A. Morozov for helpful advice on the Rosetta protein–DNA module, and K. Arya and G. Cooperman for customizing the DMTCP checkpointing software for our purposes. Wolfram Research provided the Mathematica software environment necessary for the analyses performed. This work was supported by Department of Energy Office of Science Grant DE-FG02-05ER64136 (to H.H.M.). We used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the US Department of Energy under Contract DE-AC02-05CH11231. M.A. was supported by the Stanford Genome Training Program (Grant T32 HG00044 from the National Human Genome Research Institute).

Footnotes

The authors declare no conflict of interest.

*This Direct Submission article had a prearranged editor.

See Commentary on page 14713.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1106460108/-/DCSupplemental.

References

1.Jorgensen WL, Tirado-Rives J. Potential energy functions for atomic-level simulations of water and organic and biomolecular systems. Proc Natl Acad Sci USA. 2005;102:6665–6670. doi: 10.1073/pnas.0408037102. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Zhou Y, Zhou HY, Zhang C, Liu S. What is a desirable statistical energy function for proteins and how can it be obtained? Cell Biochem Biophys. 2006;46:165–174. doi: 10.1385/cbb:46:2:165. [DOI] [PubMed] [Google Scholar]
3.Ponder JW, Case DA. Force fields for protein simulations. Adv Protein Chem. 2003;66:27–85. doi: 10.1016/s0065-3233(03)66002-x. [DOI] [PubMed] [Google Scholar]
4.Chartrand R, Baraniuk RG, Eldar YC, Figueiredo MAT, Tanner J. Introduction to the issue on compressive sensing. IEEE J Sel Top Signal Process. 2010;4:241–243. [Google Scholar]
5.Candes EJ, Wakin MB. An introduction to compressive sampling. IEEE Signal Process Mag. 2008;25:21–30. [Google Scholar]
6.Donald JE, Chen WW, Shakhnovich EI. Energetics of protein-DNA interactions. Nucleic Acids Res. 2007;35:1039–1047. doi: 10.1093/nar/gkl1103. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Brändén C-I, Tooze J. Introduction to Protein Structure. 2nd Ed. New York: Garland; 1999. p. xiv. [Google Scholar]
8.Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Series B Stat Methodol. 1996;58:267–288. [Google Scholar]
9.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22. [PMC free article] [PubMed] [Google Scholar]
10.Berman HM, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Vaquerizas JM, Kummerfeld SK, Teichmann SA, Luscombe NM. A census of human transcription factors: Function, expression and evolution. Nat Rev Genet. 2009;10:252–263. doi: 10.1038/nrg2538. [DOI] [PubMed] [Google Scholar]
12.Lu XJ, Olson WK. 3DNA: A software package for the analysis, rebuilding and visualization of three-dimensional nucleic acid structures. Nucleic Acids Res. 2003;31:5108–5121. doi: 10.1093/nar/gkg680. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Lu XJ, Olson WK. 3DNA: A versatile, integrated software system for the analysis, rebuilding and visualization of three-dimensional nucleic-acid structures. Nat Protoc. 2008;3:1213–1227. doi: 10.1038/nprot.2008.104. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Morozov AV, Havranek JJ, Baker D, Siggia ED. Protein-DNA binding specificity predictions with structural models. Nucleic Acids Res. 2005;33:5781–5798. doi: 10.1093/nar/gki875. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Morozov AV, Siggia ED. Connecting protein structure with predictions of regulatory sites. Proc Natl Acad Sci USA. 2007;104:7068–7073. doi: 10.1073/pnas.0701356104. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Angarica VE, Perez AG, Vasconcelos AT, Collado-Vides J, Contreras-Moreira B. Prediction of TF target sites based on atomistic models of protein-DNA complexes. BMC Bioinformatics. 2008;9:436. doi: 10.1186/1471-2105-9-436. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Wintjens R, Rooman M. Structural classification of HTH DNA-binding domains and protein-DNA interaction modes. J Mol Biol. 1996;262:294–313. doi: 10.1006/jmbi.1996.0514. [DOI] [PubMed] [Google Scholar]
18.Tropp JA. On the conditioning of random subdictionaries. Appl Comput Harmon Anal. 2008;25:1–24. [Google Scholar]
19.Candes EJ, Romberg J. Quantitative robust uncertainty principles and optimally sparse decompositions. Found Comput Math. 2006;6:227–254. [Google Scholar]
20.Badis G, et al. Diversity and complexity in DNA recognition by transcription factors. Science. 2009;324:1720–1723. doi: 10.1126/science.1162327. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Mirny LA, Gelfand MS. Structural analysis of conserved base pairs in protein-DNA complexes. Nucleic Acids Res. 2002;30:1704–1711. doi: 10.1093/nar/30.7.1704. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Hoglund A, Kohlbacher O. From sequence to structure and back again: Approaches for predicting protein-DNA binding. Proteome Sci. 2004;2:3. doi: 10.1186/1477-5956-2-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Eisen M. All motifs are NOT created equal: Structural properties of transcription factor-DNA interactions and the inference of sequence specificity. Genome Biol. 2005;6:P7. [Google Scholar]
24.Moroni E, Caselle M, Fogolari F. Identification of DNA-binding protein target sequences by physical effective energy functions: Free energy analysis of lambda repressor-DNA complexes. BMC Struct Biol. 2007;7:61. doi: 10.1186/1472-6807-7-61. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Kumar MD, Gromiha MM. PINT: Protein-Potein Interactions Thermodynamic Database. Nucleic Acids Res. 2006;34:D195–198. doi: 10.1093/nar/gkj017. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Vendruscolo M, Paci E. Protein folding: Binging theory and experiment closer together. Curr Opin Struct Biol. 2003;13:82–87. doi: 10.1016/s0959-440x(03)00007-1. [DOI] [PubMed] [Google Scholar]
27.Oliveberg M, Wolynes PG. The experimental survey of protein-folding energy landscapes. Q Rev Biophys. 2005;38:245–288. doi: 10.1017/S0033583506004185. [DOI] [PubMed] [Google Scholar]
28.Mello CC, Barrick D. An experimentally determined protein folding energy landscape. Proc Natl Acad Sci USA. 2004;101:14102–14107. doi: 10.1073/pnas.0403386101. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

supp_108_36_14819__index.html^{(971B, html)}

1106460108_pnas.1106460108_SI.pdf^{(743.5KB, pdf)}

1106460108_SD01.xls^{(24.5KB, xls)}

1106460108_SD02.xls^{(57.5KB, xls)}

[B1] 1.Jorgensen WL, Tirado-Rives J. Potential energy functions for atomic-level simulations of water and organic and biomolecular systems. Proc Natl Acad Sci USA. 2005;102:6665–6670. doi: 10.1073/pnas.0408037102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] 2.Zhou Y, Zhou HY, Zhang C, Liu S. What is a desirable statistical energy function for proteins and how can it be obtained? Cell Biochem Biophys. 2006;46:165–174. doi: 10.1385/cbb:46:2:165. [DOI] [PubMed] [Google Scholar]

[B3] 3.Ponder JW, Case DA. Force fields for protein simulations. Adv Protein Chem. 2003;66:27–85. doi: 10.1016/s0065-3233(03)66002-x. [DOI] [PubMed] [Google Scholar]

[B4] 4.Chartrand R, Baraniuk RG, Eldar YC, Figueiredo MAT, Tanner J. Introduction to the issue on compressive sensing. IEEE J Sel Top Signal Process. 2010;4:241–243. [Google Scholar]

[B5] 5.Candes EJ, Wakin MB. An introduction to compressive sampling. IEEE Signal Process Mag. 2008;25:21–30. [Google Scholar]

[B6] 6.Donald JE, Chen WW, Shakhnovich EI. Energetics of protein-DNA interactions. Nucleic Acids Res. 2007;35:1039–1047. doi: 10.1093/nar/gkl1103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] 7.Brändén C-I, Tooze J. Introduction to Protein Structure. 2nd Ed. New York: Garland; 1999. p. xiv. [Google Scholar]

[B8] 8.Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Series B Stat Methodol. 1996;58:267–288. [Google Scholar]

[B9] 9.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22. [PMC free article] [PubMed] [Google Scholar]

[B10] 10.Berman HM, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] 11.Vaquerizas JM, Kummerfeld SK, Teichmann SA, Luscombe NM. A census of human transcription factors: Function, expression and evolution. Nat Rev Genet. 2009;10:252–263. doi: 10.1038/nrg2538. [DOI] [PubMed] [Google Scholar]

[B12] 12.Lu XJ, Olson WK. 3DNA: A software package for the analysis, rebuilding and visualization of three-dimensional nucleic acid structures. Nucleic Acids Res. 2003;31:5108–5121. doi: 10.1093/nar/gkg680. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13.Lu XJ, Olson WK. 3DNA: A versatile, integrated software system for the analysis, rebuilding and visualization of three-dimensional nucleic-acid structures. Nat Protoc. 2008;3:1213–1227. doi: 10.1038/nprot.2008.104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] 14.Morozov AV, Havranek JJ, Baker D, Siggia ED. Protein-DNA binding specificity predictions with structural models. Nucleic Acids Res. 2005;33:5781–5798. doi: 10.1093/nar/gki875. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15.Morozov AV, Siggia ED. Connecting protein structure with predictions of regulatory sites. Proc Natl Acad Sci USA. 2007;104:7068–7073. doi: 10.1073/pnas.0701356104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] 16.Angarica VE, Perez AG, Vasconcelos AT, Collado-Vides J, Contreras-Moreira B. Prediction of TF target sites based on atomistic models of protein-DNA complexes. BMC Bioinformatics. 2008;9:436. doi: 10.1186/1471-2105-9-436. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] 17.Wintjens R, Rooman M. Structural classification of HTH DNA-binding domains and protein-DNA interaction modes. J Mol Biol. 1996;262:294–313. doi: 10.1006/jmbi.1996.0514. [DOI] [PubMed] [Google Scholar]

[B18] 18.Tropp JA. On the conditioning of random subdictionaries. Appl Comput Harmon Anal. 2008;25:1–24. [Google Scholar]

[B19] 19.Candes EJ, Romberg J. Quantitative robust uncertainty principles and optimally sparse decompositions. Found Comput Math. 2006;6:227–254. [Google Scholar]

[B20] 20.Badis G, et al. Diversity and complexity in DNA recognition by transcription factors. Science. 2009;324:1720–1723. doi: 10.1126/science.1162327. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 21.Mirny LA, Gelfand MS. Structural analysis of conserved base pairs in protein-DNA complexes. Nucleic Acids Res. 2002;30:1704–1711. doi: 10.1093/nar/30.7.1704. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22.Hoglund A, Kohlbacher O. From sequence to structure and back again: Approaches for predicting protein-DNA binding. Proteome Sci. 2004;2:3. doi: 10.1186/1477-5956-2-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] 23.Eisen M. All motifs are NOT created equal: Structural properties of transcription factor-DNA interactions and the inference of sequence specificity. Genome Biol. 2005;6:P7. [Google Scholar]

[B24] 24.Moroni E, Caselle M, Fogolari F. Identification of DNA-binding protein target sequences by physical effective energy functions: Free energy analysis of lambda repressor-DNA complexes. BMC Struct Biol. 2007;7:61. doi: 10.1186/1472-6807-7-61. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] 25.Kumar MD, Gromiha MM. PINT: Protein-Potein Interactions Thermodynamic Database. Nucleic Acids Res. 2006;34:D195–198. doi: 10.1093/nar/gkj017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] 26.Vendruscolo M, Paci E. Protein folding: Binging theory and experiment closer together. Curr Opin Struct Biol. 2003;13:82–87. doi: 10.1016/s0959-440x(03)00007-1. [DOI] [PubMed] [Google Scholar]

[B27] 27.Oliveberg M, Wolynes PG. The experimental survey of protein-folding energy landscapes. Q Rev Biophys. 2005;38:245–288. doi: 10.1017/S0033583506004185. [DOI] [PubMed] [Google Scholar]

[B28] 28.Mello CC, Barrick D. An experimentally determined protein folding energy landscape. Proc Natl Acad Sci USA. 2004;101:14102–14107. doi: 10.1073/pnas.0403386101. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Direct inference of protein–DNA interactions using compressed sensing methods

Mohammed AlQuraishi

Harley H McAdams

Series information

Abstract

Fig. 1.