Centroid estimation in discrete high-dimensional spaces with applications in biology

Luis E Carvalho; Charles E Lawrence

doi:10.1073/pnas.0712329105

. 2008 Feb 27;105(9):3209–3214. doi: 10.1073/pnas.0712329105

Centroid estimation in discrete high-dimensional spaces with applications in biology

Luis E Carvalho ¹, Charles E Lawrence ^1,^*

PMCID: PMC2265131 PMID: 18305160

Abstract

Maximum likelihood estimators and other direct optimization-based estimators dominated statistical estimation and prediction for decades. Yet, the principled foundations supporting their dominance do not apply to the discrete high-dimensional inference problems of the 21st century. As it is well known, statistical decision theory shows that maximum likelihood and related estimators use data only to identify the single most probable solution. Accordingly, unless this one solution so dominates the immense ensemble of all solutions that its probability is near one, there is no principled reason to expect such an estimator to be representative of the posterior-weighted ensemble of solutions, and thus represent inferences drawn from the data. We employ statistical decision theory to find more representative estimators, centroid estimators, in a general high-dimensional discrete setting by using a family of loss functions with penalties that increase with the number of differences in components. We show that centroid estimates are obtained by maximizing the marginal probabilities of the solution components for unconstrained ensembles and for an important class of problems, including sequence alignment and the prediction of RNA secondary structure, whose ensembles contain exclusivity constraints. Three genomics examples are described that show that these estimators substantially improve predictions of ground-truth reference sets.

Keywords: prediction, statistical inference, computational biology, discrete decoding

In the past decade, high-throughput data-acquisition technologies have rendered datasets with sizes unimaginable to our predecessors, including the sequence of the human genome (1) and the products of numerous high-throughput technologies of the post-genome era (2), data warehouses of commercial and internet transactions (3), and surveys of the objects of the universe (4). Although the emergence of such large datasets seems to imply more precise parameter estimation, paradoxically just the opposite is becoming increasingly common. This paradox emerged because these technologies simultaneously opened opportunities to draw inferences on previously unanswerable high-dimensional questions.

Estimation and prediction have long been dominated by procedures that identify the most probable point, including maximum likelihood estimation (5), maximum a posteriori (MAP) estimates such as Viterbi decoding of hidden Markov models, and minimum “free-energy” structure predictions (6, 7). These types of estimators are referred as ML estimators (maximum likelihood-family estimators) in the remainder of this article. In addition, many algorithms that optimize scoring functions to produce estimates or predictions correspond to equivalent maximum likelihood estimation procedures (8, 9), and thus also yield ML estimators.

Historically, there have been good reasons for this dominance. ML estimates are intuitively appealing because they identify the point in the space of the unknowns for which the data have highest probability. In the prediction of molecular structures, if the energy of one structure is sufficiently lower than all of the others, its probability will be near one and thus it will dominate the ensemble. More importantly, the long dominance of ML estimators rests on a principled foundation showing that they possess a number of very desirable properties, at least in the historic setting in which they were developed and have been very successfully applied: low-dimensional continuous spaces. Specifically, ML estimators have three key advantageous properties, as reported by Wald and Cramér (10, 11): consistency, asymptotic normality, and asymptotic efficiency. However, these properties only hold asymptotically as the data increase relative to the number of unknowns, and only properly for continuous variables. Such conditions are not attained when interest is focused on the inference of high-dimensional (high-D) discrete unknowns. Thus, the principled foundation supporting ML estimators is absent in this new setting.

Furthermore, evidence has emerged indicating that, in practice, estimators that gather information from the full ensemble of solutions predict ground-truth reference sets better than ML estimators. Specifically, Miyazawa (9) described reliable alignments that outperformed maximum similarity alignment procedures in the prediction of protein structure. More recently, Ding et al. (12) derived centroid estimators for predicting RNA secondary structure, and showed that they outperform well established ML estimators. Thus, there is now evidence in principle and in practice suggesting the need for alternative estimation procedures.

Bayesian inference provides a very useful alternative that enjoys a number of advantages in continuous settings. However, mean values of these estimators are not applicable here because, in general, they will not provide discrete solutions. Also, when interest is not on the overall solution, but on individual components, maximization of the marginal probabilities has been proposed (13). In sequence alignment, Miyazawa developed reliable alignments that maximize marginal probabilities and showed that these estimates meet the problem's main constraints (9). For the special case of predicting RNA secondary structure, one of us and colleagues developed centroid predictions and showed that these meet this problem's constraints (12).

Here, we use statistical decision theory to broadly generalize and extend the results of Ding et al. (12), to formally develop an alternative class of centroid estimators, and to prove some related theorems.

Background

A common high-dimensional inference problem concerns the estimation of n correlated binary variables, θ, living on a subset of {0,1}ⁿ. For example, in network identification one seeks to predict if pairs of nodes are either connected or not. In these applications binary variables, θ, are not observed directly, rather they have to be inferred from available data, S. For example, to predict gene networks, binary variables predict interactions between a pair of genes based on intensity data from microarray assays of expression levels of thousands of individual genes (14).

In maximum likelihood estimation, one finds the values of the unknowns that maximize the probability of the observed data, argmax_θP(S|θ). In Bayesian statistics, the other major statistical inference paradigm, Bayes's theorem gives the probability of allowable realizations of the set of unknown random variables θ after seeing the data S, and thus is called posterior probability:

where P(θ|Λ) is the prior probability of θ conditional on a set of parameters Λ and P(S|θ,Λ) is the likelihood of the data conditional on θ and Λ. The denominator in Eq. 1, which is analogous to the partition function in statistical physics, is often called the marginal likelihood of the data. The common Bayesian ML estimator, the posterior probability maximizer, maximum a posteriori estimator is

where, from now on, we denote by θ the set of all possible realizations of θ.

Consider now the simple case of three correlated binary variables. Suppose that the eight combinations of these variables were equally likely a priori and that, after observing the data, the probabilities of the eight alternative combinations of these variables are those shown in Fig. 1, and p₂ < p₁ < 2p₂.

As Fig. 1 shows, although the point L(1, 1, 1) is the most likely, it does not seem to represent the data well, because the only other points with positive probabilities are the four points that are about as different from this point as possible and together they are between two and four times more probable than point L. Point M is the mean with coordinates (p₁ + p₂, p₁ + p₂, p₁ + p₂) and because, as discussed more formally below, p₁ + p₂<1/2, it is always closer to C(0, 0, 0).

For the general case of n binary variables, where p₂ < p₁ < (n − 1)p₂, we could still have a cluster of n + 1 points that differ from some point C by no more than one component and such that each has probability p₂. The point at the opposite end of the hypercube L has probability p₁, all its neighbors have zero probability, and it differs from all of the rest with positive probability in at least n − 1 of its n components. Since p₁ + (n + 1)p₂ = 1, which implies that 1/(2n) < p₂ < 1/(n + 2), if we choose p₂ = 1/(n + 3), then we would have p₁ = 2/(n + 3). The ratio of the posterior mass for the cluster around point C to that around L is (n + 1)p₂/p₁ = (n + 1)/2. Thus, as the dimensionality of the problem increases, the proportion of posterior mass around point C becomes arbitrarily large when compared with L, and so does the distance between C and L. Nevertheless, since p₂ < p₁, point L is the ML estimate.

Although by design this is an extreme case, there are many other examples that present similar scenarios that raise concerns about the utility of ML estimators in high-dimensional discrete spaces. To examine these issues in less extreme and more realistic cases, we consider four questions: (i) Is there a principled basis that prevents ML estimators from being isolated from the region of space strongly recommended by the data? (ii) What alternative estimators better represent the data? (iii) Do these alternative estimators offer improved representation of the data in practice? (iv) Do these alternative estimators predict known references cases better?

Centroid Estimation

Is There a Principled Basis Preventing ML Estimators from Being Isolated from the Region of Space Strongly Recommended by the Data?

In most high-dimensional circumstances, ML estimators are likely to have a very small probability because the single terms in the numerator of Eq. 1 will soon become swamped by the large number of terms in the denominator. We are not the first to recognize this fact, but many who recognize this often hoped that the ML estimator would be surrounded by a large number of similar solutions that together would contain a high proportion of the posterior probability mass (15). To find an estimator that will capture this hope, we employ statistical decision theory.

Statistical decision theory provides a principled means to optimally identify estimators with desirable properties through the use of loss functions. These functions assign losses to differences between the unknown value one seeks to infer and its estimate. Specifically, in the theory's jargon, we consider the loss L(θ̂, θ) associated with making the prediction θ̂ when the actual value of the unknown solution is θ, under loss function L(·) (11). To address the uncertainty inherent in estimation and prediction we seek an estimator that minimizes the expected loss, the risk. For example, to find estimators with minimum variance, the expected squared error loss is minimized. Specifically, the posterior risk is defined as the expected loss of choosing some estimator from S (16, 17),

It is well known that ML estimators in this discrete setting are guaranteed to minimize posterior risk only under a zero–one loss function (13). Thus, any other point in the space that is different from the estimator incurs in a unitary penalty to the risk, regardless of how different they are.

Also, because these estimators ignore all other configurations but the most probable one, there is no principled reason for them not to differ greatly from all other solutions in the space, regardless of how strongly other points are supported by the data. Thus, because ML estimators in this setting only represent themselves and frequently have small posterior probabilities, these estimators are unlikely to be good representatives of the information contained in the observed data.

What Alternative Estimators Better Represent the Data?

Although the hope of finding a high proportion of the posterior-weighted mass concentrated around an ML estimator has not been realized, nevertheless the concept of finding an estimator that does concentrate mass bares further consideration. Thus, to answer this question, we employ loss functions that incur higher losses as the difference in the number of components increases. The motivation is to better capture the character of the distribution of the posterior probability mass in the ensemble by seeking an estimator that represents collections of similar solutions, like point C in Fig. 1.

First, consider the Hamming loss,

where I is the indicator function, that is, I(a) = 1 if a is true and I(a) = 0 otherwise, and n is the dimension of both z and y. The Hamming loss function simply measures how many components differ between two members of a discrete solution space. Its posterior risk for some estimator θ̂ is

graphic file with name zpq00908-9444-m05.jpg

and so, it is immediate that, to minimize the risk, we can simply choose

that is, the posterior marginal sum maximizer. We call θ̂_C the centroid estimator, for which we have just presented the proof of the following equivalent definition:

Theorem 1.

θ̂_C is the posterior Hamming loss risk minimizer.

In the special case when θ = A₁ × A₂ × ⋯ × A_n, where A_i is the set of possible values for the ith entry in θ, we can take the marginal posterior maximizers for each position. That is, we choose an estimate by choosing the value of each component of the solution that is most probable, consensus estimator:

Constrained Centroid Estimation

Centroid estimators also have their drawbacks. For instance, it might not be straightforward to derive a centroid estimator if the solution space is shaped by complex constraints. A naïve approach would be to employ the consensus estimator, but, since the inference is driven by the marginals, it is possible to find an estimate that is not feasible, for example, that does not belong to the solution space of the original problem. A simple example occurs when the space comprises three points (1, 0, 0), (0, 1, 0), and (0, 0, 1) with probabilities p₁, p₂, and p₃ < 0.5, respectively; the consensus estimator would be (0, 0, 0), which does not belong to the space. In general, taking the maximizers of the marginals is not a feasible solution in such a constrained problem, but under appropriate conditions it can be.

For many important applications, like RNA secondary structure or sequence alignment, the discrete unknowns are binary and the constraints have the characteristic form Σ_i∈J θ_i ≤ 1, where J ⊂ {1, …, n}. Since, at most, one position in J can be matched, we say that J is restricted by an exclusivity constraint. This implies that, if we marginalize on J, then no two marginal sets of positions can have 1 at the same position. Therefore, we can reduce the problem to either selecting the alternative that has a probability greater than a half or, if none exists, assigning zero to all alternatives of this constraint. This would always yield a feasible centroid estimator. Formally, a more general result is available:

Theorem 2.

If θ⊂ {0,1}ⁿ is such that θ ∈ θ satisfies a set of conditions{C_k}_{k = 1}^K of the form C_k: Σ_{i∈J_k} θ_i ≤ 1, where{J_k}_{k = 1}^K is a collection of index sets (J_k ⊂ {1,…,n}, 1 ≤ k ≤ K), then θ̂*_C also satisfies each condition C_k, 1 ≤ k ≤ K, that is, θ̂*_C = θ̂_C. [See supporting information (SI) Appendix for the proof.]

Theorem 2 shows that, for problems in this class, consensus estimates will satisfy the original problem's constraint set even if the constraints overlap, and thus are centroid estimators. For example, in the sequence alignment problem there are two essential sets of constraints. First, if we view each solution as an array of binary variables θ_ij, 1 ≤ i ≤ n and 1 ≤ j ≤ m for sequences of size n and m, then each character in the first sequence should match with at most one other character in the second sequence, and vice versa: Σ_{j = 1}^m θ_ij ≤ 1 for 1 ≤ i ≤ n and Σ_{i = 1}ⁿ θ_ij ≤ 1 for 1 ≤ j ≤ m. The second set of constraints are collinearity constraints that prohibit the crossing of aligned character pairs: θ_ij + θ_kl ≤ 1, 1 ≤ i < k and l < j ≤ n. Because these are all exclusivity constraints, Theorem 2 applies.

Consider next pth power loss functions. These loss functions cover the broad class of loss functions that minimize pth order centered moments, including the important special case of the expected second centered moment. For categorical variables we can adopt a suitable binary representation to obtain the following result:

Theorem 3.

θ̂_C is the posterior pth power loss risk minimizer. (See SI Appendix for the proof.)

The estimator θ̂_C minimizes the expected second moment centered around itself and so it is analogous to a multidimensional mean. As a matter of fact, under the same representation as before, it is the closest point to the mean.

Theorem 4.

θ̂_C minimizes the squared distance to the posterior mean. (See SI Appendix for the proof.)

Because this estimator is nearest to the center of mass of the posterior space, we call θ̂_C the centroid estimator. Moreover, because θ̂_C depends on the distance to other points and their probabilities, its behavior is quite different from θ̂_MAP: it seeks to find a point that minimizes the posterior-weighted distance to all points in the ensemble instead of choosing the single highest peak in the space. Thus, it pools the data's evidence from all points in the solution space.

Do These Alternative Estimators Offer Improved Representation of the Data in Practice?

Ding et al. (18) also applied centroid estimators to the characterization of messenger RNAs. In this study they showed that the variance about the ML estimate, minimum free-energy (MFE) structure, was on average 66% greater than the variance around the centroid estimate, indicating that posterior space is often asymmetric and that the most likely structure was often far from the center of mass of the posterior space. Although not as extreme as the illustrative example in Fig. 1, Fig. 2 shows an example in which the most likely structure, the MFE, lies in the periphery of the posterior space. Ding et al. (18) also showed that the posterior space often contained multiple clusters, and that the most likely structure was not in the largest cluster for 55 of the 100 mRNAs in their study.

Fig. 2. — Multidimensional scaled distribution (A) and histogram of distances to cluster 2 centroid (B) derived from 1,000 representative samples from Sfold for the secondary structure of *Dermocarpa* sp. ribonuclease P RNA.

Do These Alternative Estimators Predict Known References Cases Better?

We know of three applications of centroid estimators that have compared their predictions of known references to those of ML estimators. Ding et al. (12) made such a comparison by using an energy model that was identical to the model used by the most popular RNA structural prediction web server mfold (6) that predicts the most likely structure. Thus, their comparisons contrast directly only the two estimators. They found that, on average, the predicted base pairs of centroids have 30% fewer prediction errors (positive predictive value improvement) than those in the most likely structure, while also correctly predicting 3.5% more base pairs (sensitivity improvement). By using a different set of free-energy parameters (19), Mathews also showed that consensus estimators, thus centroid estimators, of RNA secondary structure improve positive predictive values by, on average, eight percentage points compared with the MFE structure (20).

In a article on the reliable alignment of protein sequences, Miyazawa (9) used a probabilistic model to identify the marginal probabilities of pairs of aligned protein residues and used a consensus, centroid estimator to estimate an alignment. He compared the ability of these alignments and optimal alignments that correspond to most probable alignments in their ability to predict the x-ray crystal structures of 1 of 109 pairs of proteins from that of the other. He found that the most probable alignments predicted reference's gold standard crystal structures better than centroid alignments by at least 0.25 Å root mean square deviation (rmsd) in only 4 of the 109 protein pairs. For these four, the most probable alignment was, on average, 0.41 Å rmsd closer to the reference structure than the centroid alignment. However, he found 29 pairs for which the centroid alignment predicted the reference structure better than the most probable alignment by at least 0.25 Å with an average improvement of 0.81 Å rmsd, thus demonstrating the centroid estimator's ability to improve the prediction of protein structure.

The computational identification of the locations of the regulatory sites of genes is another important area of study in genomics. Algorithms for this purpose are commonly known as motif-finding algorithms. Recently, Newberg at al. (21) developed a Gibbs sampling algorithm that seeks to identify regulatory sites by using the sequences from multiple related species. In this study they showed that centroid solutions consistently outperformed ML estimators. For example, in their simulation study of 1,000-bp-long sequences from five yeast species, they found that the centroid estimator made from 11% to 35% fewer prediction errors than the ML estimators with equal or better sensitivity. The larger differences occur when sites are more difficult to identify.

Conclusions

ML estimators have dominated prediction and estimation for years. Our results indicate that this paradigm has serious theoretical and practical limitations, and that there are better alternatives. Specifically, by using statistical decision theory with loss functions that incur increased penalties with increasing difference in their components, we develop centroid estimators. These estimators center themselves in the posterior-weighted ensemble by balancing the forces of the members based on their posterior probabilities and their component-wise distances from the centroid estimator. Given the findings in three computational biology applications that centroid estimators substantially improve prediction of ground-truth reference sets without modification of the underlying probabilistic model and perhaps more importantly their principled foundation, centroid estimators offer a promising avenue for improved estimation and prediction in discrete high-D inference problems that are becoming increasingly common in the twenty-first century.

Additional reports also show the utility of exploring the full ensemble of solutions. For example, Bradley et al. (7) in studies of protein structure prediction show that it is useful to sample the ensemble of solutions to identify probable energy wells. In CONTRAfold (22), conditional log-linear models are used to specify a probability distribution for RNA secondary structures conditional on RNA sequence. RNA structure estimation is then defined by the maximization of the expected accuracy, where accuracy weights correctly paired positions by a sensitivity/specificity trade-off parameter γ; when γ = 1 the estimator is the centroid estimator. Hartemink et al. (14) found posterior model averaging useful after the application of simulated annealing to visit high-scoring regions in inference of genetic regulatory networks, although our experience differs somewhat from theirs. In our experience any use of preliminary optimization such as simulated annealing is detrimental (21). Also, Zhang and Liu (23) found that the incorporation of a side-chain entropy term in a simple free-energy function significantly improved the discrimination of native protein structures from decoys.

Centroid estimation has many other potential applications outside computational molecular biology. A common area of application of high-D discrete inference is variable selection in which discrete choices are made for inclusion of variables in a model. For example, Casella and Moreno (24) treat variable selection in normal regression models through the use of intrinsic priors and select the model with highest posterior probability. Smith and Fahrmeir (25) consider model selection in functional magnetic resonance imaging analysis with indicator variables for inclusion of regressors defined on a lattice, and use marginal probabilities for variable selection. Tadesse et al. (26) formulate a clustering problem by using a multivariate normal mixture model. Observations are allocated to classes according to the mode of a marginal posterior distribution, and variables are selected if their marginal posterior probability of above a threshold a; if a = 1/2 they have a centroid estimator.

Several caveats are appropriate. Since at this stage only a few applications have been examined, the assessment of how well these estimators will predict reference results in other settings awaits further study. The estimators developed here are appropriate in the important set of problems involving categorical variables. However, when discrete spaces involve ordinal or interval variables, estimators based on other loss functions that still achieve the goal of centering estimates in the posterior space may be more appropriate.

Ding el al. (18) showed evidence of multimodal posterior ensembles. When the clusters within these spaces are well separated, no single solution, including the centroid, is likely to represent the posterior space well. In such cases, multiple centroids, one for each cluster, may be required (12, 18). An alternate explanation for the results of Ding el al. (12) and Mathews (19) showing that there are better predictors of RNA structure than the minimum free-energy structure is that the energy model of these two works is incomplete. Thus, if evolution selects sequences that will adopt the minimum-energy states as the native states, the minimum of the incomplete (secondary structure) energy models of these two studies may not correspond with this overall energy minimum, and thus yields an incorrect prediction.

Although these estimators represent the posterior space in a defined manner, left unaddressed is the question of how representative a proposed estimator is of the specific ensemble from which it is drawn, thus leaving for future development the need to report how far the “true” state of nature may be from the proposed estimate. Whereas our findings on feasibility cover several important cases, for other cases further steps may be required to ensure feasibility. Dynamic programming offers a promising avenue to obtain these estimates while satisfying the constraints of the underlying problem (15, 21). Moreover, whereas for most constraints maintaining feasibility is important, it may not always be desirable to maintain all constraints. For example, it may be desirable to relax constraints imposed to achieve computability such as the no-pseudo-knot constraints of the RNA secondary structure algorithms. Also, centroid estimators focus on making reliable predictions. In the extreme case, posterior space is so widely dispersed the centroid estimator is empty, for example, with no margins >0.5, reflecting the fact that the data provide no support for any prediction. An estimation procedure that forces a result in this circumstance is available (15).

As we have shown in several important cases, centroid estimates can be derived from the marginal probabilities of solution components. However, obtaining marginal distributions is often a hard problem. In such cases, the approximation methods like variational Bayes (27, 28) and Markov chain Monte Carlo (MCMC) (16) can be applied. However, many problems present a sufficiently complicated joint probability structure to render ML estimation intractable by direct optimization. In such instances, it is common to apply sampling methods, such as MCMC. Given such a sample, estimating the marginal distributions can usually be completed with linear complexity in the number of solution components, whereas obtaining global maximum for ML estimation usually requires a more computationally intensive sampling approach, such as simulated annealing (29).

Centroid and ML estimators diverge more as the complexity of the probability space grows and becomes more multimodal, structured, and correlated. When the consensus estimator is the centroid estimator, the converse is easily seen, because, for probability spaces where each dimension is independent of the others, to maximize the joint distribution is equivalent to maximizing the marginals and so the two estimators always coincide.

Rapid improvements in data-acquisition technologies promise to continue to dramatically increase the pool of data in many fields. Although these data will be of great benefit, they also have opened a new universe of high-D inference and prediction problems that will likely provide major data analytic challenges in the coming decades. Among these is the development of point estimators in discrete spaces that are the focus of the centroid estimators developed here. But the more general point estimation challenge is to find one or a small number of feasible solutions among the many in the ensemble that is by some appropriate measure representative of the full ensemble and suitable for the data structural features of the solution space. These new high-D data and unknowns will also almost certainly force a reexamination of extant approaches to interval estimation, hypothesis tests, and predictive inference. In important ways these new challenges hark back to the early days of statistical physics in the age of Newtonian mechanics. For here again we are confronted with large ensembles and entropic effects arising from their shear size. But here it is often insufficient to deliver only distributions and averages for low-dimensional features, but rather specific high-D results are often demanded. Thus, centroid point estimates are only a small step into the challenges being driven by the rapid advances of data-acquisition technology.

Supplementary Material

Supporting Information

pnas_0712329105_index.html^{(654B, html)}

ACKNOWLEDGMENTS.

We thank Profs. Don McClure, Dan Weinreich, Richard Stratt, Ben Raphael, and Stuart Geman from Brown University and Drs. Lee Newberg, Ye Ding, and Clarence Chan of the Wadsworth Center (Albany, NY) for useful discussions and suggestions. This work was supported by Department of Energy Grant DE-FG02-04ER63942 and National Institutes of Health Grant R01-HG01257 (to C.E.L.) and by the Center for Computational Molecular Biology at Brown University.

Footnotes

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/cgi/content/full/0712329105/DC1.

References

1.Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank. Nucleic Acids Res. 2006;34(Database issue):D16–D20. doi: 10.1093/nar/gkj157. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.ENCODE Project Consortium. The ENCODE (ENCyclopedia Of DNA Elements) project. Science. 2004;5696:636–640. doi: 10.1126/science.1105136. [DOI] [PubMed] [Google Scholar]
3.Metwally A, Agrawal D, Abbadi AE. Using Association Rules for Fraud Detection in Web Advertising Networks. Norway: Trondheim; 2005. pp. 169–180. [Google Scholar]
4.York DG, et al. The SLOAN digital sky survey: technical summary. Astron J. 2000;120:1579–1587. [Google Scholar]
5.Fisher RA. On the mathematical foundations of theoretical statistics. Philos Trans R Soc Lond Ser A. 1921;222:309–368. [Google Scholar]
6.Zuker M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 2003;31:3406–3415. doi: 10.1093/nar/gkg595. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Bradley P, Misura KMS, Baker D. Toward high-resolution de novo structure prediction for small proteins. Science. 2005;309:1868–1871. doi: 10.1126/science.1113801. [DOI] [PubMed] [Google Scholar]
8.Karlin S, Altschul SF. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Nat Acad Sci USA. 1990;87:2264–2268. doi: 10.1073/pnas.87.6.2264. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Miyazawa S. A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng. 1994;8:999–1009. doi: 10.1093/protein/8.10.999. [DOI] [PubMed] [Google Scholar]
10.Wald A. Statistical Decision Functions. New York: Wiley; 1950. [Google Scholar]
11.Lehmann EL, Casella G. Theory of Point Estimation. New York: Springer; 2003. [Google Scholar]
12.Ding Y, Chan CY, Lawrence CE. RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. RNA. 2005;11:1157–1166. doi: 10.1261/rna.2500605. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Besag J. On the statistical analysis of dirty pictures. J R Stat Soc. 1986;48:259–302. [Google Scholar]
14.Hartemink AJ, Gifford DK, Jaakkola TS, Young RA. Combining locations and expression data for principled discovery of genetics network models. Pacific Symposium on Biocomputing. 2002;Vol 7:437–449. [PubMed] [Google Scholar]
15.Durbin R, Eddy SR, Krogh A, Mitchison G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge Univ Press; 1999. [Google Scholar]
16.Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. 2nd Ed. New York: Chapman and Hall/CRC; 2003. [Google Scholar]
17.Carlin BP, Louis TA. Bayes and Empirical Bayes Methods for Data Analysis. 2nd Ed. New York: Chapman and Hall/CRC; 2000. [Google Scholar]
18.Ding Y, Chan CY, Lawrence CE. Clustering of RNA secondary structures with application to messenger RNAs. J Mol Biol. 2006;359:554–571. doi: 10.1016/j.jmb.2006.01.056. [DOI] [PubMed] [Google Scholar]
19.Mathews DH, et al. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Natl Acad Sci USA. 2004;101:7287–7292. doi: 10.1073/pnas.0401799101. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Mathews DH. Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA. 2004;10:1178–1190. doi: 10.1261/rna.7650904. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Newberg LA, et al. A phylogenetic Gibbs sampler that yields centroid solutions for cis regulatory site prediction. Bioinformatics. 2007 doi: 10.1093/bioinformatics/btm241. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Do CB, Woods DA, Batzoglou S. CONTRAfold: RNA secondary structure prediction without energy-based models. Bioinformatics. 2006;22:e90–e98. doi: 10.1093/bioinformatics/btl246. [DOI] [PubMed] [Google Scholar]
23.Zhang J, Liu JS. On side-chain conformational entropy of proteins. PLoS Comput Biol. 2006;2:e168. doi: 10.1371/journal.pcbi.0020168. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Casella G, Moreno E. Objective bayesian variable selection. J Am Stat Assoc. 2006;101:157–167. [Google Scholar]
25.Smith M, Fahrmeir L. Spatial bayesian variable selection with application to functional magnetic resonance imaging. J Am Stat Assoc. 2007;102:417–431. [Google Scholar]
26.Tadesse MG, Sha N, Vannucci M. Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc. 2005;100:602–617. [Google Scholar]
27.Attias H. A variational Bayesian framework for graphical models. Adv Neural Inf Process Syst. 2000;12:209–215. [Google Scholar]
28.Beal MJ, Ghahramani Z. The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. Bayesian Stat 7. 2003:453–464. [Google Scholar]
29.Kirkpatrick S, Gelatt CD, Vecchi MP. Optimization by simulated annealing. Science. 1983;220:671–680. doi: 10.1126/science.220.4598.671. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

pnas_0712329105_index.html^{(654B, html)}

pnas_0712329105_1.pdf^{(42.5KB, pdf)}

[B1] 1.Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank. Nucleic Acids Res. 2006;34(Database issue):D16–D20. doi: 10.1093/nar/gkj157. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] 2.ENCODE Project Consortium. The ENCODE (ENCyclopedia Of DNA Elements) project. Science. 2004;5696:636–640. doi: 10.1126/science.1105136. [DOI] [PubMed] [Google Scholar]

[B3] 3.Metwally A, Agrawal D, Abbadi AE. Using Association Rules for Fraud Detection in Web Advertising Networks. Norway: Trondheim; 2005. pp. 169–180. [Google Scholar]

[B4] 4.York DG, et al. The SLOAN digital sky survey: technical summary. Astron J. 2000;120:1579–1587. [Google Scholar]

[B5] 5.Fisher RA. On the mathematical foundations of theoretical statistics. Philos Trans R Soc Lond Ser A. 1921;222:309–368. [Google Scholar]

[B6] 6.Zuker M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 2003;31:3406–3415. doi: 10.1093/nar/gkg595. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] 7.Bradley P, Misura KMS, Baker D. Toward high-resolution de novo structure prediction for small proteins. Science. 2005;309:1868–1871. doi: 10.1126/science.1113801. [DOI] [PubMed] [Google Scholar]

[B8] 8.Karlin S, Altschul SF. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Nat Acad Sci USA. 1990;87:2264–2268. doi: 10.1073/pnas.87.6.2264. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] 9.Miyazawa S. A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng. 1994;8:999–1009. doi: 10.1093/protein/8.10.999. [DOI] [PubMed] [Google Scholar]

[B10] 10.Wald A. Statistical Decision Functions. New York: Wiley; 1950. [Google Scholar]

[B11] 11.Lehmann EL, Casella G. Theory of Point Estimation. New York: Springer; 2003. [Google Scholar]

[B12] 12.Ding Y, Chan CY, Lawrence CE. RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. RNA. 2005;11:1157–1166. doi: 10.1261/rna.2500605. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13.Besag J. On the statistical analysis of dirty pictures. J R Stat Soc. 1986;48:259–302. [Google Scholar]

[B14] 14.Hartemink AJ, Gifford DK, Jaakkola TS, Young RA. Combining locations and expression data for principled discovery of genetics network models. Pacific Symposium on Biocomputing. 2002;Vol 7:437–449. [PubMed] [Google Scholar]

[B15] 15.Durbin R, Eddy SR, Krogh A, Mitchison G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge Univ Press; 1999. [Google Scholar]

[B16] 16.Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. 2nd Ed. New York: Chapman and Hall/CRC; 2003. [Google Scholar]

[B17] 17.Carlin BP, Louis TA. Bayes and Empirical Bayes Methods for Data Analysis. 2nd Ed. New York: Chapman and Hall/CRC; 2000. [Google Scholar]

[B18] 18.Ding Y, Chan CY, Lawrence CE. Clustering of RNA secondary structures with application to messenger RNAs. J Mol Biol. 2006;359:554–571. doi: 10.1016/j.jmb.2006.01.056. [DOI] [PubMed] [Google Scholar]

[B19] 19.Mathews DH, et al. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Natl Acad Sci USA. 2004;101:7287–7292. doi: 10.1073/pnas.0401799101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] 20.Mathews DH. Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA. 2004;10:1178–1190. doi: 10.1261/rna.7650904. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 21.Newberg LA, et al. A phylogenetic Gibbs sampler that yields centroid solutions for cis regulatory site prediction. Bioinformatics. 2007 doi: 10.1093/bioinformatics/btm241. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22.Do CB, Woods DA, Batzoglou S. CONTRAfold: RNA secondary structure prediction without energy-based models. Bioinformatics. 2006;22:e90–e98. doi: 10.1093/bioinformatics/btl246. [DOI] [PubMed] [Google Scholar]

[B23] 23.Zhang J, Liu JS. On side-chain conformational entropy of proteins. PLoS Comput Biol. 2006;2:e168. doi: 10.1371/journal.pcbi.0020168. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] 24.Casella G, Moreno E. Objective bayesian variable selection. J Am Stat Assoc. 2006;101:157–167. [Google Scholar]

[B25] 25.Smith M, Fahrmeir L. Spatial bayesian variable selection with application to functional magnetic resonance imaging. J Am Stat Assoc. 2007;102:417–431. [Google Scholar]

[B26] 26.Tadesse MG, Sha N, Vannucci M. Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc. 2005;100:602–617. [Google Scholar]

[B27] 27.Attias H. A variational Bayesian framework for graphical models. Adv Neural Inf Process Syst. 2000;12:209–215. [Google Scholar]

[B28] 28.Beal MJ, Ghahramani Z. The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. Bayesian Stat 7. 2003:453–464. [Google Scholar]

[B29] 29.Kirkpatrick S, Gelatt CD, Vecchi MP. Optimization by simulated annealing. Science. 1983;220:671–680. doi: 10.1126/science.220.4598.671. [DOI] [PubMed] [Google Scholar]

PERMALINK

Centroid estimation in discrete high-dimensional spaces with applications in biology

Luis E Carvalho

Charles E Lawrence

Abstract

Background

Fig. 1.

Centroid Estimation

Is There a Principled Basis Preventing ML Estimators from Being Isolated from the Region of Space Strongly Recommended by the Data?

What Alternative Estimators Better Represent the Data?

Theorem 1.

Constrained Centroid Estimation

Theorem 2.

Theorem 3.

Theorem 4.

Do These Alternative Estimators Offer Improved Representation of the Data in Practice?

Fig. 2.

Do These Alternative Estimators Predict Known References Cases Better?

Conclusions

Supplementary Material

ACKNOWLEDGMENTS.

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Centroid estimation in discrete high-dimensional spaces with applications in biology

Luis E Carvalho

Charles E Lawrence

Abstract

Background

Fig. 1.

Centroid Estimation

Is There a Principled Basis Preventing ML Estimators from Being Isolated from the Region of Space Strongly Recommended by the Data?

What Alternative Estimators Better Represent the Data?

Theorem 1.

Constrained Centroid Estimation

Theorem 2.

Theorem 3.

Theorem 4.

Do These Alternative Estimators Offer Improved Representation of the Data in Practice?

Fig. 2.

Do These Alternative Estimators Predict Known References Cases Better?

Conclusions

Supplementary Material

ACKNOWLEDGMENTS.

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases