Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2025 Jan 2;122(1):e2416301121. doi: 10.1073/pnas.2416301121

PhiSiCal-Checkup: A Bayesian framework to validate amino acid conformations within experimental protein structures

Piyumi R Amarasinghe a, Lloyd Allison a, Craig J Morton b, Peter J Stuckey a, Maria Garcia de la Banda a, Arthur M Lesk c, Arun S Konagurthu a,1
PMCID: PMC11725904  PMID: 39746043

Significance

Structural biology, biochemistry, evolutionary biology, medicine, and drug development all require high-quality protein structures. Reliable tools for assessing structures remain essential to ensure and maintain quality. We present a Bayesian method for analysis and validation of amino-acid conformations within protein structures: “PhiSiCal (ϕψχal) Checkup.” This method overcomes major, long-standing shortcomings and provides significant improvements over the current state of the art. By introducing more-reliable information-theoretic measures of “favorability” of amino acid conformations, PhiSiCal-Checkup provides ways to analyze protein structures that were previously out of reach for experimentalists and structure biologists.

Keywords: protein structure validation, amino acid conformation, conformation favorability, conformation outlier, Bayesian statistics

Abstract

As structural biology and drug discovery depend on high-quality protein structures, assessment tools are essential. We describe a new method for validating amino-acid conformations: “PhiSiCal (ϕψχal) Checkup.” Twenty new joint probability distributions in the form of statistical mixture models explain the empirical distributions of dihedral angles ω,ϕ,ψ,χ1,χ2, of canonical amino acids in experimental protein structures. Marginal and conditional probability distributions for subsets of dihedral angles are derived from these joint mixture models. Together, these distributions are employed to measure rapidly the information-theoretic “favorability” of any proposed experimental protein structure. The inferred statistical models and measures overcome several shortcomings and afford improvements over the current state of the art in amino-acid conformation verification. Experimental comparisons are made against current protein conformation verification software. In a number of examples, we pick up outliers that are invisible to current methods. We also calculate, as part of verification, the sensitivity of favorability to small changes in a proposed structure accounting for the precision of coordinates. In some cases a near neighbor of a proposed amino-acid conformation may be either less or more favorable. This raises the question, is the current reliance on fixed “thresholds” for validation a good thing? PhiSiCal-Checkup is freely available for online and offline (open-source) use from https://lcb.infotech.monash.edu.au/phisical/checkup.


Experimental methods for protein structure determination produce raw measurements from which atomic coordinates are modeled as solutions to protein three-dimensional structures. The Worldwide Protein Data Bank (wwPDB) (1) administers the public archive (PDB) of protein coordinates (2) and scrutinizes them across a number of validation criteria, based on the guidelines of its validation task force (VTF) (3).

Key among the validation criteria is the knowledge-based assessment of amino-acid conformations. The current practice is to quantify the “favorability” of each amino acid’s main chain conformation (described by ϕ,ψ dihedral angles) and, separately, side chain conformation (described by χ1,χ2, dihedral angles) using their observed distributions within carefully curated datasets of protein structures (4). This quantification allows defining and detecting “outliers” from the distributions of conformational features. (Note that although outliers deserve further examination, they are not necessarily errors. Indeed, outliers that are correct often point to features of structural or functional interest. Conversely, non-outliers are not necessarily correct.) Popular programs to validate ϕ,ψ and χ1,χ2, dihedral angles developed over the past three decades include PROCHECK (5), WHAT_CHECK (6), O (7), and MolProbity (8).

MolProbity represents the current state of the art, recommended by VTF (9). However, it has several limitations, some of which are acknowledged by its authors:

  • “What we do expect will improve in the future is a redefinition of conformational validation...Rather than doing separate Ramachandran and rotamer evaluation, we should move toward analyzing all backbone and side-chain torsional dimensions together, including allowance for the influence of secondary structure and local motifs” (10).

(Others are discussed in Results, summarized in SI Appendix, Table ST1).

To overcome these limitations, we introduce “PhiSiCal (ϕψχal) Checkup,” a comprehensive Bayesian and information-theoretic framework for validation of amino-acid conformations. Supporting PhiSiCal-Checkup are twenty new joint probability distributions, one for each of the canonical amino acids; each distribution treats all mainchain and sidechain dihedral angles together. These are inferred as statistical mixture models using our recently published inference methodology, PhiSiCal (11), based on the unsupervised model selection criterion of Minimum Message Length (12, 13). [Although PhiSiCal provides the core methodology for inference of joint mixture models, specializing these models for the conformation-validation implemented by PhiSiCal-Checkup required several new methodological advances previously unexplored (Materials and Methods)].

In PhiSiCal-Checkup, each joint mixture model can be factorized into mutually consistent probability distributions to analyze any possible combination of dihedral angle terms. Specifically, for any amino-acid type containing d dihedral angle terms (ω,ϕ,ψ,χ1,χ2,ddihedral angle terms), the corresponding joint mixture model (defining a distribution in a d-dimensional toroidal space, Td) can be transformed “on-the-fly” using Bayesian axioms of probability to derive 2d2 marginal and 3d2d+1+1 conditional probability distributions. The formal relationship between joint, marginal, and conditional distributions is governed by the Bayes theorem (14) making all amino acid–specific distributions mutually consistent.

Marginal probability distributions allow the unconstrained evaluation of any combination of the d dihedral angle terms, of which the assessment of main chain ϕ,ψ and side chain χ1,χ2, conformations are but two possibilities. Conditional distributions allow the evaluation of any subset of dihedral angles subject to observing specific values for others, providing meticulous new ways of analyzing amino-acid conformations. For example, the conditional probability distribution of the χ1,χ2, terms given some specific observed values for ϕ,ψ (shorthand, χ1,χ2,|ϕ,ψ) can be used to evaluate the favorability of the observed χ1,χ2, dihedral angles conditioned on the ϕ,ψ observation. When the (continuous) ϕ,ψ values correspond to, say, a (discrete) secondary structural state, the χ1,χ2,|ϕ,ψ probability distribution will automatically embody the secondary-structural constraint on the χ1,χ2, angles. In extension, the conditional distribution χ1,χ2,|ω,ϕ,ψ constrains the evaluation of χ1,χ2, additionally on the cis/trans/twisted peptide state informed by the continuous value of ω, beyond the state informed by ϕ,ψ. As another example, the conditional distribution of ϕ,ψ|ω allows evaluating ϕ,ψ given any observed continuous value for ω, thereby automatically accounting for ω being in cis/trans/twisted states. Data in ref. 15 shows the animations of amino acid–specific ϕ,ψ|ω distributions for varying ω)

This work also introduces new statistical measures to validate amino-acid conformations. PhiSiCal-Checkup’s validation of dihedral angles (in any combination of terms) is driven by the information-theoretic measure of lossless compression, quantified in bits. Favorability (or unfavorability) of any conformation can objectively be assessed based on the amount of compression gained (or lost) with respect to the raw bit-length of encoding the observed dihedral angles. Statistically, this measure of compression is equivalent to quantifying the log-likelihood-odds of any observation arising from its corresponding (amino acid–specific joint/marginal/conditional) mixture model compared to the raw (uniform) distribution. Further, since all statistical mixture models are continuous and differentiable at every point in the respective distributions’ support, a gradient vector in the probability distribution can be computed for any observed dihedral angles. Using this gradient, sensitivity of the reported compression statistic to perturbation of dihedral angles can be analyzed accounting for the (im)precision of statement of atomic coordinates. This overcomes another limitation in the current state of art: the reliance on hard (non-overlapping) membership assignments of observations to one of three categories—“outlier,” “allowed,” and “favored”—that can yield misleading assignments, especially near the boundaries of those categories.

A configurable web server and standalone software implementing PhiSiCal-Checkup is available for immediate online and offline use: https://lcb.infotech.monash.edu.au/phisical/checkup. Validation of dihedral terms is fully customizable and supported by instructive visualizations.

Results and Discussion

Statistical Mixture Models for Validation of Dihedral Angles.

Twenty new joint statistical mixture models are inferred in this work, one for each of the canonical amino acids, using the reference protein structural data set of Williams et al. (1), curated specifically for the task of structure validation (Materials and Methods and SI Appendix, Table ST2). Each joint mixture model describes a continuous amino acid–specific probability distribution of its corresponding dihedral angle terms, ω,ϕ,ψ,χ1,χ2,. Amino acid–specific marginal and conditional probability distributions are derived (on demand) from these inferred joint mixture models. Importantly, these distributions are also statistical mixture models, characterized in a reduced number of dimensions defined by the subset they explain (Materials and Methods).

The amino acid–specific joint mixture model, combined with the derived marginal and conditional mixture models provide the “basis set” of probability distributions for PhiSiCal-Checkup to interrogate dihedral angle conformations in any possible combination of terms. The computation of joint/marginal/conditional probabilities and other derived-statistics remains extremely efficient taking between 10 to 60 microseconds-per-computation on a standard (single-thread) computer.

Compressibility of Observed Dihedral Angles.

PhiSiCal-Checkup uses the measures of Shannon information content (16) and compression to quantify the surprise of observing any amino-acid conformation. These measures are derived from the corresponding joint/marginal/conditional mixture model, depending on the combination of dihedral angle terms being observed. Compression (measured in bits) is statistically equivalent to computing the log-odds of an observation arising from its corresponding mixture model, compared to the uniform distribution. Compression (gain) of +n bits yields 2n:1 odds in favor of the mixture model, whereas compression (loss) of n bits yields 1:2n odds against it. Importantly, PhiSiCal-Checkup also assesses the sensitivity of the computed compression by locally perturbing the observation in the direction of its gradient vector in the probability distribution (Materials and Methods).

To understand compression, its sensitivity, and statistical odds, consider the case study of Chain A Arginine 368 (Arg-368) in the PDB coordinates of arabinoxylan arabinofuranohydrolase (AXAH) from Bacillus subtilis (3C7F). The conformation of Arg-368 in 3C7F (Chain A) yields the following dihedral angle observations: ω=19.15°, ϕ=130.68°, ψ=120.11°, χ1=67.61°, χ2=178.90°, χ3=174.16°, χ4=171.50°, χ5=179.13°. Particularly, ω=19.15° being discussed here is the rotation around the peptide bond connecting the preceding Glycine (Gly-367) and the current Arg-368. This suggests a strained, energetically unfavorable cis-conformation for Arg-368. However, this is not an error: checking the 1.55 Å crystal structure coordinates of 3C7F, the carbonyl preceding Arg-368 is seen facilitating the binding of the Sodium (Na+) metal ion (17).

Table 1 shows the compression and sensitivity statistics of observing Arg-368 in 16 combinations of dihedral angle terms (of the total 6,305 possible for Arginine). The individual observation of ω=19.15° (ignoring all other dihedral angles) results in a loss of compression (−11.53 bits) using its corresponding Arginine-specific marginal mixture model. This gives 1:211.531:3,000 odds against that mixture model. The compression’s sensitivity is derived by perturbing ω=19.15°±2.5° in the direction of its gradient, causing compression to vary between [12.2,10.9] bits (and odds between ≈[1:2,000, 1:4,700]) in the local neighborhood of that observation. Similarly, the observed values for ω,ϕ,ψ explained using its marginal mixture model result in a loss of compression of −9.22 bits (odds of 1:600) with a sensitivity of ±0.7 bits. The evaluation of all ω,ϕ,ψ,χ1,χ2, terms using the joint mixture model results in a loss of compression of −6.77 bits (1:100 odds). In contrast, observing only the Arg-368’s side chain (χ1,χ2,) terms leads to a gain in compression of +11.24 bits (≈2,400:1 odds). All other dihedral angles (in combinations not involving the strained ω) lead to gain in compression. Finally, the Arg-368’s side chain conformation conditioned on the pair of Ramachandran angles taking the values ϕ,ψ=130.68°,120.11° (but ignoring the strained cis-ω=19.15°) leads to a gain in compression of +11.43 bits. However, when ω=19.15° is also considered, the compression-gain drops sharply to 2.5 bits. This is equivalent to the statistical odds dropping from ≈2,800:1 to ≈6:1, thus quantifying the extent of surprise of observing that ω as part of Arg-368’s overall conformation.

Table 1.

Compression and sensitivity of dihedral angles (in varying combinations) of Arg-368 in the protein coordinates of 3C7F Chain A

Observation (mixture model) Compression (±sensitivity) Observation (Mixture model) Compression (±sensitivity)
ω(marginal) −11.5 (+0.60.7) bits ω,ϕ,ψ(marginal) −9.2 (+0.70.7) bits
ϕ(marginal) 1.1 (+0.00.0) bits ϕ,ψ(marginal) 2.5 (+0.30.3) bits
ψ(marginal) 0.7 (+0.30.3) bits χ1,χ2,χ3,χ4,χ5(marginal) 11.2 (+0.01.4) bits
χ1(marginal) 3.4 (+0.00.1) bits ϕ,ψ|ω(conditional) 2.3 (+0.10.1) bits
χ2(marginal) 3.2 (+0.00.0) bits χ1|ϕ,ψ(conditional) 3.0 (+0.30.5) bits
χ3(marginal) 2.6 (+0.20.3) bits χ1,χ2,χ3,χ4,χ5|ϕ,ψ(conditional) 11.4 (+0.01.4) bits
χ4(marginal) 1.5 (+0.10.1) bits χ1,χ2,χ3,χ4,χ5|ω,ϕ,ψ(conditional) 2.5 (+0.00.0) bits
χ5(marginal) 2.3 (+0.01.5) bits ω,ϕ,ψ,χ1,,χ5(joint) −6.8 (+0.70.7) bits

We note that, in the current state of the art (MolProbity), the surprise of the ω value at the peptide bond between Gly-367 and Arg-368 (that gives Arg-368 a cis conformation) is overlooked as it can only validate the ϕ,ψ and χ1,χ2, terms (independently). In sum, such detailed analysis of observed amino-acid conformations, at varying granularity and constraints, is unique to PhiSiCal-Checkup.

Quantifying Surprise of Observed Dihedral Angles.

From the relationship between statistical odds and compression, it follows that compression at zero bits yields the surprisal-odds of 1:1 (fifty–fifty), thus demarcating an objective boundary for the joint/marginal/conditional distributions: observations below zero become exponentially surprising and those above exponentially favorable.

Here, we explore how compression correlates with the empirical frequencies of dihedral angles. To achieve this we analyzed the percentile-rank distribution of compression across the 1,720,588 amino-acid conformations found in the reference dataset (4).

Fig. 1A tracks the bits of compression at varying percentile levels ({min,0.05,0.3,2,5,10,25,50,75,max}) for ϕ,ψ and χ1,χ2, dihedral angle observations of each amino-acid type. (SI Appendix, Table ST3 provides a tabular view of the figure in raw numbers. The last row of this table summarizes the mean and SD of the compression values at each of those percentile levels.) Broadly, the compression values at each percentile mark for ϕ,ψ show relatively lower dispersion about the mean compression values, compared to those of χ1,χ2,. This is expected because the differences in the amino-acid types arise due to their side chain groups that show varying conformational-mobility and energetics (18). In the lower-half of the distribution below the median, the compression statistics of Proline ϕ,ψ and χ1,χ2, observations diverge the most, followed by Glycine for ϕ,ψ and Arginine for χ1,χ2,. This likely arises due to Proline’s cyclic-pyrrolidine side chain, Glycine’s absence of β-carbon, and Arginine’s side chain length.

Fig. 1.

Fig. 1.

(A) Bits of compression at varying percentile levels for amino acid–specific Ramachandran ϕ,ψ (Left) and side chain χ1,χ2, (Right) dihedral angle terms, using 1,720,588 amino-acid conformations observed in the reference dataset (4). Insets show a zoomed-in view of the left-tail of the distributions. (B) Amino acid–specific percentile ranks with <0 bits of compression are shown. Note: Amino Acids Alanine (ALA) and Glycine (GLY) do not have side chain dihedral angle terms.

Focusing on the distributions’ (infrequent) tails, Fig. 1B shows the proportions of ϕ,ψ and χ1,χ2, observations that lose compression (<0 bits). For ϕ,ψ, this accounts for ∼5±1% of observations for most amino acids, except for Glycine with 9.1%, Valine and Isoleucine with 2.8% and Proline with 1.4% that deviate from this trend. For χ1,χ2,, the proportions are more spread out, ranging from 0.1% for Proline to 8.2% for Cysteine. These results quantitatively highlight the differences in the distributions of dihedral angles across the 20 amino-acid types.

These differences were further explored by qualitatively analyzing the compression statistics for each amino acid using all-pairs compression contour maps derived from their corresponding (marginal) mixture models. Specifically, for each amino-acid type with d dihedral terms ω,ϕ,ψ,χ1,χ2,, d-choose-2 contour plots were generated for each possible pair of dihedral angle terms (19). Each 2D plot shows 1) the empirical distribution for that pair, 2) compression contour lines at 1-bit intervals between [5,+5] bits, and 3) gradient vectors showing the rate of change of probability (and hence compression) at 5°×5° intervals. Visual examination of these plots again highlights the significant differences in the amino acid–specific distributions of dihedral angles.

These plots also illustrate the close-fit between the mixture models and the underlying empirical distributions. As an example, Fig. 2 shows two such plots for Proline. Specifically, Fig. 2A displays the contour plot for Proline ϕ,ψ terms. The contours correlate closely with the empirical frequencies of observed ϕ,ψ angles (encoded in light-to-dark shades of blue). The +5-bit compression lines (innermost black contours) encompass two regions of high-probability (20) that peak at ϕ,ψ60°,+145° and 60°,30°, where the norm of the gradient vectors in those regions approach 0. At −5 bits of compression (outermost, red line), the contour encompasses infrequent/low-probability regions in the valleys of Proline’s ϕ,ψ distribution. Importantly, points/observations with the same compression values can have significantly different gradients (and hence sensitivities to local perturbations), as illustrated by points labeled P1 and P2 in the figure. Locally perturbing P1 in the direction of its gradient improves compression (and statistical odds) significantly more than it does at P2.

Fig. 2.

Fig. 2.

Proline’s empirical observations (shown as scatter points colored in light-to-dark shades of blue colored based on their 1°×1° grid-frequencies), compression contours (multicolored lines) in 1-bit intervals between [5,+5], and gradient vectors (black arrows) at 5°×5° interval are shown above for (A) ϕ,ψ and (B) χ2,χ3 dihedral angle terms. The ϕ,ψ plot also shows two points P1 and P2 on the −5 bit (outmost, red) compression line. The Insets around these two points highlight the differences in gradients (and hence sensitivities under local-perturbation). For χ2,χ3 plot, the axes are truncated to [60°,+60°], as there are no empirical observations beyond that in the reference dataset. Separately, a set of anomalous observations are highlighted (red points)—refer to the main text for discussion.

Similarly, Fig. 2B shows the compression contours and gradient vectors for Proline’s side chain dihedral angle pair χ2,χ3. The contour lines again correlate closely with the empirical frequencies of that pair within the observed Proline conformations. This plot additionally overlays a set of 29 anomalous Proline observations, all from the same protein structure (3H8G), 1.5 Å Bestatin complex structure of leucine aminopeptidase from Pseudomonas putida (21). Surprisingly, in all 29 observations, χ2 is nearly 0° (SI Appendix, Table ST4). Examining the PDB validation report of 3H8G, none of these Prolines were earmarked as outliers by MolProbity. (Note, MolProbity ignores χ2 and χ3 dihedral angles in their evaluation for Proline and considers only χ1.) This illustrates how the contour-plots enabled by PhiSiCal-Checkup can be used to identify and examine surprising conformations that deviate from the observed distributions.

Altogether, the above analyses reveal quantitative and qualitative differences in the empirical distributions of dihedral angles across amino acids. These differences highlight the importance of using amino acid–specific distributions to validate dihedral angle conformations. Further, the analyses also reveal the variation of compression statistic at any fixed percentile threshold across amino-acid types. This throws into question the prevalent use of fixed percentile-thresholds to flag ϕ,ψ and χ1,χ2, outliers in the current validation protocols (see discussion below).

Comparison with MolProbity.

We compare PhiSiCal-Checkup with MolProbity. SI Appendix, Table ST1 summarizes the key differences between the two systems. Among the differences is the statistical test they rely on to validate amino-acid conformations. Central to MolProbity’s method is the use of fixed percentile rank thresholds that do not change with amino-acid type. Specifically, MolProbity employs a 3-way classification of ϕ,ψ and χ1,χ2, observations. Each observation is assigned a hard-membership to one of {outlier, allowed, favored} categories based on the percentile rank of the observation’s score (derived from their normalized functions) within a set of scores precomputed for their reference data. MolProbity sets the outlier thresholds of 0.05% (significance-level 0.0005) for ϕ,ψ and 0.3% (significance-level 0.003) for χ1,χ2, observations. For the allowed category, the percentile-threshold is set at 2% (significance-level 0.02) for both. Consequently, observations that fall above the 2nd percentile as per MolProbity’s scoring earmarks the favored category.

In statistical parlance, MolProbity is relying on a Z-test for a hard 3-way clustering of ϕ,ψ and χ1,χ2, observations. However, a Z-test is effective only if the test-statistic is normally distributed: only then can the significance-levels of 0.0005, 0.003, 0.02 correspond to Z-scores thresholds of ±3.5σ, ±3σ, and ±2.3σ from the respective mean values of scores in a two-tailed test. Importantly, our analysis finds no evidence of normality of the underlying probabilities (which the MolProbity’s function-scores are approximating). SI Appendix, Figs. SF1 and SF2 show the distribution of probabilities for the ϕ,ψ and χ1,χ2, conformations observed in the reference dataset, derived using PhiSiCal-Checkup’s mixture models. This is quantitatively supported by the observed variance in the compression statistic across amino-acid types at 0.05, 0.3, and 2 percentile levels (SI Appendix, Table ST3). Therefore, PhiSiCal-Checkup avoids using percentile-rank thresholds in its method of validation.

Instead, PhiSiCal-Checkup relies on the information-theoretic measure of compression to quantify surprisal-odds of any observation. Compression at 0 bits defines an objective threshold below which observations grow exponentially surprising, and above which, exponentially favorable. Thus, PhiSiCal-Checkup uses compression of 0 bits to earmark favored observations. For comparison with MolProbity, another threshold to flag outlier conformations requires to be defined, while noting that such a choice remains fully subjective and cannot be formally defended. A subjective outlier threshold at −4 bits of compression (or 1:16 odds) is thus chosen here as a default setting in PhiSiCal-Checkup. (Note, fractional odds (e.g., 1:16) should not be confused with frequentist percentile-based probabilities. In MolProbity, a score ranked at the 0.05th percentile gives a 1 in 2,000 chance of observing that score in their precomputed set of scores. At the same percentile rank (refer SI Appendix, Table ST3), PhiSiCal-Checkup, on average, yields −6.4 bits of compression for ϕ,ψ observations, or ∼1:85 surprisal-odds.)

This results in a 3-way categorization for PhiSiCal-Checkup based on compression: outlier <4 bits, allowed [4,0) bits and favored 0 bits. A key difference compared with MolProbity’s categorization is that PhiSiCal-Checkup permits overlapping membership-assignment based on the observation’s gradient information (discussed below). The gradient accounts for the uncertainty (imprecision) of stated atomic coordinates and alerts users of observations that are close to the boundaries of PhiSiCal-Checkup’s 3-way classification, and whose membership status is sensitive to minor perturbations of observed dihedral angles.

Table 2 summarizes the agreement/disagreement between the two systems. This is based on 3,624,568 ϕ,ψ (Table 2A) and 3,027,146 χ1,χ2, (Table 2B) observations derived from 9,419 filtered-PDB structures (23). The corresponding details at the level of individual amino-acid types are available from refs. 24 and 25.

Table 2.

(A and B) 3×3 tables (confusion matrices) displaying the extent of agreement (main diagonal cells) and disagreement (off-diagonal cells) between PhiSiCal-Checkup (compression-based) and MolProbity (percentile-rank based) systems, performing 3-way ({outlier, allowed, and favored}) classification of 3,624,568 ϕ,ψ observations and 3,027,146 χ1,χ2, observations

graphic file with name pnas.2416301121inline01.jpg

Rows and columns represent MolProbity’s and PhiSiCal-Checkup’s respective 3-way assignments. The agreement/disagreement percentages with respect to MolProbity’s membership assignments are shown in parentheses in each cell—row percentages add up to 100. (C and D) Difference matrices (corresponding to the confusion matrices shown above) quantifying the number of observations that change their membership in PhiSiCal-Checkup upon a minor perturbation of each observation in the direction of the observation’s gradient in their probability distributions. Negative numbers indicate the reduction and positive indicate accretion in the corresponding column category for each row (sum of these differences in each row has to add up to 0).

Broadly, ∼97% of all ϕ,ψ observations and ∼95% of all χ1,χ2, observations are in agreement between the two systems. However, this is dominated by the overrepresentation of observations assigned to the favored category by both systems. This arises because 1) the measure-space of this category (2% for MolProbity and 0 bits for PhiSiCal-Checkup) is disproportionately larger than that of the other two, and 2) the observations come from verified PDB structures which already embody this representational imbalance, with favored conformation forming the very basis for admission into PDB. With this drastic imbalance, any two systems are likely to display overwhelming agreement. We emphasize here that flagging outliers is a business fully in the tails of the distributions, so it becomes necessary to examine more carefully similarities and differences in the ∼3%/∼5% tails of the ϕ,ψ/χ1,χ2, observations.

Tables 2 A and B show the raw counts of agreement/disagreement between the two systems in the 3×3=9 possible combinations of assignments between the two systems: MolProbity (rows) and PhiSiCal-Checkup (columns). Each cell also shows the agreement/disagreement-percentage with respect to MolProbity’s classification, with the sum of percentages in each row adding up to 100%. Analyzing the differences (off-diagonal cells), we observe that nearly all observations (99.7% for ϕ,ψ and 98.8% for χ1,χ2,) arise in cells that are ±1 distance from the main diagonal. This highlights the disagreement arising from observations being assigned to adjacent categories by the two systems: outlier⟷ allowed or allowed⟷favored.

Examining the compression statistics for observations that fall in ±1 off-diagonal cells, we observe that a significant proportion of them are close to the compression-based membership-boundaries defined by PhiSiCal-Checkup. This can be qualitatively visualized in the amino acid–specific contour plots for ϕ,ψ shown in ref. 26. More quantitatively, Table 2 C and D demonstrate the effect of perturbation (in the direction of the gradient) on memberships (Materials and Methods). Each cell tracks the difference in the number of observations after perturbation compared to the raw counts shown in Tables 2 A and B. For ϕ,ψ observations, ∼38% of the observations previously assigned to the outlier category by PhiSiCal-Checkup (first column) change membership and move into the allowed category (second column). Next, ∼40% of those previously classified as allowed by PhiSiCal-Checkup (second column), change membership and move into the favored category (third column). A similar trend is observed for χ1,χ2, observations, with ∼42% moving from outlier to allowed, and ∼44% moving from allowed to favored. This demonstrates the inherent limitation of using hard (non-overlapping) memberships, especially considering the uncertainty implicit in any statement of coordinates. To overcome this, PhiSiCal-Checkup alerts users of conformations with overlapping memberships using its gradient information.

Further, examining the sources of differences at the level of individual amino-acid types, other limitations in the state of the art come to the fore, enumerated below (also refer to refs. 2426):

  1. For ϕ,ψ observations, MolProbity employs the same “general” model for 16 (non-{Glycine, Valine, Isoleucine, Proline}) amino-acid types. This ignores noticeable variations in the empirical distributions of ϕ,ψ for these 16 amino acids, clearly observable from their contour plots (27). For this reason, across all evaluations, PhiSiCal-Checkup employs amino acid–specific probability distributions which model the empirical variations more accurately.

  2. Specifically for Proline ϕ,ψ observations, MolProbity infers two additional normalized-functions after grouping the Proline data in the reference dataset into two coarse bins, based on their observed ω value: ω[30°,+30°] and ω([180°,150°)(+150°,+180°)) (9). In contrast, PhiSiCal-Checkup employs formal and continuous ϕ,ψ|ω conditional probability distributions that are more accurate than the discretized (cis-only and trans-only) models of MolProbity.

  3. For χ1,χ2, observations, no clear pattern emerges, with the differences spreading across all amino-acid types. As a proportion of each amino acid’s number of observations, the differences between PhiSiCal-Checkup and MolProbity vary from ∼2 to 3% on the lower side (Tyrosine, Phenylalanine, Leucine, Isoleucine, and Proline) to ∼7 to 8% on the higher (Glutamic acid, Histidine, Asparagine, Lysine, Serine, Cysteine, Methionine). The remaining amino acids (Aspartic acid, Glutamine, Tryptophan, Valine, Threonine, Arginine) differ between [4,6]%. We observe that the PhiSiCal-Checkup’s mixture models fit the empirical distributions accurately (19). On the other hand, as the number of dihedral angles in the side chain increases, MolProbity uses increasingly coarse grid sizes and variable smoothing parameters to fit side chain-specific functions, contributing to the observed differences.

Concluding Remarks and Future Direction.

The results presented above demonstrate the advances PhiSiCal-Checkup achieves to enable a comprehensive, consistent, and accurate evaluation of amino acid conformations of experimental protein coordinates. Where previously only ϕ,ψ and (independently) χ1,χ2, observations could be validated, to varying consistency and accuracy, the current Bayesian method of PhiSiCal-Checkup supports amino-acid specific validation of dihedral angles in any combination of terms (conditional or otherwise) using strictly formal and mutually consistent statistical models, far-outstripping the scope of investigations currently possible. Further, the information-theoretic measure of ‘favorability’ along with the use of mathematical gradients to enable sensitivity analyses, allows experimentalists to quantify reliably the degree of surprise of any amino acid conformation while accounting for the uncertainty implicit in the protein coordinates.

Beyond these features, a significant effort has been directed toward engineering a configurable software that implements PhiSiCal-Checkup for immediate practical use by experimentalists. This is downloadable both as an open-source program written in C++ (for offline use), and as a web-server (for online use): https://lcb.infotech.monash.edu.au/phisical/checkup.

Several extensions to PhSiCal-Checkup are planned, earmarked as future work. Extending the current statistical models that analyze and assess individual amino-acid conformations, more generalized amino acid–specific statistical models are being constructed to permit assessment of short oligopeptide (k ≥ 1-mer) conformations and their spatially interacting k-mer ensembles. The most basic in this line of planned extensions comes in the form of statistical models for pairs of interacting amino acids (i.e., pair of interacting 1-mers)—these extended models will permit ways to analyze and validate covalent (e.g., disulfide bridge) and non-covalent (e.g., hydrogen (−H) bonds) interactions that currently remain overlooked despite underpinning the protein 3D structure. Another study earmarked for immediate future work is to analyze the distribution of conformational angles of protein structures not determined by experimental methods but predicted using programs such as AlphaFold 3 (22). These predicted structures are increasingly being used in research and it would be interesting to compare the distributions with those derived from experimental coordinates.

Materials and Methods

Statistical Mixture Models.

A statistical mixture model describes a probability distribution expressed as a convex combination (i.e., a mixture) of component probability density functions. Formally, a parametric mixture model M composed of a mixture containing |M| component probability density functions is characterized as M(x)=i=1|M|wifi(x|Θi). Here, x denotes any observation, fi(x|Θi) denotes the i-th probability density function in the mixture with parameters Θi, and wi denotes the component-weight such that i=1|M|wi=1. In unsupervised inference, all mixture parameters {|M|,{wi}1i|M|,{Θi}1i|M|} have to be inferred automatically from the observed data (i.e., from a set of observations of the form X={x1,x2,,xn}).

We recently described an unsupervised method to infer amino acid–specific joint mixture models from any given source collection of protein coordinates (11). The inference method relies on the Bayesian and information-theoretic criterion of Minimum Message Length (MML) (12, 13). Each inferred amino acid–specific mixture model describes a continuous joint probability distribution over its (vector of) dihedral angles terms (ω,ϕ,ψ,χ1,χ2,). Each term in the vector is a continuous random variable in the range (180°,+180°].

For an amino acid type aa with d dihedral angle terms, its corresponding joint mixture model M(aa)(joint)(ω,ϕ,ψ,χ1,χ2,) defines a probability distribution on a wrapped multidimensional d-Torus (Td). In our work, each component of M(aa)(joint) is a product of von Mises distributions, one for each dihedral angle term, thus defining a proper continuous probability density function in Td space. For a set X of observed vector of dihedral angles of an amino acid type aa, M(aa)(joint)(ω,ϕ,ψ,χ1,χ2,) is inferred using the Bayesian criterion of MML as the mixture model that best explains X. (Refer to Amarasinghe et al. (1) for details of the inference method.)

Deriving Marginal Probability Distributions.

Using the axioms of probability, amino acid–specific marginal probability distributions of any proper subset of dihedral angle terms can be derived from its corresponding joint mixture model.

Formally, let A={ω,ϕ,ψ,χ1,χ2,} denote the set of d dihedral angle terms (random variables) for the amino acid type aa. Let B define any non-empty, proper subset of terms in A. Then, the marginal probability distribution of the subset BA can be derived from the joint probability distribution of A by (contour) integrating (out of the joint distribution) all dihedral angle terms {z1,zm} in B=AB. (Note, if |A|=d and |B|=0<d<d, then m=dd.)

M(aa)(marg.)(BA)=ziBM(aa)(joint)(A)dz1dzm

From the computational side, we note that the marginal distribution for any subset of dihedral angle terms in B is also a mixture model. It defines a continuous probability distribution in the subspace TdTd. Because each component of the joint distribution is a product of von Mises distributions (over dihedral angle terms), a spatial projection of the joint mixture model M(aa)(joint)(A) from Td space Td<d space results in the corresponding M(aa)(marg.)(B). This allows us to efficiently compute any marginal mixture model on-the-fly (in real time) taking tens of microseconds on modern standalone computers (Results).

Deriving Conditional Probability Distributions.

Amino acid–specific conditional probability distributions of any proper subset of the joint dihedral angle terms, upon observing specific values of another subset of dihedral angles terms can be derived using Bayes theorem.

As introduced above, let BA define any non-empty proper subset of A containing 0<d<d dihedral angle terms. Further, let CBA denote another subset containing 0<ddd terms. Assume the terms (i.e. random variables) in C are observed to take specific dihedral angle values c={c1,c2,cd}. Then, the conditional probability distribution for the subset B after observing some specific values c for the terms (random variables) in C can be derived as:

M(aa)(cond.)(B|C=c)=M(aa)(marg.)(BC=c)Pr(C=c).

In the above equation, M(aa)(marg.)(BC=c) denotes a mixture model over the dihedral angle terms (random variables) in Bafter assigning the dihedral angle terms C = c in M(aa)(marg.)(BC). Further, Pr(C=c)M(aa)(marg.)(C=c) is the marginal probability of observing the dihedral angle terms (random variables) C = c.

As can be seen from the equation above, a conditional mixture model is deduced from the corresponding marginal mixture models, each of which is derived as a projection from the amino acid–specific joint mixture model. Hence, as before, the computation of any conditional mixture model in any combination of terms (i.e., subsets B and C of set A) can be performed highly efficiently on standalone computers in real time (Results).

Measures of Shannon Information and Compression.

From the mathematical theory of communication (16), the Shannon information content of any observation O is given by the relationship, I(O)=log2(Pr(O)) bits, where Pr(O) is the probability of that observation drawn from some source probability distribution. In this work, an observation O involves observing specific values for the dihedral angles terms (in any combination of them). In PhiSiCal-Checkup, the information content in any observation O (denoted here by Imixture(O)) is evaluated by computing its probability Prmixture(O) under its corresponding amino-acid specific joint/marginal/conditional mixture model. The precise mixture model that has to be used to compute Imixture(O) is fully determined by the amino acid type of the observation and the combination of dihedral terms being evaluated.

Next, the measure of compression for the observation O can be derived by comparing the Shannon information content using the mixture model against its uncompressed raw/null bit content: Compression(O)=Inull(O)Imixture(O) bits. Note, the uncompressed bit content is the same as measuring the Shannon information content of O with the uniform distribution as its source distribution.

Applying the negative-logarithm relationship between Shannon information content and source probabilities, compression is equivalent to computing the log-likelihood (or log-odds) of the observation O arising from each of the two competing distributions as its source (mixture vs. uniform null): Compression(O)=log2(Prmixture(O)Prnull(O)) bits. Thus, if an observation O results in +n bits of compression (i.e., gain of n bits), the odds are 2n:1 in favor of O arising from the mixture model. Conversely, if it results in n bits of compression (i.e., loss of n bits), the odds are 1:2n against the observation O arising from the mixture model.

Gradient.

All joint, marginal, and conditional amino acid–specific mixture models are continuous distributions that are differentiable at every point in their respective toroidal support. For example, the gradient vector for M(aa)(joint)(A) at any point a={a1,a2,ad}Td, is denoted by

M(aa)(joint)(A=a)=M(aa)(joint)(A=a)a1,M(aa)(joint)(A=a)a2,,M(aa)(joint)(A=a)ad

For the mixture models involving the product of von Mises distributions, the result is in a closed mathematical form that can be computed on the fly. We note that the gradients for marginal and conditional mixture models yield similar expressions and characteristics since they also manifest as mixture models (except they are represented in the reduced dimensions d of the subset BA).

Thus, for any (multidimensional) point p, its gradient vector (v=M(p)) gives the magnitude and direction of steepest ascent at p in its corresponding amino acid–specific probability distribution (joint/marginal/conditional mixture model M). Using this gradient, p can be perturbed to a near-neighboring point p~=p±λv^, where v^ gives the direction cosines of the gradient vector v. In practice, PhiSiCal-Checkup chooses λ = 5 for any p (for dihedral terms stated in degrees). This constrains the norm of the projection of p^p in any (dihedral angle) dimension to 5°.

Difference between PhiSiCal-Checkup and PhiSiCal.

The inference of joint mixture models to support validation of amino acid conformations in PhiSiCal-Checkup is derived using the previously described methodology, PhiSiCal (11). Although PhiSiCal-Checkup builds on the inference-methodology of PhiSiCal, realizing a comprehensive framework specifically for dihedral angle validation required several novel additions and extensions.

The new set of joint mixture models inferred for PhiSiCal-Checkup includes the modeling of ω dihedral angle, along with all other (main chain and side chain) dihedral angles for each amino acid type—previous work, PhiSiCal, ignored ω from its joint probability distribution. Further, the mixture models of PhiSiCal were inferred on PDB50 and PDB50HighRes structural dataset that lacked residue-level filtering. Instead, PhiSiCal-Checkup uses the “Top2018” dataset curated by Williams et al. (4). This dataset comes with residue-level filtering to ensure only the “best parts” of high-quality protein residues are considered with “good electron density support for a physically acceptable model conformation” (4). Furthermore, the factorization of joint mixture models to derive marginal and conditional mixture models, along with the use of compression and gradient information (all described above) to support accurate validation of protein coordinates are unique to PhiSiCal-Checkup.

Supplementary Material

Appendix 01 (PDF)

pnas.2416301121.sapp.pdf (535.2KB, pdf)

Acknowledgments

We thank Monash eResearch Centre and eServices for special job allocations on Monash high-performance computing clusters that facilitated this work.

Author contributions

A.S.K. designed research; P.R.A. and A.S.K. performed research; P.R.A., L.A., A.M.L., and A.S.K. contributed new reagents/analytic tools; P.R.A., L.A., C.J.M., P.J.S., M.G.d.l.B., A.M.L., and A.S.K. analyzed data; C.J.M. introduced the validation problem; and P.R.A., A.M.L., and A.S.K. wrote the paper.

Competing interests

The authors declare no competing interest.

Footnotes

This article is a PNAS Direct Submission.

Data, Materials, and Software Availability

Supplementary Figures and Tables are included in SI Appendix. Supplementary Data has been deposited in FigShare (15, 19, 23, 24, 25, 26, 27). Software for online and offline use is available from https://lcb.infotech.monash.edu.au/phisical/checkup. All other data are included in the manuscript and/or SI Appendix.

Supporting Information

References

  • 1.Berman H., Henrick K., Nakamura H., Announcing the worldwide Protein Data Bank. Nat. Struct. Mol. Biol. 10, 980 (2003). [DOI] [PubMed] [Google Scholar]
  • 2.Waman V. P., Orengo C., Kleywegt G. J., Lesk A. M., Three-dimensional structure databases of biological macromolecules. Methods Mol. Biol. 2449, 43–91 (2022). [DOI] [PubMed] [Google Scholar]
  • 3.Gore S., et al. , Validation of structures in the Protein Data Bank. Structure 25, 1916–1927 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Williams C. J., Richardson D. C., Richardson J. S., The importance of residue-level filtering and the top2018 best-parts dataset of high-quality protein residues. Protein Sci. 31, 290–300 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Laskowski R. A., MacArthur M. W., Moss D. S., Thornton J. M., PROCHECK: A program to check the stereochemical quality of protein structures. J. Appl. Crystallogr. 26, 283–291 (1993). [Google Scholar]
  • 6.Hooft R. W., Vriend G., Sander C., Abola E. E., Errors in protein structures. Nature 381, 272 (1996). [DOI] [PubMed] [Google Scholar]
  • 7.Jones T. A., Zou J. Y., Cowan S. W., Kjeldgaard M., Improved methods for building protein models in electron density maps and the location of errors in these models. Acta Crystallogr. Sect. A: Found. Crystallogr. 47, 110–119 (1991). [DOI] [PubMed] [Google Scholar]
  • 8.Lovell S. C., et al. , Structure validation by Cα geometry: ϕ, ψ and Cβ deviation. Proteins: Struct., Funct., Bioinf. 50, 437–450 (2003). [DOI] [PubMed] [Google Scholar]
  • 9.Read R. J., et al. , A new generation of crystallographic validation tools for the Protein Data Bank. Structure 19, 1395–1412 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Hintze B. J., Lewis S. M., Richardson J. S., Richardson D. C., Molprobity’s ultimate rotamer-library distributions for model validation. Proteins: Struct., Funct., Bioinf. 84, 1177–1189 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Amarasinghe P. R., et al. , Getting ‘ϕψχal’ with proteins: Minimum message length inference of joint distributions of backbone and sidechain dihedral angles. Bioinformatics 39, i357–i367 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Wallace C. S., Statistical and Inductive Inference by Minimum Message Length (Springer, 2005). [Google Scholar]
  • 13.Allison L., Coding Ockham’s Razor (Springer, 2018). [Google Scholar]
  • 14.Bayes T., An essay towards solving a problem in the doctrine of chances. By the late Rev. Mr. Bayes, FRS communicated by Mr. Price, in a letter to John Canton, AMFR. Philos. Trans. R. Soc. London 53, 370–418 (1763). [Google Scholar]
  • 15.P. R. Amarasinghe et al. , Supplementary Data 1. FigShare. https://figshare.com/s/0b349f998f795c45b109. Deposited 21 October 2024.
  • 16.Shannon C. E., A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948). [Google Scholar]
  • 17.E. Vandermarliere et al., Structural analysis of a glycoside hydrolase family 43 arabinoxylan arabinofuranohydrolase in complex with xylotetraose reveals a different binding mechanism compared with other members of the same family. Biochem. J. 418, 39–47 (2009). [DOI] [PubMed]
  • 18.Carugo O., Argos P., Correlation between side chain mobility and conformation in protein structures. Protein Eng. 10, 777–787 (1997). [DOI] [PubMed] [Google Scholar]
  • 19.P. R. Amarasinghe et al. , Supplementary Data 2. FigShare. https://figshare.com/s/1ee14426c6d49c7cbd2a. Deposited 21 October 2024.
  • 20.H. K. Ganguly, G. Basu, Conformational landscape of substituted prolines. Biophys. Rev. 12, 25–39 (2020). [DOI] [PMC free article] [PubMed]
  • 21.Kale A., Pijning T., Sonke T., Dijkstra B. W., Thunnissen A. M. W., Crystal structure of the leucine aminopeptidase from Pseudomonas putida reveals the molecular basis for its enantioselectivity and broad substrate specificity. J. Mol. Biol. 398, 703–714 (2010). [DOI] [PubMed] [Google Scholar]
  • 22.Abramson J., et al. , Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.P. R. Amarasinghe et al. , Supplementary Data 3. FigS22hare. https://figshare.com/s/bda9c52606eb07780686. Deposited 21 October 2024.
  • 24.P. R. Amarasinghe et al. , Supplementary Data 4. FigShare. https://figshare.com/s/d65b9522eff486a5a5d4. Deposited 21 October 2024.
  • 25.P. R. Amarasinghe et al. , Supplementary Data 5. FigShare. https://figshare.com/s/b7bd8664f226dbb3d005. Deposited 21 October 2024.
  • 26.P. R. Amarasinghe et al. , Supplementary Data 6. FigShare. https://figshare.com/s/5e3ee64c43d69b21c88f. Deposited 21 October 2024.
  • 27.P. R. Amarasinghe et al. , Supplementary Data 7. FigShare. https://figshare.com/s/aebee230df911e48cb34. Deposited 21 October 2024.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

pnas.2416301121.sapp.pdf (535.2KB, pdf)

Data Availability Statement

Supplementary Figures and Tables are included in SI Appendix. Supplementary Data has been deposited in FigShare (15, 19, 23, 24, 25, 26, 27). Software for online and offline use is available from https://lcb.infotech.monash.edu.au/phisical/checkup. All other data are included in the manuscript and/or SI Appendix.


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES