Abstract
High-throughput structure determination based on solution Nuclear Magnetic Resonance (NMR) spectroscopy plays an important role in structural genomics. One of the main bottlenecks in NMR structure determination is the interpretation of NMR data to obtain a sufficient number of accurate distance restraints by assigning nuclear Overhauser effect (NOE) spectral peaks to pairs of protons. The difficulty in automated NOE assignment mainly lies in the ambiguities arising both from the resonance degeneracy of chemical shifts and from the uncertainty due to experimental errors in NOE peak positions. In this paper we present a novel NOE assignment algorithm, called HAusdorff-based NOE Assignment (HANA), that starts with a high-resolution protein backbone computed using only two residual dipolar couplings (RDCs) per residue37, 39, employs a Hausdorff-based pattern matching technique to deduce similarity between experimental and back-computed NOE spectra for each rotamer from a statistically diverse library, and drives the selection of optimal position-specific rotamers for filtering ambiguous NOE assignments. Our algorithm runs in time O(tn3 +tn log t), where t is the maximum number of rotamers per residue and n is the size of the protein. Application of our algorithm on biological NMR data for three proteins, namely, human ubiquitin, the zinc finger domain of the human DNA Y-polymerase Eta (pol η) and the human Set2-Rpb1 interacting domain (hSRI) demonstrates that our algorithm overcomes spectral noise to achieve more than 90% assignment accuracy. Additionally, the final structures calculated using our automated NOE assignments have backbone RMSD < 1.7 Å and all-heavy-atom RMSD < 2.5 Å from reference structures that were determined either by X-ray crystallography or traditional NMR approaches. These results show that our NOE assignment algorithm can be successfully applied to protein NMR spectra to obtain high-quality structures.
1. INTRODUCTION
High-throughput structure determination based on X-ray crystallography and Nuclear Magnetic Resonance (NMRa) spectroscopy are key steps towards the era of structural genomics. Unfortunately, structure determination by either approach is generally time-consuming. In X-ray crystallography, growing a good quality crystal is in general a difficult task, while in NMR structure determination, the bottleneck lies in the processing and analysis of NMR data, and in interpreting a sufficient number of accurate distance restraints from experimental Nuclear Over-hauser Enhancement Spectroscopy (NOESY) spectra, which exploit the dipolar interaction of nuclear spins, called nuclear Overhauser effect (NOE), for through-space correlation of protons. The intensity (or volume) of an NOE peak in a NOESY spectrum is converted into a distance restraint by calibrating the intensity (or volume) vs. distance curve or classifying all NOESY peaks into different bins.12, 16, 38 Traditional NMR structure determination approaches use NOE distance restraints as the main source of information to compute the structure of a protein, a problem known to be strongly NP-hard,30 essentially due to the local nature of the restraints. Rigorous approaches to solve this problem using NOE data, such as the distance geometry method,10 require exponential time in the worst-case (see discussion in Ref. 39). While substantial progress has been made to design practical algorithms for structure determination,3, 12–14, 24, 28, 31 most algorithms still rely on heuristic techniques such as molecular dynamics (MD) and simulated annealing (SA), which use NOE data plus other NMR data to compute a protein structure. The NOE distances used by these distance-based structure determination protocols must be obtained by assigning NOE data, i.e., for every NOE, we must determine the associated pair of interacting protons in the primary sequence. This is called the NOE assignment problem.
While much progress has been made in automated NOE assignment,12, 14, 16, 21, 24, 27, 28 most NOE assignment algorithms have a SA/MD-based or a distance geometry-based structure determination protocol sitting in a tight inner loop, which is invoked many times to filter ambiguous assignments. Since distance geometry methods have exponential worst-case time complexity, and SA/MD-based structure determination protocols lack combinatorial precision and have no guarantees on solution quality or running time, these NOE assignment algorithms suffer from the same drawbacks, in addition to the inherent difficulties in the interpretation of NOESY spectra. Therefore, it is natural to ask if there exists a provably polynomial-time algorithm for the NOE assignment problem, which can guarantee solution quality—this will pave new ways for better understanding and interpretation of experimental data, and for developing robust protocols with both theoretical guarantees and good practical performance.
In Ref. 39, a new linear time algorithm was developed, based on Refs. 37 and 36, to determine protein backbone structure accurately using a minimum amount of residual dipolar coupling (RDC) data. RDCs provide global orientational restraints on internuclear vectors, for example, backbone NH and CH bond vectors with respect to a global frame of reference. The algorithm in Refs. 37, 36, and 39 computes the backbone conformation by solving, in closed form, systems of low-degree polynomial equations formulated using the RDC restraints. The algorithm is combinatorially-precise and employs a systematic search strategy to compute the backbone structure in polynomial time. The accurately-computed backbone conformations enable us to propose a new strategy for NOE assignment. In Ref. 38, for example, an NOE assignment algorithm was proposed to filter ambiguous NOE assignments based on an ensemble of distance intervals computed using intra-residue vectors mined from a rotamer database, and inter-residue vectors from the backbone structure determined from Refs. 37, 36, and 39. The algorithm in Ref. 38 uses a triangle-like inequality between the intra-residue and inter-residue vectors to prune incorrect assignment for side-chain NOEs. However, the algorithm in Ref. 38 has the following deficiencies: (a) it does not exploit the diversity of the rotamers in the library, (b) uncertainty in NOE peak position, and other inherent difficulties in interpreting NOESY spectra suggest a probabilistic model with provable properties which Ref. 38 does not capture, and (c) it does not exploit rotamer pattern structure in NOESY spectra.
To address the shortcomings in Ref. 38 and other previous work, our algorithm, HAusdorff-based NOE Assignment (HANA), uses a novel pattern-directed framework for NOE assignment, that combines a combinatorially-precise, algebraic geometry-based approach for computing high-resolution protein backbones from residual dipolar coupling data, with a framework that uses a statistically diverse library of rotamers and the Hausdorff distance to measure similarity between experimental and back-computed NOE spectra, and drives the selection of optimal position-specific rotamers to prune ambiguous NOE assignments. Our Hausdorff-based framework views the NOE assignment problem as a pattern-recognition problem, where the objective is to establish a match by choosing the correct rotamers between the experimental NOESY spectrum and the back-computed NOE pattern. By explicitly modeling the uncertainty in NOE peak positions and the probability of mismatches between NOE patterns, we provide a rigorous means of analyzing and evaluating the algorithmic benefits and the quality of assignments.
We first compute a high-resolution protein backbone from RDC data using the algorithms in Refs. 37, 36, and 39. Using this backbone structure, an assigned resonance list, and a library of rotamers25, the NOE pattern for each rotamer can be back-computed (Figure 1B). By measuring the match of the back-computed NOE patterns with experimental NOESY spectrum, we choose an ensemble of top rotamers according to the match scores for each residue. Then, we construct an initial low-resolution protein structure by combining the high-resolution backbone and the chosen approximate rotamers together. The low-resolution structure is then used to filter ambiguous NOE assignments. Finally, our NOE assignments are fed to a structure calculation program, e.g., XPLOR/CNS 3 which outputs the final ensemble of structures. The experimental results, based on our NMR data for three proteins, viz., human ubiquitin, the zinc finger domain of the human DNA Y-polymerase Eta (pol η) and the human Set2-Rpb1 interacting domain (hSRI) show that HANA achieves an assignment accuracy of more than 90%. In summary, our main contributions in this paper are:
Development of a novel framework that combines a combinatorially-precise, algebraic geometry-based linear time algorithm for high-resolution backbone structure determination with the Hausdorff distance measure, and exploits the statistical diversity of a rotamer library to infer accurate NOE assignments for both backbone and side-chain NOEs from 2D and 3D NOESY spectra.
Introduction of Hausdorff distance-based pattern matching technique to measure the similarity between experimental NOE spectra and back-computed NOE spectra, and modeling uncertainties arising both from false random matches and from experimental deviations in NOE peak positions.
A fully-automated O(tn3 + tn log t) time NOE assignment algorithm, where t is the maximum number of rotamers in a residue and n is the number of residues in the protein.
Derivation of provable properties, viz. soundness in rotamer selection.
Application of our algorithm on three real biological NMR data sets to demonstrate high assignment accuracy (> 90%), and fast running times (< 2 minutes).
2. PRELIMINARIES AND PROBLEM DEFINITION
In NMR spectra, each proton or atom is identified by its chemical shift (or resonance), which is obtained by mapping atom names in the known primary sequence of the protein to the corresponding frequencies from triple-resonance or other NMR spectra; this process is referred to as resonance assignment. Substantial progress has been made in designing efficient algorithms1, 20, 22, 26 for automatic resonance assignment. Given the chemical shift of each proton, the NOE assignment problem in two dimensionsb is to assign each NOESY peak to each pair of protons that are correlated through a dipole-dipole NOE interaction.
Formally, let {a1,…, aq} denote the set of proton names (e.g., Hα of Arg56), where q = Θ (n) is the total number of protons and n is the number of residues in a protein. Let ω(ai) denote the chemical shift for proton ai determined from resonance assignment, 1 ≤ i ≤ q. An NOE peak (a.k.a. cross-peak) with respective frequencies x and y for a pair of protons, is denoted by the point (x, y) on the plane of NOESY spectrum. Given a set of known chemical shifts L = {ω (ai),…, ω (aq)} for all protons {a1,…, aq} and a list of NOESY peaks (i.e., a set of points on the plane of NOESY spectrum), the NOE assignment problem is to map each NOE cross-peak (x, y) to an interacting proton pair (ai, aj) such that ||ω (ai) − x|| ≤ δx and ||ω (aj) − y|| ≤ δy, where δx and δy encode the uncertainty in the peak position due to experimental errors.
In a hypothetical ideal case without any experimental error and noise, this would be an easy problem. However, for most proteins, two pairs of interacting protons can produce overlapping NOE peaks in a NOESY spectrum. The chemical shift differences of different protons are often too small to resolve experimentally, a phenomenon often referred to as chemical shift degeneracy. Also, due to experimental noise, artifact NOE peaks might occur from either manual or automated peak picking. These factors lead to more than one possible NOE assignment for a 2D NOESY spectrum which are called ambiguous NOE assignments.12, 21 Hence, one or more additional dimensions are generally introduced to relieve the congestion of NOE peaks. In a 3D NMR experiment, for example, each NOE peak is labeled with chemical shifts of a triple of atoms, viz., dipole-dipole interacting protons plus the heavy atom nucleus such as 15N or 13C bonded to the second proton. Even for 3D spectra, the interpretation and assignment of NOESY cross-peaks still remains hard, and poses a difficult computational challenge to obtain a unique NOE assignment. Manual assignment of NOESY peaks take months of time on average, requires significant expertise, and is prone to human errors. In structure determination, even a few incorrect NOE assignments can result in incorrect structures.5 Hence, it is critical to develop highly efficient and fully automated NOE assignment algorithms to aid high-throughput NMR structure determination.
3. PREVIOUS WORK
Protein structure determination using NOE distance restraints is strongly NP-hard,30 essentially due to sparsity of the experimental data and local nature of the constraints. While rigorous approaches to solve this problem using distance intervals from NOE data, such as the distance geometry method,10 require exponential time in the worst-case; heuristic approaches such as SA/MD, while providing practical ways of solving this problem, lack combinatorial precision, and have no guarantees on running time or solution quality. Previous approaches for NOE assignment12, 14, 16, 21, 24, 27, 28 follow an iterative strategy, in which an initial set of relatively unambiguous NOEs is used to generate an ensemble of structures, which are then used to filter ambiguous and inconsistent NOE assignments. This iterative assignment process is repeated until no further improvements in NOE assignments or structures can be obtained. What makes such approaches loose guarantees on the running time and assignment accuracy is their tight coupling with a heuristic structure determination protocol, which sits in a tight inner-loop of the assignment algorithm.
NOAH,27, 12 for example, uses the structure determination package DYANA,14 and follows the previously mentioned iterative strategy starting with an initial set of NOE assignments with supposedly one or two possible assignments. ARIA 28, 24 and CANDID14 improved on NOAH by incorporating better modeling of ambiguous distance constraints. In AUTO-STRUCTURE16 more experimental data such as dihedral angle restraints from TALOS 8 and slow H-D exchange data are used to improve assignment accuracy. In PASD 21 several strategies were proposed to reduce the chance of invoking the structure calculation into a biased path due to the incorrect initial global fold. Since all these iterative NOE assignment programs invoke SA/MD-based structure determination protocols such as XPLOR/CNS3, they may converge to a local, but not a global minimum to obtain a best-fit of the data; therefore, the NOE assignments might not be correct.
An alternative approach for automated NOE assignment proposed by Wang and Donald in Ref. 38, based on Refs. 37, 36, and 39, uses a rotamer ensemble and residual dipolar couplings, and is the first polynomial-time algorithm for automated NOE assignment. However, Ref. 38 does not exploit the pattern structure of NOESY spectrum to model the uncertainty in peak positions probabilistically using a library of rotamers; therefore, assignment accuracy is reduced while processing NOESY spectra with many noisy peaks.
Our algorithm HANA retains the paradigm of Ref. 38, and develops a novel framework using the algebraic geometry-based linear time algorithm developed in Ref. 39 to compute high-resolution protein backbones from residual dipolar couplings, and then uses this backbone and a library of rotamers to do NOE assignments. Viewing the NOE assignment problem as a pattern-recognition problem, our algorithm uses an extended Hausdorff distance-based probabilistic framework to model the uncertainties in NOE peak positions and the probability of mismatches between NOE patterns. In contrast to previous heuristic algorithms12, 14, 16, 21, 24, 27, 28 for NOE assignment, HANA has the advantages of being combinatorially precise with a running time of O(tn3 + tn log t), where t is the maximum number of rotamers per residue and n is the size of the protein, and runs extremely fast in practice to compute high quality NOE assignments (> 90% assignment accuracy).
4. NOE ASSIGNMENT BASED ON ROTAMER PATTERNS
4.1. Overview of our approach
Our goal is to assign pairs of proton namesc to cross-peaks in NOESY data. Figure 1 illustrates the basic idea of our algorithm. The NOE assignment process can be divided into three phases, viz. initial NOE assignment (phase 1), rotamer selection (phase 2), and filtration of ambiguous NOE assignments (phase 3). The initial NOE assignment (phase 1) is done by considering all pairs of ambiguous NOEs assigned to a NOESY cross peak if the resonances of corresponding atoms fall within a tolerance window around the NOE peak. In the rotamer selection phase, we first compute the backbone structure from RDCs (see Section 4.2), and then place all the rotamers at each residue into backbone and compute all expected NOEs within the upper-bound limit of NOE distance (Figure 1A). Based on the set of all expected NOEs and the resonance assignment list, we back-compute the expected NOE peak pattern for each rotamer (Figure 1B). By matching the back-computed NOE pattern with the experimental NOESY spectrum using an extended model of the Hausdorff distance,17, 19 we measure how well a rotamer fits the real side-chain conformation when interpreted in terms of the NOESY data. We then select the top k rotamers with highest fitness scores at each residue, and obtain a “low-resolution” structure,d by combining the high-resolution backbone structure and the approximate ensemble of side-chain conformations at each residue. The low-resolution structure is then used (in phase 3) to filter ambiguous NOE assignments. The details of filtering ambiguous NOE assignments using the low-resolution structure are provided in Supplementary Material (SM) Section 4 available online in Ref. 40.
4.2. Protein backbone structure determination from residual dipolar couplings
Residual dipolar coupling33, 34 data provide global orientational restraints on the internuclear bond vectors, such as, backbone NH and CH bond vectors with respect to a global coordinate frame. In solution NMR, RDCs can be recorded with high precision, and assigned much faster than NOEs. In Refs. 39 and 37, the authors proposed the first polynomial-time de novo algorithm, which we henceforth refer to as RDC-EXACT, to compute high-resolution protein backbone structures from RDC data. RDC-EXACT takes as input (a) two RDCs per residue (e.g., assigned NH RDCs in two media or NH and CH RDCs in a single medium), (b) delimited α-helices and β-sheets with known hydrogen bond information between paired strands, and a few unambiguous NOEs (used to pack the helices and strands). Note that, these sparse set of NOEs used by RDC-EXACT can usually be assigned using chemical shift information alone37, 39 without requiring any sophisticated NOE assignment algorithm. Our algorithm HANA uses the high-resolution backbones computed by RDC-EXACT. Loops with missing RDCs are computed using an enhanced version of robotics-based cyclic coordinate descent (CCD) algorithm.4, 32 The details of RDC-EXACT and modeling of loops (in case of missing RDCs) are provided in SM40 Section 1.
4.3. NOE pattern matching based on the Hausdorff distance measure
Given two finite sets of points B = {b1,…, bm} and Y = {y1,…, yn} in Euclidean space, the Hausdorff distance between B and Y is defined as H(B, Y) = max{h(B, Y), h(Y, B)}, where h(B, Y) = maxb∈B miny∈ Y ||b − y||, and ||b − y|| measures the normed distance (e.g., L2-norm) between points b and y. Intuitively, the Hausdorff distance H(B, Y ) finds the point in one set that is farthest from any point in the other set, and thus measures the degree of mismatch between the two point sets B and Y. The Hausdorff distance has been widely used in the image processing and computer vision problems, such as visual correspondence,17 pattern recognition,19 and shape matching,18 etc. Unlike many other pattern-recognition algorithms, Hausdorff-based algorithms are combinatorially precise, and provide a robust method for measuring the similarity between two point sets or image patterns18, 19 in the presence of noise and positional uncertainties.
In the NOE assignment problem, let B denote a back-computed NOE pattern, i.e., the set of back-computed NOE peaks, and let Y denote the set of experimental NOESY peaks. Generally, the size of a back-computed NOE pattern is much smaller than the total number of experimental NOESY peaks. Therefore, we only consider the directed Hausdorff distance from B to Y, namely, h(B, Y ) = maxb∈B miny∈Y ||b − y||. We apply an extended model of Hausdorff distance18, 19, 17 to measure the match between the back-computed NOE pattern and experimental NOESY spectrum. Below, we assume 3D NOESY spectra without loss of generality.
Given the back-computed NOE pattern B with m peaks, and the set of NOESY peaks Y with w peaks, the τ-th Hausdorff distance from B to Y is defined as
where τth is the τ-th largest of m values. We call f = τ/m the similarity score between the back-computed NOE pattern B and the experimental peak set Y, after fixing the Hausdorff distance hτ(B, Y ) = δ, which is the error tolerance in the NOESY spectra. The similarity score for a rotamer given δ can be computed using a scheme similar to Ref. 17:
(1) |
where Yδ denotes the union of all balls obtained by replacing each point in Y with a ball of radius δ, B ∩Yδ denotes the intersection of sets B and Yδ, and |·| denotes the size of a set.
We incorporate two types of uncertainty in the calculation of the similarity score in Equation (1) for the match between the back-computed NOE pattern and experimental NOESY spectrum: (a) possibility of a false random match17 in the NOESY spectra; (b) uncertainty of NOE peak positions due to experimental noise.
(a) Possibility of a false random match 17
A false random match between the back-computed NOE pattern and the experimental NOESY spectrum is defined as a match when hτ (B, Y ) ≤ δ occurs at random. We calculate the probability of a false random match and use it as a weighting factor for the similarity score in Equation (1). Let p be the probability for a back-computed NOE peak to randomly match to an experimental peak in Yδ. Let θ be the probability of a false random match, which can be estimated using the following asymptotic approximation from Ref. 17:
where , and Φ(·) is the Gauss error function.
(b) Uncertainty from the NOE peak positions
Let bi = (ω (a1), ω (a2), ω (a3)) denote the back-computed NOE peak for an NOE (a1, a2, a3) in a 3D NOESY spectrum. The likelihood for a back-computed peak bi = (ω (a1), ω (a2), ω (a3)) in the NOE pattern B to match an experimental NOESY peak within the distance δ in Yδ can be defined as
where (p1, p2, p3) is the experimental NOESY peak matched to (ω (a1), ω (a2), ω (a3)) according to the Hausdorff distance measure, and (|x − μ|, σ) is the probability of observing the difference |x − μ| in a normal distribution with mean μ and standard deviation σ. Here we assume that the noise distribution of peak positions at each dimension is independent of each other. We note that the normal distribution and other similar distribution families have been widely and efficiently used to approximate the noise in the NMR data, e.g., see Refs. 29 and 22.
Then the expected number of peaks in B∩Yδ can be bounded by . Thus, we have the following equation for the similarity score:
(2) |
After considering both possibility from a false random match and uncertainty from the NOE peak positions, we obtain the following fitness score for a rotamer
(3) |
For each rotamer, the computation of its similarity score s′ can be computed in O(mw) time, where m is the number of back-computed NOE peaks, and w is the total number of cross peaks in the experimental NOESY spectrum. The detailed pseudocodes for computing the similarity score and for HANA are provided in SM Sections 3–4 available in Ref. 40.
5. ANALYSIS
5.1. Analysis of rotamer selection based on NOE patterns
Given a back-computed NOE peak bi = (ωi1, ωi2, ωi3) in the NOE pattern of a rotamer, suppose that it finds a matched experimental peak in Yδ with probability g(ωi1, ωi2, ωi3, Yδ). Finding such a matched experimental NOESY peak for bi can be regarded as a Poisson trial with success probability g(ωi1, ωi2, ωi3, Yδ). We present the following result about the expected number of matched peaks for the back-computed NOE pattern of a rotamer.
Lemma 5.1
Let Xi be an indicator random variable which is equal to 1 if the back-computed NOE peak bi of a rotamer r finds a matched experimental peak; 0 otherwise. Let , where m is the total number of back-computed NOE cross-peaks for the rotamer r. Then the expected number of back-computed NOE peaks that find matched experimental peaks is given by
Let rt denote the rotamer closest to the real side-chain conformation for a residue, and let rf denote another rotamer in the library for the same residue. We call rt the true rotamer, and rf the false rotamer. Let Xi and Yi be indicator random variables as defined in Lemma 5.1 for each back-computed NOE peak in the true rotamer rt and the false configuration rf respectively. Let mt and mf denote the numbers of back-computed NOE peaks for the true rotamer rt and the false rotamer rf. Let and denote the number of back-computed NOE peaks that find matched experiment peaks for rotamers rt and rf respectively. Let μt = E(X) and μf = E(Y ) denote the expectations of X and Y. For simplicity of our theoretical analysis, we use Equation (1) to measure the fitness between the back-computed NOE pattern of a rotamer and the experimental spectrum in our theoretical model.
To measure the accuracy of the rotamer chosen based on our scoring function, we calculate the probability that the algorithm chooses the wrong rotamer rf rather than the true rotamer rt, and show how it is bounded by certain threshold. The following theorem formally states this result. The proof of this theorem can be found in SM40 Section 5.
Theorem 5.1
Suppose that . Then with probability at least , our algorithm chooses the true rotamer rt rather than the false rotamer rf.
Theorem 5.1 indicates that if the difference between the expected numbers of matched NOE peaks for two roatmers is larger than certain threshold, we are able to distinguish these two roamters based on the Hausdorff distance measure with certain probability bound. By Theorem 5.1, we have the following result on the bound of the probability of picking the correct rotamer from the library based on the Hausdorff distance measure, if we select top k rotamers with highest similarity scores.
Theorem 5.2
Let t denote the maximum number of rotamers for a residue. Suppose that and mt > t − k hold for the true rotamer rt and every false rotamer rf. Then with probability at least , our algorithm chooses the correct rotamer.
Proof
Since the total number of rotamers in a residue is t, by Theorem 5.1 the probability that the similarity score of the true rotamer is larger than that of at least t − k rotamers is at least . According to the fact (1 + x)a ≥ 1 + ax for x > −1 and a ≥ 1, we have . Thus, the probability for the algorithm to choose the right rotamer is at least .
Theorem 5.2 shows that if the discrepancy of the expected number of matched NOE peaks between the true rotamer and every other rotamer, and the number of back-computed NOE peaks are sufficiently large, the ensemble of top k rotamers with highest similarity scores will contain the true rotamer.
5.2. Time complexity analysis
The following theorem states that HANA runs in polynomial time.
Theorem 5.3
HANA runs in O(tn3 +tn log t) time, where t is the maximum number of rotamers at a residue and n is the total number of residues in the protein sequence.
The detailed derivation of the time complexity can be found in SM40 Section 6. We note that in practice, our NOE assignment algorithm HANA runs in 1–2 minutes on a 3 GHz single-processor Linux workstation.
6. RESULTS
HANA takes as input (a) protein sequence, (b) 3D NOESY-HSQC or 2D NOESY peak list, (c) assigned resonance list, (d) backbone computed by using the rdc-exact algorithm37, 39 (Section 4.2), and (e) Xtalview rotamer library.25 HANA was tested on experimental NMR data for human ubiquitin,35, 9 zinc finger domain of the human DNA Y-polymerase Eta (pol η)2 and human Set2-Rpb1 interacting domain (hSRI).23 The high-resolution structures of these three proteins have been solved either by X-ray crystallography35 or by traditional NMR approaches using both distance restraints from NOE data and orientational restraints from scalar and dipolar couplings.9, 2, 23 We used these solved structures, which are also in the Protein Data Bank (PDB), as the reference structures to compare and check the quality of NMR structures determined from our NOE assignment tables. The NMR data for hSRI and pol η were recorded using Varian 600 and 800 MHz spectrometers at Duke University. Ubiquitin NMR data was obtained from Ref. 15 and from the PDB (ID: 1D3Z).
6.1. Robustness of Hausdorff distance and NOE assignment accuracy
To check the robustness of the Hausdorff distance measure for NOE pattern matching, we first computed a low-resolution structure of ubiquitin by combining the backbone determined from RDC-EXACT,37, 36, 39 and rotamers selected based on the Hausdorff distance measure using patterns for backbone-sidechain NOEs. This low-resolution NMR structure is not the final structure, but is used to filter ambiguous NOE assignments (including backbone-backbone, backbone-sidechain and sidechain-sidechain NOE assignments). Our result shows that the low-resolution structure of ubiquitin obtained from our algorithm has a backbone RMSD 1.58 A and an all-heavy-atom RMSD 2.85 Å from the corresponding X-ray structure (PDB ID: 1UBQ). Using this low-resolution structure, HANA was able to resolve the NOE assignment ambiguity caused from the chemical shift degeneracy, and prune a sufficient number of ambiguous NOE assignments, as we will discuss next.
To measure the assignment accuracy of HANA, we define a compatible NOE assignment as one in which the distance between the assigned pair of NOE protons in the reference structure is within NOE distance bound of 6.0 Å. Otherwise, we call it an incompatible NOE assignment. The number of compatible NOE assignments can be larger than the number of total NOESY peaks, since it is possible that multiple compatible NOEs can be assigned to a single NOESY cross peak. Next, the assignment accuracy is defined as the fraction of compatible assignments in the final assignment table output by HANA.
As summarized in Table 1, our NOE assignment algorithm achieved above 90% assignment accuracy for all three proteins. We note that the fraction of assigned peaks of hSRI is less than the other two proteins. This is because we only used backbones in the secondary structure regions (residues 15–34, 51–72, 82–97) for pruning ambiguous NOE assignments for hSRI. Presently we are developing new algorithms to solve long loops. We believe that with more accurate loop backbone structures, we will be able to improve the accuracy of our NOE assignment algorithm, while assigning more NOE peaks. We note that the ubiquitin 13C NOESY data from Ref. 15 are quite degenerate, thus we carefully picked a subset of NOESY peaks for assigning NOEs. Presently we are re-collecting a completely new set of ubiquitin NMR data including four-dimensional NOESY spectra for further testing of our algorithm.
Table 1.
Proteins | # of residues | # of NOESY peaks§ | # of compatible assignments† | # of incompatible assignments† | Assignment accuracy |
---|---|---|---|---|---|
ubiquitin* | 76 | 1580 | 901 | 93 | 90.6% |
pol η** | 39 | 1386 | 590 | 65 | 90.1% |
hSRI*** | 112 | 5916 | 1429 | 119 | 92.3% |
The ubiquitin backbone calculated from the RDC data using RDC-EXACT has RMSD 1.58 Å from the X-ray reference structure (PDB ID: 1UBQ) (residues 2–71).
The pol η backbone calculated from the RDC data using RDC-EXACT has RMSD 1.28 Å for the secondary structure regions and RMSD 2.71 Å for both secondary structure and loop regions (residues 8–36) from the NMR reference structure (PDB ID: 2I5O).
The hSRI backbone calculated from the RDC data using RDC-EXACT has RMSD 1.62 Å from the NMR reference structure (PDB ID: 2A7O) for the secondary structure regions (residues 15–34, 51–72, 82–97).
The NOESY peak list contains diagonal and symmetric cross peaks.
Redundant symmetric NOE restraints have been removed from the final NOE assignment table.
Since the long-range NOEs, in which the spin-interacting protons are at least four residues away, play an important role in the structure determination, we also checked the fraction of incompatible long-range NOE assignments from our algorithm. We found that less than 3% of total assignments were from incompatible long-range NOEs in our computed assignments. As we will discuss next, such a small fraction of incompatible long-range NOE assignments can be easily resolved after one iteration of structure calculation.
6.2. Evaluation of structures from our NOE assignment tables
To test the quality of our NOE assignment results for structure determination, we fed the NOE assignment tables into the standard structure calculation program XPLOR.3 The input files for the structure calculation include protein sequence, NOE assignment table, and dihedral restraints. Compared with Refs. 2 and 23, in which RDCs are incorporated along with NOE restraints into the final structure calculation, here we only used RDCs to compute the initial backbone fold. From an algorithmic point of view, our structure determination using only NOEs can be considered as a good “control” test of the quality of our NOE assignment. The structure calculation was performed in two rounds. After the first round of structure calculation, the NOE violations larger than 0.5 Å among top 10 structures with lowest energies were removed from the NOE assignment table. Then the refined NOE table was fed into the XPLOR program for the second-round structure calculation.
Figures 2 illustrates final NMR structures of ubiquitin, pol η and hSRI calculated from XPLOR using our NOE restraint tables. For all three proteins, only a small number 18–60 (which is 1–4% of the total number of NOE assignments) of NOE violations larger than 0.5 Å occurred after the first round of structure calculation. All final structures converged to an ensemble of low-energy structures with small RMSDs from the reference structure solved either by the X-ray crystallography or by traditional NMR approaches. For all three test cases, the mean structure of final top 10 structures with lowest energies had a backbone RMSD less than 1.7 Å and an all-heavy-atom RMSD less than 2.5 Å from the reference structure. This implies that our NOE assignment algorithm has provided a sufficient number of accurate distance restraints for protein structure determination. In particular, we examined the structure quality in secondary structure and loop regions. We found that the secondary structure regions have better RMSD from the reference structure than the loop regions. After the final structure calculated by XPLOR using our NOE assignment table output by HANA, the RMSD of secondary structure regions in pol η is 0.81 Å for backbone atoms and 1.74 Å for all heavy atoms, and the RMSD of secondary structure regions in ubiquitin is 0.93 Å for backbone atoms and 1.59 Å for all heavy atoms. These results show that the initial fold of secondary structure regions solved using the RDC-EXACT algorithm is accurate enough to combine with chosen rotamers from NOE patterns to resolve the NOE assignment ambiguities. In addition, we also found that the short loop regions of final structures can achieve about the same RMSD from the reference structure as the secondary structure regions. This indicates that the CCD algorithm with filtering of loops based on RDC fit can provide accurate short loops for our NOE assignment algorithm.
Our structure calculation protocol only requires one iteration, while other traditional NMR approaches in general take 7–10 iterations between NOE assignment and structure calculation. In addition, our NOE assignment algorithm only takes 1–2 minutes, versus hours to weeks for other methods. This efficiency is consistent with the proofs of correctness and time complexity of our algorithm. Therefore, the structure calculation framework based on our NOE assignment algorithm is more efficient than all other previous approaches in both theory and practice.
7. CONCLUSION
We have described a novel automated NOE assignment algorithm, HANA, that is combinatorially precise, and runs in polynomial time. To our knowledge, HANA is the first NOE assignment algorithm that simultaneously exploits the accurate algebraic geometry-based high-resolution backbone computation from RDC data,37, 39 the statistical diversity of rotamers from a rotamer library,25 and the robust Hausdorff measure17, 19 for comparing the back-computed NOE patterns with the experimental NOE spectra and choosing accurate rotamers, to finally compute the NOE assignments with high accuracy. Owing to its simplicity, HANA runs extremely fast in practice. Furthermore, when applied to real biological NMR spectra for three proteins, our algorithm yields high assignment accuracy (> 90%) in each case suggesting its ability to play a role in high-throughput structure determination.
Although our current implementation of HANA uses 2D and 3D NOESY spectra, HANA is general and can be easily extended to use higher-dimensional (e.g., 4D) NOESY data.6, 7 In addition, it would be interesting to extend the current version of HANA for NOE assignment with missing resonances. In general, acquisition of complete resonance assignment can require selective labeling of proteins, and is time-consuming. On the other hand, selection of correct rotamers can help the resonance assignment for side-chains. In principle, HANA can be extended to accommodate the NOE assignment with a partially assigned resonance list, as long as the back-computed NOE patterns with missing peaks are sufficient to identify accurate rotamers. Finally, it would be interesting to explore the use of side-chain rotamer packing algorithms11 to choose rotamers that fit the data.
Supplementary Material
Acknowledgments
We thank Dr. L. Wang, Mr. A. Yan, Dr. S. Apaydin, Mr. J. Boyles, Prof. J. Richardson, Prof. D. Richardson, and all members of the Donald and Zhou Labs for helpful discussions and comments. We are grateful to Ms. M. Bomar for helping us with pol NMR data.
Footnotes
Abbreviations used: NMR, Nuclear Magnetic Resonance; ppm, parts per million; RMSD, root mean square deviation; NOESY, Nuclear Overhauser Enhancement SpectroscopY; HSQC, Heteronuclear Single Quantum Coherence spectroscopy; NOE, Nuclear Overhauser Effect; RDC, Residual Dipolar Coupling; PDB, Protein Data Bank; pol η, zinc finger domain of the human DNA Y-polymerase Eta; hSRI, human Set2-Rpb1 interacting domain; POF, Principal Order Frame; CCD, Cyclic Coordinate Descent; SA, Simulated Annealing; MD, Molecular Dynamics; □, Q.E.D.; SM, Supplementary Material.
The problem for 3D and 4D cases can be defined in an analogous manner. Here the 2D case is explained for clarity. Our NOE assignment algorithm has been tested on both 2D and 3D spectra, and extends easily to handle 4D NOESY spectra.
We will use terms proton name and proton interchangeably in this paper.
The “low resolution” structure generally has approximately 2.0–3.0 Å (all heavy atom) RMSD from the reference structures solved by X-ray or traditional NMR approaches.
Contributor Information
Jianyang (Michael) Zeng, Department of Computer Science, Duke University, Durham, NC 27708, USA.
Chittaranjan Tripathy, Department of Computer Science, Duke University, Durham, NC 27708, USA.
Pei Zhou, Department of Biochemistry, Duke University Medical Center, Durham, NC 27708, USA.
Bruce R. Donald, Department of Computer Science, Duke University, Department of Biochemistry, Duke University Medical Center, Durham, NC 27708, USA.
References
- 1.Bailey-Kellogg C, Chainraj S, Pandurangan G. A random graph approach to nmr sequential assignment. Journal of Computational Biology. 2005;12(6):569–583. doi: 10.1089/cmb.2005.12.569. [DOI] [PubMed] [Google Scholar]
- 2.Bomar MG, Pai M, Tzeng S, Li S, Zhou P. Structure of the ubiquitin-binding zinc finger domain of human DNA Y-polymerase η. EMBO reports. 2007;8:247–251. doi: 10.1038/sj.embor.7400901. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Brünger AT. X-PLOR, Version 3.1: a system for X-ray crystallography and NMR. Journal of the American Chemical Society. 1992 [Google Scholar]
- 4.Canutescu AA, Dunbrack RL., Jr Cyclic coordinate descent: A robotics algorithm for protein loop closure. Protein Science. 2003;12:963–972. doi: 10.1110/ps.0242703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Clore GM, Omichinski JG, Sakaguchi K, Zambrano N, Sakamoto H, Appella E, Gronenborn AM. Interhelical angles in the solution structure of the oligomerization domain of the tumour suppressor p53. Science. 1995;267:1515–1516. doi: 10.1126/science.7878474. [DOI] [PubMed] [Google Scholar]
- 6.Coggins BE, Venters RA, Zhou P. Filtered Back-projection for the Reconstruction of a High-Resolution (4,2)D CH3-NH NOESY Spectrum on a 29 kDa Protein. J Am Chem Soc. 2005;127:11562–11563. doi: 10.1021/ja053110k. [DOI] [PubMed] [Google Scholar]
- 7.Coggins BE, Zhou P. PR-CALC: A Program for the Reconstruction of NMR Spectra from Projections. J Biomol NMR. 2006;34:179–95. doi: 10.1007/s10858-006-0020-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Cornilescu G, Delaglio F, Bax A. Protein backbone angle restraints from searching a database for chemical shift and sequence homology. Journal of Biomolecular NMR. 1999;13:289–302. doi: 10.1023/a:1008392405740. [DOI] [PubMed] [Google Scholar]
- 9.Cornilescu G, Marquardt JL, Ottiger M, Bax A. Validation of Protein Structure from Anisotropic Carbonyl Chemical Shifts in a Dilute Liquid Crystalline Phase. Journal of the American Chemical Society. 1998;120:6836–6837. [Google Scholar]
- 10.Crippen GM, Havel TF. Distance Geometry and Molecular Conformations. Wiley; New York: 1988. pp. 635–642. [Google Scholar]
- 11.Georgiev I, Lilien RH, Donald BR. The minimized dead-end elimination criterion and its application to protein redesign in a hybrid scoring and search algorithm for computing partition functions over molecular ensembles. Journal of Computational Chemistry. 2008 Feb 21; doi: 10.1002/jcc.20909. [Epub ahead of print] PMID: 1829329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Güntert P. Automated NMR Protein Structure Determination. Progress in Nuclear Magnetic Resonance Spectroscopy. 2003;43:105–125. [Google Scholar]
- 13.Güntert P. Automated NMR protein structure calculation with CYANA. Meth Mol Biol. 2004;278:353–378. doi: 10.1385/1-59259-809-9:353. [DOI] [PubMed] [Google Scholar]
- 14.Herrmann T, Güntert P, Wüthrich K. Protein NMR Structure Determination with Automated NOE Assignment Using the New Software CANDID and the Torsion Angle Dynamics Algorithm DYANA. Journal of Molecular Biology. 2002;319(1):209–227. doi: 10.1016/s0022-2836(02)00241-3. [DOI] [PubMed] [Google Scholar]
- 15.Harris R. The ubiquitin NMR resource page, BBSRC Bloomsbury Center for Structural Biology. 2007 Jun; http://www.biochem.ucl.ac.uk/bsm/nmr/ubq/
- 16.Huang YJ, Tejero R, Powers R, Montelione GT. A topology-constrained distance network algorithm for protein structure determination from NOESY data. Proteins: Structure Function and Bioinformatics. 2006;62(3):587–603. doi: 10.1002/prot.20820. [DOI] [PubMed] [Google Scholar]
- 17.Huttenlocher DP, Jaquith EW. Computing visual correspondence: Incorporating the probability of a false match. Procedings of the Fifth International Conference on Computer Vision (ICCV 95); 1995. pp. 515–522. [Google Scholar]
- 18.Huttenlocher DP, Kedem K. Distance Metrics for Comparing Shapes in the Plane. In: Donald BR, Kapur D, Mundy J, editors. Symbolic and Numerical Computation for Artificial Intelligence. Academic press; 1992. pp. 201–219. [Google Scholar]
- 19.Huttenlocher DP, Klanderman GA, Rucklidge W. Comparing Images Using the Hausdorff Distance. IEEE Trans Pattern Anal Mach Intell. 1993;15(9):850–863. [Google Scholar]
- 20.Kamisetty H, Bailey-Kellogg C, Pandurangan G. An efficient randomized algorithm for contact-based nmr backbone resonance assignment. Bioinformatics. 2006;22(2):172–180. doi: 10.1093/bioinformatics/bti786. [DOI] [PubMed] [Google Scholar]
- 21.Kuszewski J, Schwieters CD, Garrett DS, Byrd RA, Tjandra N, Clore GM. Completely automated, highly error-tolerant macromolecular structure determination from multidimensional nuclear overhauser enhancement spectra and chemical shift assignments. J Am Chem Soc. 2004;126(20):6258–6273. doi: 10.1021/ja049786h. [DOI] [PubMed] [Google Scholar]
- 22.Langmead CJ, Yan AK, Lilien RH, Wang L, Donald BR. A polynomial-time nuclear vector replacement algorithm for automated nmr resonance assignments; Proceedings of the seventh annual international conference on Research in computational molecular biology; 2003. pp. 176–187. [DOI] [PubMed] [Google Scholar]
- 23.Li M, Phatnani HP, Guan Z, Sage H, Greenleaf AL, Zhou P. Solution structure of the Set2-Rpb1 interacting domain of human Set2 and its interaction with the hyperphosphorylated C-terminal domain of Rpb1. Proceedings of the National Academy of Sciences. 2005;102:17636–17641. doi: 10.1073/pnas.0506350102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Linge JP, Habeck M, Rieping W, Nilges M. ARIA: Automated NOE assignment and NMR structure calculation. Bioinformatics. 2003;19(2):315–316. doi: 10.1093/bioinformatics/19.2.315. [DOI] [PubMed] [Google Scholar]
- 25.Lovell SC, Word JM, Richardson JS, Richardson DC. The Penultimate Rotamer Library. Proteins: Structure Function and Genetics. 2000;40:389–408. [PubMed] [Google Scholar]
- 26.Montelione GT, Moseley HNB. Automated analysis of NMR assignments and structures for proteins. Curr Opin Struct Biol. 1999;9:635–642. doi: 10.1016/s0959-440x(99)00019-6. [DOI] [PubMed] [Google Scholar]
- 27.Mumenthaler C, Güntert P, Braun W, Wüthrich K. Automated combined assignment of NOESY spectra and three-dimensional protein structure determination. J Biomol NMR. 1997;10(4):351–362. doi: 10.1023/a:1018383106236. [DOI] [PubMed] [Google Scholar]
- 28.Nilges M, Macias MJ, O’Donoghue SI, Oschkinat H. Automated NOESY interpretation with ambiguous distance restraints: the refined NMR solution structure of the pleckstrin homology domain from β-spectrin. Journal of Molecular Biology. 1997;269(3):408–422. doi: 10.1006/jmbi.1997.1044. [DOI] [PubMed] [Google Scholar]
- 29.Rieping W, Habeck M, Nilges M. Inferential Structure Determination. Science. 2005;309:303–306. doi: 10.1126/science.1110428. [DOI] [PubMed] [Google Scholar]
- 30.Saxe JB. Embeddability of weighted graphs in k-space is strongly NP-hard. Proc. 17th Alleron Conf. Commun. Control Comput; 1979. pp. 480–489. [Google Scholar]
- 31.Schwieters CD, Kuszewski JJ, Tjandra N, Clore GM. The Xplor-NIH NMR molecular structure determination package. J Magn Reson. 2003;160:65–73. doi: 10.1016/s1090-7807(02)00014-9. [DOI] [PubMed] [Google Scholar]
- 32.Shehu A, Clementi C, Kavraki LE. Modeling protein conformational ensembles: from missing loops to equilibrium fluctuations. Proteins: Structure, Function, and Bioinformatics. 2006;65(1):164–79. doi: 10.1002/prot.21060. [DOI] [PubMed] [Google Scholar]
- 33.Tjandra N, Bax A. Direct measurement of distances and angles in biomolecules by NMR in a dilute liquid crystalline medium. Science. 1997;278:1111–1114. doi: 10.1126/science.278.5340.1111. [DOI] [PubMed] [Google Scholar]
- 34.Tolman JR, Flanagan JM, Kennedy MA, Prestegard JH. Nuclear magnetic dipole interactions in field-oriented proteins: Information for structure determination in solution. Proc Natl Acad Sci USA. 1995;92:9279–9283. doi: 10.1073/pnas.92.20.9279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Vijay-Kumar S, Bugg CE, Cook WJ. Structure of ubiquitin refined at 1.8 A resolution. Journal of Molecular Biology. 1987;194:531–44. doi: 10.1016/0022-2836(87)90679-6. [DOI] [PubMed] [Google Scholar]
- 36.Wang L, Donald BR. Analysis of a Systematic Search-Based Algorithm for Determining Protein Backbone Structure from a Minimal Number of Residual Dipolar Couplings. Proceedings of The IEEE Computational Systems Bioinformatics Conference (CSB); Stanford CA. August, 2004; 2004. [DOI] [PubMed] [Google Scholar]
- 37.Wang L, Donald BR. Exact solutions for internuclear vectors and backbone dihedral angles from NH residual dipolar couplings in two media, and their application in a systematic search algorithm for determining protein backbone structure. Jour Biomolecular NMR. 2004;29(3):223–242. doi: 10.1023/B:JNMR.0000032552.69386.ea. [DOI] [PubMed] [Google Scholar]
- 38.Wang L, Donald BR. An Efficient and Accurate Algorithm for Assigning Nuclear Overhauser Effect Restraints Using a Rotamer Library Ensemble and Residual Dipolar Couplings. The IEEE Computational Systems Bioinformatics Conference (CSB); Stanford CA. August, 2005; 2005. pp. 189–202. [DOI] [PubMed] [Google Scholar]
- 39.Wang L, Mettu R, Donald BR. A Polynomial-Time Algorithm for De Novo Protein Backbone Structure Determination from NMR Data. Journal of Computational Biology. 2006;13(7):1276–1288. doi: 10.1089/cmb.2006.13.1267. [DOI] [PubMed] [Google Scholar]
- 40.Zeng J, Tripathy C, Zhou P, Donald BR. Department of Computer Science, Duke University; May, 2008. A Hausdorff-Based NOE Assignment Algorithm Using Protein Backbone Determined from Residual Dipolar Couplings and Rotamer Patterns – Supplementary Material. [online]. Available: http://www.cs.duke.edu/donaldlab/Supplementary/csb08/ [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.