Abstract
As methods for analysis of biomolecular structure and dynamics using nuclear magnetic resonance spectroscopy (NMR) continue to advance, the resulting 3D structures, chemical shifts, and other NMR data are broadly impacting biology, chemistry, and medicine. Structure model assessment is a critical area of NMR methods development, and is an essential component of the process of making these structures accessible and useful to the wider scientific community. For these reasons, the Worldwide Protein Data Bank (wwPDB) has convened an NMR Validation Task Force (NMR-VTF) to work with the wwPDB partners in developing metrics and policies for biomolecular NMR data harvesting, structure representation, and structure quality assessment. This paper summarizes the recommendations of the NMR-VTF, and lays the groundwork for future work in developing standards and metrics for biomolecular NMR structure quality assessment.
Introduction
The Worldwide Protein Data Bank (wwPDB) (Berman et al., 2003) has convened several task forces to recommend metrics, standards and software for macromolecular structure quality assessment. These include task forces providing recommendations for validating biomacromolecular structures determined by X-ray crystallography (Read et al., 2011), cryo-electron microscopy (cryoEM) (Henderson et al., 2012), NMR, and small angle scattering (SAXS/SANS) (Trewhella et al., 2013). The deliberations of these task forces are also important for defining critical research areas in the field of biomacromolecular structure analysis, and for guiding efforts of researchers developing their own structure validation platforms. Here we present the initial recommendations of the NMR Validation Task Force (NMR-VTF). These recommendations supplement those published by an earlier commission addressing related problems of NMR structure representation and interpretation (Markley et al., 1998).
NMR is now routinely used for determining 3D structures of small (< 20 kDa) proteins to high accuracy, often using largely automated methods (Mao et al., 2011; Rosato et al., 2012; Serrano et al., 2012). In favorable cases, structures of proteins as large as 50 kDa or larger can be determined with good accuracy (Lange et al., 2012; Raman et al., 2010). It is critical to the further development of the field to establish metrics and standards for assessing the reliability and accuracy of NMR-derived structures. A wide variety of different data types and methods are used by different groups to generate NMR structures. For this reason, it is important to define unifying and standardized approaches for defining the precision and accuracy of NMR-derived biomolecular structures. This information is essential for appropriate use of these structures in biological research.
Several software packages have been developed that integrate various tools for NMR structure quality assessment (Bhattacharya et al., 2007; Doreleijers et al., 2012; Laskowski et al., 1996; Nabuurs et al., 2003; Spronk et al., 2003; Vriend, 1990; Tejero et al., 2013; Vuister et al, 2013). These analyses can provide useful information to the experimentalists for improving the interpretation of the NMR data (e.g., allowing more accurate identification of NOESY peaks), and to users of the data by indicating those parts of the structure that can be used to address specific questions about structure and function. Although such software suites are very useful, the field has not yet adopted uniformly accepted metrics and standards for NMR structure quality assessment.
The NMR-VTF recommends that certain well-developed methods and software packages can form a basis for a standardized platform for protein NMR structure assessment. In particular, the NMR-VTF recommends that existing tools developed for validation of X-ray diffraction-derived structures of biomacromolecules and their complexes (Read et al., 2011; Gore et al., 2012) are also appropriate for NMR structures. NMR-specific software tools and metrics that are already broadly adopted by the NMR community can be used together with these knowledge-based geometric validation methods to provide a first-stage NMR structure validation platform.
However, further research and development is required to address additional important issues of NMR structure validation that are needed for a comprehensive platform. These include validation of 3D structures against chemical shift, residual dipolar coupling, chemical shift anisotropy, paramagnetic relaxation enhancement, small angle X-ray scattering, and NOESY peak list data, as well as assessing the impact of internal dynamics and ensemble-averaging effects on the interpretation of these NMR data. It is recognized that more extensive metrics than those presented here will be required to fully capture the full breath of all these effects. Further work by the NMR-VTF in consultation with the community is required in order to standardize these additional structure quality metrics so that they can be broadly adopted by the wwPDB on behalf of the biological NMR community. It is very important that the broader scientific community has a voice in the evolution of standards and conventions for validating biomolecular NMR structures. For this reason, we have established an “NMR VTF Community Wiki site”, at http://nmr-community.wwpdb.org.
Principal recommendations
The NMR-VTF recommends that the depositors of biomolecular NMR structures are encouraged to provide atomic coordinates for all atoms in residues for which backbone and/or sidechain resonance assignments have been determined. This would include large internal loops, interdomain linkers, N- and C-terminal regions, and purification tags, even where the structure in these regions is “ill defined” (i.e. not-well converged or not-well defined in terms of a unique conformation) by the ensemble. The concept is that every residue for which some experimental data are available should be represented in the atomic coordinates. However, as explained below, the ill-defined regions should be specifically identified by the wwPDB through the use of software agreed upon by the community, and these regions should be handled distinctly in the structure quality assessment process. Not withstanding these recommendations, coordinates for regions of the structure that the depositor feels are not reliable may be excluded from the deposition at the discretion of the depositor.
The NMR-VTF further recommends that wwPDB implement a standardized structure validation pipeline in three phases.
Phase 1. Validation metrics that can be implemented easily and immediately by the wwPDB by using existing software.
Phase 2. Validation metrics for which software / methods are available but that need more assessment before standards and conventions can be defined for the wwPDB.
Phase 3. Validation methods requiring further research over the coming years.
The NMR-VTF has focused its initial efforts on validation of protein structures determined primarily from NMR data. Further work by the NMR-VTF is needed in order to establish recommendations for validation of nucleic acids, carbohydrates, and other biological structures determined primarily from NMR data.
Phase 1. Validation metrics that can be implemented easily by the wwPDB using existing software
Existing software packages can be used to generate validation reports for all submitted protein NMR structures. Software that is freely accessible to the scientific community, and in general use by the biomolecular NMR community, should be used for generating these validation reports. These wwPDB NMR Structure Validation Reports should include four components: (i) a report validating the completeness and global referencing of chemical shift data, independent of 3D structure; (ii) analysis of “well-defined” vs. “ill-defined” regions; (iii) a knowledge-based model validation report; and (iv) a restraint-based model-vs-data validation report, comparing each member of the ensemble of NMR models to the available NMR restraints.
These validation reports also should be generated for all NMR structures already in the PDB and distributed by wwPDB member sites. It is recognized that as many of these structures do not have chemical shift data and/or complete restraint data, these validation reports for many of the archived NMR structures will be incomplete. It is also recognized that some structures will have poor validation analyses, often reflecting the early vintage of some of the NMR structures in the archive. These validation reports on the archived NMR structures are nonetheless valuable to the scientific community. Care should also be taken to first identify and exclude from this analysis “averaged atomic coordinates”, which do not correspond to physically-reasonable models. In the initial phase, these reports will be provided primarily for protein structures, until similar recommendations have been developed for nucleic acids and other biomolecular NMR structures.
In certain cases, spectra may indicate the presence of two or more structures in slow exchange on the chemical shift timescale (e.g. a mixture of reduced and oxidized forms of a protein, or conformations distinguished by slow proline cis/trans isomerization). In the event that two or more non-trivially distinct NMR structures are generated for the biomolecule from distinct restraint data, the two (or more) atomic coordinate sets should be deposited and validated separately. An example of two structures that should be deposited and validated separately would be one protein for which different coordinates have been determined for surface loops containing different proline (cis vs trans) peptide bond conformations.
1.1. Chemical shift data validation report
All new NMR structures that are deposited in the PDB are now required to include the chemical shift data used to determine the structure. The NMR-VTF recommends that a Phase 1 NMR Structure Validation Report include an analysis of the completeness of these chemical shift data for all assigned atoms, global reference corrections, and a list of chemical shift outliers. Completeness of assignments refers to the percentage of resonances (e.g. 1H, 13C, and 15N) for which assignments are reported, relative to the number of potentially assignable atoms in the full-length protein construct, excluding highly exchangable protons (e.g., N-terminal and Lys amino and Arg guanido groups, hydroxyl hydrogens of Ser, Thr, Tyr, and carboxyl hydrogens of Asp and Glu, or the equivalent hydrogen atoms of nucleic acids), non-protonated nitrogens and carbons (e.g. Pro N and aromatic Cγ). Backbone carbonyl carbons of peptide bonds shall generally be included in the number of assignable atoms. For the purpose of calculating assignment completeness percentages, the three hydrogens of a methyl group are counted as one atom. If a single chemical shift has been assigned for a diastereotopic pair, this same shift should be reported for both hydrogens or methyl groups, unless it has been experimentally established that it originates from only one of the two diastereotopic partners.
Standardized chemical shift completeness and global referencing reports can be generated by using the same tools that are used by the BioMagResBank (Ulrich et al., 2008), including the Assignment Validation Software Suite (AVS) (Moseley et al., 2004), LACS (Wang et al., 2005; Wang and Markley, 2009), and SPARTA+ (Shen and Bax, 2010). Additional tools that could be useful for validation of the integrity and accuracy of the chemical shift data, and to identify unusual or outlier chemical shifts, include PANAV (Wang et al., 2010), CheckShift (Ginzinger et al., 2007), ShiftX2 (Han et al., 2011), and VASCO (Rieping and Vranken, 2010). The output of an appropriate subset of these tools could to be combined into a single consistent chemical shift data validation report. As discussed below, it is premature to include in this Phase 1 report a validation of 3D structure models on the basis of chemical shift data.
1.2. Well-defined vs. ill-defined atoms or residue ranges
NMR structure validation methods are generally applicable only to the “well-defined” regions of the macromolecular structure. Atoms that are not “well defined” in their atomic positions by the experimental NMR data should not be included as part of the global NMR structure validation. However, such “ill-defined” regions of the structure may still be useful for expert applications, and models for these regions generally will also be included in the atomic coordinate file. Users of protein NMR structure models need to be made aware of which atoms in the PDB coordinate file are “well-defined” in the NMR structure. For these reasons, it is important that NMR structure coordinates are flagged in a way that that identifies the “well-defined” and “ill-defined” residues and atoms.
Solution NMR structures typically are represented as “ensembles” of coordinate sets. Each member of the ensemble represents a single model that is consistent with the experimental data. The distribution of models across the ensemble provides insight into how “well-defined” the structure is in different regions. “Well-defined” regions are those that are precisely (though not necessarily accurately) modeled across the ensemble. The “ill-defined” parts of the structure may correspond to regions of a molecule undergoing conformational dynamics, or may simply reflect incompleteness of the restraining data.
The ensemble representation of molecular models is sometimes confusing to biologists attempting to use an NMR structure. Although each model in the ensemble is considered to be a valid representation of the structure, the uncertainty in these atomic coordinates is commonly assessed by statistical analysis across the ensemble. However, for various reasons, the ensemble representation does not provide a statistically sound estimate of the precision of the atomic coordinates given uncertainties of the experimental data (Andrec et al., 2007; Clore et al., 1993; Snyder et al., 2005; Snyder and Montelione, 2005; Spronk et al., 2003), nor does it provide a true estimate of conformational dynamics. Nonetheless, the ensemble representation is the current convention of the field for distinguishing those regions of a structure that are “well-defined” by the experimental data, from those that are “ill-defined”. This distinction is critical for appropriate quality assessment of NMR structures. Accordingly, it is important that the ensemble information is conveyed in a simple way to users of NMR structures.
Chemical shifts, RDCs, relaxation rates, and other NMR parameters that provide structural information may be associated with the “ill-defined” regions. Ill-defined regions may also include transient structural information that is functionally important. For these reasons, it is recommended that atomic coordinates are provided for all residues for which chemical shift data are available, even though these coordinates may be imprecisely defined by the experimental data.
It is recognized that such ill-defined regions, which may be flexibly disordered, are often biologically and/or biophysically important, particularly in determining the biochemical functions of macromolecules (Dyson and Wright, 2005; Dunker et al., 2008). NMR can provide unique information about amplitudes and timescales of dynamic fluctuations, which often contribute to biomolecular function. These are important considerations for the NMR VTF in phases 2 and 3, as outlined below. Although the convention of designating residues or atoms as “ill defined” is helpful for users of biomolecular structures, the terminology “ill defined” should in no way be interpreted as devaluing the significance of these regions of the structure. Considering that flexibly disordered regions of biomolecular structures are important structural and functional features, and that methods for interpreting ensemble-averaged information in these regions of the structure are still under development, the NMR VTF recommends that the standard validation report include plots of backbone and sidechain circular variance vs. residue number for all residues for which atomic coordinates are provided.
Depositor specification of “well-defined” and “ill-defined” regions
The NMR-VTF recommends that the wwPDB allow depositors to specify regions of the biomolecular structure that are “well-defined” across the ensemble of NMR structures, and those that are “ill-defined”. Tags for such designators have already been developed by the wwPDB and are ready to be implemented as part of PDB depositions.
Automated analysis of “well-defined” and “ill-defined” regions
Although the initial designation of “ill-defined” regions can be provided by the depositors, for the purposes of uniform structure quality assessment, the wwPDB should adopt an automated method for defining those parts of the NMR structure that are “well-defined” and “ill-defined”. Several algorithms and software packages are available to make these assessments automatically. These include methods based on (i) the locations of elements of secondary structure, (ii) backbone dihedral angle circular variance (Hyberts et al., 1992) and (iii) variance matrix analysis (Brunger et al., 1993; Kelley et al., 1996, 1997; Kirchner and Güntert, 2011; Snyder and Montelione, 2005), including methods that use maximum likelihood superimposition based on principal components analysis (Theobald and Wuttke, 2006, 2008). Definitions based on locations of elements of secondary structure exclude irregular structures in proteins that may, in fact, be well-defined. Methods based on backbone dihedral angle circular variance (Hyberts et al., 1992) are very popular in the protein NMR community, but do not provide information about long-range order, i.e., they cannot assess how well defined subdomains of the structure are with respect to one another.
The NMR-VTF recommends that the wwPDB adopt one of the several software packages for discriminating between “well-defined” and “ill-defined” regions of the protein structure. The method adopted should include the ability to distinguish multiple “well-defined regions” or “domains” that are not well defined with respect to one another. Examples of these would include two domains of well-defined atoms connected by an ill-defined linker, or a well-defined domain and independent well-defined helix, connected by an ill-defined linker. In these cases, each of the corresponding subdomains can generally be identified by distance variance matrix methods, and should be assessed separately. The software package recommended for this analysis is CYRANGE (Kirchner and Güntert, 2011); other similar software tools have also been described in the literature (Brunger et al., 1993; Kelley et al., 1996, 1997; Snyder and Montelione, 2005) and may also be suitable for this purpose.
Representative NMR structure
It is recognized that the user community requires the designation of one NMR model from the calculated ensemble, or derived from the ensemble, that is a single representative of the solution structure. The NMR-VTF recommends that the PDB identify the medoid model (Struyf et al., 1997; Snyder et al., 2005; Tejero et al., 2013) that is most similar to all the other conformers [i.e. the model in the ensemble with smallest average rmsd between it and all (other) models of the ensemble], and designate it as the single representative NMR structure. The medoid model should be identified using only the well-defined residue range(s). It can be computed using the algorithm described by Tejero et al (2013). For NMR structures containing multiple domains that are ill defined with respect to one another, the representative model should be chosen using this approach for the largest domain. If the domains are identical size – the representative multdomain structure should be selected as the one containing the domain resulting in smallest rmsd. In addition, the depositor may identify a “depositor-designated representative structure”, as part of the deposition process, based on alternative criteria to be provided at the time of deposition. The PDB might annotate the “medoid representative structure” and “depositor-designated representative structure” in order to facilitate their use.
The representative model should also be annotated to indicate which residues and/or atoms are “well-defined” and which are “ill-defined” in the NMR ensemble, either on the basis of the depositor-defined or the automatically generated designations as outlined in the previous sections. Specifically, the information about atoms or residues being well or ill-defined should go into the PDB file of the structure, and distributed by wwPDB so as to be readily available to users and external software. These annotations may be used by visualization programs to color-code or exclude ill-defined regions when displaying the representative model(s). Such annotations will be valuable to users of NMR structure coordinates, particularly users who are not familiar with interpreting the traditional ensemble representations.
1.3. Knowledge-based protein structure validation
It is the consensus of the NMR-VTF that knowledge-based model validation of protein NMR structures, including covalent geometry, dihedral angle conformations, and core packing, should utilize the same methods, software, and standards as those recommended for the model validation of protein structures determined by X-ray crystallography (Read et al., 2011). In particular, the MolProbity software (Chen et al., 2010) should be used for analysis of overpacking (e.g. all-atom steric clashes), and the Rosetta Holes (Sheffler and Baker, 2009) software for analysis of underpacking. Ramachandran backbone dihedral analysis should utilize recently updated parameters (Arendall et al., 2005; Read et al., 2011).
Knowledge-based validation should be carried out on either the automated or depositor-specified “well-defined” regions of the structure, outlined above. Global Z-scores or percentiles, which may be plot as bar graphs as proposed for X-ray crystal structures (Read et al., 2011; Gore et al., 2012), should be reported only for ‘well-defined’ regions of the structure, and should be graded using the same set of structures used in grading X-ray crystal structures (Gore et al., 2012). In particular, users should be able to compare structures determined by X-ray crystallography and by NMR using metrics and scales that are common to X-ray and NMR structures.
Specifically, the NMR VTF recommends that knowledge-based validation scores are reported on two scales: (i) relative to the entire protein crystal structure archive of the PDB (i.e. the same reference structures used for assessment of X-ray crystal structures), and (ii) relative to the NMR structure archive of the PDB. However, implementers are encouraged to consider alternate basis sets of structures to use in determining such assessment statistics. Scores should be reported as first quartile, mean, and third quartile.
In addition, knowledge-based validations should be reported for each residue of the structure. For such local model structure validation, it is recommended to consider residues in both the “well-defined” and “ill-defined” regions of the structure. By analogy with X-ray structures, for which local structural information is graded by the wwPDB by comparison with crystal structures refined to similar diffraction resolutions, for NMR structures this local structural information should be graded based on the entire database of NMR structures. Although “ill-defined” regions of the structure may or may not have energetically reasonable conformations, depositors should be encouraged to model these regions with plausible conformations. However, the final decision regarding how to model regions of the structure that are underconstrained by the experimental data should be left to the experimentalists who have determined the NMR structure.
1.4. Validation of the consistency between experimental restraints and structural models
NMR structures also should be validated against distance and dihedral restraint data that is submitted as part of a PDB deposition. In Phase 1, the NMR-VTF recommends a simple model-vs-data validation of the structure against only the submitted experimental restraints. These should include (i) distance restraints, (ii) hydrogen-bond restraints, (iii) dihedral angle restraints, and (iv) any additional distance restraints provided with the PDB deposition. These restraint data should be compared with the coordinates of each model to determine restraint violations by each model. NOE-based distance-restraint violations should be interpreted with the assumption of r−6 summation for ambiguous restraints (Nilges, 1995). The numbers of intra-residue (i = j), sequential (|i−j| = 1), medium range (1 < |i − j| < 5), long range (|i − j| ≥ 5), and inter-chain restraints should be summarized, together with the number of restraints in each category (NOE-based, hydrogen-bond, dihedral angle, etc.). The number of scalar coupling, residual dipolar coupling, chemical shift anisotropy, paramagnetic relaxation enhancement and other restraint data should also be summarized. The numbers of restraint violations, in each class, should be reported in bins (e.g., 0 – 0.2 Å, 0.2 – 0.5 Å, > 0.5 Å), along with the values of the largest restraint violations in each restraint class. If appropriate, such NMR specific metrics could be graded by comparison against the corresponding values observed in all NMR structures in PDB for which such restraint data is available. These data should be summarized for all the models in the ensemble in a concise format, and also for the individual models.
1.5. Standardized NMR structure validation report
A standard wwPDB NMR structure validation report should be developed. The committee recommends that initially only a core set of standardized validation metrics be adopted. The report should include a summary of the completeness of chemical shift data, including a summary of unusual chemical shift values, along with a validation of the NMR structure models using knowledge-based and restraint violation statistics. The report should include a version number, along with raw scores generated by the underlying knowledge-based structure validation software. It should also include machine readable output. These would be expanded over time, as the NMR-VTF assesses and recommends more sophisticated model-vs-data metrics.
Useful models of such reports are provided by the Protein Structure Validation Suite software (Bhattacharya et al., 2007), CING software package (Doreleijers et al., 2012), and the PDBStat software (Tejero et al., 2013). A recently published survey of NMR structure validation software (Vuister et al. 2013) also provides useful guidance for the development of NMR structure quality assessment reports. An example of a NMR Structure Validation Report for Phase 1, including chemical shift completeness statistics, restraint violation summaries and statistics, and knowledge-based structure validation statistics, taken from a recent paper (Aramini et al., 2012) is presented in Table 1. This example is provided only as a guide to the kind of concise summary that wwPDB might include in their validation reports. Additional information, such as chemical shift validation statistics (Moseley et al., 2004), could also be provided. Appropriate criteria will need to be developed for structures refined from NOESY-derived distance restraints that do not specify upper or lower bounds (Nilges, 1995). The X-ray crystal structure validation reports described by Gore et al. (2012) also provide useful examples to guide the design of a concise wwPDB NMR validation report. In addition, more extensive NMR structure validation data and graphical assessment tools, similar to those provided for X-ray Crystal Structure Validation Reports (Read et al., 2011), should be provided.
Table 1.
Example of a table providing a summary of structural statistics, developed based on these recommendations of the NMR VTF for bacterial protein Alr2454
| Alr2454a | ||
|---|---|---|
| Completeness of resonance assignmentsb: | ||
| Backbone (%) | 99.4 | |
| Side chain (%) | 98.3 | |
| Aromatic (%) | 96.6 | |
| Stereospecific methyl (%) | 100 | |
| Conformationally-restricting restraintsc: | ||
| Distance restraints | ||
| Total | 2478 | |
| intra-residue (i = j) | 688 | |
| sequential (|i−j| = 1) | 619 | |
| medium range (1 < |i − j| < 5) | 462 | |
| long range (|i − j| ≥ 5) | 709 | |
| Dihedral angle restraints | 162 | |
| Hydrogen bond restraints | 0 | |
| No. of restraints per residue | 25.5 | |
| No. of long range restraints per residue | 6.8 | |
| Residual restraint violationsc: | ||
| Average no. of distance violations per structure: | ||
| 0.1 – 0.2 Å | 8.75 | |
| 0.2 – 0.5 Å | 1.85 | |
| > 0.5 Å | 0 | |
| Average no. of dihedral angle violations per structure: | ||
| 1 – 10° | 8.75 | |
| > 10° | 0 | |
| Model Qualityc: | ||
| RMSD backbone atoms (Å)d | 0.6 | |
| RMSD heavy atoms (Å)d | 0.9 | |
| RMSD bond lengths (Å) | 0.018 | |
| RMSD bond angles (°) | 1.1 | |
| MolProbity Ramachandran statisticsc,d | ||
| most favored regions (%) | 96.8 | |
| allowed regions (%) | 3.1 | |
| disallowed regions (%) | 0.1 | |
| Global quality scores (Raw / Z-score)c | ||
| Verify3D | 0.40 | −0.96 |
| ProsaII | 0.66 | 0.04 |
| ProCheck (phi-psi)d | −0.15 | −0.28 |
| ProCheck (all)d | −0.03 | −0.18 |
| MolProbity clash score | 12.51 | −0.62 |
| Model Contents: | ||
| Ordered residue rangesd | 1–100 | |
| Total no. of residues | 108 | |
| BMRB accession number: | 17965 | |
| PDB ID: | 2LJWa | |
Structural statistics computed for the ensemble of 20 deposited structures.
Computed using AVS software (Moseley et al., 2004) from the expected number of resonances, excluding: highly exchangeable protons (N-terminal, Lys, and Arg amino groups, hydroxyls of Ser, Thr, Tyr), carboxyls of Asp and Glu, non-protonated aromatic carbons, and the C-terminal His6 tag.
Calculated using PSVS ver. 1.4 (Bhattacharya et al., 2007). Average distance violations were calculated using the sum over r−6.
Based on ordered residue ranges [S(phi) + S(psi) > 1.8].
Phase 2 - Methods and software exist, but require additional assessment before adopting standard validation conventions
A critical task for the NMR-VTF is to continue to assess model-vs-data validation metrics that can be used to validate the degree to which 3D NMR structures fit the underlying experimental data; i.e. “NMR R factors”. These model-vs-data metrics will include assessment of scalar coupling, residual dipolar coupling (RDC), chemical shift anisotropy (CSA), unassigned NOESY peak list, paramagnetic resonance enhancement (PRE), paramagnetic pseudo-contact shift, solid-state dipolar coupling, and small angle X-ray or neutron scattering (SAXS or SANS) data. Several tools for validating structures against these data are available, including methods for validation of protein structures against RDC data (Bryson et al., 2008; Clore et al., 1993; Valafar and Prestegard, 2004), CSA data (Cornilescu et al., 1998), NOESY peak lists (Bagaria et al., 2012; Huang et al., 2005; Huang et al., 2012; Nilges, 1995), and SAXS data (Grishaev et al., 2005).
While these methods are very powerful and generally robust, they have not yet been uniformly adopted across the biomolecular NMR community. Metrics based on these data require clear definitions and further assessment, as well as a process for harvesting these data by wwPDB in an appropriate format for validation. For these reasons, the NMR-VTF does not recommend including these model-vs-data metrics in standard wwPDB validation reports in Phase 1.
During Phase 2, the NMR-VTF will assess and then recommend the software packages most suited to model-vs-data validation. In order to provide an expanded NMR Structure Validation Report in Phase 2, with additional model-vs-data assessments, depositors of biomolecular NMR structures are encouraged to archive (where available) in the BioMagResBank (Ulrich et al., 2008) NOESY peak lists, RDC, PRE, and SAXS or SANS experimental data, as well as unprocessed free induction decay (FID) data, for biomolecular structures deposited in the PDB.
The NMR-VTF also recognizes the value of biomolecular structures which have been deposited in the PDB and subsequently found to include some inaccuracies as valuable test data sets useful for the development of structure validation methods. Coordinates that have been designated by depositors as ‘obsolete”, that are archived in the PDB, are also valuble for testing and developing structure validation tools. The VTF recommends that a set of such “inaccurate NMR structure coordinates” is collected and provided to the community for methods development.
Phase 3. Areas requiring additional research
The NMR-VTF identified the validation of NMR structures of polynucleic acids, including DNA and RNA, and polysaccharides as critical areas that require additional research. Although it is likely that some of the same tools used for validating NMR structures of proteins and X-ray crystal structures of nucleic acids will be appropriate for NMR structures of nucleic acids and polysaccharides, the NMR-VTF agreed to make standarization of metrics for nucleic acid NMR stucture validation a future priority of the committee.
A key metric requiring further research is the validation of structures in terms of the experimental information content of the data on which they are based. This “information content measure” would be analogous to the "resolution" measure so central for X-ray crystal structures. For NMR, there could potentially be both a global and a local version of such a measure. It was generally agreed by the NMR-VTF that the metric of "restraints per residue", while in the spirit of such an “information content measure”, is not satisfactory because different restraints have different “information content”. In particular, the “restraints per residue” metric does not correlate well with structural accuracy. This is an important area of research.
Chemical shift data can also be used for validation of 3D biomolecular structures (Han et al., 2011; Rieping and Vranken, 2010; Shen and Bax, 2010). This is a significant motivation for capturing chemical shift data for all protein and nucleic acid structures deposited in the PDB. However, chemical shifts are dominated by local effects and hence need to be combined with other data sensitive to longer-range structural features as part of a comprehensive model-vs-data quality assessment. Although advances have been made in this high-impact area of computational NMR, additional research is needed before standardized methods for validating structures directly against chemical shifts can be recommended for inclusion in the wwPDB NMR validation pipeline.
The NMR-VTF also recognized that biomolecular NMR data generally are an ensemble-average, with Boltzmann-weighted contributions from the various conformers present in the sample. Accordingly, the NMR data may not be best fit by a single conformer. For this reason, it is critical to develop tools that can be used to assess to what degree the lack of precision in defining atomic coordinates is due to such underlying internal dynamics, as can be assessed experimentally by nuclear relaxation, chemical shift, dipolar coupling, and/or residual dipolar coupling data. Methods for generating ensembles of conformers that best satisfy the experimental data (e.g., (Clore and Schwieters, 2004; Lindorff-Larsen et al., 2005)), particularly in highly dynamic regions of a structure, and validation of these ensembles of conformers against the ensemble-averaged data, are also an important area for future research.
Summary
There is no a priori reason to believe that biomolecular structures determined by NMR in solution or the solid state are fundamentally different from those determined by X-ray crystallography, even though intermolecular packing effects in the crystal lattice may stabilize local conformations that are not predominant in solution. For this reason, the knowledge-based validation of NMR structures should be done using the same metrics and standards, and scaled against the same or comparable structural datasets, as has been recommended for X-ray crystal structures (Read et al., 2011). As there is no generally accepted “information content measure” in NMR similar to a resolution, these knowledge-based statistics should be reported relative to (i) all crystal structures in the archive and (ii) relative to all NMR structures in the archive.
Model-vs-data validation of NMR structures is critical for the maturation of the field of biomolecular NMR. However, the recommendation of consensus statistics for model-vs-data validation (i.e. “NMR R-factors”) is complicated by the fact that NMR structures are often derived from a large number of different kinds of NMR data types. Quality assessment is simplified in these initial recommendations by focusing on restraint violation analyses. However, the restraints are interpreted data, which may not capture all of the information present in NOESY spectra and other NMR data sets. While methods are available to assess models against all these kinds of experimental data, more work is needed to define standards and metrics before incorporating these metrics in a wwPDB NMR validation pipeline. Hence, additional work will be needed to develop standards and methods for a comprehensive model-vs-data assessment.
Considering these caveats, software is available today to generate a useful and extensive Phase 1 wwPDB NMR Structure Validation Report. This report will include chemical shift data validation (completeness and outliers), assessment of ‘well-defined’ and ‘ill-defined’ regions of the structure, knowledge-based validation of ‘well-defined’ regions, and a comprehensive validation of the structure against restraint data. Such reports will provide valuable information on the precision and accuracy of NMR structures useful for guiding biological research.
Acknowledgements
We thank the following people for participation in the discussions of the wwPDB NMR-VTF and for their comments on these recommendations: J. Aramini, J. Block, R. A. Byrd, A. Gutmanas, N. Kobayashi, P. M. S. Hendrickx, Y.P. Huang, C. Lawson, H. Nakamura, R.J. Read, A. Rosato, D. Snyder, R. Tejero, E.L. Ulrich, and J. Westbrook. Support for this work was provided by members of the Worldwide PDB: RCSB PDB (NSF DBI 0829586), PDBe (Wellcome Trust 075968 and 088944; BBSRC BB/E007511/1), PDBj (NBDC-JST), BMRB (NLM NIH P41 LM05799); The Pasteur Institute, and NIGMS Protein Structure Initiative grant U54 GM094597 (to G.T.M). Funding was also provided by NIH Intramural Research Programs of NIDDK and CIT (to A.B. and C.S.), BBSRC grants BB/J007471/1 and BB/J007897/1 (to G.J.K. and G.W.V, respectively), NIH grant R01-GM073930 (for J.R.), and the Brussels Institute for Research and Innovation (Innoviris) grant BB2B 2010-1-12 (to W.F.V.)
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- Andrec M, Snyder DA, Zhou Z, Young J, Montelione GT, Levy RM. A large data set comparison of protein structures determined by crystallography and NMR: statistical test for structural differences and the effect of crystal packing. Proteins. 2007;69:449–465. doi: 10.1002/prot.21507. [DOI] [PubMed] [Google Scholar]
- Aramini JM, Petrey D, Lee DY, Janjua H, Xiao R, Acton TB, Everett JK, Montelione GT. Solution NMR structure of Alr2454 from Nostoc sp. PCC 7120, the first structural representative of Pfam domain family PF11267. J Struct Funct Genomics. 2012;13:171–176. doi: 10.1007/s10969-012-9135-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arendall WB, 3rd, Tempel W, Richardson JS, Zhou W, Wang S, Davis IW, Liu ZJ, Rose JP, Carson WM, Luo M, et al. A test of enhancing model accuracy in high-throughput crystallography. Journal of structural and functional genomics. 2005;6:1–11. doi: 10.1007/s10969-005-3138-4. [DOI] [PubMed] [Google Scholar]
- Bagaria A, Jaravine V, Huang YJ, Montelione GT, Güntert P. Protein structure validation by generalized linear model root-mean-square deviation prediction. Protein science : a publication of the Protein Society. 2012;21:229–238. doi: 10.1002/pro.2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhattacharya A, Tejero R, Montelione GT. Evaluating protein structures determined by structural genomics consortia. Proteins. 2007;66:778–795. doi: 10.1002/prot.21165. [DOI] [PubMed] [Google Scholar]
- Brunger AT, Clore GM, Gronenborn AM, Saffrich R, Nilges M. Assessing the quality of solution nuclear magnetic resonance structures by complete cross-validation. Science. 1993;261:328–331. doi: 10.1126/science.8332897. [DOI] [PubMed] [Google Scholar]
- Bryson M, Tian F, Prestegard JH, Valafar H. REDCRAFT: a tool for simultaneous characterization of protein backbone structure and motion from RDC data. J Magn Reson. 2008;191:322–334. doi: 10.1016/j.jmr.2008.01.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen VB, Arendall WB, 3rd, Headd JJ, Keedy DA, Immormino RM, Kapral GJ, Murray LW, Richardson JS, Richardson DC. MolProbity: all-atom structure validation for macromolecular crystallography. Acta crystallographica Section D, Biological crystallography. 2010;66:12–21. doi: 10.1107/S0907444909042073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clore GM, Robien MA, Gronenborn AM. Exploring the limits of precision and accuracy of protein structures determined by nuclear magnetic resonance spectroscopy. J Mol Biol. 1993;231:82–102. doi: 10.1006/jmbi.1993.1259. [DOI] [PubMed] [Google Scholar]
- Clore GM, Schwieters CD. How much backbone motion in ubiquitin is required to account for dipolar coupling data measured in multiple alignment media as assessed by independent cross-validation? Journal of the American Chemical Society. 2004;126:2923–2938. doi: 10.1021/ja0386804. [DOI] [PubMed] [Google Scholar]
- Cornilescu G, Marquardt JL, Ottiger M, Bax A. Validation of protein structure from anisotropic carbonyl chemical shifts in a dilute liquid crystalline phase. J Am Chem Soc. 1998;120:6836–6837. [Google Scholar]
- Dyson HJ, Wright PE. Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol. 2005;6:197–208. doi: 10.1038/nrm1589. [DOI] [PubMed] [Google Scholar]
- Doreleijers JF, Sousa da Silva AW, Krieger E, Nabuurs SB, Spronk CA, Stevens TJ, Vranken WF, Vriend G, Vuister GW. CING: an integrated residue-based structure validation program suite. J Biomol NMR. 2012;54:267–283. doi: 10.1007/s10858-012-9669-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dunker AK, Silman I, Uversky VN, Sussman JL. Function and structure of inherently disordered proteins. Curr. Opin. Struct. Biol. 2008;18:756–764. doi: 10.1016/j.sbi.2008.10.002. [DOI] [PubMed] [Google Scholar]
- Ginzinger SW, Gerick F, Coles M, Heun V. CheckShift: automatic correction of inconsistent chemical shift referencing. J Biomol NMR. 2007;39:223–227. doi: 10.1007/s10858-007-9191-5. [DOI] [PubMed] [Google Scholar]
- Grishaev A, Wu J, Trewhella J, Bax A. Refinement of multidomain protein structures by combination of solution small-angle X-ray scattering and NMR data. J Am Chem Soc. 2005;127:16621–16628. doi: 10.1021/ja054342m. [DOI] [PubMed] [Google Scholar]
- Gore S, Velankar S, Kleywegt GJ. Implementing an X-ray validation pipeline for the Protein Data Bank. Acta Crystall D Biol Crystallography. 2012;68:478–483. doi: 10.1107/S0907444911050359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Han B, Liu Y, Ginzinger SW, Wishart DS. SHIFTX2: significantly improved protein chemical shift prediction. J Biomol NMR. 2011;50:43–57. doi: 10.1007/s10858-011-9478-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Henderson R, Sali A, Baker ML, Carragher B, Devkota B, Downing KH, Egelman EH, Feng Z, Frank J, Grigorieff N, et al. Outcome of the first electron microscopy validation task force meeting. Structure. 2012;20:205–214. doi: 10.1016/j.str.2011.12.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang YJ, Powers R, Montelione GT. Protein NMR recall, precision, and F-measure scores (RPF scores): structure quality assessment measures based on information retrieval statistics. J Am Chem Soc. 2005;127:1665–1674. doi: 10.1021/ja047109h. [DOI] [PubMed] [Google Scholar]
- Huang YJ, Rosato A, Singh G, Montelione GT. RPF: a quality assessment tool for protein NMR structures. Nucleic Acids Res. 2012;40:W542–W546. doi: 10.1093/nar/gks373. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hyberts SG, Goldberg MS, Havel TF, Wagner G. The solution structure of eglin c based on measurements of many NOEs and coupling constants and its comparison with X-ray structures. Protein Sci. 1992;1:736–751. doi: 10.1002/pro.5560010606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kelley LA, Gardner SP, Sutcliffe MJ. An automated approach for clustering an ensemble of NMR-derived protein structures into conformationally related subfamilies. Protein Eng. 1996;9:1063–1065. doi: 10.1093/protein/9.11.1063. [DOI] [PubMed] [Google Scholar]
- Kelley LA, Gardner SP, Sutcliffe MJ. An automated approach for defining core atoms and domains in an ensemble of NMR-derived protein structures. Protein Eng. 1997;10:737–741. doi: 10.1093/protein/10.6.737. [DOI] [PubMed] [Google Scholar]
- Kirchner DK, Güntert P. Objective identification of residue ranges for the superposition of protein structures. BMC Bioinformatics. 2011;12:170. doi: 10.1186/1471-2105-12-170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lange OF, Rossi P, Sgourakis NG, Song Y, Lee HW, Aramini JM, Ertekin A, Xiao R, Acton TB, Montelione GT, et al. Determination of solution structures of proteins up to 40 kDa using CS-Rosetta with sparse NMR data from deuterated samples. Proc Natl Acad Sci U S A. 2012;109:10873–10878. doi: 10.1073/pnas.1203013109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Laskowski RA, Rullmannn JA, MacArthur MW, Kaptein R, Thornton JM. AQUA and PROCHECK-NMR: programs for checking the quality of protein structures solved by NMR. J Biomol NMR. 1996;8:477–486. doi: 10.1007/BF00228148. [DOI] [PubMed] [Google Scholar]
- Lindorff-Larsen K, Best RB, Depristo MA, Dobson CM, Vendruscolo M. Simultaneous determination of protein structure and dynamics. Nature. 2005;433:128–132. doi: 10.1038/nature03199. [DOI] [PubMed] [Google Scholar]
- Markley JL, Bax A, Arata Y, Hilbers CW, Kaptein R, Sykes BD, Wright PE, Wüthrich K. Recommendations for the presentation of NMR structures of proteins and nucleic acids. Pure Appl. Chem. 1998;70:117–142. doi: 10.1023/a:1008290618449. reprinted: (1998) J. Biomol NMR, 12, 1–23; (1998) J. Mol. Biol. 280, 933–952. [DOI] [PubMed] [Google Scholar]
- Mao B, Guan R, Montelione GT. Improved technologies now routinely provide protein NMR structures useful for molecular replacement. Structure. 2011;19:757–766. doi: 10.1016/j.str.2011.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moseley HN, Sahota G, Montelione GT. Assignment validation software suite for the evaluation and presentation of protein resonance assignment data. J Biomol NMR. 2004;28:341–355. doi: 10.1023/B:JNMR.0000015420.44364.06. [DOI] [PubMed] [Google Scholar]
- Nabuurs SB, Spronk CA, Krieger E, Maassen H, Vriend G, Vuister GW. Quantitative evaluation of experimental NMR restraints. J Am Chem Soc. 2003;125:12026–12034. doi: 10.1021/ja035440f. [DOI] [PubMed] [Google Scholar]
- Nilges M. Calculation of protein structures with ambiguous distance restraints. Automated assignment of ambiguous NOE crosspeaks and disulphide connectivities. J Mol Biol. 1995;245:645–660. doi: 10.1006/jmbi.1994.0053. [DOI] [PubMed] [Google Scholar]
- Raman S, Lange OF, Rossi P, Tyka M, Wang X, Aramini J, Liu G, Ramelot TA, Eletsky A, Szyperski T, et al. NMR structure determination for larger proteins using backbone-only data. Science. 2010;327:1014–1018. doi: 10.1126/science.1183649. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Read RJ, Adams PD, Arendall WB, 3rd, Brunger AT, Emsley P, Joosten RP, Kleywegt GJ, Krissinel EB, Lutteke T, Otwinowski Z, et al. A new generation of crystallographic validation tools for the protein data bank. Structure. 2011;19:1395–1412. doi: 10.1016/j.str.2011.08.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rieping W, Vranken WF. Validation of archived chemical shifts through atomic coordinates. Proteins. 2010;78:2482–2489. doi: 10.1002/prot.22756. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rosato A, Aramini JM, Arrowsmith C, Bagaria A, Baker D, Cavalli A, Doreleijers JF, Eletsky A, Giachetti A, Guerry P, et al. Blind testing of routine, fully automated determination of protein structures from NMR data. Structure. 2012;20:227–236. doi: 10.1016/j.str.2012.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Serrano P, Pedrini B, Mohanty B, Geralt M, Herrmann T, Wüthrich K. The J-UNIO protocol for automated protein structure determination by NMR in solution. J Biomol NMR. 2012;53:341–354. doi: 10.1007/s10858-012-9645-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sheffler W, Baker D. RosettaHoles: rapid assessment of protein core packing for structure prediction, refinement, design, and validation. Protein science : a publication of the Protein Society. 2009;18:229–239. doi: 10.1002/pro.8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen Y, Bax A. SPARTA+: a modest improvement in empirical NMR chemical shift prediction by means of an artificial neural network. J Biomol NMR. 2010;48:13–22. doi: 10.1007/s10858-010-9433-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Snyder DA, Bhattacharya A, Huang YJ, Montelione GT. Assessing precision and accuracy of protein structures derived from NMR data. Proteins. 2005;59:655–661. doi: 10.1002/prot.20499. [DOI] [PubMed] [Google Scholar]
- Snyder DA, Montelione GT. Clustering algorithms for identifying core atom sets and for assessing the precision of protein structure ensembles. Proteins. 2005;59:673–686. doi: 10.1002/prot.20402. [DOI] [PubMed] [Google Scholar]
- Spronk CA, Nabuurs SB, Bonvin AM, Krieger E, Vuister GW, Vriend G. The precision of NMR structure ensembles revisited. J Biomol NMR. 2003;25:225–234. doi: 10.1023/a:1022819716110. [DOI] [PubMed] [Google Scholar]
- Struyf A, Hubert M, Rousseeuw P. Clustering in an object-oriented environment. J Statistical Software. 1997;1:1–30. [Google Scholar]
- Tejero R, Mao B, Aramini JM, Montelione GT. PDBStat: A universal restraint converter and restraint analysis software package for protein NMR. J. Biomol. NMR. 2013 doi: 10.1007/s10858-013-9753-7. (in press). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trewhella J, Hendrickson WA, Kleywegt GJ, Sali A, Sato M, Schwede T, Svergun DI, Tainer JA, Westbrook J, Berman HM. Report of the wwPDB Small-Angle Scattering Task Force: data requirements for biomolecular modeling and the PDB. Structure. 2013;21:875–881. doi: 10.1016/j.str.2013.04.020. [DOI] [PubMed] [Google Scholar]
- Theobald DL, Wuttke DS. THESEUS: maximum likelihood superpositioning and analysis of macromolecular structures. Bioinformatics. 2006;22:2171–2172. doi: 10.1093/bioinformatics/btl332. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Theobald DL, Wuttke DS. Accurate structural correlations from maximum likelihood superpositions. PLoS Comput Biol. 2008;4:e43. doi: 10.1371/journal.pcbi.0040043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ulrich EL, Akutsu H, Doreleijers JF, Harano Y, Ioannidis YE, Lin J, Livny M, Mading S, Maziuk D, Miller Z, et al. BioMagResBank. Nucleic Acids Res. 2008;36:D402–D408. doi: 10.1093/nar/gkm957. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Valafar H, Prestegard JH. REDCAT: a residual dipolar coupling analysis tool. J Magn Reson. 2004;167:228–241. doi: 10.1016/j.jmr.2003.12.012. [DOI] [PubMed] [Google Scholar]
- Vriend G. WHAT IF: a molecular modeling and drug design program. J Mol Graph. 1990;8:52–56. 29. doi: 10.1016/0263-7855(90)80070-v. [DOI] [PubMed] [Google Scholar]
- Vuister GW, Fogh RH, Hendrickz PMS, Doreleijers JF, Gutmanas A. An overview of tools for the validation of protein NMR structures. J. Biomol. NMR. 2013 doi: 10.1007/s10858-013-9750-x. (in press). [DOI] [PubMed] [Google Scholar]
- Wang B, Wang Y, Wishart DS. A probabilistic approach for validating protein NMR chemical shift assignments. J Biomol NMR. 2010;47:85–99. doi: 10.1007/s10858-010-9407-y. [DOI] [PubMed] [Google Scholar]
- Wang L, Eghbalnia HR, Bahrami A, Markley JL. Linear analysis of carbon-13 chemical shift differences and its application to the detection and correction of errors in referencing and spin system identifications. J Biomol NMR. 2005;32:13–22. doi: 10.1007/s10858-005-1717-0. [DOI] [PubMed] [Google Scholar]
- Wang L, Markley JL. Empirical correlation between protein backbone 15N and 13C secondary chemical shifts and its application to nitrogen chemical shift re-referencing. J Biomol NMR. 2009;44:95–99. doi: 10.1007/s10858-009-9324-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
