Cryogenic electron microscopy (cryo-EM) has become a widely used technique in structural biology, and an increasing number of biomolecular structures have been determined and stored in the public databases, PDB1 and EMDB2. Despite the popularity and increased rigor, it has been noticed that errors occur in the process of building models from cryo-EM maps, more frequently than one might think. Thus, establishing quality assessment methods has become a crucial and urgent task for biomolecular structure determination with cryo-EM3. To address this need, we recently developed a quality assessment method, the Deep-learning-based Amino acid-wise model Quality (DAQ) score, which detects map-model compatibility outliers using deep learning4.
Here, we report the DAQ-Score Database (https://daqdb.kiharalab.org/), which provides precomputed quality assessment results for protein models deposited in PDB and the corresponding cryo-EM maps in EMDB. The DAQ score quantifies incompatibilities between residue assignments in protein structure models and local EM map features, which are detected by a deep neural network (Figure 1). The neural network detects specific map features for protein amino acid residue types, Cα atoms, and secondary structures, and computes the likelihood that each residue assignment is correct. DAQ(AA) detects potential errors in amino acid sequence (residue type) assignment, including assignments to an otherwise correct main-chain trace. DAQ(ATOM) detects local regions where the main-chain is likely traced incorrectly. DAQ(SS) is aimed at detecting incompatibility of modeled secondary structure.
Fig. 1: Overview of data stored in DAQ-Score Database.

a, the protocol of computing DAQ score for a protein structure model from a cryo-EM map. DAQ score quantifies how well the model agrees with the position-wise probabilities of EM map computed with a deep neural network. It scans an EM map with a box of a 11 × 11 × 11 Å3 size and outputs the probabilities of amino acid type, Cα position, and secondary structure type for the center position of the box. Then, DAQ(AA), DAQ(ATOM), and DAQ(SS) are calculated across the entire model. Computed scores for the model are stored in the database. b, in each entry page of DAQ-Score Database, DAQ score along the protein structure model is visualized by a color code from red (low) to blue (high) in the model viewer5. Users can select a local region in the sequence panel to highlight the region in the structure viewer in magenta. The three DAQ scores are also shown along the sequence in an interactive graph. The first version of the model for PDB ID: 7JSN Chain B (DAQ-Sore DB ID: 22458_7jsn_B_v1-1) was used for illustration. c, changes of the number of low-scoring residues in protein chain models that have two versions in PDB. Out of 9,306 protein chain structures that have two versions in PDB, 714 protein chains were analyzed that have at least one residue with a DAQ score change between the two versions. The number of low-scoring residues was counted for a model, which were defined as those with DAQ(AA) < −0.5 or DAQ(ATOM) < −0.5. In the plot, the difference between (the number of low-scoring residues in the initial model) – (in the updated model) were plotted. The change is depicted by color: fewer residues with scores below the −0.5 threshold for both DAQ(AA) and DAQ(ATOM) in blue (104 chains); more residues with poor scores for both DAQ(AA) and DAQ(ATOM) in red (23 chains); more residues with poor DAQ(AA) and fewer with poor DAQ(ATOM) in light brown (40 chains); fewer residues poor DAQ(AA) and more with poor DAQ(ATOM) in cyan (41 chains); and no change in numbers of residues below the threshold in white (506 chains). d, bubble plot showing the breakdown of the protein chains that have 0, 1–9, and more than 10 low-scoring residues defined as DAQ(AA) < −0.5 and DAQ(ATOM) < −0.5. If a chain has multiple versions of models, the latest model was used. Thus, the total number of the latest chain models used in this plot is 122,978. A bubble’s area is proportional to the number of the chain models.
The DAQ-Score Database currently contains over 132,492 protein chain models from 8,705 PDB entries that were derived from cryo-EM maps. We plan to update the database once a month. We targeted protein models from maps at a resolution between 2.5 Å and 5.0 Å, inclusive, since this is the range where the neural network can effectively detect local structural features from a map. Also, we limited the analysis to protein models that have sufficient overlap with their corresponding maps (Supplementary Information 1; Supplementary Fig. S1). The list of included or excluded maps from the analysis is provided at the database website. In addition to the current version of PDB entries, we have also computed DAQ of previous major versions of models if they exist.
Entries can be searched by PDB ID, EMDB ID, or keywords. As shown in Figure 1b, computed the DAQ scores are visualized in a color code on a protein structure model in an interactive structure viewer, coupled with a graph that shows DAQ score along the residue position. The graph dynamically interacts with the model structure in the viewer.
In the original DAQ score work, we showed that residues with a DAQ score less than −0.5 are highly likely to have been incorrectly placed in the map4 (Supplementary Information 2). As an illustration, in Figure 1c, we analyzed protein chain structures in PDB that have two versions and compared DAQ(AA) and DAQ(ATOM) of the two models. We counted the number of residues that have DAQ(AA) or DAQ(ATOM) scores less than −0.5 and plotted the improvement between the two models which have at least one residue with a DAQ score change. In the plot, a decrease of the count of such low-scoring residues is plotted in terms of the two scores; thus, a positive number indicates an improvement in terms of that score. An updated structure often has large decreases of the number of low-scoring residues as shown in the first quadrant in the plot, which suggest improvement of the model quality. The same plot with a −1.0 cutoff is shown in Supplementary Figure S2.
Figure 1d shows a breakdown of all entries in the database by the number of low-scoring residues in any given model. A total of 132,492 chain model entries are classified by the number of low-scoring residues with DAQ(AA) and DAQ(ATOM) scores less than −0.5. 7.8% have 1 to 9 low-scoring residues. Moreover, 8.7% of chain models have over 10 low-scoring residues. These models require attention to resolve possible modeling errors because residues with DAQ <−0.5 are highly likely to be incorrect. While 83.5% of the chain models have no residues with a substantially low DAQ score, this does not guarantee that these models are truly error-free, as incorrect residues can have a DAQ score higher than −0.5 (see discussion in Supplementary Information 2). In particular, DAQ tends to have a small absolute value and get closer to 0, when the map resolution is low, which makes it difficult for the neural network to make a decisive call. In Supplementary Fig. S3, we provide the same plot with a score cutoff of −1.0.
In summary, the DAQ-Score Database provides precomputed quality assessment of protein models from cryo-EM maps that are stored in PDB. DAQ score can be also computed for a protein model on the Google Colab site6 or by downloading the software7 and running it locally. We believe this resource will be valuable for the structural biology community.
Supplementary Material
Acknowledgments
The authors are grateful to C. Christoffer for proofreading the manuscript. This database is currently financially supported by grants from the National Institutes of Health (R01GM133840, 3R01GM133840-02S1). D.K. also acknowledges supports from the National Science Foundation (CMMI1825941, MCB1925643, DBI2146026, IIS2211598, DMS2151678, and DBI2003635). X.W. is a recipient of the MolSSI graduate fellowship.
Footnotes
Code availability
Source code for computing DAQ score is available at https://github.com/kiharalab/DAQ.
Competing interests
The authors declare no competing interests.
Data availability
All precomputed DAQ score data is available at https://daqdb.kiharalab.org/. The DAQ-Score Database will be updated and maintained by the Kihara lab, with possible future assistance from system managers and other structural biology researchers at Purdue.
References
- 1.Berman HM et al. The Protein Data Bank. Nucleic Acids Research 28, 235–242 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Lawson CL et al. EMDataBank unified data resource for 3DEM. Nucleic Acids Research 44, D396–403, doi: 10.1093/nar/gkv1126 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Lawson CL et al. Cryo-EM model validation recommendations based on outcomes of the 2019 EMDataResource challenge. Nature Methods 18, 156–164, doi: 10.1038/s41592-020-01051-w (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Terashi G, Wang X, Maddhuri Venkata Subramaniya SR, Tesmer JJG & Kihara D Residue-wise local quality estimation for protein models from cryo-EM maps. Nature Methods 19, 1116–1125, doi: 10.1038/s41592-022-01574-4 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Sehnal D et al. Mol* Viewer: modern web app for 3D visualization and analysis of large biomolecular structures. Nucleic Acids Research 49, W431–W437, doi: 10.1093/nar/gkab314 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.DAQ Score Google Colab: https://bit.ly/daq-score
- 7.GitHub repository of DAQ Score software: https://github.com/kiharalab/DAQ
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All precomputed DAQ score data is available at https://daqdb.kiharalab.org/. The DAQ-Score Database will be updated and maintained by the Kihara lab, with possible future assistance from system managers and other structural biology researchers at Purdue.
