3dDNAscoreA: A scoring function for evaluation of DNA 3D structures

Yi Zhang; Chenxi Yang; Yiduo Xiong; Yi Xiao

doi:10.1016/j.bpj.2024.02.018

. 2024 Feb 26;123(17):2696–2704. doi: 10.1016/j.bpj.2024.02.018

3dDNAscoreA: A scoring function for evaluation of DNA 3D structures

Yi Zhang ¹, Chenxi Yang ¹, Yiduo Xiong ¹, Yi Xiao ^1,^∗

PMCID: PMC11393702 PMID: 38409781

Abstract

DNA molecules are vital macromolecules that play a fundamental role in many cellular processes and have broad applications in medicine. For example, DNA aptamers have been rapidly developed for diagnosis, biosensors, and clinical therapy. Recently, we proposed a computational method of predicting DNA 3D structures, called 3dDNA. However, it lacks a scoring function to evaluate the predicted DNA 3D structures, and so they are not ranked for users. Here, we report a scoring function, 3dDNAscoreA, for evaluation of DNA 3D structures based on a deep learning model ARES for RNA 3D structure evaluation but using a new strategy for training. 3dDNAscoreA is benchmarked on two test sets to show its ability to rank DNA 3D structures and select the native and near-native structures.

Significance

To evaluate the performance of prediction methods of all-atom 3D structures of DNA, accurate all-atom scoring functions for assessing the accuracy of DNA structure models are needed. However, no such functions are available currently. To address this gap, we developed a deep learning model, 3dDNAscoreA.

Introduction

The determination of the double helical structure of DNA in 1953 (1) remains the landmark event in the development of modern biological and biomedical science. Since then, a variety of 3D structures of DNAs have been found, and they play important roles in many biological processes and have broad applications in medicine; e.g., DNA aptamers have been rapidly developed for diagnosis, biosensors, and clinical therapy (2,3,4,5,6,7,8). So far, DNA 3D structures have been mainly obtained by experimental approaches, including nuclear magnetic resonance, x-ray crystallography, and cryogenic electron microscopy (9,10,11), but the number of the determined DNA 3D structures is still limited. To bridge the gap between the numbers of sequences and structures, it is essential to develop computational methods of predicting DNA 3D structures. In particular, this is also very helpful for the selection of DNA aptamers, which typically takes 2 to 4 months to identify the target aptamer from a candidate pool by using the SELEX technique (12).

Until now, only a few methods have been proposed to predict the 3D structures of DNAs from sequences. These methods fall into three categories. One is a molecular-dynamics-based method such as HiRe-RNA (13), NARES-2P (14), or DNA_CG (15), which can only predict coarse-grained structures of small DNA molecules. Another is indirect method (16,17). It first uses RNA 3D structure prediction methods like Assemble (18) or RNAComposer (19) to model the corresponding RNA structures and then mutate the base “U” to “T” to derive DNA 3D structures. The third category is direct method that uses a template-based approach to predict 3D structures of DNAs from their sequences and secondary structures. 3dDNA (20) proposed by us is a fully automated method of predicting all-atom 3D structures of DNAs directly. This method is an extension of our RNA 3D structure prediction method 3dRNA (21,22). However, 3dDNA lacks an all-atom scoring function to rank and evaluate predicted DNA 3D structures. Therefore, we need to develop a scoring function to select the proper ones from the predicted candidates of DNA 3D structures.

In the last decade, a number of scoring functions have been proposed for RNA 3D structure evaluation using different metrics such as RMSD (23), INF (24), DI (24), MCQ (25), etc. They include knowledge-based statistical potentials (RASP (26), RNA KB potentials (27), 3dRNAscore (28), DFIRE-RNA (29), rsRNASP (30), RNAssess (31), and RNAlyzer (32)), and machine learning-based scoring functions (RNA3DCNN (33) and ARES (34)). Among them, the machine learning model of ARES was shown to be capable of efficient learning using 18 RNAs with 1000 decoys of each. Since the number of determined DNA 3D structures is very limited, the machine learning framework of ARES is also a better choice to construct a scoring function for the evaluation of DNA 3D structures.

In this work, we report a scoring function, 3dDNAscoreA, for evaluation of DNA 3D structures. It is based on the deep learning model ARES for RNA 3D structure evaluation but incorporates a new training strategy that uses the DNA training sets within root mean-square deviation (RMSD) thresholds for model training. Benchmarks show that 3dDNAscoreA has considerable ability to identify the native and near-native structures.

Materials and methods

The pipeline of 3dDNAscoreA is shown in Fig. 1. Our method consists of four parts, which are data decoy preparation, training set preparation, network architecture, and model training.

The architecture of 3dDNAscoreA for DNA 3D structures assessment. To see this figure in color, go online.

Data decoy preparation

First, all 9132 DNA structures deposited in the PDB (35) were collected until January 20, 2023. Second, we removed the chains whose structures are complexed with DNAs, RNAs, or proteins and those with an x-ray resolution >4 Å. Third, the CD-HIT-EST (36) with similarity 85% and RNA-align (37) with TM-Score >0.5 were used to reduce sequence and structure redundancies. Finally, the remaining structures were also filtered to eliminate those unable to form basepairs as calculated by the 3DNA program (38). After applying these filters, the final data set contained 148 DNA chains, which were used to construct the training and test sets. Practically, the training samples were generated in two stages: molecular dynamics (MD) simulations and structural clustering. A detailed description is as follows.

In the MD phase, for each of the 148 DNAs, a 100-ns simulated annealing MD simulation was run, with temperature gradually rising from 200 to 500 K by using the software Amber (18). The MD trajectory is initially divided into 10 parts based on the RMSD value of each frame relative to the native state, with a separation of 1.0 Å up to 10.0 Å. To pick 500 distinct structures from this set, a two-step clustering approach was employed in the subsequent clustering phase. First, we employed hierarchical clustering with bins at 1-Å intervals to cluster the structures on each of 10 parts. This step aids in reducing structure redundancy within each part, which also greatly improves the efficiency of following calculations. Subsequently, we employed k-means clustering with a specified list of cluster number, which is (5, 15, 35, 50, 60, 70, 80, 75, 60, 50). The k-means clustering algorithm ensures the selection of the most representative structures for each interval according to the specified number of clusters. In the end, for each DNA, we obtained a collection of 500 structures that evenly distribute within a 0–10 Å RMSD range and exhibit maximum differentiation.

Training sets

The 148 DNAs are randomly divided into two sets: 113 DNAs form the training set, and 35 DNAs form the test set.

Usually, scoring functions based on machine learning use RMSD relative to native structure as a label to evaluate the difference between two structures (see following); i.e., the scores of the structures with similar RMSDs relative to their native structure are considered to be the same. However, the structural differences between two structures with similar RMSDs relative to the native structure may be large, and these differences become more pronounced as the RMSDs increase, as clearly illustrated in Fig. 2. Fig. 2 a presents the distribution of the RMSDs between the candidate structures with similar RMSDs relative to the native structures in 148 DNAs. Here, the RMSDs of the candidate structures relative to their native structures are divided into five bins from 0 to 10 Å in intervals of 2 Å, and the structures in the same bin are considered to have similar RMSDs. To minimize the dissimilarity between the structures with similar RMSDs, we trained the network model by a new training strategy that uses the subsets of the training sets in which the RMSDs of the structures relative to their native structures are smaller than a threshold. Taking into account the need of the neural network training for an adequate number of data samples, we trained the neural network by using training subsets with RMSD thresholds being 4 Å, 5 Å, 6 Å, 7 Å, and 8 Å in conjunction with the total training set. This allows us to evaluate the impact of different training sets on the model’s performance. Afterward, we select the most appropriate network model for evaluating the test set.

Dissimilarity of structures with similar RMSD. (a) The structural differences (RMSD) between two candidate structures with similar RMSDs relative to their native structure. Here, the RMSDs of the candidate structures relative to their native structures are divided into five bins from 0 to 10 Å in intervals of 2 Å, and the structures in the same bin are considered to have similar RMSDs. (b) An example shows two candidate structures 6VUG_362 (*blue*) and 6VUG_268 (*yellow*) that have similar RMSDs (6–8 Å) relative to the native structure, but the structural differences between them are very large (RMSD = 13.02 Å). To see this figure in color, go online.

Test sets

We tested our 3dDNAscoreA on two benchmark data sets. Test set I consists of 35 DNA molecules, randomly selected from the 148 DNA structures. For each DNA, there is a native structure and 500 nonnative DNA decoys, and the RMSDs of these decoys are distributed within the range of 0–10 Å. Test set II comprises 31 DNA molecules from our previous work (20). For each DNA, one assembled structure and 10 optimized structures were generated using 3dDNA21. It is worth noting that this test set does not contain native structures, so we do not need to remove redundancy between this test set and the training set.

Loss function

During the 3dDNAscoreA training, the Huber loss (39) was used to evaluate the difference between predicted RMSD (score) and the true RMSD, which was defined as follows:

L o s s = {\begin{array}{c} 0.5 * {(R - S)}^{2} ， | R - S | < 1 \\ | R - S | - 0.5 ， o t h e r w i s e \end{array}

Here, R is the true RMSD between the candidate and native structures, and S is the predicted RMSD. The loss function defines the optimization goal of the neural network, which tells 3dDNAscoreA how to adjust the weights and parameters during the training process to reduce the loss. However, considering that the structural differences between candidate structures corresponding to the same or similar RMSD are still very large, it may be difficult for 3dDNAscoreA to learn the correct structural relationship and map different structures to the same RMSD value, which may lead to unstable training or failure to converge. To address this issue, we train the neural network through a new training strategy that minimizes the differences between candidate structures with similar RMSDs (see above).

Model training

Depending on the RMSD distribution of the training samples, six neural networks were trained based on six sets of training samples. During model training, in each epoch, we divide the constructed data into training and validation sets in a ratio of 9:1. All training is performed on an A100 GPU, and we set the batch size to 16. We choose Huber loss as the training loss function and use the Adam optimizer to update the model parameters. The initial learning rate is set to 0.01. If the loss on the validation set does not decrease within five consecutive epochs, we reduce the learning rate by 0.1. The model with the smallest loss on the validation set will be selected as the final model for this training.

Results and discussion

Evaluation metrics

In general, the effectiveness of a scoring function is measured by its ability to identify native and near-native structures from a pool of decoy structures and to appropriately rank these decoys. To evaluate the performance of 3dDNAscoreA, we employed a set of evaluation metrics for the test sets. The count of native conformations identified as the smallest-score (here RMSD) structures, as well as the count of native structures found within the top five or top 10 structures with the smallest RMSDs, when native structures were mixed with decoy structures, are among the most commonly used indicators. In the context of real-world molecular modeling, a more pertinent task is to distinguish near-native models from nonnative ones. Consequently, the precise ranking of near-native structures offers a more practical test for assessing energy scoring methods. The metrics used for this evaluation include the enrichment score (ES) and Pearson correlation coefficient (PCC). The ES is defined as follows:

E S = \frac{| E_{t o p 10 %} \cup R_{t o p 10 %} |}{0.01 * N_{d e c o y s}},

where $| E_{t o p 10 %} \cup R_{t o p 10 %} |$ means the number of intersection structures between the top one0% of the structures with the lowest energy and the top one0% near-native decoy structures. $N_{d e c o y s}$ represents the total count of decoy structures for each test DNA. The ES value falls within the range of 1–10, with 10 indicating optimal scoring performance. The PCC is defined as follows:

P C C = \frac{\sum_{n = 1}^{N_{d e c o y s}} (E_{n} - \bar{E}) (R_{n} - \bar{R})}{\sqrt{\sum_{n = 1}^{N_{d e c o y s}} {(E_{n} - \bar{E})}^{2}} \sqrt{\sum_{n = 1}^{N_{d e c o y s}} {(R_{n} - \bar{R})}^{2}}}

In above equation, $E_{n}$ and $R_{n}$ mean the energy and RMSD of the nth DNA structure, respectively. $\bar{E}$ and $\bar{R}$ correspond to the average energy and RMSD of decoys of each DNA, respectively. PCC varies within the range of 0–1. If the connection between energies and RMSDs is entirely linear (PCC equals 1), it would demonstrate a flawless performance.

It’s worth noting that in test set I, there are 500 decoys, making it straightforward to assess the performance of 3dDNAscoreA using the aforementioned evaluation metrics. However, when it comes to test set II, the decoys for each native structure are too few to yield meaningful ES values and PCC values.

Performance of 3dDNAscoreA on test set I

Six models of scoring functions were trained, and the main difference between them was the training set used. These trained models were named as follows: Model_4, Model_5, Model_6, Model_7, Model_8, and Model_all, where Model_i was trained by using the training subset with the RMSD threshold i Å and Model_all by using the total training set. For example, Model_4 was trained by using those DNA structures with RMSD $\leq$ 4 Å relative to their native structures in the total training set. Test set I consists of 35 DNAs with 500 decoy structures generated by MD, and the RMSDs of decoy structures are distributed in a range of 0–10 Å. The specific scoring results in the test set are provided in Table S1.

First, Fig. 3 shows the ability of each training model in selecting native structures in the first 50 epochs of network training. The number of native structures in the top one structures ranked by Model_4 to Model_8 is 19, 24, 25, 13, and 11, respectively, and that by Model_all is 6. As expected, Model_all resulted in the worst performance among these models. It seems that the smaller RMSD threshold of the training sample does not give the better performance to select native structures. For example, Model_4 ranks third in its ability to pick native structures, whereas Model_6 has the highest success rate among all models. One of the reasons is that for many DNAs, the structures of some of their decoys have an RMSD of less than 1 Å, making them nearly indistinguishable from the native structure. When these near-native structures are also considered as the native structures, the success rate of each model in identifying top one are greatly increased, as illustrated in Fig. 3. The success rates of selecting top one for these models have increased from 54%, 69%, 71%, 37%, 31%, and 17% to 89%, 94%, 97%, 71%, 54%, and 46%, respectively. Another reason may be the smaller training set for smaller RMSD threshold (Fig. S1).

The number of native structures detected from decoys by six models of scoring functions (Model_4 to Model_8 and Model_all). Here, top1, top5, and top10 mean the number of native structures recognized within the top one, the top five, and the top 10 of the lowest energies for all models. The Model_i_r1 denotes the results when the near-native structures with their RMSDs less than 1Å relative to their native structures are also considered as the native structures. To see this figure in color, go online.

Fig. 3 also shows the number of native structures identified within the top five and the top 10 of the lowest energies for all models. It shows that the ability of Model_all_r1 to identify native structures is still the worst, with the ability to select 30 and 31 native structures from the top five structures and top one0 structures, respectively. In contrast, Model_4_r1, Model_5_r1, and Model_6_r1 perform equally well, with a 100% success rate in selecting native structures from both top five structures and top one0 structures. These results show that the models trained by a training set with a smaller RMSD threshold can perform better in selecting native structures.

Scoring functions face a greater challenge when it comes to detecting near-native structures in real-world DNA structure prediction scenarios, primarily because native structures are typically unknown. Therefore, the average ES and PCC values for test set I were calculated. As shown in Fig. 4, the mean ES values of the 35 DNAs are 5.50, 5.81, 6.70, 6.64, 6.56, and 6.47 for Model_4 to Model_all, respectively. And the mean PCC values are 0.71, 0.80, 0.81, 0.81, 0.83, and 0.85, respectively. The ES results show that Model_6 has the best average performance, aligning with the conclusion of identifying native structures. However, the performance of Model_all is not as bad as the ability to identify native structures, and it falls only slightly below Model_6. Additionally, in terms of PCC, which assesses the correlation between RMSD and energy values for DNA, Model_all has the best performance, and the average PCC for models Model_4 to Model_all increases sequentially. These results seem to show that the models trained by using the data set with larger RMSD thresholds perform better than those with smaller ones. However, carefully observing the funnel plots in Fig. 5 (and Figs. S2–S13 for complete test set I) indicates that the models trained by using the data set with smaller RMSD thresholds perform much better than those with larger ones for the near-native structures. For example, Model_4 can distinguish the near-native structures with RMSD $\leq$ 4 Å, whereas Model_all cannot. In practical scoring scenarios, our primary interest lies in identifying near-native structures, so the models trained using data sets with smaller RMSD thresholds are ideally suited for this purpose. Fig. 6 shows the ability of 3dDNAscoreA to select near-native structures within top one, top five, and top 10 by using RMSD metrics. It again shows that Model_6 has the best performance, especially for top one ranking.

The average values of (a) ES and (b) PCC by six scoring models in the first 50 epochs and the first 100 epochs. To see this figure in color, go online.

The score-RMSD scatterplots for DNA 2M8Z by six scoring functions. The values of ES and PCC in each plot are annotated in black. The black dashed line is used to distinguish test candidate structures based on the RMSD of the model training set. To see this figure in color, go online.

The ability of 3dDNAscoreA to select near-native structures within top one, top five, and top 10 by using RMSD metrics for test set I. The “rmsd_lowest” denotes the average RMSD of the lowest-RMSD structures among the decoys of each DNA in test set I. To see this figure in color, go online.

Performance of 3dDNAscoreA on test set II

3dDNA20 is a method proposed recently by us to predict the 3D structures of DNAs. For a target DNA, 3dDNA can give assembled and optimized structures. The assembled structure is one just assembled by using the 3D templates for each smallest secondary element of the target DNA and minimized by Amber to avoid atom clash. It can be further optimized by the residue-level simulated annealing Monte Carlo method and a residue-level energy function to give optimized structures (40). However, it lacks a scoring function to evaluate the predicted structures. Therefore, test set II mainly focuses on the performance of 3dDNAscoreA in ranking the structures predicted by 3dDNA in practice. This test set includes 31 DNA molecules. In real-world prediction scenarios, the native structures of these DNAs are unknown. So, each DNA in the set consists of one assembled structure and 10 optimized structures generated by 3dDNA. This ensures that there is no overlap between this test set and the training set.

Fig. 7 show the ability of 3dDNAscoreA to select near-native structures within top one and top five by using RMSD metrics. It shows that the average RMSDs of top one structures and the structures with the lowest RMSD among the top five structures ranked by 3dDNAscoreA are 3.90 Å and 3.17 Å, respectively, and that of the structures with the lowest RMSDs among the 11 decoys is 2.83 Å. The former is very close to the latter, especially that of top five. The relatively higher RMSD in “top one” is only due to the larger RMSDs for a few DNAs. For most DNA molecules, the ability of 3dDNAscoreA to select the near-native structures is strong, as depicted in Fig. 7. Therefore, 3dDNAscoreA can effectively guide 3dDNA for accurate DNA 3D structure prediction.

The ability of 3dDNAscoreA to select near-native structures within top one and top five from the predicted structures by 3dDNA (test set II) by using RMSD metrics. The “lowest” denotes the average RMSD of the lowest-RMSD structures among the decoys of each DNA in test set II. To see this figure in color, go online.

Fig. 8 shows examples of four DNA molecules. 3dDNAscoreA can pick out the near-native structures for 4KB1, 1OMH, and 5HTO, whereas 2A6O can only pick the near-native structure among the top five structures since the structure with moderately higher RMSD has lower score than that with the lowest RMSD. This may be because the structure with the lowest score exhibits differences in the loop region from the native structure, whereas the structure with the lowest RMSD shows differences in the stem region. The stem region is related to the base pairing of two chains, so it may have larger effect on the score since it is also measured in RMSD metrics.

Comparison of the lowest-score structures and the score-RMSD scatterplots given by 3dDNAscoreA. (a) Comparison of lowest-score structure (*purple*), lowest RMSD structure (*green*), and native structure (*gray*). (*b-d*) Comparison of lowest-score structure (*purple*) and native structure (*gray*). To see this figure in color, go online.

Conclusion

In this work, we have developed a machine learning-based scoring function, 3dDNAscoreA, which was trained by using a list of 113 nonredundant experimental DNA structures. Despite the limited number and size of currently known experimental DNA structures, our scoring function was able to discriminate between near-native and nonnative DNA structures.

3dDNAscoreA has some special features. First, it inherits the neural networks and key parameters that have proven to be successful in the related field of RNA structure assessment and prediction (34). Secondly, it explores the performance of the scoring function under the training set within different RMSD thresholds. The results in test set I show that when our network simply uses RMSD as a label to evaluate the difference between the candidate structures and the native structure, the structures whose RMSD is too large relative to their native structure are not suitable for training neural networks for near-native structure selection. It is expected that as more nonredundant DNA structures become available, the performance of 3dDNAscoreA will further improve, and this can assist the accurate evaluation of DNA structure models predicted by computer-based techniques.

Despite the important features described above, the current version of 3dDNAscoreA has limitations that need further improvements. For example, constructing a reasonable training set using MD and clustering methods is a time-consuming and highly uncertain process. Additionally, compared with using the structures generated by MD as a training set, combining with using 3dDNA computational modeling methods to sample a large number of candidate structures to train the network will be more consistent with the real prediction scenario of DNA, as discussed in the RNA3DCNN paper.

Data and code availability

We make our data and code publicly available on https://github.com/zylgtao/3ddnascoreA/tree/master.

Author contributions

Y. Xiao and Y.Z. conceived and designed the research, Y.Z. developed the tools, and Y.Z., C.Y., and Y. Xiong performed the experiments and analyzed the data. Y. Xiao and Y.Z. wrote the manuscript.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (grant number 32071247).

Declaration of interests

The authors declare no competing interests.

Editor: Filip Lankas.

Footnotes

Supporting material can be found online at https://doi.org/10.1016/j.bpj.2024.02.018.

Supporting material

Document S1. Figures S1–S13 and Table S1

mmc1.pdf^{(7.1MB, pdf)}

Document S2. Article plus supporting material

mmc2.pdf^{(10.8MB, pdf)}

References

1.Watson J.D., Crick F.H.C. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature. 2007;462:3–5. doi: 10.1097/BLO.0b013e31814b9304. [DOI] [PubMed] [Google Scholar]
2.Čech P., Kukal J., et al. Svozil D. Automatic workflow for the classification of local DNA conformations. BMC Bioinf. 2013;14:205. doi: 10.1186/1471-2105-14-205. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Adendorff M.R., Tang G.Q., et al. Bricker W.P. Computational investigation of the impact of core sequence on immobile DNA four-way junction structure and dynamics. Nucleic Acids Res. 2022;50:717–730. doi: 10.1093/nar/gkab1246. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Erie D.A., Suri A.K., et al. Olson W.K. Theoretical predictions of DNA hairpin loop conformations: correlations with thermodynamic and spectroscopic data. Biochemistry. 1993;32:436–454. doi: 10.1021/bi00053a008. [DOI] [PubMed] [Google Scholar]
5.Neidle S. Beyond the double helix: DNA structural diversity and the PDB. J. Biol. Chem. 2021;296 doi: 10.1016/j.jbc.2021.100553. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Nakanishi C., Seimiya H. G-quadruplex in cancer biology and drug discovery. Biochem. Biophys. Res. Commun. 2020;531:45–50. doi: 10.1016/j.bbrc.2020.03.178. [DOI] [PubMed] [Google Scholar]
7.Masai H., Tanaka T. G-quadruplex DNA and RNA: Their roles in regulation of DNA replication and other biological functions. Biochem. Biophys. Res. Commun. 2020;531:25–38. doi: 10.1016/j.bbrc.2020.05.132. [DOI] [PubMed] [Google Scholar]
8.Varshney D., Spiegel J., et al. Balasubramanian S. The regulation and functions of DNA and RNA G-quadruplexes. Nat. Rev. Mol. Cell Biol. 2020;21:459–474. doi: 10.1038/s41580-020-0236-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Markley J.L., Bax A., et al. Wüthrich K. Recommendations for the presentation of NMR structures of proteins and nucleic acids. J. Mol. Biol. 1998;280:933–952. doi: 10.1006/jmbi.1998.1852. [DOI] [PubMed] [Google Scholar]
10.Biochemistry R.J., Education M.B. Vol. 35. 2010. pp. 387–388. (Crystallography made crystal clear : a guide for users of macromolecular models). [Google Scholar]
11.Cheng Y. Single-Particle Cryo-EM at Crystallographic Resolution. Cell. 2015;161:450–457. doi: 10.1016/j.cell.2015.03.049. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Lyu C., Khan I.M., Wang Z. Capture-SELEX for aptamer selection: A short review. Talanta. 2021;229 doi: 10.1016/j.talanta.2021.122274. [DOI] [PubMed] [Google Scholar]
13.Cragnolini T., Derreumaux P., Pasquali S. Coarse-grained simulations of RNA and DNA duplexes. J. Phys. Chem. B. 2013;117:8047–8060. doi: 10.1021/jp400786b. [DOI] [PubMed] [Google Scholar]
14.Maciejczyk M., Spasic A., et al. Scheraga H.A. DNA Duplex Formation with a Coarse-Grained Model. J. Chem. Theor. Comput. 2014;10:5020–5035. doi: 10.1021/ct4006689. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Mu Z.C., Tan Y.L., et al. Shi Y.Z. Ab initio predictions for 3D structure and stability of single- and double-stranded DNAs in ion solutions. PLoS Comput. Biol. 2022;18 doi: 10.1371/journal.pcbi.1010501. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Jeddi I., Saiz L. Three-dimensional modeling of single stranded DNA hairpins for aptamer-based biosensors. Sci. Rep. 2017;7:1178. doi: 10.1038/s41598-017-01348-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Sabri M.Z., Hamid A.A.A., et al. Rahim M.Z.A. The assessment of three dimensional modelling design for single strand DNA aptamers for computational chemistry application. Biophys. Chem. 2020;267 doi: 10.1016/j.bpc.2020.106492. [DOI] [PubMed] [Google Scholar]
18.Jossinet F., Ludwig T.E., Westhof E. Assemble: an interactive graphical tool to analyze and build RNA architectures at the 2D and 3D levels. Bioinformatics. 2010;26:2057–2059. doi: 10.1093/bioinformatics/btq321. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Popenda M., Szachniuk M., et al. Adamiak R.W. Automated 3D structure composition for large RNAs. Nucleic Acids Res. 2012;40:e112. doi: 10.1093/nar/gks339. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Zhang Y., Xiong Y., Xiao Y. 3dDNA: A Computational Method of Building DNA 3D Structures. Molecules. 2022;27:5936. doi: 10.3390/molecules27185936. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Wang J., Wang J., et al. Xiao Y. 3dRNA v2.0: An Updated Web Server for RNA 3D Structure Prediction. Int. J. Mol. Sci. 2019;20:4116. doi: 10.3390/ijms20174116. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Zhao Y., Huang Y., et al. Xiao Y. Automated and fast building of three-dimensional RNA structures. Sci. Rep. 2012;2:734. doi: 10.1038/srep00734. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Kabsch W. A solution for the best rotation to relate two sets of vectors. Acta Crystallogr., Sect. A. 1976;32:922–923. [Google Scholar]
24.Parisien M., Cruz J.A., et al. Major F. New metrics for comparing and assessing discrepancies between RNA 3D structures and models. RNA. 2009;15:1875–1885. doi: 10.1261/rna.1700409. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Zok T., Popenda M., Szachniuk M. MCQ4Structures to compute similarity of molecule structures. Cent. Eur. J. Oper. Res. 2013;22:457–473. [Google Scholar]
26.Capriotti E., Norambuena T., et al. Melo F. All-atom knowledge-based potential for RNA structure prediction and assessment. Bioinformatics. 2011;27:1086–1093. doi: 10.1093/bioinformatics/btr093. [DOI] [PubMed] [Google Scholar]
27.Bernauer J., Huang X., et al. Levitt M. Fully differentiable coarse-grained and all-atom knowledge-based potentials for RNA structure evaluation. RNA. 2011;17:1066–1075. doi: 10.1261/rna.2543711. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Wang J., Zhao Y., et al. Xiao Y. 3dRNAscore: a distance and torsion angle dependent evaluation function of 3D RNA structures. Nucleic Acids Res. 2015;43:e63. doi: 10.1093/nar/gkv141. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Zhang T., Hu G., et al. Zhou Y. All-Atom Knowledge-Based Potential for RNA Structure Discrimination Based on the Distance-Scaled Finite Ideal-Gas Reference State. J. Comput. Biol. 2020;27:856–867. doi: 10.1089/cmb.2019.0251. [DOI] [PubMed] [Google Scholar]
30.Tan Y.L., Wang X., et al. Tan Z.J. rsRNASP: A residue-separation-based statistical potential for RNA 3D structure evaluation. Biophys. J. 2022;121:142–156. doi: 10.1016/j.bpj.2021.11.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Lukasiak P., Antczak M., et al. Blazewicz J. RNAssess--a web server for quality assessment of RNA 3D structures. Nucleic Acids Res. 2015;43:W502–W506. doi: 10.1093/nar/gkv557. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Lukasiak P., Antczak M., et al. Adamiak R.W. RNAlyzer--novel approach for quality analysis of RNA structural models. Nucleic Acids Res. 2013;41:5978–5990. doi: 10.1093/nar/gkt318. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Li J., Zhu W., et al. Wang W. RNA3DCNN: Local and global quality assessments of RNA 3D structures using 3D deep convolutional neural networks. PLoS Comput. Biol. 2018;14 doi: 10.1371/journal.pcbi.1006514. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Townshend R.J.L., Eismann S., et al. Dror R.O. Geometric deep learning of RNA structure. Science. 2021;373:1047–1051. doi: 10.1126/science.abe5650. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Rose P.W., Prlić A., et al. Burley S.K. The RCSB protein data bank: integrative view of protein, gene and 3D structural information. Nucleic Acids Res. 2017;45 doi: 10.1093/nar/gkw1000. D271-d281. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Fu L., Niu B., et al. Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28:3150–3152. doi: 10.1093/bioinformatics/bts565. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Gong S., Zhang C., Zhang Y. RNA-align: quick and accurate alignment of RNA 3D structures based on size-independent TM-scoreRNA. Bioinformatics. 2019;35:4459–4461. doi: 10.1093/bioinformatics/btz282. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Lu X.J., Bussemaker H.J., Olson W.K. DSSR: an integrated software tool for dissecting the spatial structure of RNA. Nucleic Acids Res. 2015;43:e142. doi: 10.1093/nar/gkv716. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Huber P.J., Peter M.S. Robust Estimation of a Location Parameter. Ann. Math. Stat. 1964;35:73–101. [Google Scholar]
40.Wang J., Mao K., et al. Xiao Y. Optimization of RNA 3D structure prediction using evolutionary restraints of nucleotide-nucleotide interactions from direct coupling analysis. Nucleic Acids Res. 2017;45:6299–6309. doi: 10.1093/nar/gkx386. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S13 and Table S1

mmc1.pdf^{(7.1MB, pdf)}

Document S2. Article plus supporting material

mmc2.pdf^{(10.8MB, pdf)}

Data Availability Statement

We make our data and code publicly available on https://github.com/zylgtao/3ddnascoreA/tree/master.

[bib1] 1.Watson J.D., Crick F.H.C. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature. 2007;462:3–5. doi: 10.1097/BLO.0b013e31814b9304. [DOI] [PubMed] [Google Scholar]

[bib2] 2.Čech P., Kukal J., et al. Svozil D. Automatic workflow for the classification of local DNA conformations. BMC Bioinf. 2013;14:205. doi: 10.1186/1471-2105-14-205. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Adendorff M.R., Tang G.Q., et al. Bricker W.P. Computational investigation of the impact of core sequence on immobile DNA four-way junction structure and dynamics. Nucleic Acids Res. 2022;50:717–730. doi: 10.1093/nar/gkab1246. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Erie D.A., Suri A.K., et al. Olson W.K. Theoretical predictions of DNA hairpin loop conformations: correlations with thermodynamic and spectroscopic data. Biochemistry. 1993;32:436–454. doi: 10.1021/bi00053a008. [DOI] [PubMed] [Google Scholar]

[bib5] 5.Neidle S. Beyond the double helix: DNA structural diversity and the PDB. J. Biol. Chem. 2021;296 doi: 10.1016/j.jbc.2021.100553. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Nakanishi C., Seimiya H. G-quadruplex in cancer biology and drug discovery. Biochem. Biophys. Res. Commun. 2020;531:45–50. doi: 10.1016/j.bbrc.2020.03.178. [DOI] [PubMed] [Google Scholar]

[bib7] 7.Masai H., Tanaka T. G-quadruplex DNA and RNA: Their roles in regulation of DNA replication and other biological functions. Biochem. Biophys. Res. Commun. 2020;531:25–38. doi: 10.1016/j.bbrc.2020.05.132. [DOI] [PubMed] [Google Scholar]

[bib8] 8.Varshney D., Spiegel J., et al. Balasubramanian S. The regulation and functions of DNA and RNA G-quadruplexes. Nat. Rev. Mol. Cell Biol. 2020;21:459–474. doi: 10.1038/s41580-020-0236-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Markley J.L., Bax A., et al. Wüthrich K. Recommendations for the presentation of NMR structures of proteins and nucleic acids. J. Mol. Biol. 1998;280:933–952. doi: 10.1006/jmbi.1998.1852. [DOI] [PubMed] [Google Scholar]

[bib10] 10.Biochemistry R.J., Education M.B. Vol. 35. 2010. pp. 387–388. (Crystallography made crystal clear : a guide for users of macromolecular models). [Google Scholar]

[bib11] 11.Cheng Y. Single-Particle Cryo-EM at Crystallographic Resolution. Cell. 2015;161:450–457. doi: 10.1016/j.cell.2015.03.049. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Lyu C., Khan I.M., Wang Z. Capture-SELEX for aptamer selection: A short review. Talanta. 2021;229 doi: 10.1016/j.talanta.2021.122274. [DOI] [PubMed] [Google Scholar]

[bib13] 13.Cragnolini T., Derreumaux P., Pasquali S. Coarse-grained simulations of RNA and DNA duplexes. J. Phys. Chem. B. 2013;117:8047–8060. doi: 10.1021/jp400786b. [DOI] [PubMed] [Google Scholar]

[bib14] 14.Maciejczyk M., Spasic A., et al. Scheraga H.A. DNA Duplex Formation with a Coarse-Grained Model. J. Chem. Theor. Comput. 2014;10:5020–5035. doi: 10.1021/ct4006689. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Mu Z.C., Tan Y.L., et al. Shi Y.Z. Ab initio predictions for 3D structure and stability of single- and double-stranded DNAs in ion solutions. PLoS Comput. Biol. 2022;18 doi: 10.1371/journal.pcbi.1010501. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Jeddi I., Saiz L. Three-dimensional modeling of single stranded DNA hairpins for aptamer-based biosensors. Sci. Rep. 2017;7:1178. doi: 10.1038/s41598-017-01348-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17.Sabri M.Z., Hamid A.A.A., et al. Rahim M.Z.A. The assessment of three dimensional modelling design for single strand DNA aptamers for computational chemistry application. Biophys. Chem. 2020;267 doi: 10.1016/j.bpc.2020.106492. [DOI] [PubMed] [Google Scholar]

[bib18] 18.Jossinet F., Ludwig T.E., Westhof E. Assemble: an interactive graphical tool to analyze and build RNA architectures at the 2D and 3D levels. Bioinformatics. 2010;26:2057–2059. doi: 10.1093/bioinformatics/btq321. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Popenda M., Szachniuk M., et al. Adamiak R.W. Automated 3D structure composition for large RNAs. Nucleic Acids Res. 2012;40:e112. doi: 10.1093/nar/gks339. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Zhang Y., Xiong Y., Xiao Y. 3dDNA: A Computational Method of Building DNA 3D Structures. Molecules. 2022;27:5936. doi: 10.3390/molecules27185936. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.Wang J., Wang J., et al. Xiao Y. 3dRNA v2.0: An Updated Web Server for RNA 3D Structure Prediction. Int. J. Mol. Sci. 2019;20:4116. doi: 10.3390/ijms20174116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] 22.Zhao Y., Huang Y., et al. Xiao Y. Automated and fast building of three-dimensional RNA structures. Sci. Rep. 2012;2:734. doi: 10.1038/srep00734. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23.Kabsch W. A solution for the best rotation to relate two sets of vectors. Acta Crystallogr., Sect. A. 1976;32:922–923. [Google Scholar]

[bib24] 24.Parisien M., Cruz J.A., et al. Major F. New metrics for comparing and assessing discrepancies between RNA 3D structures and models. RNA. 2009;15:1875–1885. doi: 10.1261/rna.1700409. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Zok T., Popenda M., Szachniuk M. MCQ4Structures to compute similarity of molecule structures. Cent. Eur. J. Oper. Res. 2013;22:457–473. [Google Scholar]

[bib26] 26.Capriotti E., Norambuena T., et al. Melo F. All-atom knowledge-based potential for RNA structure prediction and assessment. Bioinformatics. 2011;27:1086–1093. doi: 10.1093/bioinformatics/btr093. [DOI] [PubMed] [Google Scholar]

[bib27] 27.Bernauer J., Huang X., et al. Levitt M. Fully differentiable coarse-grained and all-atom knowledge-based potentials for RNA structure evaluation. RNA. 2011;17:1066–1075. doi: 10.1261/rna.2543711. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] 28.Wang J., Zhao Y., et al. Xiao Y. 3dRNAscore: a distance and torsion angle dependent evaluation function of 3D RNA structures. Nucleic Acids Res. 2015;43:e63. doi: 10.1093/nar/gkv141. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] 29.Zhang T., Hu G., et al. Zhou Y. All-Atom Knowledge-Based Potential for RNA Structure Discrimination Based on the Distance-Scaled Finite Ideal-Gas Reference State. J. Comput. Biol. 2020;27:856–867. doi: 10.1089/cmb.2019.0251. [DOI] [PubMed] [Google Scholar]

[bib30] 30.Tan Y.L., Wang X., et al. Tan Z.J. rsRNASP: A residue-separation-based statistical potential for RNA 3D structure evaluation. Biophys. J. 2022;121:142–156. doi: 10.1016/j.bpj.2021.11.016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31.Lukasiak P., Antczak M., et al. Blazewicz J. RNAssess--a web server for quality assessment of RNA 3D structures. Nucleic Acids Res. 2015;43:W502–W506. doi: 10.1093/nar/gkv557. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32.Lukasiak P., Antczak M., et al. Adamiak R.W. RNAlyzer--novel approach for quality analysis of RNA structural models. Nucleic Acids Res. 2013;41:5978–5990. doi: 10.1093/nar/gkt318. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] 33.Li J., Zhu W., et al. Wang W. RNA3DCNN: Local and global quality assessments of RNA 3D structures using 3D deep convolutional neural networks. PLoS Comput. Biol. 2018;14 doi: 10.1371/journal.pcbi.1006514. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] 34.Townshend R.J.L., Eismann S., et al. Dror R.O. Geometric deep learning of RNA structure. Science. 2021;373:1047–1051. doi: 10.1126/science.abe5650. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] 35.Rose P.W., Prlić A., et al. Burley S.K. The RCSB protein data bank: integrative view of protein, gene and 3D structural information. Nucleic Acids Res. 2017;45 doi: 10.1093/nar/gkw1000. D271-d281. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] 36.Fu L., Niu B., et al. Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28:3150–3152. doi: 10.1093/bioinformatics/bts565. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] 37.Gong S., Zhang C., Zhang Y. RNA-align: quick and accurate alignment of RNA 3D structures based on size-independent TM-scoreRNA. Bioinformatics. 2019;35:4459–4461. doi: 10.1093/bioinformatics/btz282. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] 38.Lu X.J., Bussemaker H.J., Olson W.K. DSSR: an integrated software tool for dissecting the spatial structure of RNA. Nucleic Acids Res. 2015;43:e142. doi: 10.1093/nar/gkv716. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] 39.Huber P.J., Peter M.S. Robust Estimation of a Location Parameter. Ann. Math. Stat. 1964;35:73–101. [Google Scholar]

[bib40] 40.Wang J., Mao K., et al. Xiao Y. Optimization of RNA 3D structure prediction using evolutionary restraints of nucleotide-nucleotide interactions from direct coupling analysis. Nucleic Acids Res. 2017;45:6299–6309. doi: 10.1093/nar/gkx386. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

3dDNAscoreA: A scoring function for evaluation of DNA 3D structures

Yi Zhang

Chenxi Yang

Yiduo Xiong

Yi Xiao

Abstract

Significance

Introduction

Materials and methods

Figure 1.

Data decoy preparation

Training sets

Figure 2.

Test sets

Loss function

Model training

Results and discussion

Evaluation metrics

Performance of 3dDNAscoreA on test set I

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Performance of 3dDNAscoreA on test set II

Figure 7.

Figure 8.

Conclusion

Data and code availability

Author contributions

Acknowledgments

Declaration of interests

Footnotes

Supporting material

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases