Abstract
Protein–RNA and protein–DNA complexes play critical roles in biology. Despite considerable recent advances in protein structure prediction, the prediction of the structures of protein–nucleic acid complexes without homology to known complexes is a largely unsolved problem. Here we extend the RoseTTAFold machine learning protein-structure-prediction approach to additionally predict nucleic acid and protein–nucleic acid complexes. We develop a single trained network, RoseTTAFoldNA, that rapidly produces three-dimensional structure models with confidence estimates for protein–DNA and protein–RNA complexes. Here we show that confident predictions have considerably higher accuracy than current state-of-the-art methods. RoseTTAFoldNA should be broadly useful for modeling the structure of naturally occurring protein–nucleic acid complexes, and for designing sequence-specific RNA and DNA-binding proteins.
Subject terms: Machine learning, DNA-binding proteins, RNA-binding proteins
RoseTTAFoldNA extends the RoseTTAFold2 platform to predict the structures of protein–DNA and protein–RNA complexes.
Main
Current approaches for protein–nucleic acid complex structure prediction involve building models of the protein and nucleic acid (NA) components separately and then building up complexes using computational docking calculations1–3. For predicting protein components, machine learning-guided approaches like RoseTTAFold4 and AlphaFold5 are highly accurate, while RNA structure prediction has used a combination of Monte Carlo sampling approaches6–9 as well as deep learning methods10,11. Despite this progress in predicting individual components, the prediction of the structure of protein–nucleic acid complexes has lagged considerably behind the prediction of protein structures or RNA structures alone.
AlphaFold and RoseTTAFold take as input one or more aligned protein sequences, and successively transform this information in parallel one-dimensional (1D), two-dimensional (2D) and—in the case of RoseTTAFold—three-dimensional (3D) tracks, ultimately outputting three-dimensional protein structures. The 10 s to 100 s of millions of free parameters in these deep networks are learned by training on large sets of proteins of known structures from the Protein Data Bank (PDB). Both AlphaFold and RoseTTAFold can generate accurate models of not only protein monomers but also protein complexes, modeling folding and binding by successive transformations over hundreds of iterations. Given the overall similarities between protein folding and RNA folding, and between protein-protein binding and protein–nucleic acid binding, we reasoned that the concepts and techniques underlying AlphaFold and RoseTTAFold could be extended to the prediction of the structures of nucleic acids and protein–nucleic acid complexes from sequence information alone. We set out to generalize RoseTTAFold to model nucleic acids in addition to proteins, and to learn the many new parameters required for general protein–nucleic acid systems by training on the structures in the PDB. A major question at the outset was whether there were sufficient nucleic acid and protein–nucleic acid structures in the PDB to train an accurate and general model; key to the success of AlphaFold are the hundreds of thousands of protein structures in the PDB, but there are an order of magnitude fewer nucleic acid structures and complexes. The flexibility of nucleic acids relative to proteins could also make the prediction of the former more difficult.
Our new model, RoseTTAFoldNA, was trained using the same data as RoseTTAFold, augmented with all RNA, protein–RNA and protein–DNA complexes in the PDB. Using nucleic acid complexes published more recently than any training-set examples, we evaluate its ability to predict structures of protein–nucleic acid complexes without homologs. We also assess the model’s self-assessments of model accuracy, and compare our predictions to a combination of AlphaFold and computational protein–DNA docking.
Results
The architecture of RoseTTAFoldNA (RFNA) is illustrated in Fig. 1. It is based on the three-track architecture of RoseTTAFold4, which simultaneously refines three representations of a biomolecular system: sequence (1D), residue-pair distances (2D) and cartesian coordinates (3D). In addition to several modifications to improve performance12, we extended all three tracks of the network to support nucleic acids in addition to proteins. The 1D track in RoseTTAFold has 22 tokens, corresponding to the 20 amino acids, a 21st ‘unknown’ amino acid or gap token and a 22nd mask token that enables protein design; to these, we added 10 additional tokens, corresponding to the four DNA nucleotides, the four RNA nucleotides, unknown DNA and unknown RNA. The 2D track in RoseTTAFold builds up a representation of the interactions between all pairs of amino acids in a protein or protein assembly; we generalized the 2D track to model interactions between nucleic acid bases and between bases and amino acids. The 3D track in RoseTTAFold represents the position and orientation of each amino acid in a frame defined by three backbone atoms (N, CA and C), and up to four chi angles to build up the sidechain. For RoseTTAFoldNA, we extended this to include representations of each nucleotide using a coordinate frame describing the position and orientation of the phosphate group (P, OP1 and OP2), and 10 torsion angles which enable the building up of all the atoms in the nucleotide. RoseTTAFoldNA consists of 36 of these three-track layers, followed by four additional structure refinement layers, with a total of 67 million parameters.
We trained this end-to-end protein–NA structure prediction network using a combination of protein monomers, protein complexes, RNA monomers, RNA dimers, protein–RNA complexes and protein–DNA complexes, with a 60/40 ratio of protein-only and NA-containing structures (Methods). Multichain assemblies other than the DNA double helix were broken into pairs of interacting chains. For each input structure or complex, sequence similarity searches were used to generate multiple sequence alignments (MSAs) of related protein and nucleic acid molecules. Network parameters were optimized by minimization of a loss function consisting of a generalization of the all atom Frame Aligned Point Error (FAPE) loss5 defined over all protein and nucleic acid atoms (Methods) together with additional contributions assessing the recovery of masked sequence segments, residue-residue (both amino acids and nucleotides) interaction geometry and error prediction accuracy. To try to compensate for the far smaller number of nucleic-acid-containing structures in the PDB (following sequence-similarity-based cluster to reduce redundancy, there are 1,632 RNA clusters and 1,556 protein–nucleic acid complex clusters compared to 26,128 all protein clusters), we also incorporated physical information in the form of Lennard-Jones and hydrogen-bonding energies13 as input features to the final refinement layers, and as part of the loss function during fine-tuning. During training, 10% of the clusters were withheld for model validation.
We trained the model using structures determined prior to May 2020, and used RNA and protein–NA structures solved since then as an additional independent validation set. For the validation set, complexes were not broken into interacting pairs and were processed entirely as full complexes. Paired MSAs were generated for complexes with multiple protein chains as described previously14. Due to GPU memory limitations, for the validation set only, we excluded complexes with more than 1,000 total amino acids and nucleotides, which resulted in a validation set containing 520 cases (98 clusters) with a single RNA chain, 224 complexes (116 clusters) with one protein molecule plus a single RNA chain (62/28 clusters) or DNA duplex (162/88 clusters), and 161 cases with more than one protein chain or more than a single RNA chain or DNA duplex.
Predicting protein–NA complexes
RoseTTAFoldNA results on 224 monomeric protein–NA complexes are summarized in Fig. 2, shown as 116 clusters. The predictions are reasonably accurate, with an average Local Distance Difference Test (lDDT) of 0.73 and 29% of models with lDDT > 0.8 (19% of clusters, Fig. 2a), and about 45% of models contain greater than half of the native contacts between protein and NA (fraction of native contacts, FNAT > 0.5, 35% of clusters, Fig. 2c). RoseTTAFoldNA, like RoseTTAFold and AlphaFold, outputs not only a predicted structure but also a predicted model confidence, and as expected the method correctly identifies which structure models are accurate. Although only 38% of the complexes (28% of clusters) are predicted with high confidence (mean interface predicted aligned error, PAE < 10), of those, 81% (78% of clusters) correctly model the protein–NA interface (‘acceptable’ or better by CAPRI metrics15). Over the 33 clusters with no detectable sequence similarity to training protein–NA structures, the accuracy is similar (average lDDT = 0.68 with 24% of models > 0.8 lDDT and 42% with FNAT > 0.5), and the model is still able to correctly identify accurate predictions—24% of predictions in this subset are predicted with high confidence, of which all eight have acceptable interfaces according to CAPRI metrics. Four predictions of structures with no sequence homologs in the training set are shown in Fig. 2d–g. These include the endonuclease BpuJ1, tumor antigen p53, SmpB bound to a tRNA-like RNA domain, and components of a telomerase reverse transcriptase. Inaccuracies in these predictions can be found in flexible terminal regions (Fig. 2e,g), a slight tilt of the DNA double helix relative to the interface (Fig. 2e) and slight deviations in RNA tertiary structure (Fig. 2f,g), but the interfaces are clearly correct.
In cases where RoseTTAFoldNA fails to produce an accurate prediction, the most common cause is poor prediction of individual subunits, typically large multidomain proteins, large RNAs (>100 nt) and small single-stranded nucleic acids. When the subunit predictions are accurate, the most common failure mode is for the model to identify either the correct binding orientation or the correct interface residues, but not both. The remaining cases with completely incorrect interfaces often involve only glancing contacts or heavily distorted DNAs. It is possible that a different training schedule could reduce these errors, but more likely it is due to limited training data in these regimes. Extended Data Fig. 1 illustrates some examples.
RoseTTAFoldNA prediction is not limited to complexes with only a single protein subunit. Figure 3 summarizes the performance of RoseTTAFoldNA on 161 multisubunit protein–NA complexes, most of which are homodimeric proteins bound to nucleic acid duplexes. The performance is similar to that for monomeric protein–nucleic acid complexes, with an average lDDT = 0.72 with 30% of cases >0.8 lDDT, and good agreement between confidence and accuracy (Fig. 3a). Three examples are illustrated in Fig. 3b–d, showing the ability of the model to predict complex structure as well as the ‘bending’ of DNA induced by protein binding (Fig. 3e). Figure 3f,g shows another example where the relative positioning of protein domains is only made by copredicting these complexes. Such effects would not be possible to predict by approaches that first generate models of the independent components and then rigidly dock them.
Predicting RNA complexes
Finally, RoseTTAFoldNA performance on RNA structures alone are summarized in Extended Data Fig. 2. Most predictions are reasonably accurate: the average lDDT is 0.73, with 48% of models (but only 14% of clusters) predicted with lDDT > 0.8 (Extended Data Fig. 2a). 62% of cases (30% of clusters) are predicted with very high confidence (predicted lDDT, plDDT > 0.9), for which the average lDDT is 0.81 and 77% of models (45% of clusters) have lDDT > 0.8. Even for cases with no homologs of known structure or small numbers of sequence relatives (shallow MSAs), confidently predicted models are generally quite accurate (colourbar, Extended Data Fig. 2b,c) and the network is capable of predicting structures without detectable homologs in the training dataset (Extended Data Fig. 2d–g).
Discussion
At the outset of this work, it was not clear that there were enough protein–nucleic acid structures in the PDB to enable robust training of a deep learning-based predictor with atomic accuracy—the training data used for nucleic acid prediction is only one tenth the size of the dataset used for protein structure prediction. Our results show, however, that this data is sufficient in many cases for de novo structure modeling, with accurate modeling of protein–NA interfaces without shared MSA information or homologs of known structure in about 31% of cases. Prospective and blind tests will be important for further critical evaluation of the method. Along these lines, we made predictions for CASP15 RNA targets during CASP with an earlier version of RoseTTAFoldNA.
Comparison of RoseTTAFoldNA to current state-of-the-art methods is more difficult than the case for the deep learning methods AlphaFold and RoseTTAFold which focused on the much more well studied protein structure prediction problem. There has been recent work on RNA structure prediction; Extended Data Fig. 3 shows the performance of this network compared to the traditional sampling-based FARFAR2 method4 and the deep learning-based DeepFoldRNA method15. FARFAR2 and DeepFoldRNA top-ranked models have average lDDTs of 0.44 and 0.64, respectively, compared to 0.62 for RoseTTAFoldNA. On the CASP15 RNA targets, we perform worse than the leading machine learning methods DeepFoldRNA and AIchemy—but most of the targets are quite large and several are synthetic RNA origamis with no MSAs16. For protein structure prediction, we see performance in-line with AlphaFold, with an average TM-score of 0.87 for RFNA versus 0.88 for AlphaFold (comparing AlphaFold ‘model 1’ and using the same MSA for both AlphaFold and RFNA). While the performance of individual modalities is not an advancement over state-of-the-art, the strength of RoseTTAFoldNA is in the prediction of protein–nucleic acid complexes. Here, comparisons are more difficult, as there are no equivalent deep learning-based methods, and even sampling-based methods have focused more on bespoke solutions to a specific problem rather than general methods. While automated methods are available for predicting individual protein, RNA, and DNA components and for energy-based docking of macromolecules, we find that this alternative workflow has very poor accuracy, finding the correct complex within the top three models in only 1 of 14 test cases (see Methods for details on our workflow and Extended Data Fig. 4 for detailed results). Hence, while the accuracy of RoseTTAFoldNA on protein–nucleic acid complexes is considerably lower than that of AlphaFold on protein structures, it represents a notable improvement in the state-of-the-art.
Further increases in accuracy might come from a larger, more expressive network; we used a smaller network than that of RoseTTAFold, with ∼67 M parameters and 36 total layers. Use of high-confidence predicted structures as additional training examples (made more difficult by subsampling MSAs) should further increase model accuracy10; for this purpose there are databases of structured RNAs17,18 and DNA-binding profiles for thousands of proteins19,20, and the latter should be useful for training a model fine-tuned for DNA specificity as well (see Methods and Extended Data Fig. 5 for RoseTTAFoldNA performance on DNA-binding specificity prediction). Deep learning-guided structure prediction of proteins has opened up new avenues of research; we hope that RoseTTAFoldNA does the same for protein–NA interactions and complexes. To this end, we have made the method freely available.
Methods
Training and validation data processing
The protein and protein complex data used in training was identical to that used in training RoseTTAFold2. Additional data from RNA and protein–nucleic acid complexes was added to this. To construct this dataset, all PDBs solved by nuclear magnetic resonance, crystallography or cryo-electron microscopy at better than 4.5 Å resolution were collected. A dataset was constructed considering all PDB structures published at or before 30 April 2020, and collecting:
All RNA single chains and all RNA duplexes. A duplex was defined by looking for pairs of RNA chains making at least 10 hydrogen bonds.
All interacting protein–nucleic acid pairs. Interacting pairs were defined by counting the number of 7 Å contacts between protein Cαs and any (non-hydrogen) nucleic acid atom; if there were more than 16 such contacts, the pair was considered interacting. Nucleic acid duplexes were included if the DNA or RNA chains made at least 10 hydrogen bonds.
For modeling, the full-length sequence was used. All non-standard bases/amino acids were converted into a backbone-only ‘unknown’ residue type. The dataset size was 7,396 RNA chains and 23,583 complexes. These were then clustered using a 1 × 10−3 hhblits29 E-value for proteins and 80% sequence identity for RNA molecules, yielding 1,632 non-redundant RNA clusters and 1,556 non-redundant protein–NA clusters. These clusters were then split into training and validation sets, with clusters chosen for the training set; an example which contained any member (NA or protein) of a validation set cluster was assigned to the validation set. This led to 199 protein–NA clusters and 116 RNA clusters in the validation set.
Multiple sequence alignments (MSAs) were then created for all protein and RNA sequences in the training and validation set. Protein MSAs were generated in the same way as RoseTTAFold12, using hhblits at successive E-value cutoffs (1 × 10−30, 1 × 10−10, 1 × 10−6 and 1 × 10−3), stopping when the MSA contains more than 10,000 unique sequences with >50% coverage. RNA MSAs were generated using a pared-down version of rMSA (https://github.com/pylelab/rMSA) that removes secondary structure predictions: sequences were searched using blastn30 over three databases (RNAcentral17, rfam18 and nt) to first identify hits, then using nhmmer31 to rerank hits. We again use successive E-value cutoffs (1 × 10−8, 1 × 10−7, 1 × 10−6, 1 × 10−3, 1 × 10−2 and 1 × 10−1), stopping when the MSA contains more than 10,000 unique sequences with >50% coverage.
Finally, to improve generalizability of protein–DNA interactions we added a few ways of ‘randomizing’ inputs during training. As many crystal structures of protein–DNA complexes involve short DNA chains with the binding motif in the middle, initial versions of the model had a strong preference to binding in the middle of any provided sequence. To deal with this, we added a random padding of 0–6 nucleotides to both ends of all native structures: (1) containing double-stranded DNA and (2) making at least three base-specific contacts (using a cutoff distance of 3.4 Å). This yielded 580 protein–DNA complexes. These added residues were not included in loss calculations, but were present in the predicted structures. Additionally, we also performed negative training for these same 580 complexes; all DNA bases forming base-specific contacts to the bound protein were randomly mutated (maintaining Watson–Crick base pairing), and the model was trained to move the protein and DNA far apart (by favouring the 6-dimensional ‘distogram’ loss to place all its probability mass in the final bin).
Test set data processing
For an independent test set, we took all structures published to the PDB 1 May 2020 or later. Selection criteria and preprocessing was the same as for the training and validation data with two exceptions: (1) only complexes fewer than 1,000 residues plus nucleotides in length were considered and (2) for complexes containing more than one unique protein chains, paired MSAs were created by merging sequences from the same organism into a single combined sequence (following prior work14). This gave us 91 complexes with one protein molecule plus a single RNA chain or DNA duplex, 43 cases with a single RNA chain and 106 cases with more than one protein chain or more than a single RNA chain or DNA duplex.
All atom generation for nucleotides
Following AlphaFold’s treatment of amino acids, when predicting structure, the model represents each nucleotide as a rigid frame (with a rotation and translation) and a set of internal torsion angles. For nucleic acids this frame corresponds to the orientation of the phosphate group (O–P–O), in the same way that N–Cα–C is used as an amino acid frame. A set of ten torsions describe the placement of all sidechain atoms, representing the rotatable bonds in the nucleotide: six backbone (α, β, γ, δ, ϵ and ζ), one sidechain (χ) and three additional angles controlling ribose ‘pucker’ (ν0, ν1 and ν2). When all atom models are generated as part of the loss calculation, they are kinematically folded outward from the phosphate group following the chain of torsions connecting them.
Loss functions
The model was trained using a loss function similar to RoseTTAFold, where we take the weighted sum:
Above, seq is the masked amino acid recovery loss (no masking is applied to nucleotide sequences); 6D is the six-dimensional ‘distogram’ loss32; str is the structure loss, consisting of the average backbone FAPE loss5 over all 40 structure layers of the network plus the all atom FAPE loss for the final model; tors is the torsion prediction loss averaged over the 40 structure layers; err is the loss in pLDDT prediction; and the w terms are the weights on individual components in the loss function.
FAPE loss is extended to nucleic acids in a straightforward manner from how it is implemented for amino acids. For backbone FAPE loss, the phosphate group (O–P–O) in the nucleic acid backbone is treated as the nucleotides ‘frame.’ For nucleic acid all atom FAPE loss, three-atom frames are constructed corresponding to each of the ten ‘rotatable torsions’ (see above), where the frame consists of the two bonded atoms defining the torsion plus an additional bonded atom, closer to the phosphate group in the bond graph. The cross product of these ten frames with all atoms is used to calculate FAPE loss.
Following training with the above loss function, an additional ‘fine-tuning’ phase is carried out, where additional energy terms are added to the loss function enforcing reasonable model geometry:
Above, LJ and hbond are the Lennard-Jones and hydrogen bond energies of the final structure (normalized by the number of atoms), using a reimplementation of the corresponding Rosetta energy terms13; geom is a term that enforces ideal bond lengths and bond angles around the peptide or phosphodiester bond connecting residues/nucleotides; and pairerr is a predicted residue-pair error5. The functional form of the geom term is identical to that of RoseTTAFold2, a linear penalty with a ‘flat bottom’ ±3°/0.02 Å from the ideal values.
Model training
The network was trained in two stages, an initial training period, and a fine-tuning period. In both, input structures were divided into five pools: (1) protein structures, (2) ‘distilled’ protein structures (consisting of high-confidence AlphaFold predictions), (3) protein complexes, (4) protein–NA complexes and (5) RNA structures. Training sampled from each of these pools with equal probability (though later in training protein–NA frequency was increased to 25% and RNA frequency lowered to 15%). For both pools containing ‘complexes,’ an equal number of positive and negative examples were used in training. Negative examples consist of nonbinding proteins or protein–NA pairs; the structure loss only penalizes each component individually, and the 6D loss favors placing negative binding examples far apart.
Examples larger than 256 residues/nucleotides in length were ‘cropped’ to 256 residues in length. For protein-only data these crops were continuous sequences; for nucleic acids and nucleic acid–protein complexes the cropping was a bit more complex. A graph was constructed where sequential residues/nucleotides had edges with weight 1, Watson–Crick base-paired nucleotides had weight 0 and protein–NA bases closer than 12 Å (Cα to P) had a weight of 0. In negative cases, a single random protein–NA edge was given weight 0. Then minimum-weight graph traversal starting from a randomly chosen protein–NA edge was used to crop the model down to 256 residues/nucleotides. For RNA-only models the same strategy was used, though the starting point was a random nucleotide.
Training was carried out in parallel on 64 GPUs. A batch size of 64 was used throughout training with a learning rate of 0.001, decaying every 5,000 steps. The following weights were used: wseq = 3.0, w6d = 1.0, wstr = 10.0, wtors = 10.0 and werr = 0.1. The Adam optimizer was used, with L2 regularization (coeff = 0.01).
Following ∼1 × 105 optimization steps, fine-tuning training was carried out. Here we increase crop size to 384 and effective batch size to 128, and reduce learning rate to 5× 10−4. We used additional loss terms with weights wgeom = 0.1, wLJ = 0.02, whbond = 0.05 and wpairerr = 0.1, and optimized for an additional 30,000 minimization steps. All told, training took approximately 4 weeks.
Protein–nucleic acid docking
From the protein–nucleic acid complexes with no homologs in RFNA’s training set, we selected eight protein–DNA complexes and six protein–RNA complexes to use as test cases for docking. Protein monomer structures were predicted with AlphaFold5, using the same MSAs generated for RFNA predictions and choosing the prediction with the highest average predicted lDDT from models 1–5. RNA components were predicted using DeepFoldRNA following the default instructions. DNA duplexes were generated as B-form helices using x3DNA33. Docking was performed using the Hdock web server34, using only template-free docking to avoid fitting directly to the original deposited model. Structure and interface accuracy of the top three docks were evaluated as for RFNA. We acknowledge that a more careful DNA modeling and docking workflow could produce more accurate models, but similar could be said for RFNA.
Binding and nonbinding DNA sequence dataset
We obtained experimental data of transcription factors’ DNA-binding profiles from the Cis-BP database19. We used 1,509 proteins for which the protein sequences of the experimental constructs and DNA 8mer E-scores were available. From the 8mer E-scores for each protein, we chose the top three most enriched DNA sequences as ‘binding’ and three random negatively enriched DNA sequences as ‘nonbinding’. We predicted the proteins and DNAs together using RFNA and evaluated the model based on the average PAE across the interface.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41592-023-02086-5.
Supplementary information
Source data
Acknowledgements
This work was supported by Microsoft (M.B. and D.B., and generous gifts of Azure computing time), the Audacious Project at the Institute for Protein Design (R.M., F.D. and D.B.), National Science Foundation Grant CHE 2226466 (F.D. and D.B.), AI-Bio Research Grant through Seoul National University (M.B.), Croucher Fellowship (H.J.) and the Howard Hughes Medical Institute (D.B.).
Extended data
Author contributions
F.D. and M.B. conceived the presented model. I.A., F.D. and R.M. carried out the training-set curation. F.D. and M.B. carried out the model training. R.M and F.D. computed and analyzed the results. H.J. and R.M. computed and analyzed the binding specificity data. F.D., M.B, R.M and D.B. wrote the initial manuscript. R.M. and F.D. generated the initial figures. All authors discussed the results and contributed to the final manuscript.
Peer review
Peer review information
Nature Methods thanks Hashim Al-Hashimi, Remo Rohs, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team.
Data availability
Source code and a link to the training weights have been made available at https://github.com/uw-ipd/RoseTTAFold2NA. Updated CASP15 RNA predictions have been made available at 10.5281/zenodo.7555957. All data used for training and evaluation is publicly available through the PDB (https://www.rcsb.org/). The data used for analyzing sequence specificity is publicly available through Cis-BP (http://cisbp.ccbr.utoronto.ca/). Source data are provided with this paper.
Code availability
The code for this model is available at https://github.com/uw-ipd/RoseTTAFold2NA. This repository includes preprocessing and inference scripts, and a link to the model weights.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
is available for this paper at 10.1038/s41592-023-02086-5.
Supplementary information
The online version contains supplementary material available at 10.1038/s41592-023-02086-5.
References
- 1.Honorato RV, Roel-Touris J, Bonvin AMJJ. MARTINI-based protein-DNA coarse-grained HADDOCKing. Front. Mol. Biosci. 2019;6:102. doi: 10.3389/fmolb.2019.00102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Tuszynska I, Magnus M, Jonak K, Dawson W, Bujnicki JM. NPDock: a web server for protein-nucleic acid docking. Nucleic Acids Res. 2015;43:W425–W430. doi: 10.1093/nar/gkv493. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Banitt I, Wolfson HJ. ParaDock: a flexible non-specific DNA-rigid protein docking algorithm. Nucleic Acids Res. 2011;39:e135. doi: 10.1093/nar/gkr620. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Baek M, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021;373:871–876. doi: 10.1126/science.abj8754. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Jumper J, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Watkins AM, Rangan R, Das R. FARFAR2: Improved de novo Rosetta prediction of complex global RNA folds. Structure. 2020;28:963–976. doi: 10.1016/j.str.2020.05.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Krokhotin A, Houlihan K, Dokholyan NV. iFoldRNA v2: folding RNA with constraints. Bioinformatics. 2015;31:2891–2893. doi: 10.1093/bioinformatics/btv221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Zhao C, Xu X, Chen S-J. Predicting RNA structure with Vfold. Methods Mol. Biol. 2017;1654:3–15. doi: 10.1007/978-1-4939-7231-9_1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Wang J, et al. Optimization of RNA 3D structure prediction using evolutionary restraints of nucleotide–nucleotide interactions from direct coupling analysis. Nucleic Acids Res. 2017;45:6299–6309. doi: 10.1093/nar/gkx386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Sato K, Akiyama M, Sakakibara Y. RNA secondary structure prediction using deep learning with thermodynamic integration. Nat. Commun. 2021;12:941. doi: 10.1038/s41467-021-21194-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Townshend RJL, et al. Geometric deep learning of RNA structure. Science. 2021;373:1047–1051. doi: 10.1126/science.abe5650. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Baek M. et al. Efficient and accurate prediction of protein structure using RoseTTAFold2. Preprint at bioRxiv10.1101/2023.05.24.542179 (2023).
- 13.Alford RF, et al. The Rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 2017;13:3031–3048. doi: 10.1021/acs.jctc.7b00125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Humphreys IR, et al. Computed structures of core eukaryotic protein complexes. Science. 2021;374:eabm4805. doi: 10.1126/science.abm4805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Lensink MF, Wodak SJ. Docking, scoring, and affinity prediction in CAPRI. Proteins. 2013;81:2082–2095. doi: 10.1002/prot.24428. [DOI] [PubMed] [Google Scholar]
- 16.Das R. et al. Assessment of three-dimensional RNA structure prediction in CASP15. Preprint at bioRxiv10.1101/2023.04.25.538330 (2023). [DOI] [PMC free article] [PubMed]
- 17.RNAcentral Consortium RNAcentral 2021: secondary structure integration, improved sequence search and new member databases. Nucleic Acids Res. 2021;49:D212–D220. doi: 10.1093/nar/gkaa921. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Kalvari I, et al. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res. 2021;49:D192–D200. doi: 10.1093/nar/gkaa1047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Weirauch MT, et al. Determination and inference of eukaryotic transcription factor sequence specificity. Cell. 2014;158:1431–1443. doi: 10.1016/j.cell.2014.08.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Gerstein MB, et al. Architecture of the human regulatory network derived from ENCODE data. Nature. 2012;489:91–100. doi: 10.1038/nature11245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Probst M, et al. Structural insight into DNA-assembled oligochromophores: crystallographic analysis of pyrene- and phenanthrene-modified DNA in complex with BpuJI endonuclease. Nucleic Acids Res. 2016;44:7079–7089. doi: 10.1093/nar/gkw644. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Petty TJ, et al. An induced fit mechanism regulates p53 DNA binding kinetics to confer sequence specificity. EMBO J. 2011;30:2167–2176. doi: 10.1038/emboj.2011.127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Gutmann S, et al. Crystal structure of the transfer-RNA domain of transfer-messenger RNA in complex with SmpB. Nature. 2003;424:699–703. doi: 10.1038/nature01831. [DOI] [PubMed] [Google Scholar]
- 24.Huang J, et al. Structural basis for protein-RNA recognition in telomerase. Nat. Struct. Mol. Biol. 2014;21:507–512. doi: 10.1038/nsmb.2819. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Hellert J, et al. The 3D structure of Kaposi sarcoma herpesvirus LANA C-terminal domain bound to DNA. Proc. Natl Acad. Sci. USA. 2015;112:6694–6699. doi: 10.1073/pnas.1421804112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Grenha R, et al. Structural basis for the activation mechanism of the PlcR virulence regulator by the quorum-sensing signal peptide PapR. Proc. Natl Acad. Sci. USA. 2013;110:1047–1052. doi: 10.1073/pnas.1213770110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Shevtsov MB, et al. Structural analysis of DNA binding by C.Csp231I, a member of a novel class of R-M controller proteins regulating gene expression. Acta Crystallogr. D Biol. Crystallogr. 2015;71:398–407. doi: 10.1107/S139900471402690X. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Šoltysová M, et al. Structural insight into DNA recognition by bacterial transcriptional regulators of the SorC/DeoR family. Acta Crystallogr D Struct. Biol. 2021;77:1411–1424. doi: 10.1107/S2059798321009633. [DOI] [PubMed] [Google Scholar]
- 29.Remmert, M., Biegert, A., Hauser, A. & Söding, J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods9, 173–175 (2012). [DOI] [PubMed]
- 30.Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Wheeler TJ, Eddy SR. nhmmer: DNA homology search with profile HMMs. Bioinformatics. 2013;29:2487–2489. doi: 10.1093/bioinformatics/btt403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Yang J, et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA. 2020;117:1496–1503. doi: 10.1073/pnas.1914677117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Zheng G, Lu X-J, Olson WK. Web 3DNA–a web server for the analysis, reconstruction, and visualization of three-dimensional nucleic-acid structures. Nucleic Acids Res. 2009;37:W240–W246. doi: 10.1093/nar/gkp358. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Yan Y, Zhang D, Zhou P, Li B, Huang S-Y. HDOCK: a web server for protein-protein and protein-DNA/RNA docking based on a hybrid strategy. Nucleic Acids Res. 2017;45:W365–W373. doi: 10.1093/nar/gkx407. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Source code and a link to the training weights have been made available at https://github.com/uw-ipd/RoseTTAFold2NA. Updated CASP15 RNA predictions have been made available at 10.5281/zenodo.7555957. All data used for training and evaluation is publicly available through the PDB (https://www.rcsb.org/). The data used for analyzing sequence specificity is publicly available through Cis-BP (http://cisbp.ccbr.utoronto.ca/). Source data are provided with this paper.
The code for this model is available at https://github.com/uw-ipd/RoseTTAFold2NA. This repository includes preprocessing and inference scripts, and a link to the model weights.