Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2024 Jun 28;40(Suppl 1):i347–i356. doi: 10.1093/bioinformatics/btae259

RiboDiffusion: tertiary structure-based RNA inverse folding with generative diffusion models

Han Huang 1,2,b, Ziqian Lin 3,4,b, Dongchen He 5, Liang Hong 6, Yu Li 7,
PMCID: PMC11211841  PMID: 38940178

Abstract

Motivation

RNA design shows growing applications in synthetic biology and therapeutics, driven by the crucial role of RNA in various biological processes. A fundamental challenge is to find functional RNA sequences that satisfy given structural constraints, known as the inverse folding problem. Computational approaches have emerged to address this problem based on secondary structures. However, designing RNA sequences directly from 3D structures is still challenging, due to the scarcity of data, the nonunique structure-sequence mapping, and the flexibility of RNA conformation.

Results

In this study, we propose RiboDiffusion, a generative diffusion model for RNA inverse folding that can learn the conditional distribution of RNA sequences given 3D backbone structures. Our model consists of a graph neural network-based structure module and a Transformer-based sequence module, which iteratively transforms random sequences into desired sequences. By tuning the sampling weight, our model allows for a trade-off between sequence recovery and diversity to explore more candidates. We split test sets based on RNA clustering with different cut-offs for sequence or structure similarity. Our model outperforms baselines in sequence recovery, with an average relative improvement of 11% for sequence similarity splits and 16% for structure similarity splits. Moreover, RiboDiffusion performs consistently well across various RNA length categories and RNA types. We also apply in silico folding to validate whether the generated sequences can fold into the given 3D RNA backbones. Our method could be a powerful tool for RNA design that explores the vast sequence space and finds novel solutions to 3D structural constraints.

Availability and implementation

The source code is available at https://github.com/ml4bio/RiboDiffusion.

1 Introduction

The design of RNA molecules is an emerging tool in synthetic biology (Chappell et al. 2015, McKeague et al. 2016) and therapeutics (Zhu et al. 2022), enabling the engineering of specific functions in various biological processes. There have been various explorations into RNA-based biotechnology, such as translational RNA regulators for gene expression (Laganà et al. 2015, Chappell et al. 2017), aptamers for diagnostic or therapeutic applications (Espah Borujeni et al. 2016, Findeiß et al. 2017), and catalysis by ribozymes (Dotu et al. 2014, Park et al. 2019). While the tertiary structure determines how RNA molecules function, one fundamental challenge in RNA design is to create functional RNA sequences that can fold into the desired structure, also known as the inverse RNA folding problem (Hofacker et al. 1994).

Most early computational methods for inverse RNA folding focus on folding into RNA secondary structures (Churkin et al. 2018). Some programs use efficient local search strategies to optimize a single seed sequence for the desired folding properties, guided by the energy function (Hofacker et al. 1994, Andronescu et al. 2004, Busch and Backofen 2006, Garcia-Martin et al. 2013). Others attempt to solve the problem globally by modeling the sequence distribution or directly manipulating diverse candidates (Taneda 2011, Kleinkauf et al. 2015, Yang et al. 2017, Runge et al. 2019). However, without considering 3D structures of RNA, these methods cannot meet accurate functional structure constraints, since RNA secondary structures only partially determine their tertiary structures (Vicens and Kieft 2022). The pioneering work (Yesselman and Das 2015) applies a physically based approach to optimize RNA sequences and match the fixed backbones, but it is still constrained by the local design strategy and computational efficiency.

Recent advances in deep learning and the accumulation of biomolecular structural data have enabled computational methods to model mapping between sequences and 3D structures with extraordinary performance, as demonstrated by remarkable results in protein 3D structure prediction (Jumper et al. 2021, Lin et al. 2023) and inverse design (Dauparas et al. 2022). Inspired by this, the development of geometric learning methods on RNA structures has received increasing research interest. On the one hand, many studies have explored RNA tertiary structure prediction using machine learning models with limited data (Baek et al. 2022, Shen et al. 2022, Li et al. 2023). On the other hand, although deep learning has a promising potential to narrow down the immense sequence space for inverse folding, developing an appropriate model for RNA inverse folding remains an open problem, as it requires capturing the geometric features of flexible RNA conformations, handling the nonunique mappings between structures and sequences, and providing alternative options for different design preferences.

In this study, we introduce RiboDiffusion, a generative diffusion model for RNA inverse folding based on tertiary structures. We formulate the RNA inverse folding problem as learning the sequence distribution conditioned on fixed backbone structures, using a generative diffusion model (Yang et al. 2022). Unlike previous methods that predict the most probable sequence for a given backbone (Ingraham et al. 2019, Jing et al. 2021, Gao et al. 2023, Joshi et al. 2023), our method captures multiple mappings from 3D structures to sequences through distribution learning. With a generative denoising process for sampling, our model iteratively transforms random initial RNA sequences into desired candidates under tertiary structure conditioning. This global iterative generation distinguishes our model from autoregressive models and local updating methods, enabling it to better search for sequences that satisfy global geometric constraints. We parameterize the diffusion model with a cascade of a structure module and a sequence module, to capture the mutual dependencies between sequence and structure. The structure module, based on graph neural networks, extracts SE(3)-invariant geometrical features from 3D fixed RNA backbones, while the sequence module, based on Transformer-liked layers, captures the internal correlations of RNA primary structures. To train the model, we randomly drop the structural module to learn both the conditional and the unconditional RNA sequence distribution. We also mix the conditional and unconditional distributions in the sampling procedures, to balance sequence recovery and diversity for more candidates.

We use RNA tertiary structures from PDB database (Bank 1971) to construct the benchmark dataset and augment it with predicted structures from the RNA structure prediction model (Shen et al. 2022). We split test sets based on RNA clustering using different sequence or structure similarity cutoffs. Our model achieves an 11% higher recovery rate than the machine learning baselines for benchmarks based on sequence similarity, and 16% higher for benchmarks based on structure similarity. RiboDiffusion also performs consistently well across different RNA lengths and types. Further analysis reveals its great performance for cross-family and in silico folding. Our method could be a powerful tool for RNA design, exploring a wide sequence space and finding novel solutions to 3D structural constraints.

2 Methodology

This section will explain RiboDiffusion in detail—a deep generative model for RNA inverse folding based on fixed 3D backbones. The overview is shown in Fig. 1. We will first introduce the preliminaries of diffusion models and our formulations of the RNA inverse folding problem. We will then describe the design of neural networks to parameterize the diffusion model and explain the sequence sampling procedures.

Figure 1.

Figure 1.

Overview of RiboDiffusion for tertiary structure-based RNA inverse folding. We construct a dataset with experimentally determined RNA structures from PDB, supplemented with additional structures predicted by an RNA structure prediction model. We cluster RNA with different cut-offs for sequence or structure similarity and make cross-split to evaluate models. RiboDiffusion trains a neural network with a structure module and a sequence module to recover the original sequence from a noisy sequence and a coarse-grained RNA backbone extracted from the tertiary structure. RiboDiffusion then uses the trained network to iteratively refine random initial sequences until they match the target structure. We present a comprehensive evaluation and analysis of the proposed method.

2.1 Preliminary and formulation

2.1.1 Diffusion model

As a powerful genre of generative models, diffusion models (Sohl-Dickstein et al. 2015) have been successfully applied to the distribution learning of diverse data, including images (Ho et al. 2020, Song et al. 2021), graphs (Huang et al. 2022, 2023a), and molecular geometry (Huang et al. 2023b, Watson et al. 2023). As the first step of setting up the diffusion model, a forward diffusion process is constructed to perturb data with a sequence of noise. This converts the data distribution to a known prior distribution. With random variables x0Rd and a forward process {xt}t[0,T], a Gaussian transition kernel is set as:

q0t(xt|x0)=N(xt|αtx0,σt2I), (1)

where αt,σtR+ are time-dependent differentiable functions that are usually chosen to ensure a strictly decreasing signal-to-noise ratio (SNR) αt2/σt2 and the final distribution qT(xT)N(0,I) (Kingma et al. 2021). Diffusion models can generate new samples starting from the prior distribution, after learning to reverse the forward process. Such the reverse-time denoising process from time T to time 0 can be described by a stochastic differential equation (SDE) (Yang et al. 2022) as:

dxt=[f(t)xtg2(t)xlogpt(xt)]dt+g(t)dw¯t, (2)

where xlogpt(xt) is the so-called score function and w¯t is the standard reverse-time Wiener process. While f(t)=dlogαtdt is the drift coefficient of SDEs, g2(t)=dσt2dt2dlogαtdtσt2 is the diffusion coefficient (Kingma et al. 2021). Deep neural networks are used to parameterize the score function variants in two similar forms, i.e. the noise prediction model ϵθ(xt,t) and the data prediction model dθ(xt,t). In this study, we focus on the parameterization of the widely used data prediction model to directly predict the original data x0 from xt.

2.1.2 RNA inverse folding

Inverse folding aims to explore sequences that can fold into a predefined structure, which is specified here as the fixed sugar-phosphate backbone of an RNA tertiary structure. For an RNA molecule with N nucleotides consisting of four different types A (Adenine), U (Uracil), C (Cytosine), and G (Guanine), its sequence can be defined as S{A,U,C,G}N. Among the backbone atoms, we choose one three-atom coarse-grained representation including the atom coordinates of C4′, C1′, N1 (pyrimidine) or N9 (purine) for every nucleotide. The simplified backbone structure can be denoted as XR3N×3. Note that there are various alternative schemes for coarse-graining RNA 3D backbones, including using more atoms to obtain precise representations (Dawson et al. 2016). We explore a concise representation with regular structural patterns (Shen et al. 2022).

Formally, we consider the RNA inverse folding problem as modeling the conditional distribution p(S|X), i.e. the sequence distribution conditioned on RNA backbone structures. We establish a diffusion model to learn the conditional sequence distribution. To take advantage of the convenience of defining diffusion models in continuous data spaces (Dieleman et al. 2022, Chen et al. 2023), discrete nucleotide types in the sequence are represented by one-hot encoding and continuousized in the real number space as SR4N. The continuous-time forward diffusion process in the sequence space R4N can be described by the forward SDE with t[0,T] as dSt=f(t)Stdt+g(t)dw. Under this forward SDE, the original sequence at time t =0 is gradually corrupted by adding Gaussian noise. With the linear Gaussian transition kernel derived from the forward SDE in Equation (1) (Yang et al. 2022), we can conveniently sample St=αt+σtϵS at any time t for training, where ϵS is Gaussian noise in the sequence space. For the generative denoising process, the corresponding reverse-time SDE from time T to 0 can be derived from Equation (2) as:

dSt=[f(t)g2(t)Slogpt(St|X)]dt+g(t)d(w¯t), (3)

where pt(St|X) is the marginal distribution of sequences given X, and the score function Slogpt(St|X) represents the gradient field of the logarithmic marginal distribution.

Once the score function is parameterized, we can numerically solve this reverse SDE to convert random samples from the prior distribution N(0,I) into the desired sequences. We establish a data prediction model to achieve the score function parameterization, learning to reverse the forward diffusion process. Specifically, we feed the noised sequence data St, the log signal-to-noise ratio λt=log(αt2/σt2), and the conditioning RNA backbone structures X to the data prediction model dθ(St,λt,X). We optimize the data prediction model with a simple weighted squared error objective function:

minθEt{αtσtES0,XESt|S0||dθ(St,λt,X)S0||22}, (4)

which can be considered as optimizing a weighted variational lower bound on the data log-likelihood or a form of denoising score matching (Ho et al. 2020, Kingma et al. 2021, Song et al. 2021).

2.2 Model architecture

The architecture design of the data prediction model largely determines the diffusion learning quality of the diffusion model. We propose a two-module model to predict the original nucleotide types: a structure module to capture geometric features and a sequence module to capture intrasequential correlation.

2.2.1 Structure module

Geometric deep learning models aim to extract equivariant or invariant features from 3D data and achieve impressive performance in the protein inverse folding task (Ingraham et al. 2019, Jing et al. 2021, Gao et al. 2023). Our structure module is constructed based on the GVP-GNN architecture (Jing et al. 2021) and adapted for RNA backbone structures.

The fixed RNA backbone is first represented as a geometric graph G=(V,E) where each node viV corresponds to a nucleotide and connects to its top-k nearest neighbors according to the distance of C1′ atoms. The scalar and vector features are extracted from 3D coordinates as node and edge attributes in graphs, which describe the local geometry of nucleotides and their relative geometry. Specifically, the scalar node features in nucleotides are obtained from dihedral angles, while the vector node features consist of forward and reverse vectors of sequential C1′ atoms, as well as the local orientation vectors of C1′ to C4′ and N1/N9. The initial embedding of each edge consists of its connected C1′ atom’s direction vector, Gaussian radial basis encoding for their Euclidean distance, and sinusoidal position encoding (Vaswani et al. 2017) of the relative distance in the sequence. In addition to geometry information, we also append the corrupted one-hot encoding of nucleotide types St as the node scalar features. Furthermore, inspired by the widely used self-conditioning technique in diffusion models (Chen et al. 2023, Huang et al. 2023b, Watson et al. 2023), the previously predicted sequence output, denoted as S0˜, is also considered as node embeddings to enhance the utilization of model capacity. To update the node embeddings, the nucleotide graph employs a standard message-passing technique (Gilmer et al. 2017). This involves combining the neighboring nodes and edges through GVP layers, where scalar and vector features interact via gating to create messages. The resulting messages are then transmitted across the graph to update scalar and vector node representations.

2.2.2 Sequence module

The sequential correlation in RNA primary structures is crucial for inverse folding and to obtain high-quality RNA sequences even with imprecise 3D coordinates. This concept is applicable in the inverse folding of proteins (Hsu et al. 2022, Zheng et al. 2023). The sequence module takes in f-dimensional nucleotide-level embeddings h0RN×f as tokens, which consists of SE(3)-invariant scalar node representations from the structure module and corrupted sequence data. During training, we randomly add self-conditioning sequence data similar to those of the structure module and drop structural features to model both the conditional and the unconditional sequence distributions for further application.

Our sequence module architecture is modified from the Transformer block (Vaswani et al. 2017) to inject diffusion context, log-SNR λ, or other potential conditional features (e.g. RNA types) (Dhariwal and Nichol 2021, Peebles and Xie 2023). The context input C affects sequence tokens in the form of adaptive normalization and activation layers, which are denoted as adaLN and act functions:

adaLN(h,C)=(1+MLP1(C))·LN(h)+MLP2(C),act(h,C)=MLP3(C)·h, (5)

where LN(·) is the layer normalization and MLP(·) is a multilayer perception to learn shift and scale parameters. The l-th Transformer block is defined as follows:

ml=MHA(adaLN(hl,λt)),hl+1=act(ml,λt)+hl,hl+1=act(FFN(adaLN(hl+1,λt)),λt)+hl+1, (6)

where MHA(·) is the multi-head attention layer and FFN(·) is the Feedforward neural network (Vaswani et al. 2017). Finally, the sequence module output hL is projected to nucleotide one-hot encodings via an extra MLP. The detailed training procedure is referred to as Algorithm 1.

Algorithm 1.

RiboDiffusion Training.

1: tU(0,1],S0,XTrainingSet

2: StN(St|αtS0,σt2I),λt=log(αt2/σt2),S˜00

3: ifUniform(0,1.0)<0.5 then ▹ Self Conditioning

4:   S0˜dθ([St,S0˜],λt,X)

5:   S˜0StopGradient(S0˜)

6: end if

7: ifUniform(0,1.0)<0.4then ▹ Drop Structure Condition

8:   X0

9: end if

10: Minimize αtσt[||dθ([St,S0˜],λt,X)S0||22]

2.3 Sequence sampling

To generate RNA sequences that are likely to fold into the given backbone, we construct a generative denoising process based on the parameterized reverse-time SDE with the optimized data prediction model dθ, as described in Equation (3). Various numerical solvers for the SDE can be employed for sampling, such as ancestral sampling, the Euler-Maruyama method, etc. We apply convenient ancestral sampling combined with the data prediction model and self-conditioning to generate sequences. Algorithm 2 outlines the specific sampling procedure. For more details on the noise schedule parameters, including αt and σt, refer to (Kingma et al. 2021). We intuitively explain the denoising process as follows: we start by sampling noisy data from a Gaussian distribution that represents a random nucleotide sequence, and we iteratively transform this data toward the desired candidates under the condition of the given RNA 3D backbones.

Algorithm 2.

RNA inverse folding via RiboDiffusion.

Require: time schedule {ti}i=0M, RNA backbone coordinates X

1: S0˜0

2: St0STN(0,I)

3: fori1 to Mdo

4:   tti1,sti,λtlog(αt2/σt2)

5:   αt|sαt/αs,σt|s2σt2αt|s2σs2

6:   S0˜dθ([St,S0˜],λt,X)

7:   S¯sαt|sσs2σt2St+αsσt|s2σt2S˜0

8:   SϵN(0,I)

9:   SsS¯s+σt|sσsσtSϵ

10: end for

11: return S¯tM

Exploring novel RNA sequences that fold into well-defined 3D conformations distinct from the natural sequence is also an essential goal for RNA design, as it has the potential to introduce new functional sequences. This task not only requires the model to generate sequences that satisfy folding constraints but also to increase diversity for subsequent screening. During the generative denoising process, our model can balance the proportion of unconditional and conditional sequence distributions by adjusting the output of the data prediction model. Let w be the conditional scaling weight, and the data prediction model can be modified as:

dθ˜(St,λt,X)=wdθ(St,λt,X)+(1w)dθ(St,λt,0). (7)

Setting w =1 is the original conditional data prediction model while decreasing w <1 weakens the effect of conditional information and strengthens the sequence diversity. In this way, we achieve a trade-off between recovering the original sequence and ensuring diversity. The distribution weighting technique is also used in diffusion models for text-to-image generation (Ho and Salimans 2022, Saharia et al. 2022).

3 Results

We comprehensively evaluate and analyze RiboDiffusion for tertiary structure-based RNA inverse folding. Additional results can be found in Supplementary Material. The source code is provided at https://github.com/ml4bio/RiboDiffusion.

3.1 Dataset construction

We gather a dataset of RNA tertiary structures from the PDB database for RNA inverse folding. The dataset contains individual RNA structures and single-stranded RNA structures extracted from complexes. After filtering based on sequence lengths ranging from 20 to 280, there is a total of 7.322 RNA tertiary structures and 2527 unique sequences. In addition to experimentally determined data, we construct augment training data by predicting structures with RhoFold (Shen et al. 2022). The structures predicted from RNAcentral sequences (Sweeney et al. 2019) are filtered by pLDDT to keep only high-quality predictions, resulting in 17 000 structures.

To comprehensively evaluate models, we divide the structures determined by experiments into training, validation, and test sets based on sequence similarity and structure similarity with different clustering thresholds. We use PSI-CD-HIT (Fu et al. 2012) to cluster sequences based on nucleotide similarity. We set the threshold at 0.8/0.6/0.4 and obtain 1252/1157/1114 clusters, respectively. For structure similarity clustering, we calculate the TM-score matrix using US-align (Zhang et al. 2022) and apply the agglomerative clustering algorithm from scipy (Virtanen et al. 2020) on the similarity matrix. We achieve 2036/1659/1302 clusters with TM-score thresholds of 0.6/0.5/0.4. We randomly split the clusters into three groups: 15% for testing, 10% for validation, and the remaining for training. We perform four random splits with nonoverlapping testing and validation sets for each split strategy to evaluate models. The augmented training data are also filtered strictly based on the similarity threshold with the validation and testing sets for each split.

3.2 RNA inverse folding benchmarking

3.2.1 Baselines

We compare our model with four machine learning baselines with tertiary structure input, including gRNAde (Joshi et al. 2023), PiFold (Gao et al. 2023), StructGNN (Ingraham et al. 2019), GVP-GNN (Jing et al. 2021). While gRNAde is a concurrent graph-based RNA inverse folding method, PiFold, StructGNN, and GVP-GNN are representative deep-learning methods of protein inverse folding, which are modified here to be compatible with RNA. Implementation details of these model modifications are in Supplementary Material. These methods use the same three-atom RNA backbone representation. We also introduce RNA inverse folding methods with secondary structures as input for comparison. RNAinverse (Hofacker et al. 1994) is an energy-based local searching algorithm for secondary structure constraints. MCTS-RNA (Yang et al. 2017) searches candidates based on Monte Carlo tree search. LEARNA and MetaLEARNA are deep reinforcement learning approaches (Runge et al. 2019) to design RNA that folds into the given secondary structures. Each method generates a sequence for every RNA backbone for benchmarking.

3.2.2 Metrics

The recovery rate is a commonly used metric in inverse folding that shows how much of the sequence generated by the model matches the original native sequence. While similar sequences have a higher chance of achieving the correct fold, the recovery rate is not a direct measure of structural fitness. We further evaluate with two metrics: the F1 Score, which assesses the alignment between the generated sequence’s predicted secondary structure (via RNAfold; Gruber et al. 2008) and the secondary structure extracted from the input’s tertiary structure (using DSSR; Lu et al. 2015), and the success rate determined by Rfam’s covariance model (Kalvari et al. 2021), which evaluates the preservation of family-specific information in the generated sequences, indicating conserved structures and functions. Average success rates across families are reported.

3.2.3 Performance Comparison

We present recovery rate results in Table 1, which contains the average and SD of four nonoverlapping test sets for each model in different cluster settings. Our model outperforms the second best method by 11% on average for sequence similarity splits and 16% for structure similarity splits. RiboDiffusion consistently achieves better recovery rates in RNA with varying degrees of sequence or structural differences from training data. Methods based on tertiary structures outperform those based on secondary structures, as the latter contains less structural information. Extra results are shown in Table 2. It is worth noting that the tools used in these two metrics may contain errors. Our proposed method outperforms or matches the baseline methods in secondary structure alignments and more effectively retains family information from the input RNA.

Table 1.

Recovery rate (%) comparison across six different settings.a

Methods Seq 0.8 b
Struct. 0.6 c
Mean Median Short Medium Long Mean Median Short Medium Long
RNAinverse 25.92 ± 1.1 25.37 ± 1.0 25.99 ± 2.0 24.98 ± 0.8 27.54 ± 1.4 24.94 ± 0.6 24.24 ± 0.5 24.68 ± 0.7 24.98 ± 1.0 26.15 ± 0.9
MCTS-RNA 25.75 ± 0.3 25.61 ± 0.1 25.37 ± 0.4 26.15 ± 0.5 25.86 ± 0.2 25.81 ± 0.5 25.55 ± 0.6 25.38 ± 0.5 26.19 ± 0.6 25.86 ± 0.9
LEARNA 24.80 ± 0.2 24.55 ± 0.3 24.81 ± 0.4 24.86 ± 0.2 24.41 ± 1.0 24.96 ± 0.2 24.43 ± 0.4 24.88 ± 0.5 25.15 ± 0.5 24.36 ± 0.6
MetaLEARNA 29.10 ± 0.6 29.09 ± 0.5 27.43 ± 1.5 29.46 ± 0.7 32.40 ± 0.9 27.83 ± 2.8 27.95 ± 2.5 25.53 ± 1.8 29.51 ± 0.6 30.75 ± 4.5
gRNAde 42.67 ± 5.3 43.03 ± 6.0 36.25 ± 2.0 44.86 ± 4.9 46.06 ± 6.1 43.46 ± 2.2 43.37 ± 2.7 38.01 ± 1.4 49.82 ± 2.7 41.24 ± 3.1
PiFold 50.03 ± 4.7 50.32 ± 6.0 41.34 ± 3.3 53.20 ± 3.7 54.75 ± 5.9 47.89 ± 5.4 48.76 ± 6.6 40.13 ± 1.0 54.95 ± 5.3 45.62 ± 7.7
StructGNN 51.29 ± 5.9 52.40 ± 8.0 42.74 ± 2.5 54.45 ± 7.1 54.44 ± 7.2 55.20 ± 6.9 54.94 ± 8.6 46.36 ± 1.0 63.86 ± 8.5 48.48 ± 11.3
GVP-GNN 51.66 ± 4.9 53.48 ± 6.4 42.70 ± 2.4 56.20 ± 5.7 53.30 ± 5.7 53.76 ± 5.4 54.02 ± 5.9 45.80 ± 0.7 62.28 ± 7.5 47.39 ± 9.0
RiboDiffusion 57.32 ± 4.1 58.79 ± 4.9 52.01 ± 3.1 59.95 ± 3.4 58.91 ± 5.7 66.50 ± 5.3 66.72 ± 5.8 61.51 ± 1.4 73.89 ± 8.4 57.98 ± 7.8
Methods Seq 0.6
Struct 0.5
Mean Median Short Medium Long Mean Median Short Medium Long
RNAinverse 25.35 ± 0.5 24.30 ± 0.6 25.66 ± 0.6 25.48 ± 1.8 27.76 ± 2.1 25.82 ± 0.6 24.79 ± 0.9 25.38 ± 1.1 25.39 ± 1.0 27.69 ± 1.6
MCTS-RNA 25.81 ± 0.2 25.67 ± 0.2 25.29 ± 0.7 26.22 ± 0.5 26.29 ± 0.6 25.93 ± 0.4 25.47 ± 0.4 25.49 ± 0.5 26.28 ± 0.7 26.06 ± 0.6
LEARNA 24.93 ± 0.1 24.78 ± 0.1 24.92 ± 0.2 25.04 ± 0.6 24.34 ± 1.0 25.00 ± 0.2 24.42 ± 0.6 25.23 ± 0.3 24.64 ± 0.5 24.02 ± 1.2
MetaLEARNA 29.07 ± 3.2 29.89 ± 3.0 25.99 ± 2.4 29.81 ± 0.4 33.89 ± 3.3 28.13 ± 3.5 28.18 ± 3.5 25.81 ± 2.0 29.54 ± 0.9 30.87 ± 3.5
gRNAde 47.28 ± 4.3 49.59 ± 5.4 37.60 ± 1.7 48.66 ± 8.9 47.34 ± 3.5 43.36 ± 6.4 43.61 ± 7.6 36.82 ± 1.5 47.06 ± 6.5 41.74 ± 9.0
PiFold 46.74 ± 2.9 48.54 ± 3.9 37.11 ± 1.6 47.35 ± 4.4 51.32 ± 5.0 49.22 ± 3.0 50.06 ± 3.8 42.48 ± 3.0 53.51 ± 3.6 46.90 ± 5.3
StructGNN 54.23 ± 4.6 57.97 ± 7.0 41.49 ± 1.7 56.09 ± 6.9 53.32 ± 11.0 52.99 ± 8.6 51.81 ± 10.7 44.56 ± 2.4 59.33 ± 8.3 45.06 ± 14.3
GVP-GNN 54.27 ± 3.9 57.60 ± 5.6 42.54 ± 1.9 56.17 ± 5.8 54.20 ± 9.2 50.91 ± 5.7 50.37 ± 6.9 44.74 ± 2.0 56.51 ± 7.1 44.21 ± 9.5
RiboDiffusion 59.06 ± 2.8 61.84 ± 4.2 50.68 ± 2.1 59.66 ± 4.0 59.79 ± 7.9 60.48 ± 6.6 59.31 ± 7.9 55.40 ± 3.8 65.69 ± 9.0 51.14 ± 10.5
Methods Seq 0.4
Struct 0.4
Mean Median Short Medium Long Mean Median Short Medium Long
RNAinverse 25.53 ± 0.7 24.79 ± 1.0 25.29 ± 0.4 26.18 ± 1.7 27.27 ± 1.8 25.54 ± 0.5 24.47 ± 0.6 25.36 ± 1.0 24.94 ± 0.7 27.08 ± 1.9
MCTS-RNA 25.97 ± 0.0 25.86 ± 0.2 25.44 ± 0.3 26.48 ± 0.3 26.34 ± 0.7 25.81 ± 0.4 25.30 ± 0.3 25.27 ± 0.5 26.17 ± 0.6 25.86 ± 1.0
LEARNA 25.03 ± 0.1 24.55 ± 0.3 25.16 ± 0.1 24.84 ± 0.4 25.01 ± 1.9 25.05 ± 0.1 24.62 ± 0.5 25.21 ± 0.2 24.70 ± 0.6 24.02 ± 1.2
MetaLEARNA 28.94 ± 1.1 29.54 ± 2.3 25.83 ± 2.2 29.94 ± 0.5 35.36 ± 2.8 28.14 ± 3.3 28.31 ± 3.2 25.84 ± 1.9 29.45 ± 0.6 30.01 ± 4.2
gRNAde 43.58 ± 7.6 45.41 ± 10.0 36.02 ± 2.6 43.91 ± 2.0 46.84 ± 12.5 44.00 ± 5.7 44.10 ± 7.1 37.24 ± 1.3 48.01 ± 5.6 41.74 ± 8.9
PiFold 47.41 ± 5.0 49.00 ± 6.7 37.64 ± 1.8 50.38 ± 4.7 52.11 ± 9.8 49.84 ± 2.7 50.61 ± 3.8 42.39 ± 2.9 54.33 ± 3.5 45.92 ± 6.3
StructGNN 50.40 ± 6.7 52.57 ± 10.8 41.03 ± 1.5 51.98 ± 4.6 53.33 ± 13.9 54.65 ± 7.8 53.98 ± 9.7 45.39 ± 2.5 61.35 ± 7.0 44.62 ± 14.4
GVP-GNN 50.55 ± 4.7 52.59 ± 7.0 41.77 ± 0.9 53.73 ± 5.6 51.48 ± 9.3 52.29 ± 5.1 51.84 ± 6.6 45.26 ± 1.9 58.26 ± 5.9 44.04 ± 9.5
RiboDiffusion 57.24 ± 5.0 59.94 ± 7.7 50.06 ± 2.4 58.33 ± 4.5 58.85 ± 11.4 62.13 ± 6.0 61.09 ± 7.6 56.48 ± 3.9 67.94 ± 7.7 50.36 ± 10.9
a

The average and SD values of model performance on four random-split nonoverlapping test sets are reported. Mean recovery rates are reported for short (L ≤ 50nt), medium (50 nt < L ≤ 100nt), and long (L > 100nt) RNA. bSeq 0.8, sequence similarity-based split with 0.8 cluster threshold. cStruct. 0.6, structure similarity-based split with 0.6 cluster threshold.

Table 2.

Comparison of secondary structure similarity and success rate of family preservation.

gRNAde PiFold StructGNN GVP-GNN RiboDiffusion
Seq 0.8 F1 0.564 0.408 0.761 0.765 0.744
Suc. 0.035 0.100 0.266 0.268 0.370
Seq 0.6 F1 0.142 0.336 0.709 0.740 0.749
Suc. 0.018 0.031 0.217 0.186 0.316
Seq 0.4 F1 0.424 0.388 0.777 0.802 0.785
Suc. 0.033 0.033 0.164 0.138 0.224
Str 0.6 F1 0.571 0.434 0.774 0.785 0.856
Suc. 0.036 0.023 0.206 0.163 0.305
Str 0.5 F1 0.731 0.440 0.763 0.766 0.786
Suc. 0.064 0.028 0.140 0.150 0.195
Str 0.4 F1 0.738 0.428 0.744 0.761 0.790
Suc. 0.060 0.031 0.128 0.134 0.l77

F1, F1 score; Suc., success rate of family preservation.

We further classify the RNA in the test set based on its length and type to compare the model performance differences more thoroughly. First, we divide RNA into three categories based on the number of nucleotides (nt), i.e. Short (50 nt or less), Medium (more than 50 nt but <100 nt), and Long (100 nt or more). It can be observed in Table 1 that RiboDiffusion maintains performance advantages across different lengths of RNA. Short RNAs present a challenge for the model to recover the original sequence due to their flexible conformation, causing a relatively low recovery rate when compared to medium-length RNAs. A more detailed correlation of RiboDiffusion performance with RNA length is shown in Supplementary Material. Each split shows similar patterns: RiboDiffusion has higher variance in short RNA inverse folding, and the model’s performance becomes limited as RNA length increases. Moreover, Fig. 2 shows the recovery rate distribution of different RNA types with over 10 structures in test sets, including rRNA, tRNA, sRNA, ribozymes, etc The RNA type information is collected from (Sweeney et al. 2019). Compared to other baselines, RiboDiffusion still has a better recovery rate distribution across RNA types. Through comprehensive benchmarking, we have observed remarkable performance improvement in tertiary structure-based RNA inverse folding achieved by RiboDiffusion.

Figure 2.

Figure 2.

Violin plots for the recovery rate distribution of methods for different types of RNA, including tRNA, rRNA, sRNA, ribozyme, snRNA, SRP RNA, hammerhead ribozyme, and pre miRNA.

3.3 Analysis of RiboDiffusion

We dive into a more comprehensive analysis of RiboDiffusion.

3.3.1 Cross-family performance

We repartition the dataset with the cross-family setting to further verify the generalization of our model. We obtain the RNA family corresponding to the tertiary structure from (Kalvari et al. 2021), then randomly select four families for testing and others for training. The experimental results of four nonoverlapping splits are shown in Fig. 3. The average recovery rate of RiboDiffusion in each family generally ranges between 0.4 and 0.6. Especially, our model performs well on RF02540 whose sequence length far exceeds the training set. Although the performance is slightly worse than other splits in Table 1, these results still illustrate that our model can handle RNA families that do not appear in the training data, considering that cross-family is inherently a more difficult setting.

Figure 3.

Figure 3.

Performance of RiboDiffusion on different RNA families under the cross-family setting. The average length and number of tertiary structures for each family are marked above violin plots.

3.3.2 In silico tertiary structure folding validation

To verify whether RiboDiffusion generated sequences can fold into a given RNA 3D backbone, we use computational methods to predict RNA structures (i.e. RhoFold; Shen et al. 2022 and DRFold; Li et al. 2023) to obtain their tertiary structures. Structure prediction models with a single sequence input are used due to the difficulty in finding homologous sequences for generated sequences and performing multiple sequence alignment. We take the TM-score of C1′ backbone atoms to measure the similarity between the predicted RNA structure of generated sequences and the given fixed backbones. Note that in silico folding validation contains two sources of errors. One is the structure prediction error of the folding method itself, and the other is the sequence quality generated by RiboDiffusion. Therefore, we also predict the structure from the original native sequence using the same folding method and compare it to the given RNA backbone as an error and uncertainty reference.

As depicted in Fig. 4a, sequences generated by RiboDiffusion exhibit promising folding results in the fixed backbone for medium-length and long-length RNAs. However, the performance for short-length RNAs is relatively poor, which is affected by the unsatisfied recovery rate of our model and the limitations of RhoFold itself. We also show the folding performance using DRFold in Fig. 4b, where RiboDiffusion exhibits distribution shapes similar to those of using RhoFold. Here, due to the limitation of DRFold inference speed, we only test on the representative sequence of each cluster instead of the entire test set. We further make in silico folding (with RhoFold) case studies of rRNA, tRNA, and riboswitch in Fig. 4e. RiboDiffusion generates new sequences that are different but still tend to fold into similar geometries. To alleviate concerns about the independence of structure prediction and inverse folding models, we provide results from alternative tools and evaluations of structures independent of current datasets as an extra reference in Supplementary Material.

Figure 4.

Figure 4.

Analysis of RiboDiffusion. (a, b) In silico folding validation results that show the TM-score between structures predicted by RhoFold or DRFold and the given fixed RNA backbones (on Seq. 0.4 split). Native represents structures predicted from original sequences of given backbones as references, while Generated represents structures predicted from generated sequences. (c, d) Trade-offs between the diversity of generated sequences and recovery rate, as well as refolding F1-score (including models with and without augmented data). (e) Visualization of input RNA structures (pink) and predicted structures (green) of generated sequences. The generated sequences and the corresponding native sequences are shown below the structure visualization, where different nucleotide types are marked in red.

3.3.3 Trade-off between sequence recovery and diversity

Exploring novel RNA sequences that have the potential to collapse into a fixed backbone distinct from native sequences is a realistic demand for RNA design. However, there is a trade-off between the diversity and recovery rate of the generated sequences. RiboDiffusion can achieve this balance by controlling the conditional scaling weight. For the representative input backbone of each cluster, we generate eight sequences in total to report diversity. The diversity within the generated set of sequences G is defined as IntDiv(G)=11|G|2S1,S2GSim(S1,S2) (Benhenda 2017). The function Sim compares two sequences by calculating the ratio of the length of the aligned subsequence to the length of the shorter sequence. In Fig. 4c, it is evident that the mean diversity of generated sequences in the test sets begins to increase when the conditional scaling weight is set to 0.5, while the recovery rate and the F1 score decrease to some extent. Therefore, we recommend using a value between 0.5 and 0.35 to adjust the sequence diversity.

3.3.4 Training data augmentation analysis

Augmenting training data are primarily driven by the scarcity and limited diversity of RNA available in PDB. Table 3 indicates that the incorporation of additional RhoFold predictions improves the overall generated sequence quality. This augmentation also enhances the adjustment ability of RiboDiffusion for sequence diversity, as shown in Fig. 4d, where the sequence diversity of the model without the augmented data remains relatively low. Notably, the noisy nature of augmented data requires appropriate preprocessing and filtering for quality assurance.

Table 3.

Ablation study on data augmentation.

Rec. mean Rec. median F1 score Rfam success
RiboDiffusion 57.24% 59.94% 0.785 0.224
w/o Augment 55.26% 57.01% 0.768 0.221

Rec., recovery rate.

4 Conclusion

We propose RiboDiffusion, a generative diffusion model for RNA inverse folding based on tertiary structures. By benchmarking methods on sequence and structure similarity splits, comparing performance across RNA length and type, and validating with in silico folding, we demonstrate the effectiveness of our model. Our model can also make trade-offs between recovery and diversity, and handle cross-family inverse folding. In future work, we aim to expand the scope of RiboDiffusion by exploring RNA sequences that span larger magnitudes in size and integrate contact information from the complex into the model. Our ultimate objective is to utilize the model for designing functional RNA like ribozymes, riboswitches, and aptamers, and to verify its effectiveness in wet lab experiments.

Supplementary Material

btae259_Supplementary_Data

Contributor Information

Han Huang, Department of Computer Science and Engineering, CUHK, Hong Kong SAR, 999077, China; School of Computer Science and Engineering, Beihang University, Beijing, 100191, China.

Ziqian Lin, Department of Computer Science and Engineering, CUHK, Hong Kong SAR, 999077, China; School of Artificial Intelligence, Nanjing University, Nanjing, 210023, China.

Dongchen He, Department of Computer Science and Engineering, CUHK, Hong Kong SAR, 999077, China.

Liang Hong, Department of Computer Science and Engineering, CUHK, Hong Kong SAR, 999077, China.

Yu Li, Department of Computer Science and Engineering, CUHK, Hong Kong SAR, 999077, China.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest

None declared.

Funding

The work was supported by the Chinese University of Hong Kong with the award numbers 4937025, 4937026, 5501517, and 5501517 and partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CUHK 24204023) and a grant from Innovation and Technology Commission of the Hong Kong Special Administrative Region, China (Project No. GHP/065/21SZ). This research is also funded by RMGS in CUHK with the award number 8601603 and 8601663.

References

  1. Andronescu M, Fejes AP, Hutter F. et al. A new algorithm for RNA secondary structure design. J Mol Biol 2004;336:607–24. [DOI] [PubMed] [Google Scholar]
  2. Baek M, McHugh R, Anishchenko I. et al. Accurate prediction of protein-nucleic acid complexes using rosettafoldna. Nat Methods 2024;21:117–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bank PD. Protein data bank. Nature New Biol 1971;233:223.20480989 [Google Scholar]
  4. Benhenda M. ChemGAN challenge for drug discovery: can AI reproduce natural chemical diversity? arXiv, arXiv:1708.08227, 2017, preprint: not peer reviewed.
  5. Busch A, Backofen R.. Info-RNA – a fast approach to inverse RNA folding. Bioinformatics 2006;22:1823–31. [DOI] [PubMed] [Google Scholar]
  6. Chappell J, Watters KE, Takahashi MK. et al. A renaissance in RNA synthetic biology: new mechanisms, applications and tools for the future. Curr Opin Chem Biol 2015;28:47–56. [DOI] [PubMed] [Google Scholar]
  7. Chappell J, Westbrook A, Verosloff M. et al. Computational design of small transcription activating RNAs for versatile and dynamic gene regulation. Nat Commun 2017;8:1051. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chen T, Zhang R, Hinton G. Analog bits: generating discrete data using diffusion models with self-conditioning. International Conference on Learning Representations2023. https://openreview.net/forum?id=3itjR9QxFw.
  9. Churkin A, Retwitzer MD, Reinharz V. et al. Design of RNAs: comparing programs for inverse RNA folding. Brief Bioinformatics 2018;19:350–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Dauparas J, Anishchenko I, Bennett N. et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science 2022;378:49–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Dawson WK, Maciejczyk M, Jankowska EJ. et al. Coarse-grained modeling of RNA 3D structure. Methods 2016;103:138–56. [DOI] [PubMed] [Google Scholar]
  12. Dhariwal P, Nichol A.. Diffusion models beat GANs on image synthesis. NeurIPS 2021;34:8780–94. [Google Scholar]
  13. Dieleman S, Sartran L, Roshannai A. et al. Continuous diffusion for categorical data. arXiv, arXiv:2211.15089, 2022, preprint: not peer reviewed.
  14. Dotu I, Garcia-Martin JA, Slinger BL. et al. Complete RNA inverse folding: computational design of functional hammerhead ribozymes. Nucleic Acids Res 2014;42:11752–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Espah Borujeni A, Mishler DM, Wang J. et al. Automated physics-based design of synthetic riboswitches from diverse RNA aptamers. Nucleic Acids Res 2016;44:1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Findeiß S, Etzel M, Will S. et al. Design of artificial riboswitches as biosensors. Sensors 2017;17:1990. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Fu L, Niu B, Zhu Z. et al. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 2012;28:3150–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Gao Z, Tan C, Li SZ. PiFold: toward effective and efficient protein inverse folding. International Conference on Learning Representations2023. https://openreview.net/forum?id=oMsN9TYwJ0j.
  19. Garcia-Martin JA, Clote P, Dotu I.. RNAiFold: a constraint programming algorithm for RNA inverse folding and molecular design. J Bioinform Comput Biol 2013;11:1350001. [DOI] [PubMed] [Google Scholar]
  20. Gilmer J, Schoenholz SS, Riley PF. et al. Neural message passing for quantum chemistry. In: ICML, p. 1263–72. PMLR, 2017.
  21. Gruber AR, Lorenz R, Bernhart SH. et al. The Vienna RNA websuite. Nucleic Acids Res 2008;36:W70–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Ho J, Jain A, Abbeel P.. Denoising diffusion probabilistic models. NeurIPS 2020;33:6840–51. [Google Scholar]
  23. Ho J, Salimans T. Classifier-free diffusion guidance. arXiv, arXiv:2207.12598, 2022, preprint: not peer reviewed.
  24. Hofacker IL, Fontana W, Stadler PF. et al. Fast folding and comparison of RNA secondary structures. Monatsh Chem 1994;125:167–88. [Google Scholar]
  25. Hsu C, Verkuil R, Liu J. et al. Learning inverse folding from millions of predicted structures. In: ICML, p: 8946–70. PMLR, 2022.
  26. Huang H, Sun L, Du B. et al. GraphGDP: generative diffusion processes for permutation invariant graph generation. In: ICDM, p. 201–10. IEEE, 2022.
  27. Huang H, Sun L, Du B. et al. Conditional diffusion based on discrete graph structures for molecular graph generation. AAAI 2023a;37:4302–11. [Google Scholar]
  28. Huang H, Sun L, Du B. et al. Learning joint 2D & 3D diffusion models for complete molecule generation. arXiv, arXiv:2305.12347, 2023b, preprint: not peer reviewed.
  29. Ingraham J, Garg V, Barzilay R. et al. Generative models for graph-based protein design. NIPS 2019;32:15820–31. [Google Scholar]
  30. Jing B, Eismann S, Suriana P. et al. Learning from protein structure with geometric vector perceptrons. International Conference on Learning Representations 2021. https://openreview.net/forum?id=xMrYGO-rfuh.
  31. Joshi CK, Jamasb AR, Viñas R. et al. Multi-state RNA design with geometric multi-graph neural networks. arXiv, arXiv:2305.14749, 2023, preprint: not peer reviewed.
  32. Jumper J, Evans R, Pritzel A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021;596:583–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Kalvari I, Nawrocki EP, Ontiveros-Palacios N. et al. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res 2021;49:D192–200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Kingma D, Salimans T, Poole B. et al. Variational diffusion models. NeurIPS 2021;34:21696–707. [Google Scholar]
  35. Kleinkauf R, Houwaart T, Backofen R. et al. antaRNA–multi-objective inverse folding of pseudoknot RNA using ant-colony optimization. BMC Bioinformatics 2015;16:389–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Laganà A, Veneziano D, Russo F. et al. Computational design of artificial RNA molecules for gene regulation. RNA Bioinformatics 2015;1269:393–412. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Li Y, Zhang C, Feng C. et al. Integrating end-to-end learning with deep geometrical potentials for ab initio RNA structure prediction. Nat Commun 2023;14:5745. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Lin Z, Akin H, Rao R. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023;379:1123–30. [DOI] [PubMed] [Google Scholar]
  39. Lu X-J, Bussemaker HJ, Olson WK.. DSSR: an integrated software tool for dissecting the spatial structure of RNA. Nucleic Acids Res 2015;43:e142. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. McKeague M, Wong RS, Smolke CD.. Opportunities in the design and application of RNA for gene expression control. Nucleic Acids Res 2016;44:2987–99. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Park SV, Yang J-S, Jo H. et al. Catalytic RNA, ribozyme, and its applications in synthetic biology. Biotechnol Adv 2019;37:107452. [DOI] [PubMed] [Google Scholar]
  42. Peebles W, Xie S. Scalable diffusion models with transformers. In: ICCV, p. 4195–205. IEEE, 2023.
  43. Runge F, Stoll D, Falkner S. et al. Learning to design RNA. International Conference on Learning Representations 2019. https://openreview.net/forum?id=ByfyHh05tQ.
  44. Saharia C, Chan W, Saxena S. et al. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS 2022;35:36479–94. [Google Scholar]
  45. Shen T, Hu Z, Peng Z. et al. E2Efold-3D: end-to-end deep learning method for accurate de novo RNA 3D structure prediction. arXiv, arXiv:2207.01586, 2022, preprint: not peer reviewed.
  46. Sohl-Dickstein J, Weiss E, Maheswaranathan N. et al. Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML, p. 2256–65. PMLR, 2015.
  47. Song Y, Sohl-Dickstein J, Kingma DP. et al. Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations 2021. https://openreview.net/forum?id=PxTIG12RRHS.
  48. Sweeney BA, Petrov AI, Burkov B. et al. RNAcentral: a hub of information for non-coding RNA sequences. Nucleic Acids Res 2019;47:D221–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Taneda A. MODENA: a multi-objective RNA inverse folding. Adv Appl Bioinform Chem 2011;4:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need. In: NIPS 2017;30:6000–10. [Google Scholar]
  51. Vicens Q, Kieft JS.. Thoughts on how to think (and talk) about RNA structure. Proc Natl Acad Sci USA 2022;119:e2112677119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Virtanen P, Gommers R, Oliphant TE, SciPy 1.0 Contributors et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 2020;17:261–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Watson JL, Juergens D, Bennett NR. et al. De novo design of protein structure and function with RFdiffusion. Nature 2023;620:1089–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Yang L, Zhang Z, Song Y. et al. Diffusion models: a comprehensive survey of methods and applications. arXiv, arXiv:2209.00796, 2022, preprint: not peer reviewed.
  55. Yang X, Yoshizoe K, Taneda A. et al. RNA inverse folding using Monte Carlo tree search. BMC Bioinformatics 2017;18:468–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Yesselman JD, Das R.. RNA-Redesign: a web server for fixed-backbone 3D design of RNA. Nucleic Acids Res 2015;43:W498–501. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Zhang C, Shine M, Pyle AM. et al. US-Align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nat Methods 2022;19:1109–15. [DOI] [PubMed] [Google Scholar]
  58. Zheng Z, Deng Y, Xue D. et al. Structure-informed language models are protein designers. bioRxiv, 2023, preprint: not peer reviewed.
  59. Zhu Y, Zhu L, Wang X. et al. RNA-based therapeutics: an overview and prospectus. Cell Death Dis 2022;13:644. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btae259_Supplementary_Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES