Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2025 Aug 4;26(4):bbaf386. doi: 10.1093/bib/bbaf386

mRNA folding algorithms for structure and codon optimization

Max Ward 1,, Mary Richardson 2, Mihir Metkar 3,
PMCID: PMC12319323  PMID: 40755283

Abstract

mRNA technology has revolutionized vaccine development, protein replacement therapies, and cancer immunotherapies, offering rapid production and precise control over sequence and efficacy. However, the inherent instability of mRNA poses significant challenges for drug storage and distribution, particularly in resource-limited regions. Co-optimizing RNA structure and codon choice has emerged as a promising strategy to enhance mRNA stability while preserving efficacy. Given the vast sequence and structure design space, specialized algorithms are essential to achieve these qualities. Recently, several effective algorithms have been developed to tackle this challenge that all use similar underlying principles. We call these specialized methods mRNA folding algorithms as they generalize classical RNA folding algorithms. Initial laboratory testing of mRNA folding optimized mRNA vaccines, such as those encoding SARS-CoV-2 spike and VZV gE, has shown promising improvements in both in-solution stability and immunogenicity. While these biological properties are beginning to be evaluated experimentally, a comprehensive in silico analysis of the underlying principles, performance, and limitations of these design algorithms is equally essential. Thus, this review aims to provide an in-depth understanding of these algorithms, identify opportunities for improvement, and benchmark existing software implementations in terms of scalability, correctness, and feature support.

Keywords: mRNA folding, mRNA optimization, algorithms, RNA, dynamic programming, RNA secondary structure, mRNA therapeutics, algorithm benchmarking

Introduction

The development of mRNA vaccines against Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has unequivocally demonstrated the potential of mRNA therapeutics to combat and control infectious diseases [1, 2]. Beyond vaccines, mRNA therapeutics are showing significant promise in early-phase clinical trials for cancer neoantigen vaccines, enzyme replacement therapies, and as a delivery platform for gene-editing enzymes [3], paving the way for treatments targeting diverse medical conditions. As informational molecules, mRNAs encode the desired therapeutic protein directly within their sequence, offering unparalleled flexibility in design and production [4]. This adaptability positions mRNA as a versatile platform for addressing numerous therapeutic challenges.

Despite its advantages, the inherent instability of mRNA remains a significant barrier to its widespread use. mRNAs are highly prone to hydrolytic degradation, necessitating ultracold storage and specialized supply chains to preserve in-vial stability [5]. These logistical hurdles disproportionately affect resource-poor regions, restricting access to mRNA-based medicines. Overcoming mRNA storage and transport instability is therefore crucial to improving its global distribution, scalability, and equitable access.

One promising strategy to enhance mRNA stability is the co-optimization of RNA secondary structure and codon usage. RNA structure plays a pivotal role in determining susceptibility to hydrolytic degradation [6], while codon usage affects translational efficiency and protein expression levels [4, 7]. However, for any given protein sequence, the number of possible mRNA sequences and their associated structures is astronomically large. For instance, SARs-CoV2 spike protein has Inline graphic possible nucleotide sequences, each having Inline graphic possible secondary structures [4].

As a result, a growing number of algorithms have recently been developed to address this multi-objective optimization problem [6, 8–11]. Among these, a class we refer to as “mRNA folding” algorithms has garnered particular attention in recent years [8, 12]. Although popular recently, their foundational concepts have appeared in earlier work [13, 14].

mRNA folding algorithms extend standard RNA folding algorithms [15–18] by operating under codon constraints (selecting only from synonymous codons that encode a given protein) and incorporating mRNA-specific properties, such as codon usage bias [7, 8]. This review introduces the fundamentals of mRNA folding algorithms, highlights research gaps, and proposes opportunities for future improvements. Additionally, we provide a comprehensive comparison and benchmark of existing software packages that implement mRNA folding algorithms. By offering a foundational overview of this rapidly evolving subfield of mRNA therapeutics, we aim to guide researchers in selecting and improving algorithms for the rational design of next-generation mRNA therapeutics.

mRNA folding algorithms are related to both RNA folding algorithms and mRNA design algorithms (Fig. 1A). While they share some underlying principles, there are key differences in their goals and constraints. Understanding these distinctions is important for conceptual clarity before diving into the specifics of mRNA folding algorithms.

Figure 1.

Panel (A) shows a Venn diagram with mRNA Design methods on the left, RNA folding methods on the right, and mRNA Folding methods in the intersection. (B) shows the Pareto optimal frontier of stability versus codon optimality; as codon optimality increases stability decreases.

The algorithmic landscape around mRNA folding algorithms. (A) mRNA folding algorithms are at the intersection of mRNA design algorithms (blue) and RNA folding algorithms (red). (B) Modern mRNA folding algorithms balance two objectives: MFE and CAI. In general, there is a tradeoff between decreasing the minimum MFE and increasing Codon Adaptation Index (CAI). We depict the Pareto front schematically here. CDSFold and the Cohen and Skiena method do not incorporate CAI, so they generate the mRNA sequence with the minimum MFE. LinearDesign and DERNA attempt to balance these two objectives with a Inline graphic parameter, which weights the CAI objective. Inline graphic is equivalent to CDSFold and Cohen and Skiena, while Inline graphic is equivalent to standard CAI-based design tools. DERNA generates a set of Pareto-optimal solutions, equivalent to sampling the range of Inline graphic values.

Comparison to mRNA design algorithms

Within the broader landscape of “mRNA design” algorithms, mRNA folding algorithms represent a specialized class focused on structure-aware mRNA optimization. In general, mRNA design algorithms generate an mRNA sequence that optimizes some set of sequence and/or structural properties [6, 8, 9, 19]. mRNA folding algorithms differ from other approaches, many of which take a general optimization technique and adapt it to mRNA design. For example, RiboTree [6] adapts a Monte Carlo tree search, which is a general optimization method used across many problems. Similarly, mRNAid [10] optimizes sequences using random mutations, another general strategy used by several methods [20, 21]. This paradigm holds for many other mRNA design algorithms that use off-the-shelf optimization strategies, including evolutionary algorithms [22], sliding windows [23], and greedily picking frequently used codons [24]. In contrast, mRNA folding algorithms build directly on RNA secondary structure prediction methods, adapting them to account for coding constraints and mRNA-specific design goals. An mRNA folding algorithm could not be used on an optimization problem other than structure-aware mRNA design. The relative effectiveness of mRNA folding algorithms is probably due to the “No Free Lunch” theorem in optimization [25], which says (among other things) that specialized algorithms typically beat general-purpose algorithms on their specific task.

Recent deep learning methods do not cleanly fit into this “No Free Lunch” dichotomy as they are trained specifically on the mRNA design problem [11, 26, 27]. They differ from mRNA folding methods in that they require extensive training data and are “black boxes” whereas mRNA folding algorithms are relatively easy to interpret as they optimize well-known quantities (codon optimality and RNA structure). However, deep learning methods have the significant benefit of scaling with more training data and so have the capacity to become the dominant mRNA design paradigm eventually.

Comparison to RNA design algorithms

RNA design encompasses a wide range of problems with mRNA design being just one category. These include CRISPR guide RNA design [28], inverse RNA folding (also called RNA structure design) [29, 30], RNA (and DNA) origami [31], and more. Risking some oversimplification, broadly dividing RNA design problems into coding RNA (mRNA design) and non-coding RNA (those mentioned in this section) is informative. In general terms, non-coding RNA design focuses on finding sequences with specific sequence and/or structural features, whereas coding RNA design focuses on finding valid codon sequences. Importantly, though stability (and thus structure) matters for coding RNA design, typically, there is no specific target structure. Thus, this review focuses on algorithms for designing coding RNAs (mRNAs), and does not address methods for non-coding RNA sequence design.

For the sake of clarity, the reader should be aware that some software packages contain algorithms for both RNA folding and RNA structure design. For instance, ViennaRNA [18], RNAstructure [17], and NUPACK [32] include RNA folding algorithms as well as RNA design tools.

Comparison to RNA folding algorithms

At their core, mRNA folding algorithms extend classical RNA folding algorithms used for RNA secondary structure prediction. The dynamic programming method for RNA folding was first introduced by Zuker and Stiegler in 1981 [15]. Their algorithm forms the core of modern RNA structure prediction tools such as ViennaRNA [18], RNAstructure [17], Mfold [16], and NUPACK [32]. Extending the nomenclature, we define algorithms that leverage RNA folding principles for mRNAs design as “mRNA folding” algorithms, since mRNA folding algorithms are essentially built on top of the same dynamic programming recursions that RNA folding algorithms use.

There is a clear difference between RNA folding and mRNA folding. RNA folding algorithms predict the secondary structure of an RNA sequence, whereas mRNA folding algorithms find an optimized coding sequence for a given protein. While these two tasks are fundamentally different, both algorithm types rely on a similar dynamic programming formulation. We explore the details of this in Section 2 and Section 3.

An overview of mRNA folding algorithms

Currently, four mRNA folding algorithms are described in the literature, which we summarize in Table 1. The first, published by Cohen and Skiena [13], maximized mRNA structure by minimizing Minimum Free Energy (MFE). This was followed by CDSfold [14], which improved the algorithmic efficiency significantly and added several additional capabilities. Both methods have proven prescient as they predate the recent surge of interest in mRNA design. These algorithms modify the Zuker–Stiegler dynamic programming recursions to minimize free energy under codon constraints. Formally, let Inline graphic calculate the MFE (e.g. by using the Zuker–Stiegler algorithm), and let Inline graphic be the set of valid protein-coding sequences, then they calculate Inline graphic. The intuition is that a lower MFE implies higher stability [6, 9, 33]. The Cohen–Skiena method achieves this by adding codon conditions to the Zuker–Stiegler recursions. CDSfold instead uses a graph-based representation of valid codon sequences, enabling a significantly faster algorithm. Both methods share a notable limitation: they cannot simultaneously optimize stability (measured by MFE) and high translation efficiency (measured by Codon Adaptation Index (CAI) [7]).

Table 1.

Comparison of algorithmic features across mRNA folding packages. Entries highlighted in green are the most desirable quality for the corresponding column. The “Exact” column refers to whether the algorithm is guaranteed to return exactly correct results, or employs a heuristic instead.

Algorithm Year Method MFE CAI Pareto Optimal Exact Beam Search Publicly Available
LinearDesign [8] 2023 Codon Graph
DERNA [12] 2024 Codon-Constrained
CDSfold [14] 2016 Codon Graph
Cohen and Skiena [13] 2003 Codon-Constrained

The next method was LinearDesign [8], which mitigated this limitation by co-optimizing for MFE and CAI. It further improved upon CDSFold by incorporating a beam search heuristic, substantially increasing algorithmic speed at the moderate cost of a potentially approximate optimized mRNA. LinearDesign gained significant attention within the broader mRNA design community, evidenced by its publication in Nature [8] and the biological validation of designed sequences (see Section 1.5).

LinearDesign balances the MFE and CAI weights using a mixing factor Inline graphic, defining the sequence-structure score as Inline graphic (Fig. 1B). However, this approach presents two challenges: first, if the user wants to target a specific CAI, they need to search for the right Inline graphic; second, those unsure about target CAI need to make an arbitrary choice. Another recent alternative, DERNA [12], addressed this limitation by finding all Pareto optimal solutions for CAI and MFE, thus allowing users to find the best MFE for every possible CAI. However, DERNA has some drawbacks compared to LinearDesign and CDSfold. It is slower, even when not computing the Pareto optimal frontier. Gu et al. reported a 6-hour maximum run time for DERNA on their benchmarks versus 19 minutes for LinearDesign [12]. This is because DERNA extends the older codon-condition-based approach from the Cohen–Skiena algorithm, rather than the faster graph-based approach introduced by CDSfold and extended by LinearDesign.

From a practitioners standpoint, the choice is between using LinearDesign, CDSfold, and DERNA. Cohen and Skiena’s method is superseded by the newer approaches and lacks publicly available source code. The publicly available version of LinearDesign supports CAI as well as MFE optimization and offers speed at the tradeoff of using a heuristic (beam search). CDSfold, while more efficient than DERNA, supports only MFE optimization. DERNA supports both Pareto optimization for CAI and MFE, but uses a slower algorithm similar to the older Cohen and Skiena’s method.

The implementation details of mRNA folding algorithms are either only partially available or entirely absent from the literature. CDSfold algorithm is fully explained [14], but it does not incorporate CAI. Only a simplified version of LinearDesign [8] is explained that omits the full algorithm. DERNA [12] and the Cohen–Skiena algorithm [13] are described in full, but use inefficient algorithms.

To address this gap, Section 3 provides a comprehensive, step-by-step explanation of how mRNA folding algorithms are constructed, including full algorithmic details. We simplify and unify existing approaches by introducing a “codon graph” framework. Finally, Section 4 presents performance and correctness benchmarks for existing mRNA folding software packages.

Biological validation of mRNA folding algorithms

Sequences generated by LinearDesign have been experimentally tested in both vaccine applications [8, 19] and reporter assays [6, 9], evaluating in-solution stability and protein expression.

Zhang et al. [8] selected several mRNA sequences along the Pareto front of CAI and MFE to evaluate biological performance in two vaccine contexts, SARS-CoV-2 spike protein and varicella-zoster virus (VZV) gE protein. Under accelerated degradation conditions (10 mM MgInline graphic at 37Inline graphicC), spike mRNA half-lives ranged from 3.9 hours for high-CAI sequences (comparable to those used in current Moderna and BioNTech/Pfizer vaccines) to 20 hours for sequences with the lowest MFE (i.e. highest structure), demonstrating a Inline graphic5-fold improvement in in-solution stability. Highly structured mRNAs demonstrated up to Inline graphic3-fold greater protein expression and up to 128-fold higher IgG titers against spike, compared to sequences optimized solely for CAI. Notably, these results were obtained using unmodified uridine chemistry. In contrast, the VZV gE protein showed more modest gains: a selected “sweet spot” LinearDesign sequence achieved a Inline graphic2-fold increase in half-life, a 4.7-fold increase in protein expression, and 7-fold increase in IgG titers. Similar results using LinearDesign-optimized sequences for VZV have been reported in other work [19].

In another example [9], the authors tested a LinearDesign-optimized sequence encoding NanoLuciferase (NLuc). The sequence demonstrated approximately double the in-solution half-life compared to both a reference NLuc and GC/CAI-optimized variants. Protein expression, measured via luciferase assays at 6 and 24 hours, showed no or only modest increases relative to GC/CAI-optimized mRNAs, depending on the untranslated regions (UTRs) used in combination with the LinearDesign-optimized ORF. Notably, these results were consistent across both pseudouridine-modified sequences and unmodified sequences.

These studies demonstrate that mRNA folding algorithms can improve both in-solution stability and protein expression of therapeutic mRNAs across diverse contexts, including vaccine antigens and reporter proteins. While the magnitude of improvement varies depending on the target and UTR context, the results consistently support the utility of structure-aware sequence design.

The mRNA folding problem

We begin our description of mRNA Folding algorithms by explaining foundational definitions and concepts. An mRNA coding sequence (CDS) encodes a protein. A protein is defined by a sequence of amino acids Inline graphic. Each amino acid is encoded by multiple synonymous codons [7]. A codon is considered valid for a given amino acid if it belongs to the set of synonymous codons for that amino acid. A valid CDS for Inline graphic is a sequence of codons where each codon is valid for the corresponding amino acid.

Preliminary definitions

We start with fundamental definitions useful in describing RNA and mRNA folding algorithms.

Given an RNA sequence Inline graphic we can define a set Inline graphic of valid structures. In the Zuker–Stiegler algorithm, we define a valid structure as a properly nested secondary structure (Fig. 2A and B).

Figure 2.

Panel (A) shows an RNA secondary structure using nested arcs along with the corresponding dot-bracket notation. (B) shows the same structure using a standard 2D layout for RNA structures. (C) shows the key loop types in RNA (one-loops, two-loops, and multiloops) using the same structure as an example.

RNA secondary structure elements. (A) An example nested secondary structure is visualized using an RNA arc diagram, with the corresponding dot-bracket notation for the structure shown below. The arc colors correspond to the colors of the loop type they close. For example, the purple arc closes a multiloop comprising three helices. (B) The same nested RNA structure is represented as an RNA secondary structure diagram, with base pairs colored according to the arcs in (A). (C) This structure contains examples of the three loop types considered in RNA secondary structure prediction: one-loops (red), two-loops (blue), and multiloops (purple). Colors and labels in (A) and (B) correspond to the loop types in (C).

Formally, let Inline graphic represent an RNA sequence. An RNA is a sequence of nucleotides denoted by “A”, “U”, “G”, and “C”: Inline graphic. A valid structure Inline graphic is a set of pairs representing bonds between nucleotides. Only three nucleotide combinations can pair: AU, GC, GU. Note that these can pair in either orientation, e.g. AU and UA are both valid pairs.

A single nucleotide can be in at most one pair in a valid structure: Inline graphic. A valid structure contains no crossing pairs. Two pairs Inline graphic cross iff Inline graphic or Inline graphic.

The objectives of mRNA folding

Early mRNA folding methods aimed to identify the coding sequence (CDS) that minimizes Minimum Free Energy (MFE), producing the most stable structure among all valid CDSs for a target protein [13, 14].

The RNA MFE structure can be found using RNA folding algorithms, such as the Zuker–Stiegler algorithm. RNA folding algorithms require an energy function Inline graphic that gives the free energy change for the sequence Inline graphic folding into the structure Inline graphic. The goal of these algorithms is to compute the structure with the minimum free energy under Inline graphic, with ties broken arbitrarily:

graphic file with name DmEquation1.gif (1)

RNA folding algorithms predict the structure of a single RNA sequence. mRNA folding algorithms extend RNA folding to consider all coding sequences that could possibly encode the target protein. Early mRNA folding algorithms directly identify the sequence Inline graphic with the lowest MFE structure from the set of all valid coding sequences Inline graphic ([13, 14]):

graphic file with name DmEquation2.gif (2)

However, the goal of mRNA folding is to find the optimal sequence-structure pair for the CDS, rather than just the optimal structure as in RNA folding. In addition to structural stability, codon usage is an important factor [4]. The optimality of a codon sequence is often calculated using a metric called Codon Adaptation Index (CAI), which measures adaptation of a sequence to the host organism [7]:

graphic file with name DmEquation3.gif (3)

The CAI is the geometric mean of codon scores derived from a set of highly expressed genes in the target host. CAI values range from 0 to 1, where 1 indicates perfect adaptation to the host and 0 signifies entirely non-optimal codon usage.

Let Inline graphic represent the length of the protein Inline graphic, Inline graphic the frequency of the codon chosen for the Inline graphic-th amino acid, and Inline graphic the maximum frequency over all synonymous codons for that amino acid. Codon frequencies are typically calculated using a reference mRNA transcript data for a particular organism. It is convenient to represent CAI in logarithmic form:

graphic file with name DmEquation4.gif (4)

To find the optimal CDS that balances MFE and codon usage, newer mRNA folding algorithms further extend RNA folding to incorporate CAI ([8, 12]). We adopt a similar notation to LinearDesign [8] and combine Inline graphic and MFE into a single objective score:

graphic file with name DmEquation5.gif (5)

Note that it is convenient to drop the Inline graphic term from Equation (4) when computing CAIMFE. Since MFE grows linearly with sequence length, it is natural to scale CAI by Inline graphic. Observe that Inline graphic cancels.

Now we can define a combined MFE and CAI mRNA folding problem:

graphic file with name DmEquation6.gif (6)

RNA folding with dynamic programming

RNA folding algorithms are based on the dynamic programming recursions of Zuker and Stiegler [15], while mRNA folding algorithms adapt these recursions with additional constraints. We begin by briefly describing the Zuker–Stiegler recursions and then outline the modifications applied in mRNA folding methods.

The energy functions used in RNA and mRNA folding algorithms are typically based on the Nearest Neighbor (NN) model [34–37]. This thermodynamics-based model, derived from extensive optical melting experiments, has been in use since the 1970s [38] and is still under active development [37, 39]. All the mRNA folding algorithms described in this review utilize the Nearest Neighbor model for their energy calculations.

The Zuker–Stiegler recursions (and the NN model) break up the energy calculation for an RNA structure into three kinds of “loop”, based on the nearest neighbor model: loops that are closed by a single base pair, termed “one loops”; loops that are closed by two base pairs, termed “two loops”; and loops closed by more than 2 pairs, termed “multiloops” (see Fig. 2C) [34–37].

We denote the energy contribution of a one loop by Inline graphic, where Inline graphic is the closing pair. Similarly, Inline graphic denotes the energy contribution of a two loop with Inline graphic and Inline graphic as the closing pairs. It is assumed that Inline graphic. Multiloops are treated differently. The energy contribution of a multiloop is given by Inline graphic where Inline graphic is the number of unpaired nucleotides enclosed by the loop, and Inline graphic is the number of pairs closing the loop. So, Inline graphic is an initiation constant, Inline graphic is the cost of an unpaired nucleotide, and Inline graphic is the cost of a closing pair for the loop. See [40] for a history of and justification for this multiloop model.

We have omitted several details of the modern NN model as they add complexity and obscure core ideas. These include helix end penalties, coaxial stacking, dangling ends, and terminal mismatches. Once the core ideas are understood, we think their addition should not be difficult. However, the reader should be aware that we do not cover them. We refer the reader to the Nearest Neighbor Database for a full description [34, 37].

The following dynamic programming recursions compute the MFE based on Zuker and Stiegler’s approach. Figure 3 provides a graphical representation of this approach.

Figure 3.

A diagrammatic representation of the Zuker–Stiegler recursions. Each case is shown using a Feynman-style arc diagram and a simplified RNA 2D structure layout.

RNA folding recursions with example structures. Each case of the Zuker–Stiegler recursions is depicted with a Feynman-style diagram. A solid arc represents a base-pair between two positions in the sequence. For example, Inline graphic is represented as a solid arc between position Inline graphic and Inline graphic. A dashed arc indicates that two positions may or may not be paired. For example, Inline graphic is represented as a dashed arc between position Inline graphic and Inline graphic since these positions are not necessarily paired with each other. A solid black line segment at the base of the arc indicates that the intervening positions are unpaired. In all other cases, the structure of the intervening positions is not yet determined. An example RNA substructure corresponding to each case of the recursions is included above the Feynman-style diagram. Colors correspond to the corresponding loop type for each case: one-loop (red), two-loop (blue), or multiloop (purple). Cases that do not correspond to a specific loop type are colored grey.

graphic file with name DmEquation7.gif (7)

The paired function, Inline graphic, is the MFE over all substructures between Inline graphic and Inline graphic given that Inline graphic and Inline graphic are assumed to be paired.

There is a base case to consider. Inline graphic if the nucleotides at Inline graphic cannot form a valid pair or if Inline graphic.

graphic file with name DmEquation8.gif (8)

The multiloop function, Inline graphic is the MFE over all substructures between Inline graphic and Inline graphic given that there is at least one base pair in the substructure. In contrast, to Inline graphic, that pair is not required to be Inline graphic. Also, this function computes the MFE for a part of a multiloop only, so it is also not assumed that Inline graphic is a pair that closes a multiloop. For the base case, Inline graphic when there could not be any pairs in the subsequence between Inline graphic and Inline graphic (i.e. Inline graphic).

graphic file with name DmEquation9.gif (9)

The external loop function, Inline graphic is the MFE over all substructures for the suffix of nucleotides from Inline graphic to Inline graphic (where Inline graphic is the RNA length). The nucleotide Inline graphic is assumed to be in the external loop, which is the region not contained inside any base pair. Note that the external loop does not have an associated energy function in the Nearest Neighbor model, unlike one loops, two loops, and multiloops. The base case is Inline graphic when Inline graphic where Inline graphic is the sequence length.

The Inline graphic function is used to extract the solution to the RNA folding problem, i.e. the MFE value. Inline graphic is the MFE over all possible structures, assuming nucleotides are indexed from 1 to Inline graphic.

Key ideas

  • “RNA Folding” means finding a minimum free energy (MFE) structure for a given RNA sequence

  • Dynamic programming can compute the MFE by recursively computing it for subsequences

  • The recurrence comprises three functions: Inline graphic, Inline graphic, and Inline graphic (see Equation (7), Equation (8), and Equation (9)) that operate on the sequence indexes of the given RNA

A worked example of the recursive calculations for Inline graphic, Inline graphic, and Inline graphic is included in Supplementary Figures S1 and S2.

mRNA folding with dynamic programming

mRNA folding algorithms in the literature can be divided into two types based on their approach to incorporating codon constraints into the folding process. The first type, introduced by Cohen and Skiena [13] and later used by DERNA [12], can be described as “codon-constrained” dynamic programming. The second type, introduced by CDSfold [14] and refined by LinearDesign [8], we call “codon graph” dynamic programming.

Codon-constrained methods add codon constraints to the Zuker–Stiegler recursions. For example, Inline graphic becomes Inline graphic. The semantics are similar, but incorporate assumptions about the codons that the Inline graphic-th and Inline graphic-th nucleotides are in. First, let Inline graphic represent the amino acid index that the Inline graphic-th nucleotide corresponds to. Now, define Inline graphic as the MFE over all substructures between Inline graphic and Inline graphic given that Inline graphic and Inline graphic are paired and where the codon at Inline graphic is Inline graphic and the codon at Inline graphic is Inline graphic. The other dynamic programming functions are generalized similarly.

“Codon graph” methods use pointers into a graph instead of sequence indexes, enabling substantially more efficient mRNA folding algorithms. As these methods offer significant improvements over earlier codon-constrained approaches in terms of efficiency, simplicity, and extensibility, our focus will primarily be on them.

Codon graph algorithms

The first codon graph mRNA folding method was CDSfold [14]. Conceptually, CDSfold works by computing the same tables in the Zuker–Stiegler algorithm, but Inline graphic and Inline graphic denote pointers into a graph instead of indexes into a sequence. This is an important conceptual shift and is the main idea that enables more efficient mRNA folding algorithms.

It should be stated that CDSfold does not explicitly use a codon graph [14]. Instead, a nucleotide-constrained version of the recursions (similar to codon-constrained described above) is used. Then, “extended nucleotides” are introduced to deal with non-adjacent dependencies between nucleotides inside a codon. As Terai, Kamegai, and Asai point out, this is conceptually a graph [14]. A contribution of this work is to formalize this notion by introducing the codon graph as an elegant way to describe the CDSfold and LinearDesign algorithms and unify them using the same underlying algorithmic framework.

In standard RNA folding, the sequence is fixed, so indexes into the sequence are sufficient to know which base identities are involved. This is important since the energy functions (e.g. Inline graphic and Inline graphic) depend on the base identities involved. In contrast, for mRNA folding, there is no fixed sequence. Instead, we are folding over all valid sequences. The solution employed by CDSfold is to construct a graph such that there is a one-to-one mapping between valid sequences and paths in the graph. Then, instead of an index, a pointer to a node in the graph can be used (Fig. 4A).

Figure 4.

Panel (A) shows how in(n subscript i), out(n subscript i) atpos (i), b subscript n subscript j, and R(n subscript i, n subscript j) correspond to elements in a codon graph. (B) shows the codon graph used by CDSfold for an alanine followed by a leucine. (C) shows the graph used by Linear Design for the same protein

Codon graphs. (A) In RNA folding, sequence indexes Inline graphic and Inline graphic are sufficient to determine the nucleotide at those positions. In mRNA folding, sequence indexes are replaced with pointers to codon graph nodes Inline graphic and Inline graphic. (B) CDSfold uses an “extended nucleotide” codon graph. The codon subgraph for alanine is on the left (blue, columns 1 to 3) with leucine on the right (red, columns 4 to 6). A nucleotide (A, U, G, C) is associated with each vertex. The set of blue paths from left to right corresponds to the set of valid codons for alanine, while the red paths correspond to the valid codons for leucine. The two subgraphs are concatenated by the black edges. (C) LinearDesign uses a modified codon graph with edge weights. A weight Inline graphic is associated with each of the rightmost edges in the alanine and leucine subgraphs, which corresponds to the weight of the corresponding codon. Since there is only a single codon path passing through each weighted edge, the corresponding codon is unambiguous.

The CDSfold graph is constructed from “extended nucleotide” subgraphs for each amino acid in the protein. The “extended nucleotide” terminology refers to two nodes encoding the same nucleotide identity at the same sequence position, which captures dependencies between the first and last nucleotide in a codon. This only occurs for some codons, such as serine, arginine, and leucine in the standard codon table.

The amino acid subgraph is constructed so that each path corresponds to a valid codon. Consider leucine, which has six valid codons: CUC, CUU, CUA, CUG, UUA, UUG. We can construct each of these six codons by following a different path through the leucine subgraph (see the red subgraph in Fig. 4B). We can then construct the protein graph by concatenating individual amino acid subgraphs. Figure 4B shows how the CDSfold construction concatenates the alanine and leucine subgraphs by adding every possible edge from the end of alanine to the start of leucine as depicted by the black edges.

We refer to these graph constructions (in Fig. 4) as codon graphs.

First, we need some definitions for accessing the codon graph. Let Inline graphic refer to a node in the codon graph. Define Inline graphic as the set containing Inline graphic’s neighbors—this corresponds to the outgoing edges from Inline graphic in the codon graph. Similarly, let Inline graphic be the neighbors in the codon graph where edge directions are reversed—this corresponds to the incoming edges to Inline graphic. Let Inline graphic be the set of nodes that correspond to the Inline graphic-th sequence position. Note that a codon graph can contain several nodes at the same position in an mRNA. In Fig. 4, Inline graphic corresponds to the set of nodes at the Inline graphic-th column. For example, consider Fig. 4B where Inline graphic corresponds to the single ‘C’ node for alanine, while Inline graphic corresponds to the four terminal nucleotides for leucine. Let Inline graphic denote the base identity (A, U, G, or C) associated with node Inline graphic.

Observe that in Fig. 4 some nodes cannot be reached from other nodes. Consider two nodes in the codon graph Inline graphic and Inline graphic. We say that Inline graphic can reach Inline graphic if there is a directed path in the codon graph from Inline graphic to Inline graphic. We can construct a reachability table Inline graphic, which is true if Inline graphic can reach Inline graphic and false otherwise. Inline graphic can be constructed efficiently using standard graph algorithms: e.g. a depth-first search from each node.

Consider the three cases of Equation (7). They can be modified to operate on the codon graph as per Equation (10).

graphic file with name DmEquation10.gif (10)

Equation (10) operates on graph nodes instead of sequence indexes. In particular, Inline graphic and Inline graphic are replaced with Inline graphic and Inline graphic, which represent a codon graph node at RNA indexes Inline graphic and Inline graphic, respectively. In each case, we must try all possible nodes that could be at the sequence indexes in the former recurrence, Equation (7). Note that Inline graphic, Inline graphic, Inline graphic, and similar are used to ensure that these nodes are reachable in the codon graph. The recursions for Inline graphic (Equation (11)) and Inline graphic (Equation (12)) follow similar patterns.

graphic file with name DmEquation11.gif (11)
graphic file with name DmEquation12.gif (12)

We use Inline graphic to denote the mRNA length in Equation (12). This is always Inline graphic where Inline graphic is the input protein length.

The base cases for these recursions are similar to the Zuker–Stiegler versions. Inline graphic when the nucleotides corresponding to nodes Inline graphic and Inline graphic cannot form a valid base pair or if there is no path from Inline graphic to Inline graphic: Inline graphic. Also, Inline graphic if there is no path from Inline graphic to Inline graphic. The base case for Inline graphic needs some extra work, since previously Inline graphic. We introduce a special “end node” Inline graphic to the codon graph at index Inline graphic. There are edges from all nodes at Inline graphic to Inline graphic. We define Inline graphic.

A worked example of the recursive calculations over a simplified codon graph for Inline graphic, Inline graphic, and Inline graphic is included in Supplementary Figure S4.

Equation (10), Equation (11), and Equation (12) are equivalent to the CDSfold recursions [14]. Our presentation simplifies them by introducing the concept of an explicit codon graph, but the underlying ideas are the same. Figure 5 provides a side-by-side visualization of the RNA folding and mRNA folding recursions. Next, we will extend our dynamic programming on a codon graph framework to explain how LinearDesign operates [8].

Figure 5.

Panel (A) shows a Feynman-style diagram of the RNA folding recursions. (B) shows a diagram of the mRNA folding recursions also in the same Feynman-style

RNA versus mRNA folding recursions. (A) The Zuker–Stiegler RNA folding recursions find the MFE structure for a single sequence, referencing each nucleotide by its position (e.g. Inline graphic or Inline graphic). We use the same Feynman-style diagrams from Figure 3 to represent these recursions. (B) The modified mRNA folding recursions extend the Zuker–Stiegler recursions to work on a codon graph, referencing each nucleotide with a pointer to a node in the graph (e.g. Inline graphic or Inline graphic).

Key ideas

  • CDSfold builds on the previous RNA folding algorithm. We describe their approach using a “codon graph”

  • A codon graph encodes each valid coding sequence as a path in the graph

  • The dynamic programming algorithm is similar to RNA folding but sequence indexes (e.g. Inline graphic and Inline graphic) are replaced with codon graph node pointers (e.g. Inline graphic and Inline graphic)

Incorporating CAI

LinearDesign improved on CDSfold to enable simultaneous optimization of both CAI and MFE as in Equation (5) [8]. The LinearDesign algorithm is described in terms of a deterministic finite-automata and context-free grammar parsing. These ideas correspond to the codon graph and dynamic programming in our framework. While different nomenclature is used, the resulting algorithms are equivalent. Our codon graph and dynamic programming framework helps us to put LinearDesign into context with existing algorithms such as the Zuker–Stiegler algorithm [15] and CDSfold [14].

A complete description of the LinearDesign algorithm does not appear in the literature, as [8] provides only a description of the algorithm on a simplified model. Specifically, the Nussinov model [41] is used, which is much simpler than the full NN model. A major contribution of this work is to provide a full description of the algorithm. LinearDesign uses a beam search heuristic adapted from LinearFold [42] to speed up execution at the cost of approximating the solution. We do not include this, as our goal is to provide the foundational mRNA folding algorithms without added heuristics.

LinearDesign incorporated CAI by modifying the codon graph with added edge weights. The graph for each amino acid is modified so that the path for each codon has at least one unique edge (Fig. 4C). This modifies the amino acid graphs from CDSfold. For example, compare the leucine subgraphs in panel B of Fig. 4 to that in panel C. In the LinearDesign construction, there is a unique edge for each of the 6 codons between the middle (U, U) and rightmost (C, U, A, G) columns. In general, it is possible to construct the LinearDesign amino acid graphs by constructing a path for each unique codon prefix (e.g. CU and UU for leucine), then adding edges for all final nucleotides in each codon. Note that we use the standard codon table in our discussion, but in theory this method extends to arbitrary codon tables.

By construction, each rightmost edge in the LinearDesign amino acid graph corresponds to a single codon. CAI is incorporated into the graph by adding weights to these edges equal to the contribution of the corresponding codon to the total weighted log-CAI: Inline graphic from Equation (5). Note that other edges are not assigned a log-CAI weight and are assumed to have a weight of zero. Each path corresponds to a valid CDS and the sum of weights on the path corresponds to Inline graphic for that CDS.

A significant difference from the prior recursions is that a path between nodes can contribute a weight even if there are no paired nucleotides involved. For example, the Inline graphic case in Equation (10) only checked Inline graphic for the stretch of unpaired nucleotides from Inline graphic to Inline graphic. However, since some of the edges in a path from Inline graphic to Inline graphic could be weighted, we must now incorporate the weight.

Define Inline graphic as the sum of log-CAI weights on a minimum-weight path from node Inline graphic to node Inline graphic. Let Inline graphic if there is no path. All values for Inline graphic can be pre-computed and stored in a table. There are several ways to do this including dynamic programming on the graph (since it is directed and acyclic), or using standard shortest path algorithms on the graph. Johnson’s algorithm can compute the all-pairs shortest paths with negative edge weights [43]. All such methods are asymptotically dominated by the cost of the remainder of the algorithm.

Equation (13) updates Inline graphic from Equation (10) to incorporate edge weights. The update rule is that Inline graphic is added when the recursions consider a transition between codon graph nodes that will not be considered in a recursive call. Note that Inline graphic is used even when the path is only a single edge, e.g. Inline graphic in Equation (13).

graphic file with name DmEquation13.gif (13)

The recursions for Inline graphic and Inline graphic are similarly updated in Equation (14) and Equation (15). The base cases for all recursions are unchanged.

graphic file with name DmEquation14.gif (14)
graphic file with name DmEquation15.gif (15)

Key ideas

  • LinearDesign builds on CDSfolding by incorporating CAI optimization. We are able to present both in a unified way using codon graphs

  • The codon graph is modified with edges weights corresponding to the CAI contributions

  • The dynamic programming recursions are modified to add the weights of edges

An overview of the key elements comprising mRNA folding is given in Fig. 6.

Figure 6.

Shows the overall flow of how mRNA folding algorithms work. Panel (A) shows a codon graph. Panel (B) shows how dynamic programming is done on that graph. Panel (C) shows the end result: optimized mRNA sequences

mRNA folding overview. mRNA folding takes an input protein sequence and outputs an optimized mRNA sequence. (A) First, a codon graph is generated for the input protein sequence. Each possible path through the codon graph represents a valid coding sequence for the input protein. (B) The optimal mRNA sequence is found via dynamic programming over the codon graph. Early mRNA folding algorithms find the sequence that yields the minimum MFE structure, while more recent algorithms target a balance of MFE and CAI. Base-pairs in the optimized mRNA are depicted by arcs between paired nucleotides. The CAI values are shown below each codon in the graph. The selected codon is highlighted and its corresponding Inline graphic edge weight is labeled below the graph. (C) mRNA folding generates an optimized mRNA. The relative weight of CAI and MFE is determined by the Inline graphic parameter, for algorithms that balance the two objectives. The sequence and structure in (B) correspond to the middle example optimized mRNA in (C).

Traceback

The recursions presented compute the score of the optimal solution but do not construct the solution itself. As is typical for dynamic programming algorithms, the solution can be recovered using a traceback. The traceback is a standard procedure that recovers the solution by recapitulating the steps in the recursions that led to the best score [44]. In the case of the algorithms presented here, the goal is to recover the mRNA. The traceback details are tedious and mechanistic, but are provided in the Supplementary Algorithm S1 for completeness.

Additional energy model details

Some details were omitted from the prior description of our dynamic programming algorithm for brevity. We assumed that the Inline graphic energy function only needs to know the base identities of the two closing base pairs Inline graphic and Inline graphic. However, for the full energy model, this is not true. In general, the energy function may need to know the mismatched base’s identities at positions Inline graphic and Inline graphic. The full form of the energy function is Inline graphic.

This does not change the dynamic programming recursions’ structure, but it does complicate them. In particular, we modify Inline graphic by taking the minimum over Inline graphic, Inline graphic, Inline graphic, and Inline graphic. There are special cases when Inline graphic, Inline graphic, Inline graphic, or Inline graphic. These conveniently correspond to specific special cases in the NN model including “stacks”, “bulges”, and “1 by Inline graphic” internal loops (internal loops with 1 unpaired nucleotide on one side and a variable number on the other).

We also assumed that “hairpin loops” can be described by the simple function Inline graphic. The full energy model takes the mismatched nucleotides inside the hairpin into account, so we must use Inline graphic. This requires a similar modification as for two loops. In addition, there are “special” hairpin loops, which are specific sequences that have a unique energy term. Since these are small (3, 4, and 6 unpaired nucleotides in length), they can be incorporated into the algorithm by brute-force enumeration. That is, when computing Inline graphic, if Inline graphic, we enumerate all paths from Inline graphic to Inline graphic. Each of these paths is a possible hairpin sequence, and we take the minimum over all sequences. Sequences corresponding to special hairpins use the special hairpin rule, otherwise Inline graphic is used.

The reader is referred to the Nearest Neighbor Database for more details on hairpin loops, internal loops, stacks, and bulges in the energy model [34, 37].

Complexity analysis

In calculating the computational complexity of the algorithms we assume that tables are used to store solutions for Inline graphic, Inline graphic, and Inline graphic and these are filled bottom-up [44]. This is similar to the implementation of existing mRNA folding algorithms [8, 14]. For completeness, a valid bottom-up fill order is to iterate backwards through 5’ sequence indexes Inline graphic and for each Inline graphic iterate forward through 3’ sequence indexes Inline graphic.

Let Inline graphic be the mRNA length. The table size for Inline graphic is Inline graphic. Each mRNA sequence index has at most four nodes (one for each nucleotide) using the construction from Section 3.2 assuming the standard codon table. So, the codon graph contains an upper bound of Inline graphic nodes. The total number of table entries is bounded by Inline graphic combinations of the nodes Inline graphic and Inline graphic. The cost of computing the solution for a table entry is dominated by iterating through all Inline graphic combinations of Inline graphic and Inline graphic. However, in RNA folding algorithms it is typical to limit the size of two loops to at most 30 unpaired nucleotides, as they rapidly become thermodynamically unfavorable [45]. In this case, the calculation is dominated by the cost of considering multiloop splits, which involves enumerating all pairs of nodes Inline graphic and Inline graphic. There are at most Inline graphic such pairs, since there are at most four options for Inline graphic. This gives a time complexity of Inline graphic.

The table for Inline graphic is similarly Inline graphic in size. It is also dominated by calculating splitting pairs Inline graphic and Inline graphic. As such, the total time complexity for filling Inline graphic is Inline graphic.

The table for Inline graphic is Inline graphic in size since it is parameterized by a single node. The worst case cost of calculating an entry is Inline graphic, as it similarly considers all splitting pairs Inline graphic and Inline graphic. As such, the total time complexity for filling Inline graphic is Inline graphic.

The algorithm is dominated by filling the Inline graphic and Inline graphic tables. The worst case time complexity is Inline graphic and the space complexity is Inline graphic. The cost of computing the shortest path table Inline graphic using an efficient algorithm such as Johnson’s algorithm [43] is at most Inline graphic, since an upper bound on the number of nodes in the graph is Inline graphic and an upper bound on the number of edges is Inline graphic. The traceback is similarly dominated, since it will only visit table entries in the optimal solution and its total time cost must be less than the cost of computing the tables.

Pareto optimality

DERNA [12] introduced a unique feature to mRNA folding algorithms to find all Pareto optimal mRNAs. An mRNA Inline graphic is Pareto optimal if there is no sequence Inline graphic that dominates Inline graphic in terms of both CAI and MFE: Inline graphic. In other words, a Pareto optimal set for mRNA folding solutions contains one mRNA for each achievable CAI value and that sequence must have the minimum MFE possible for that CAI.

Finding all Pareto optimal mRNA sequences solves a problem in LinearDesign [8]. The term Inline graphic is used to balance CAI and MFE in Equation (5). Selecting the right Inline graphic can be challenging. For instance, if an mRNA designer wants to find a sequence with Inline graphic, then they must run LinearDesign multiple times to binary search the lowest Inline graphic that satisfies the condition. The set of Pareto optimal solutions contains a solution for every possible tradeoff between CAI and MFE.

The recursions presented in this work can be modified to compute all Pareto optimal solutions. Each of Inline graphic, Inline graphic, and Inline graphic computes the CAIMFE (defined in Equation (5)) of the optimal solution to the corresponding subproblem. Instead, they could compute a set of Pareto optimal solutions for each subproblem. For example, the dynamic programming table for Inline graphic might store a list of all Inline graphic pairs for Pareto optimal solutions. Two lists can be combined by enumerating all pairs of elements (one from each list) and taking only Pareto optimal combinations. This is a straightforward, albeit naive, solution.

DERNA uses a more sophisticated but less pedagogically clear weighted sum method that exploits the convexity of the CAI-MFE tradeoff—increasing CAI monotonically increases MFE. Both methods could be adapted to the recursions presented here. DERNA extends the less-efficient codon-constrained method for mRNA folding. This makes DERNA slower than both CDSfold and LinearDesign, even when not running in Pareto optimal mode.

Untranslated regions

An mRNA designer usually considers three regions: the 5’ untranslated region (UTR), the CDS, and the 3’ UTR. However, existing mRNA folding algorithms optimise only the CDS. Zhang et al. [8] suggested that an MFE-optimised CDS is less likely to have base pairs that interact with the UTRs, which is important to avoid any disruption in UTR function. This is especially important for the 5’ UTR, since structure near the mRNA 5’ end can substantially impair translation initiation [46]. There is some experimental evidence for this [9], but the algorithms do not guarantee it.

Structural constraints

The goal of MFE optimization in mRNA folding is to increase stability, which generally increases structure. However, in some cases it is desirable to suppress structure. For instance, reduced structure in the 5’ UTR, particularly near the start codon, is associated with increased expression [47–49]. As mentioned in Section 3.7, it may be useful to avoid base pairs between the CDS and the UTRs. Also, it can be useful to avoid long helices, since they can trigger an innate immune response [50]. To these ends, it would be useful to extend mRNA folding algorithms to incorporate structural constraints.

CDSfold included a heuristic to discourage base pairs in a user-specified region [14]. This heuristic penalizes Inline graphic whenever Inline graphic is a suppressed base pair, reducing but not entirely eliminating their occurrence. By modifying the free energy landscape, it increases the free energy of any structure with a suppressed pair. However, the mRNA folding algorithm may still find a sequence with low MFE in the changed free energy landscape that can be even lower when suppressed pairs are allowed again. To address this, CDSfold heuristic also employs a second phase inspired by Gaspar et al. [51]. While effective for suppressing base pairing in specific regions, this heuristic does not generalize beyond that function and is not implemented with CAI optimization.

LinearDesign also uses a heuristic to avoid structure around the 5’ UTR [8] by excluding the first three codons and optimizing the remaining CDS. Then, all combinations for the three excluded codons are enumerated and evaluated. This method appears to work for reducing structure around the start codon, but does not scale to large regions (due to brute-force enumeration) or generalize to arbitrary structural constraint. LinearDesign also avoids long helices, but their avoidance heuristics were not specified.

Sequence constraints

Avoidance of certain sequences in an mRNA can be important. Factors like restriction enzyme sequences, repeated subsequences, and the proportion of G and C nucleotides can affect mRNA efficacy and ease of manufacturing [4].

Zhang et al. [8] note that LinearDesign’s deterministic finite automaton (DFA) can be modified to avoid certain motifs, such as the restriction enzyme recognition sequence GGUACC. Since the DFA is equivalent to the codon graph framework, these modifications translate directly. While similar modifications could be made by hand for other excluded sequences, this may become cumbersome if multiple excluded sequences overlap.

Comparison of existing software packages

We conducted a series of experiments to compare existing mRNA folding software packages, including LinearDesign [8], CDSfold [14], and DERNA [12]. These were downloaded from their respective GitHub repositories using commits f0126ca, 06f3ee8, and ac84b6f compiled from source on Ubuntu 24.04 using GCC 13.2.0. All experiments were performed on the same Ubuntu system equipped with an AMD 7950X processor. The Homo sapiens codon frequency table from the Kazusa database was used for all experiments [52]. All of our benchmarking results and code are available at https://github.com/maxhwardg/mrna_folding_comparison.

An overview of our findings is summarized in Table 2.

Table 2.

Comparison of mRNA folding software packages. Entries highlighted in green are the most desirable quality for the corresponding column.

Software Package Speed Memory Usage Bugs
LinearDesign [8] Fast High Observed
DERNA [12] Slow High Observed
CDSfold [14] Intermediate Low Not observed

Performance benchmarks

The software packages were benchmarked on proteins ranging from 50 to 1500 amino acids in length, with a stride of 50. Benchmarking of a software package was terminated if it exceeded a time of one hour. The protein sequences were randomly generated with uniformly sampled amino acids. Since CDSfold does not optimize CAI, LinearDesign was run with Inline graphic and DERNA with Inline graphic to emulate the behavior of CDSfold.

Benchmarking on random protein data shows that for execution time, LinearDesign was the fastest, followed by CDSfold, and DERNA (Fig. 7A). LinearDesign operates as an approximate algorithm, whereas CDSfold and DERNA are exact. During benchmarking, LinearDesign occasionally produced less optimized results (higher MFE) than the other algorithms, though such cases were rare and the differences were minor. An example sequence is available in our GitHub repository.

Figure 7.

Panel (A) shows the execution time graph on random sequences for DERNA, CDSfold, and LinearDesign; each appears cubic with the highest (in terms of execution time) being DERNA, CDSfold, and then LinearDesign in that order. (B) shows memory usage graph on random sequences; each appears quadratic with highest (in terms of memory used) being LinearDesign, DERNA, then CDSfold in that order. (C) shows the execution time graph on the poly-leucine sequences; DERNA has a very steep plot while CDSfold and LinearDesign appear cubic. Initially CDSfold is lower, but it overtakes LinearDesign (in terms of execution time) at length 950. (D) shows the memory usage graph on poly-leucine sequences; all plots appear quadratic with the highest (in terms of memory used) being DERNA, LinearDesign, then CDSfold in that order

Benchmarks of mRNA folding packages. LinearDesign (blue circles), CDSfold (red squares), and DERNA (green triangles) are compared. Execution time (A) and memory usage (B) are shown for randomly generated protein sequences. Protein sequences range from 50 to 1500 amino acids in length, with a stride of 50. Each data point represents the median execution time across three runs, with error bars representing the highest and lowest measure. Execution time (C) and memory usage (D) are shown for a poly-leucine amino acid sequence (MLLL…). Since leucine has the maximum synonymous codons (6), this sequence provided a challenging test scenario for these algorithms.

Memory usage followed a different trend: LinearDesign consumed the most memory, closely followed by DERNA, while CDSfold used minimal resources (Fig. 7B). Notably, LinearDesign required over 60GB of memory to fold a 1,450 amino acid protein.

The performance of mRNA folding algorithms is sensitive to protein composition. For instance, the “poly-leucine” sequence MLLL… (one methionine followed by a variable number of leucines) has a more challenging codon graph than a randomly created protein, as leucine has the maximum number of synonymous codons (6). To evaluate this effect, we ran a second benchmark using various lengths of the poly-leucine sequence. As expected, all software packages performed slower on this benchmark, with DERNA being particularly affected—it required 75 minutes to fold a 220 amino acid protein, compared to Inline graphic minutes for random sequences of the same length. This meant that only shorter lengths could be tested for DERNA as it exceeded the experimental time limit. CDSFold outperformed LinearDesign for lengths up to 900 amino acids for execution time, but LinearDesign was faster for longer sequences (Fig. 7C). The memory usage was also higher for this test sequence, with DERNA using relatively more memory (Fig. 7D).

Software bugs

During the benchmarking process, several bugs were found in the software packages. Most notably, the CAI values reported by DERNA are often different from those produced by LinearDesign and by our CAI calculator, despite all programs using the same codon frequency table. Additionally, DERNA showed non-deterministic behavior and reported different CAI values for the same output mRNA given the same input protein. It also produced free energy values that do not match the free energy of the sequence structure pair as calculated by ViennaRNA [18]. We observe that all software packages target parity with the RNAfold program running with the -d0 option.

LinearDesign also exhibited bugs. Some inputs cause LinearDesign to crash when it produced an invalid RNA sequence for the input protein, triggering an assertion error. In some cases, LinearDesign produced an mRNA sequence with a reported MFE value that did not match ViennaRNA’s computed MFE value. For both of these categories of bugs, undefined behavior is the likely cause. To investigate, we recompiled LinearDesign with sanitization (via -fsanitize=address,undefined). Several cases of integer underflow were revealed.

No bugs were observed in CDSfold.

Proteins that trigger the bugs mentioned here are compiled in our GitHub repository. The errors can be reproduced using our code that runs the various software packages or by calling the software packages directly using the same codon usage table and program settings.

During the review of this manuscript, DERNA was updated to fix the bugs we reported.

Discussion

The success of LinearDesign and mRNA technology more generally highlights the importance of mRNA folding algorithms. They are fast enough to use for long proteins and provide a high degree of optimization for stability (via MFE) and codon choice (via CAI). However, the lack of flexibility and features is a limitation. In addition, existing software packages are imperfect, with the user needing to use different software packages to access different features and contend with bugs.

We have identified several gaps in the mRNA folding literature and also in the available software. Perhaps the most pressing research gap is to incorporate sequence and structure constraints, as these are widely used in existing mRNA optimization approaches. Existing mRNA folding algorithms can still be used to generate an initial sequence, which may be adjusted by another algorithm to meet sequence and structure constraints. However, a holistic approach that can incorporate some of these constraints into mRNA folding is preferred.

Another pressing gap for mRNA folding algorithms is the lack of high-quality software packages. Existing software either have significant bugs (DERNA and LinearDesign), poor performance (DERNA), high memory usage (DERNA and LinearDesign) or lack features (CDSfold and LinearDesign). We also note that no multi-core or GPU-enabled software exists despite the significant computational bottlenecks in mRNA folding algorithms.

There are several specific ideas that we suggest for the next iteration of mRNA folding algorithms.

Inclusion of UTRs. We observe for completeness that it is possible to extend mRNA folding algorithms to be UTR-aware. The UTRs can be incorporated by modifying the codon graph construction without any changes to the recursions. Construct a path for the 5’ UTR and the 3’ UTR. Each path contains the sequence of nucleotides in the UTR. The 5’ UTR path can be prepended to the codon graph and the 3’ UTR can be appended. The edges in the UTR paths should have weight zero so that they do not contribute to CAI. This is sufficient to ensure that the UTRs are included in the calculation of the MFE. To our knowledge, the addition of UTRs has not been implemented in existing mRNA folding software packages.

Suboptimal folding. Current mRNA folding algorithms only return a single solution, but it would be more practical to provide the user with a diverse set of potential sequences. Suboptimal sampling is one of the most important features of modern RNA folding software packages [53–55]. An mRNA folding implementation that returns a diverse set of mRNAs would mitigate sequence and structure constraints, as a broader set of mRNAs is more likely to contain valid solutions. In addition, it gives a larger pool of potential sequences for lab testing. DERNA finds a set of Pareto optimal solutions [12]. This set is equivalent to that obtained by running conventional mRNA folding with all Inline graphic values. It is important to understand how this is different from suboptimal sampling. DERNA finds only a single solution for a given Inline graphic, but there may be many near-optimal solutions. Further, there may be ties for Pareto optimal solutions, in which case DERNA will only report one.

Forbidden sequence avoidance. There is currently no computer algorithm for building a codon graph (or DFA) for an arbitrary set of sequence motifs to avoid, although Zhang et al. [8] give a bespoke construction for a specific sequence. We hypothesize that this could be achieved by combining the codon graph construction (or DFA) with the Aho–Corasick automaton [56]. In addition, no method has been proposed to avoid repeated sequences or inverted repeats.

Closing remarks

mRNA technology is at the cutting edge of therapeutics offering better vaccines, gene-editing, and personalized medicines. mRNA sequences optimization is essential to fully realize this potential, and mRNA folding algorithms are one of the most powerful tools available.

This review explores mRNA folding algorithms, which although recently popularized by LinearDesign, have existed since the early 2000s. We provide a comprehensive description of how these algorithms work, addressing the lack of comprehensive explanation of the core algorithms in the literature. Further, we unify and simplify the description of the algorithms used in CDSfold and LinearDesign with a new codon graph framework. Several key gaps in the literature are highlighted, and we present benchmarks comparing run-time speed, memory usage, correctness, and features of existing software.

Although mRNA folding algorithms have proven to be a powerful tool, especially for enhancing stability while balancing translational efficiency in vaccine context, they represent only a small dimension of a much broader design problem. Optimizing therapeutic mRNAs may require incorporating parameters relevant to other stages of the drug’s life cycle, such as delivery, chemically modified nucleotides, or intracellular uptake and trafficking, many of which are still poorly understood and will likely require entirely new algorithms.

Unlike in protein folding, deep learning has yet to reach its full potential in mRNA sequence optimization, largely due to the limited availability of high-quality, task-relevant data. As more experimental datasets become available, we expect deep learning to play a greater role in this space.

Progress so far has been fueled by close collaboration among mRNA biologists, nucleotide chemists, and computational scientists, and the field will continue to benefit from such interdisciplinary partnerships given the many open questions that remain.

Finally, current algorithms rely almost exclusively on 2D RNA structure predictions due to the limited availability of high-resolution 3D structural data. As advances in 3D RNA structure prediction continue, we anticipate that future algorithms will integrate 3D information to further refine sequence design. Together, these developments mark an exciting and rapidly evolving frontier in therapeutic mRNA engineering and will require the continued development of bespoke tools and algorithms tailored to currently known and unknown challenges.

We hope this review provides a strong foundation for the development of next-generation mRNA folding algorithms and contributes to the continued advancement of mRNA technology.

Key Points

  • The first comprehensive review of mRNA folding algorithms including LinearDesign, CDSfold, DERNA, and the Cohen–Skiena algorithm

  • A complete technical description of how mRNA folding algorithms work, unifying details that are currently scattered across the literature or missing altogether

  • Benchmarking of mRNA folding software packages in terms of efficiency, correctness, and supported features

Supplementary Material

supp_(1)_bbaf386
supp_(1)_bbaf386.pdf (1.9MB, pdf)

Acknowledgements

We thank The University of Western Australia and Moderna Therapeutics for providing computational resources and support. We are also grateful to Haining Lin and Wade Davis for their valuable discussions and feedback, which helped shape this work. In particular, we thank Haining Lin for her thorough review and insightful comments. Additionally, we acknowledge the developers of LinearDesign, CDSfold, and DERNA for making their software publicly available, enabling our benchmarking studies.

Contributor Information

Max Ward, School of Physics, Mathematics, and Computing, The University of Western Australia, WA 6009, Australia.

Mary Richardson, Moderna, Inc., Cambridge, MA 02142, USA.

Mihir Metkar, Moderna, Inc., Cambridge, MA 02142, USA.

References

  • 1. Fang  E, Liu  X, Li  M. et al.  Advances in covid-19 mrna vaccine development. Signal Transduct Target Ther  2022; 7:94. 10.1038/s41392-022-00950-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Hogan  MJ, Pardi  N. Mrna vaccines in the covid-19 pandemic and beyond. Annu Rev Med  2022;73:17–39. 10.1146/annurev-med-042420-112725 [DOI] [PubMed] [Google Scholar]
  • 3. Shi  Y, Shi  M, Wang  Y. et al.  Progress and prospects of mrna-based drugs in pre-clinical and clinical applications. Signal Transduct Target Ther  2024;9:322. 10.1038/s41392-024-02002-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Metkar  M, Pepin  CS, Moore  MJ. Tailor made: the art of therapeutic mRNA design. Nat Rev Drug Discov  2024;23:67–83. 10.1038/s41573-023-00827-x [DOI] [PubMed] [Google Scholar]
  • 5. Uddin  MN, Roni  MA. Challenges of storage and stability of mrna-based covid-19 vaccines. Vaccines  2021;9:1033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Wayment-Steele  HK, Kim  DS, Choe  CA. et al.  Theoretical basis for stabilizing messenger rna through secondary structure design. Nucleic Acids Res  2021;49:10604–17. 10.1093/nar/gkab764 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Sharp  PM, Li  W-H. The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res  1987;15:1281–95. 10.1093/nar/15.3.1281 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Zhang  H, Zhang  L, Lin  A. et al.  Algorithm for optimized mrna design improves stability and immunogenicity. Nature  2023;621:396–403. 10.1038/s41586-023-06127-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Leppek  K, Byeon  GW, Kladwang  W. et al.  Combinatorial optimization of mrna structure, stability, and translation for rna-based therapeutics. Nat Commun  2022;13:1536. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Vostrosablin  N, Lim  S, Gopal  P. et al.  Mrnaid, an open-source platform for therapeutic mrna design and optimization strategies. NAR Genom Bioinform  2024;6:lqae028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Li  S, Moayedpour  S, Li  R. et al.  CodonBert: large language models for mRNA vaccines. Genome Res  2024;34:1027–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Xinyu  G, Qi  Y, El-Kebir  M. Derna enables pareto optimal rna design. J Comput Biol  2024;31:179–96. [DOI] [PubMed] [Google Scholar]
  • 13. Cohen  B, Skiena  S. Natural selection and algorithmic design of mRNA. J Comput Biol  2003;10:419–32. 10.1089/10665270360688101 [DOI] [PubMed] [Google Scholar]
  • 14. Terai  G, Kamegai  S, Asai  K. Cdsfold: an algorithm for designing a protein-coding sequence with the most stable secondary structure. Bioinformatics  2016;32:828–34. 10.1093/bioinformatics/btv678 [DOI] [PubMed] [Google Scholar]
  • 15. Zuker  M, Stiegler  P. Optimal computer folding of large rna sequences using thermodynamics and auxiliary information. Nucleic Acids Res  1981;9:133–48. 10.1093/nar/9.1.133 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Zuker  M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res  2003;31:3406–15. 10.1093/nar/gkg595 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Reuter  JS, Mathews  DH. Rnastructure: software for rna secondary structure prediction and analysis. BMC Bioinform  2010;11:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Lorenz  R, Bernhart  SH, Höner  C. et al.  Viennarna package 2.0. Algorithms Mol Biol  2011;6:1–14. 10.1186/1748-7188-6-26 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Zhang  S, Wang  X, Zhao  T. et al.  Development and evaluation of the immunogenic potential of an unmodified nucleoside mrna vaccine for herpes zoster. Vaccines  2025;13:68. 10.3390/vaccines13010068 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Budzinska  MA, Leonard  TE, Chang  RY. et al.  mRNArchitect: optimized design of mRNA sequences. bioRxiv. 2024, 2024–12.
  • 21. Yeasmin  R, Skiena  S. Designing RNA secondary structures in coding regions. In: Bioinformatics Research and Applications: 8th International Symposium, ISBRA 2012, Dallas, TX, USA, May 21–23, 2012. Proceedings 8, pp. 299–314. Springer, 2012. [Google Scholar]
  • 22. Diez  M, Medina-Muñoz  SG, Castellano  LA. et al.  Icodon customizes gene expression based on the codon composition. Sci Rep  2022;12:12126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Raab  D, Graf  M, Notka  F. et al.  The GeneOptimizer algorithm: using a sliding window approach to cope with the vast sequence space in multiparameter DNA sequence optimization. Syst Synth Biol  2010;4:215–25. 10.1007/s11693-010-9062-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Parvathy  ST, Udayasuriyan  V, Bhadana  V. Codon usage bias. Mol Biol Rep  2022;49:539–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Wolpert  DH, Macready  WG. No free lunch theorems for optimization. IEEE Trans Evol Comput  1997;1:67–82. 10.1109/4235.585893 [DOI] [Google Scholar]
  • 26. Li  S, Noroozizadeh  S, Moayedpour  S. et al.  mRNA-LM: full-length integrated SLM for mRNA analysis. Nucleic Acids Res  2025;53:gkaf044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Castillo-Hair  S, Fedak  S, Wang  B. et al.  Optimizing 5’UTRs for mRNA-delivered gene editing using deep learning. Nat Commun  2024;15:5284. 10.1038/s41467-024-49508-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Liu  G, Zhang  Y, Zhang  T. Computational approaches for effective CRISPR guide RNA design and evaluation. Comput Struct Biotechnol J  2020;18:35–44. 10.1016/j.csbj.2019.11.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Churkin  A, Retwitzer  MD, Reinharz  V. et al.  Design of RNAs: comparing programs for inverse RNA folding. Brief Bioinform  2018;19:350–8. 10.1093/bib/bbw120 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Ward  M, Courtney  E, Rivas  E. Fitness functions for RNA structure design. Nucleic Acids Res  2023;51:e40–0. 10.1093/nar/gkad097 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Han  D, Qi  X, Myhrvold  C. et al.  Single-stranded DNA and RNA origami. Science  2017;358:eaao2648. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Zadeh  JN, Steenberg  CD, Bois  JS. et al.  Nupack: analysis and design of nucleic acid systems. J Comput Chem  2011;32:170–3. 10.1002/jcc.21596 [DOI] [PubMed] [Google Scholar]
  • 33. Zur  H, Tuller  T. Strong association between mrna folding strength and protein abundance in s. cerevisiae. EMBO Rep  2012;13:272–7. 10.1038/embor.2011.262 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Turner  DH, Mathews  DH. Nndb: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure. Nucleic Acids Res  2010;38:D280–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Mathews  DH, Sabina  J, Zuker  M. et al.  Expanded sequence dependence of thermodynamic parameters improves prediction of rna secondary structure. J Mol Biol  1999;288:911–40. 10.1006/jmbi.1999.2700 [DOI] [PubMed] [Google Scholar]
  • 36. Mathews  DH, Disney  MD, Childs  JL. et al.  Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of rna secondary structure. Proc Natl Acad Sci  2004;101:7287–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Mittal  A, Turner  DH, Mathews  DH. NNDB: an expanded database of nearest neighbor parameters for predicting stability of nucleic acid secondary structures. J Mol Biol  2024;436:168549. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Studnicka  GM, Rahn  GM, Cummings  IW. et al.  Computer method for predicting the secondary structure of single-stranded rna. Nucleic Acids Res  1978;5:3365–88. 10.1093/nar/5.9.3365 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Zuber  J, Schroeder  SJ, Sun  H. et al.  Nearest neighbor rules for rna helix folding thermodynamics: Improved end effects. Nucleic Acids Res  2022;50:5251–62. 10.1093/nar/gkac261 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Ward  M, Datta  A, Wise  M. et al.  Advanced multi-loop algorithms for rna secondary structure prediction reveal that the simplest model is best. Nucleic Acids Res  2017;45:8541–50. 10.1093/nar/gkx512 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Nussinov  R, Jacobson  AB. Fast algorithm for predicting the secondary structure of single-stranded rna. Proc Natl Acad Sci  1980;77:6309–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Huang  L, Zhang  H, Deng  D. et al.  Linearfold: linear-time approximate rna folding by 5Inline graphic-to-3’dynamic programming and beam search. Bioinformatics  2019;35:i295–304. 10.1093/bioinformatics/btz375 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Johnson  DB. Efficient algorithms for shortest paths in sparse networks. J ACM  1977;24:1–13. 10.1145/321992.321993 [DOI] [Google Scholar]
  • 44. Eddy  SR. What is dynamic programming? Nat Biotechnol  2004;22:909–10. 10.1038/nbt0704-909 [DOI] [PubMed] [Google Scholar]
  • 45. Lyngsø  RB, Zuker  M, Pedersen  CNS. Internal loops in rna secondary structure prediction. In: Proceedings of the Third Annual International Conference on Computational Molecular Biology, pp. 260–7, 1999.
  • 46. Babendure  JR, Babendure  JL, Ding  J-H. et al.  Control of mammalian translation by mRNA structure near caps. RNA  2006;12:851–61. 10.1261/rna.2309906 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Hinnebusch  AG, Ivanov  IP, Sonenberg  N. Translational control by 5Inline graphic-untranslated regions of eukaryotic mrnas. Science  2016;352:1413–6. 10.1126/science.aad9868 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Sample  PJ, Wang  B, Reid  DW. et al.  Human 5’ UTR design and variant effect prediction from a massively parallel translation assay. Nat Biotechnol  2019;37:803–9. 10.1038/s41587-019-0164-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Ringnér  M, Krogh  M. Folding free energies of 5’-UTRs impact post-transcriptional regulation on a genomic scale in yeast. PLoS Comput Biol  2005;1:e72. 10.1371/journal.pcbi.0010072 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Liu  L, Botos  I, Wang  Y. et al.  Structural basis of toll-like receptor 3 signaling with double-stranded RNA. Science  2008;320:379–81. 10.1126/science.1155406 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Gaspar  P, Moura  G, Santos  MAS. et al.  mRNA secondary structure optimization using a correlated stem–loop prediction. Nucleic Acids Res  2013;41:e73–3. 10.1093/nar/gks1473 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Nakamura  Y, Gojobori  T, Ikemura  T. Codon usage tabulated from international DNA sequence databases: Status for the year 2000. Nucleic Acids Res  2000;28:292–2. 10.1093/nar/28.1.292 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Mathews  DH. Revolutions in rna secondary structure prediction. J Mol Biol  2006;359:526–32. 10.1016/j.jmb.2006.01.067 [DOI] [PubMed] [Google Scholar]
  • 54. Wuchty  S, Fontana  W, Hofacker  IL. et al.  Complete suboptimal folding of rna and the stability of secondary structures. Biopolymers  1999;49:145–65. [DOI] [PubMed] [Google Scholar]
  • 55. Ding  Y, Lawrence  CE. A statistical sampling algorithm for rna secondary structure prediction. Nucleic Acids Res  2003;31:7280–301. 10.1093/nar/gkg938 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Aho  AV, Corasick  MJ. Efficient string matching: an aid to bibliographic search. Commun ACM  1975;18:333–40. 10.1145/360825.360855 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp_(1)_bbaf386
supp_(1)_bbaf386.pdf (1.9MB, pdf)

Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES