Abstract
Messenger RNA (mRNA) vaccines are being used to combat the spread of COVID-19 (refs. 1–3), but they still exhibit critical limitations caused by mRNA instability and degradation, which are major obstacles for the storage, distribution and efficacy of the vaccine products4. Increasing secondary structure lengthens mRNA half-life, which, together with optimal codons, improves protein expression5. Therefore, a principled mRNA design algorithm must optimize both structural stability and codon usage. However, owing to synonymous codons, the mRNA design space is prohibitively large—for example, there are around 2.4 × 10632 candidate mRNA sequences for the SARS-CoV-2 spike protein. This poses insurmountable computational challenges. Here we provide a simple and unexpected solution using the classical concept of lattice parsing in computational linguistics, where finding the optimal mRNA sequence is analogous to identifying the most likely sentence among similar-sounding alternatives6. Our algorithm LinearDesign finds an optimal mRNA design for the spike protein in just 11 minutes, and can concurrently optimize stability and codon usage. LinearDesign substantially improves mRNA half-life and protein expression, and profoundly increases antibody titre by up to 128 times in mice compared to the codon-optimization benchmark on mRNA vaccines for COVID-19 and varicella-zoster virus. This result reveals the great potential of principled mRNA design and enables the exploration of previously unreachable but highly stable and efficient designs. Our work is a timely tool for vaccines and other mRNA-based medicines encoding therapeutic proteins such as monoclonal antibodies and anti-cancer drugs7,8.
Subject terms: Computational biology and bioinformatics, Computer science, RNA vaccines
An algorithm based on concepts established in computational linguistics enables rapid principled design of mRNA vaccines optimizing both structural stability and codon usage, resulting in improved half-life, protein expression and immune responses.
Main
mRNA vaccines9,10 have been recognized as viable tools to limit the spread of COVID-19 owing to their scalable production, safety and efficacy1–3. However, mRNA molecules are chemically unstable and prone to degrade, which leads to insufficient protein expression5, and, in turn, compromised immunogenicity and druggability. This instability has also become a major obstacle in the storage and distribution of the vaccine, requiring the use of cold-chain technologies that hinders its use in developing countries4. Thus an mRNA molecule with enhanced stability is desirable, which would potentially have greater potency and favourable clinical efficacy.
Although it remains difficult to model chemical stability, previous work has established its correlation with RNA secondary structure, as quantified by the well-studied thermodynamic folding stability. Improving this structural stability, combined with optimal codon usage, leads to increased protein expression5. Therefore, a principled mRNA design algorithm must optimize two factors—structural stability and codon usage—to enhance protein expression.
However, the mRNA design problem (we consider only the coding region in this work) is extremely challenging owing to the exponentially large search space. Each amino acid is encoded by a triplet codon—that is, three adjacent nucleotides—but owing to redundancies in the genetic code, most amino acids have multiple codons; there are 43 (that is, 64) codons for the 20 common naturally occurring amino acids. This results in a prohibitively large number of candidates for any protein sequence. For example, the spike protein of SARS-CoV-2 has 1,273 amino acids and can therefore be encoded by approximately 2.4 × 10632 mRNA sequences (Fig. 1a). This poses an insurmountable computational challenge and rules out enumeration, which would take 10616 billion years for the spike protein (Fig. 1b). Conversely, codon optimization11,12, the conventional approach to mRNA design, optimizes codon usage but barely improves stability, leaving out the huge space of highly stable mRNAs. Optimizing GC content has a similar effect as it correlates with codon usage in vertebrates13. As a result, the vast majority of highly stable designs remains unexplored.
Here we describe LinearDesign, an algorithm that addresses this challenge by adapting the classical concept of lattice parsing6 in computational linguistics (Fig. 1c). We show that finding the optimal mRNA among the vast space of candidates is analogous to finding the most likely sentence among many similar-sounding alternatives. More specifically, we formulate the mRNA design space using a deterministic finite-state automaton (DFA), similar to a word lattice6, which compactly encodes exponentially many mRNA candidates. We then use lattice parsing to find the most stable mRNA in the DFA, or the optimal balance between stability and codon optimality in a weighted DFA. This unexpected connection to natural language enables an efficient algorithm that scales quadratically with the mRNA sequence length in practice. In this sense, our work transforms the enormous design space into an advantage—providing freedom of design—rather than an obstacle.
Compared to the codon-optimized benchmark, our COVID-19 and varicella-zoster virus (VZV) mRNA vaccines substantially improve chemical stability in vitro, protein expression in cells and immunogenicity in vivo. In particular, our COVID-19 vaccines achieved up to 128 times the antibody response of the codon-optimized benchmark in mice. This result reveals the great potential of principled mRNA design, and enables the exploration of these previously unreachable but highly stable and efficient designs. Our work provides a timely and promising tool for the design of mRNA vaccines and other mRNA-based medicines14 encoding therapeutic proteins including monoclonal antibodies7 and anti-cancer drugs8.
Formulations and algorithms
Previous work5 established two main objectives for mRNA design, stability and codon optimality, which synergize to increase protein expression. To optimize for stability, given a protein sequence, we aim to find the mRNA sequence that has the lowest minimum-free-energy change (MFE) among all possible mRNA sequences encoding that protein; that is, for each candidate mRNA sequence, we find its MFE structure among all its possible secondary structures using the standard RNA folding energy model15,16 and then choose the sequence whose MFE energy is the lowest. This is thus a minimization within a minimization (Extended Data Fig. 1a). This method would take billions of years, thus an efficient algorithm without enumeration is needed.
We also aim to jointly optimize mRNA stability and codon optimality. Codon optimality is often measured by the codon adaptation index17 (CAI), which is defined as the geometric mean of the relative adaptiveness of each codon in the mRNA. Because CAI is between 0 and 1 but MFE is generally proportional to the sequence length, we multiply the logarithm of CAI by the number of codons in the mRNA and use the hyper-parameter CAI weight (λ) to balance MFE and CAI (λ = 0 being MFE-only). The combined objective is MFE – λ|p| log CAI, where |p| is the protein length. See Methods, ‘Optimization objectives’ and Extended Data Fig. 1b for details.
We next describe our solution to these two optimization problems with two ideas borrowed from natural language: DFA (lattice) representation and lattice parsing.
Lattice representation for mRNA design space
Inspired by the word lattice representation of ambiguities in computational linguistics (Extended Data Fig. 2a), we represent the choice of codons for each amino acid using a similar lattice—more formally, a DFA, which is a directed graph with nucleotide-labelled edges (Fig. 2a and Extended Data Fig. 1c; see Methods, ‘DFA representations for codons and mRNA candidate sequences’ for formal definitions). After building a codon DFA for each amino acid in the protein sequence, we concatenate them into a single mRNA DFA, where each path between the start and final states represents a possible mRNA sequence encoding that protein (Fig. 2b and Extended Data Fig. 1d).
Lattice parsing
RNA folding is known to be equivalent to natural language parsing, where a stochastic context-free grammar (SCFG) can represent the folding energy model18 (Extended Data Fig. 1e,f). For mRNA design, the hard question is how all the mRNA sequences in the DFA can be folded together. We borrow the idea of lattice parsing6,19, which generalizes single-sequence parsing to handle all sentences in the lattice simultaneously to find the most likely one (Fig. 1c and Extended Data Fig. 2). Similarly, we use lattice parsing to fold all sequences in the mRNA DFA simultaneously to find the most stable one (Fig. 2b and Extended Data Fig. 1g,h). Note that lattice parsing is also an instance of dynamic programming, but over a much larger search space, and single-sequence folding can be viewed as a special case of lattice parsing with a single-chain DFA. This process can also be interpreted as the SCFG–DFA intersection (Extended Data Fig. 1a) where the SCFG scores for stability and the DFA demarcates the set of candidates. The runtime of this algorithm scales cubically with the mRNA sequence length (Methods, ‘SCFG, lattice parsing and intersection’), but for practical applications it scales quadratically (Fig. 3a).
Lattice parsing with weighted DFAs
We now extend DFAs to weighted DFAs to integrate codon optimality on edge weights. Since our joint optimization formulation factors CAI onto the relative adaptiveness w(c) of each individual codon c, we set edge weights in each codon DFA so that a codon c has path cost –log w(c), which can be interpreted as the “amount of deviation” from the optimal codon. Then in a weighted mRNA DFA, the cost of each start-end path is the sum of –log w(c) for each codon c in the corresponding mRNA, which is proportional to its –log CAI (Fig. 2d). Now lattice parsing takes a stochastic grammar (for stability) and a weighted DFA (for codon usage) and solves the joint optimization with optimality guarantee, which can be viewed as the weighted intersection20 between an SCFG and a weighted DFA (Extended Data Fig. 1b and Methods, ‘Weighted DFA for CAI integration’).
Expressiveness of DFAs
Our DFA framework is sufficiently general that it can also represent alternative genetic codes, modified nucleotides and coding constraints. For details, see Methods, ‘DFAs for other genetic codes, coding constraints and modified nucleotides’, Extended Data Fig. 3 and Supplementary Fig. 5.
Linear-time approximation
The exact design algorithm might still be slow for long sequences. Additionally, suboptimal designs may also be worth exploring for wet laboratory experiments, owing to the many other factors involved in mRNA design besides stability and codon usage. We therefore developed an approximate search version that runs in linear time using beam search, keeping only the top b most promising items per step (where b is the beam size), inspired by our previous work LinearFold21.
Related work
Two previous studies also tackled the problem of ‘most stable mRNA design’ (our objective 1) via dynamic programming, but using specialized extensions of the Zuker algorithm22,23 that cannot incorporate codon optimality (objective 2). By contrast, we established the connection between mRNA design and lattice parsing from computational linguistics. This connection enabled a simpler and more generalizable algorithm that can jointly optimize codon usage with a novel objective function that factors CAI onto individual codons. We also verified these algorithmic designs in vivo, showing substantial improvements for two mRNA vaccines (Figs. 4 and 5). See Methods, ‘The LinearDesign algorithm’ and ‘Related work’ for details.
In silico results and analysis
Figure 3a benchmarks the runtime of LinearDesign on UniProt proteins24. LinearDesign was shown in a combination of two optimization objectives: MFE-only (objective 1) versus MFE and CAI (objectives 1 and 2), and two search modes: exact search versus beam search (b = 500). Empirically, LinearDesign scales quadratically with mRNA sequence length n for practical applications (n < 10,000 nt) thanks to the DFA representation and lattice parsing (Supplementary Figs. 7 and 8). Next, our CAI-integrated exact search (λ = 4) had the same empirical complexity, and was only around 15% slower than the MFE-only version thanks to the convenience of our DFA representation for adding CAI. Last, our beam search version (b = 500) further speeds up the design and scales linearly with sequence length, taking only 2.7 min (versus 10.7 min for exact search) on the SARS-CoV-2 spike protein (for MFE-only), with an approximation error (percentage energy gap, defined as (1 – MFEapprox_design/MFEexact_design) × 100%) of 1.2%, where the subscripts indicate approximate or exact design. In fact, as sequences get longer, this percentage stabilizes, suggesting that beam search quality does not degrade with sequence length (Supplementary Fig. 9).
For a GC-favouring codon preference (such as in humans), the conventional codon-optimization method does improve stability, but only slightly, since the codon-optimization direction (pink arrows) are largely orthogonal to the stability optimization direction (blue arrows) (Fig. 3b,c). By contrast, our LinearDesign can directly optimize stability and find the optimally stable mRNAs. On both the SARS-CoV-2 spike protein and the VZV gE protein, the lowest MFEs (λ = 0) are 1.8 times lower than the optimal CAIs (λ = ∞). Also, our optimally stable designs have mostly double-stranded secondary structures (Fig. 3d), which are predicted to be much less prone to degradation5. By varying λ from 0 to ∞, LinearDesign computes the feasibility limit (optimal boundary) of the mRNA design space (the blue curves in Fig. 3b,c; see Extended Data Fig. 4 for λ in (–∞, 0]). Furthermore, when the codon bias prefers AU-rich codons (such as in yeast), codon optimization actually worsens the stability (Extended Data Fig. 4b).
Results for COVID-19 mRNA vaccines
We examined eight mRNA sequences for the SARS-CoV-2 spike protein in this study. Seven sequences (sequences A–G) were designed with the LinearDesign algorithm as suboptimal molecules (with beam search21,25). They were widely distributed in the low-MFE design space (the region where MFE ≤ −1,400 kcal mol−1 in Fig. 4a), which is unreachable with a conventional codon optimizing algorithm. To obtain a better understanding of the biological effects of MFE and CAI, we designed these mRNA sequences to have almost identical values in either MFE (B and C have similar MFEs, while D, E and F have similar MFEs) or CAI (A, C and F have similar CAIs, B and E have similar CAIs and D, G and H have similar CAIs). The eighth mRNA sequence (sequence H) was designed with OptimumGene, a widely used codon-optimization tool, as a benchmark. This benchmark sequence has been used in a COVID-19 mRNA vaccine that elicited high immunogenicity in two animal models26 and entered a phase I clinical trial in China (co-developed with the Chinese Center for Disease Control and Prevention; Chinese Clinical Trial Registry: CTR20210542). All of these mRNA sequences encode the same amino acid sequence of full-length wild-type SARS-CoV-2 spike protein, use natural unmodified nucleotides, and share the same 5′- and 3′-UTRs (see Supplementary Information for sequences).
Considering the potential negative effect on translation efficiency caused by a structured 5′-leader region5, we did not include the first 5 amino acids when running LinearDesign, and instead used a heuristic to select the first 15 nucleotides. It has also been suggested that long helices may elicit unwanted innate immune responses27, so we avoided them in our designs. This also explains why we did not study the lowest-MFE candidates (those closest to the optimal boundary—the blue curve in Fig. 4a), which usually contain long stems. See Methods, ‘Additional design constraints’ for details.
Besides coding region design, UTR structure is also crucial for translation28 and UTR engineering has a profound effect on protein expression3. Although LinearDesign does not address UTR optimization per se, its designed mRNA molecules—as they are more structured than solely codon-optimized ones—form fewer base pairs with and thus interfere less with the structures of widely used UTRs (Extended Data Table 1). This was confirmed by a different pair of UTRs in our VZV mRNA vaccine experiments (Extended Data Table 2) leading to improved protein expression and immune responses (Fig. 5). This evidence suggests that LinearDesign is likely to remain effective independent of the choice of UTRs, which is also consistent with a recent study29 in which LinearDesign-generated sequences with three different UTRs exhibited stronger in vitro protein expression over all benchmark sequences (see figure 4a in ref. 29); see Methods, ‘Related work’ for details.
Extended Data Table 1.
Here we show the numbers of predicted base pairs between UTRs and the coding regions of SARS-CoV-2 Spike protein. We used 5 different UTRs: StemiRNA COVID-19 UTRs used in wet lab experiments, BNT-162b2 (BioNTech) UTRs, mRNA-1273 (Moderna) UTRs, CV2CoV (CureVac) UTRs, and a widely used human β-globin mRNA UTRs. We tested 14 different sequences of the coding region: sequences A–H for wet lab experiments, sequences from three main mRNA vaccine companies, MFE-opt. and CAI-opt. sequences (i.e., sequences with the lowest folding free energy and with CAI = 1, respectively), and the wildtype sequence. Most of the LinearDesign-generated mRNA sequences (sequences A–F and MFE-opt.) form fewer base pairs with UTRs. The folding free energies and structures are predicted by Vienna RNAfold (-d0 mode); MFEs of CDS are calculated without stop codon.
Extended Data Table 2.
Here we used 5 different UTRs: StemiRNA VZV UTRs used in wet lab experiments, BNT-162b2 (BioNTech) UTRs, mRNA-1273 (Moderna) UTRs, CV2CoV (CureVac) UTRs, and human β-globin mRNA UTRs. The 7 coding sequences are gE A–E, gE-Ther and gE-WT, which are used in the wet lab experiments of VZV. The folding free energies and structures are predicted by Vienna RNAfold (-d0 mode); MFEs of CDS are calculated without stop codon.
In-solution structure compactness and chemical stability
We then studied the structural compactness of mRNA molecules, which is hypothesized to be correlated with the folding free-energy change. An mRNA molecule with a lower MFE tends to contain more secondary structures, exhibit a more compact shape and have a smaller hydrodynamic size, resulting in a higher electrophoretic mobility. We loaded mRNA samples onto a non-denaturing agarose gel and found that RNA mobility rates correlated well with the calculated MFEs for sequences A–H (Fig. 4b) despite the sequences having similar molecular weights. Sequence A, with the lowest MFE, exhibited the highest mobility, followed by other sequences in order of their MFEs. Sequence H, which has the highest MFE value, was the least mobile. These data demonstrated the validity of the MFE calculation executed by LinearDesign.
To evaluate the chemical stability of mRNAs, we incubated the mRNAs in buffers containing 10 mM (Fig. 4c) or 20 mM (Extended Data Fig. 5) Mg2+ at 37 °C, and assessed RNA integrity following incubation, similar to previous work29. Sequences A–H showed distinct degradation rates that correlated well with their MFEs (Fig. 4c and Extended Data Fig. 5). Sequence A, which has the lowest MFE, showed the slowest degradation rate, with a half-life (T1/2) of 20.0 and 12.6 h in 10 and 20 mM Mg2+ buffers, respectively (Fig. 4c and Extended Data Fig. 5). By contrast, sequence H, which has the highest MFE, degraded the fastest with T1/2 of 3.9 and 3.3 h in 10 and 20 mM Mg2+ buffers, respectively. These results support the idea that low-MFE designs are more resistant to in-solution degradation, a favourable trait for biological applications.
Cellular protein expression
For vaccines, sufficient antigen expression is a key determinant for eliciting effective immune responses. We thus evaluated protein expression of the designed mRNAs. Sequences A–H were translated efficiently into spike protein following transfection in HEK293 cells. Of note, all seven mRNAs generated by LinearDesign (sequences A–G) showed remarkably higher protein expression levels than benchmark sequence H (Fig. 4d and Supplementary Fig. 12). Sequences D and G (with CAIs almost identical to H, but lower MFEs) expressed 2.3-fold higher protein levels than sequence H, and sequence A, with the lowest MFE, showed 2.9-fold higher expression. Collectively, our results are consistent with those of Mauger et al.5, which show that low MFE and high CAI synergize to improve protein expression; we were able to test this hypothesis using mRNA molecules with much lower MFE values, thanks to the ability of LinearDesign to explore the previously unreachable design space.
In vivo immunogenicity
We further tested whether these designs could endow increased immunogenicity in vivo. We inoculated mRNA sequences A–H into mice using a lipid-based formulation30, and evaluated their humoral and cellular immune responses. For each mRNA sequence, C57BL/6 mice were inoculated intramuscularly with two doses of the vaccines with an interval of two weeks. Levels of anti-spike IgG, neutralizing antibodies and spike-specific interferon-γ (IFNγ)-secreting T cells were assessed. All mRNA molecules from LinearDesign were able to elicit robust antibody responses. By contrast, sequence H mRNA showed very limited ability to induce antibodies (Fig. 4e,f). Similar results were also observed on the antigen-specific T cell response, where a robust T helper 1-biased T cell response was induced only by the LinearDesign mRNAs (Fig. 4g). Sequences A–D, which are closer to the optimal boundary (blue shaded region in Fig. 4a), elicited a 57 to 128× increase in anti-spike IgG antibody titres and a 9 to 20× increase in neutralizing antibody titres over those elicited by the benchmark sequence H.
Since BNT-162b2 from BioNTech-Pfizer is the most widely adopted COVID-19 mRNA vaccine, we compared it with the LinearDesign mRNAs. For this head-to-head comparison, our BNT sequence is almost identical to the sequence of BNT-162b2, but with three changes: (a) the two stabilizing proline mutations31 in BNT-162b2 converted back to the wild-type sequence, (b) BNT uses the same 5′- and 3′-UTRs as in sequences A–H, and (c) it uses natural unmodified nucleotides as in sequences A-H. Four mRNA sequences—A, C, H and BNT—were included in the study. A and C showed a markedly lower in-solution degradation rate and significantly higher protein expression in HEK293 cells than BNT (Extended Data Fig. 6). Note that BNT and H have very similar MFEs, CAIs (Fig. 4a) and half-lives. Moreover, A and C were able to elicit significantly higher levels of anti-spike IgG and neutralizing antibodies than H and BNT (Extended Data Fig. 7). Collectively, these data lead us to speculate that LinearDesign-optimized mRNA molecules are more stable in vivo, which leads to improved protein expression and enhanced immunogenicity.
Results for VZV mRNA vaccines
To further evaluate the generalizability of LinearDesign, we applied the algorithm to the design of a mRNA vaccine for VZV. Vaccination against VZV is considered an effective approach to reduce the risk of shingles32. Using the same strategy as for spike mRNA design (Fig. 4a), we generated five mRNA sequences encoding the full-length VZV gE protein (gE-A to gE-E). These sequences are widely distributed in the previously unexplored high-thermostability region (Fig. 5a). These sequences were benchmarked to the gE-Ther sequence, which we designed with the widely used codon-optimization tool GeneOptimizer33. These mRNAs, including wild-type gE mRNA (gE-WT), shared the same encoded amino acid sequence and 5′ and 3′ UTRs (the sequences are provided in Supplementary Information). In line with the spike mRNA data (Fig. 4b), gE-A mRNA, which has the lowest MFE, showed the greatest mobility in a non-denaturing gel (Fig. 5b) and markedly slower degradation rates with a T1/2 of 66.5 h in 10 mM (Fig. 5c) and 50.7 h in 20 mM (Extended Data Fig. 8a) Mg2+ buffers, indicating a high chemical stability correlated with the compactness of molecules. By contrast, gE-Ther showed a T1/2 of 10.9 h in 10 mM and 5.9 h in 20 mM Mg2+ buffers. We also observed that gE mRNA molecules were more stable than spike mRNAs owing to their shorter length34. In addition, protein expression from most of the LinearDesign-generated mRNAs (gE-B to gE-E) was significantly higher than for gE-Ther and gE-WT in HEK293 cells 48 h (Fig. 5d) and 24 h (Extended Data Fig. 8b) after transfection. However, the best-performing mRNA molecules were gE-B, gE-C and gE-D. They outperformed gE-A, which has the lowest CAI, and gE-E, which has the highest MFE. This emphasizes the importance of jointly optimizing CAI and MFE. The most highly expressed molecules were those whose CAI and MFE were both in the favourable region (light blue shaded area; Fig. 5a). Finally, we evaluated the immune respopnse elicited by VZV mRNA vaccines in C57BL/6 mice. LinearDesign mRNA molecules (gE-B, gE-C and gE-E) induced significantly higher levels of anti-gE IgG than gE-Ther or gE-WT (Fig. 5e).
Discussion
An effective mRNA design strategy is of utmost importance for the development of mRNA vaccines, which have shown great promise against the COVID-19 pandemic. However, this task remains challenging owing to the prohibitively large search space. Here we present a simple solution by reducing the mRNA design problem to the classical problem of lattice parsing used in computational linguistics. This work resulted in an efficient algorithm that can design an optimal mRNA encoding the SARS-CoV-2 spike protein in 11 min. It can also jointly optimize stability and codon usage, which has been shown to be crucial for mRNA design. This approach is one of several recent fruitful exchanges between linguistics and biology35,36.
Here we have comprehensively characterized mRNA sequences generated by LinearDesign and demonstrated their superiority over the commonly used codon-optimization benchmark using two viral antigens across three attributes that are critical for vaccine performance: chemical stability, protein translation and in vivo immunogenicity. In particular, our designs for mRNA encoding the SARS-CoV-2 spike protein showed an increase of up to 128-fold in binding antibody levels over the codon-optimization benchmark. Our VZV gE mRNA designs—using a different UTR pair—also showed substantial improvements over the benchmark. These results indicate the robustness of LinearDesign in optimizing the coding region independently of UTR pairs. Indeed, coding region design and UTR engineering3 are complementary and could be combined in future work. It is worth noting that our designed mRNAs did not use chemical modification which is widely believed to be critical to the recent success of mRNA vaccines1,2,10,37,38, yet still showed high levels of stability, translation efficiency and immunogenicity, with the additional advantage of a lower manufacturing cost. The LinearDesign approach is likely to complement chemical modification and can be easily adapted to modified nucleotides once the corresponding energy model is available. Our work has only considered stability and codon usage but, owing to the generalizability of the lattice representation, could also be adapted to optimize other parameters relevant to mRNA design. By opening up the previously inaccessible region of highly stable and efficient sequences, this approach provides a timely and promising tool for mRNA vaccine development that is likely to have a key role in future pandemics. It is also a principled method for molecule design in the field of mRNA medicines generally, and can be used for all therapeutic proteins including monoclonal antibodies and anti-cancer drugs.
Methods
The LinearDesign algorithm
Optimization objectives
There are two objectives in mRNA design: stability and codon optimality. The optimal-stability mRNA design problem can be formalized as follows. Given a protein sequence where each is an amino acid residue, we find the optimal mRNA sequence that has the lowest MFE among all possible mRNA sequences encoding that protein:
1 |
2 |
where is the set of candidate mRNA sequences, is the set of all possible secondary structures for mRNA sequence , and is the free-energy change of structure for mRNA according to an energy model. This is clearly a double minimization objective involving the per-sequence minimization over all of its possible structures (that is, RNA folding; equation (2)), which has well-known dynamic programming solutions, and the global minimization over all sequences (that is, optimal mRNA design; equation (1)) which we will solve using lattice parsing (see ‘SCFG, lattice parsing and intersection’).
Next, we integrate codon optimality by adding CAI17, defined as the geometric mean of the codon optimality of each codon in the mRNA :
3 |
where is the triplet codon in , and is the relative adaptiveness of codon , defined as the frequency of divided by the frequency of its most frequent synonymous codon . Because CAI is always between 0 and 1 but MFE is generally proportional to the mRNA sequence length, we scale CAI by the number of codons and use a hyper-parameter to balance MFE and CAI ( being purely MFE), and define a novel joint objective:
4 |
which can be simplified by expanding CAI:
5 |
This joint objective is basically MFE plus (a scaled) sum of the negative logarithm of each codon’s relative adaptiveness. Now the joint optimization can be defined as:
6 |
See Fig. 2d for examples of relative adaptiveness calculation.
DFA representations for codons and mRNA candidate sequences
Informally, a DFA is a directed graph with labelled edges and distinct start and end states. For our purpose each edge is labelled by a nucleotide, so that for each codon DFA, each start-to-end path represents a triplet codon; see Fig. 2a and Extended Data Fig. 1c for examples. Formally, a DFA is a 5-tuple , where is the set of states, is the alphabet (here ), is the start state (always in this work), is the set of end states (in this work the end state is unique—that is, ), and is the transition function that takes a state and a symbol and returns the next state —that is, encodes a labelled edge .
After building DFAs for each amino acid, we can concatenate them (concatenation is indicated by below) into a single DFA for a protein sequence , which represents all possible mRNA sequences that translate into that protein
by stitching the end state of each DFA with the start state of the next. See Extended Data Fig. 1d for examples. The new end state of the mRNA DFA is .
We also define to be the set of outgoing edges from state , and to be the set of incoming edges (which will be used in the pseudocode; Supplementary Figs. 2 and 3):
For the mRNA DFA in Extended Data Fig. 1d, out_edges ((3,0)) = and in_edges ((9,0)) = .
SCFG, lattice parsing and intersection
A SCFG is a context-free grammar in which each rule is augmented with a weight. More formally, an SCFG is a 4-tuple where is the set of non-terminals, is the set of terminals (identical to the alphabet in the DFA, in this case ), is the set of weight-associated context-free writing rules, and is the start symbol. Each rule in has the form where is a non-terminal that can be rewritten according to this rule into a sequence of non-terminals and terminals (the star means repeating zero or more times) and is the weight associated with this rule.
SCFGs can be used to represent the RNA folding energy model39. The weight of a derivation (parse tree, or a secondary structure in this case) is the sum of weights of the productions used in that derivation. For example, for a very simple Nussinov–Jacobson-style model40, which simplifies the energy model to individual base pairs, we can define this SCFG as in Extended Data Fig. 1e, where each GC pair gets a score of −3, and each AU pair gets a score of −2. Thus, the standard RNA secondary structure prediction problem can be cast as a parsing problem: given the above SCFG and an input RNA sequence, find the minimum-weight derivation in that can generate the sequence. This can be solved by the classical CKY algorithm from computational linguistics41–43.
The optimal-stability mRNA design problem is now a simple extension of the above single-sequence folding problem to the case of multiple inputs: instead of finding the minimum-free-energy structure (minimum-weight derivation) for a given sequence, we find the minimum-free-energy structure (and its corresponding sequence) among all possible structures for all possible sequences (Extended Data Fig. 1). This can be solved by lattice parsing on the DFA, which is a generalization of CKY from a single sequence to a DFA. Take the bifurcation rule for example. In CKY, if you have derived non-terminal for span , notated , and if you have also derived , you can combine the two spans—that is, —and use the above rule to derive . Similarly, in lattice parsing, if you have derived both (that is, there is a path that can be derived from ) and , you can combine them to a longer path and derive with the above rule. While the runtime for CKY scales where is the grammar constant (the number of rules) and is the RNA sequence length, the runtime for lattice parsing similarly scales where is the number of states in the DFA. For mRNA design with the standard genetic code, because each position has either one or two states ( and ), so its time complexity is also actually identical to single-sequence folding, just with a larger constant. See ‘Left-to-right dynamic programming’ for details of this algorithm and Supplementary Figs. 2 and 3 for the pseudocode.
More formally, in theoretical computer science, lattice parsing with an CFG on a DFA is also known as the intersection between the languages of and (that is, the sets of sequences allowed by and ), notated , which was solved by the Bar-Hillel construction in 1961 (ref. 19). In order to adapt it to mRNA design, we need to extend this concept to the case of weighted (that is, stochastic) grammars and weighted DFAs (the latter is needed for CAI integration; see below). While the language of CFG is the set of sequences generated by , the language of the SCFG for RNA folding free-energy model defines a mapping from each RNA sequence to its MFE—that is, . This can be written as a relation:
And we also extend the language of a DFA to a trivial weighted language (which will facilitate the incorporation of CAI into DFA below):
Next we extend the intersection from two sets to two weighted sets A and B:
Now we can show that optimal-stability mRNA design problem can be solved via weighted intersection between and —that is, we can construct a new ‘intersected’ stochastic grammar that has the same weights (that is, energy model) as the original grammar but only generates sequences in the DFA:
Weighted DFA for CAI integration
As described in the main text and Fig. 2d, our novel joint optimization objective (equation (6)) factors the CAI of each mRNA candidate onto the relative adaptiveness of each of its codons, and thus can be easily incorporated into the DFA as edge weights. To do this we need to extend the definition of DFA to weighted DFA, where the transition function now returns a state and a weight—that is, —which encodes a weighted label edge . Now the set of outgoing and incoming edges are also updated to:
In this case, the weighted DFA defines a mapping from each candidate mRNA sequence to its negative logarithm of CAI scaled by the number of codons—that is, . More formally,
Now the weighted intersection defined above can be extended to incorporate the hyper-parameter and derive the joint objective:
Bottom-up dynamic programming
Next, we describe how to implement the dynamic programming algorithm behind lattice parsing (or equivalently, intersection between the languages of a SCFG and a weighted DFA) to solve the joint optimization problem. For simplicity reasons, here we use bottom-up dynamic programming on a modified Nussinov–Jacobson energy model. Supplementary Fig. 2 gives the pseudocode for this simplified version. We first build up the mRNA DFA for the given protein, and initialize two hash tables, ‘best’ to store the best score of each state, and ‘back’ to store the best backpointer. For the base cases we set for optimal-stability design, and for the joint optimization where
7 |
is the minimum (-scaled) cost of any path in the CAI-integrated DFA. Next, for each state it goes through the pairing rule and bifurcation rules, and updates if a better score is found. After filling out the hash tables bottom-up, we can backtrace the best mRNA sequence stored with the backpointers. See Supplementary Fig. 3 for details of UPDATE and BACKTRACE functions.
Left-to-right dynamic programming
Inspired by our previous work, LinearFold21, we further developed a left-to-right dynamic programming, which is equivalent to the above bottom-up version but explores the search space incrementally from left to right; see Supplementary Fig. 4 for the pseudocode. This left-to-right order also enables beam search44, a classical pruning technique, to significantly narrow down the search space without sacrificing too much search quality. Our real system uses this left-to-right dynamic programming on the Turner nearest-neighbour free-energy model15,16, and our thermodynamic parameters follow LinearFold and Vienna RNAfold45, except for the dangling ends, which do not contribute stability in LinearDesign. Dangling ends refer to stabilizing interactions for multiloops and external loops46, which require knowledge of the nucleotide sequence outside of the state . Though it could be integrated in LinearDesign, the implementation is more involved so we leave it to future work.
DFAs for other genetic codes, coding constraints and modified nucleotides
The DFA framework can also represent less common cases such as alternative genetic codes, modified nucleotides, and coding constraints. First, the DFA can encode non-standard genetic codes, such as alternative nuclear code for some yeast47 and mitochondrial codes48 (Extended Data Fig. 3a). Second, we may want to avoid some unwanted or rare codons (such as the amber stop codon) which is an easy change on the codon DFAs (Extended Data Fig. 3b), or certain adjacent codon pairs that modulate translation efficiency49, which is beyond the scope of single codon DFAs but easy on the mRNA DFA (Extended Data Fig. 3c). Similarly, we may want to disallow certain restriction enzyme recognition sites, which span across multiple codons (Supplementary Fig. 5). Finally, chemically modified nucleotides such as pseudouridine (Ψ) have been widely used in mRNA vaccines38, which can also be incorporated in the DFA (Extended Data Fig. 3d).
Related work
Here we first discuss the advantages of our algorithm over previous work, and then discuss a recent work29 that uses LinearDesign in experimental screening.
Two previous studies22,23 also tackled the problem of optimal-stability mRNA design (that is, our objective 1) via dynamic programming, but their algorithms are complicated, not generalizable and less efficient. By contrast, the stability-only version of our work reduced the mRNA design problem to the classical computational linguistics problem of lattice parsing, resulting in a much simpler and more efficient algorithm that is vastly different from the specifically-designed algorithms such as the one described in Cohen et al.22 and CDSfold23. More importantly, our work further solves the harder and practically more important problem of joint optimization between stability and codon optimality, which subsumes the stability-only objective as a special case. Here we comprehensively compare our work to the previous ones in the following seven aspects.
Lattice representation of the design space
Our work is the first to use automata theory to compactly and conveniently represent the exponentially large mRNA design space. By contrast, Cohen et al. and CDSfold extend the standard Zuker algorithm with the consideration of amino acid constraints, and they do not have any graph-theoretic or formal representation of the design space. To handle the nucleotide dependencies of the first and third positions in the codons of leucine and arginine, CDSfold introduces the ‘extended nucleotides’, which classify the same nucleotide at the second position with different notations regarding the dependency. See Supplementary Fig. 6 for the lattice representation of leucine in our work as an example, and the extended nucleotides of leucine in CDSfold as a comparison. More importantly, our lattice representation is able to integrate (the logarithm) of CAI for a joint optimization of stability and codon optimality, and is general for arbitrary genetic code; see the details in later paragraphs.
Lattice parsing
Based on our DFA representation, we further reduce the mRNA design problem to the classical computational linguistics problem of lattice parsing, which aims to find the most grammatical sentence among exponentially many alternatives. This problem was solved by Bar-Hillel et al. in 1961 (ref. 19). Therefore, instead of inventing a new algorithm, we simply adapt the classical lattice parsing to mRNA design using our algorithm of LinearDesign. Note that the single-sequence folding is a special case of our algorithm where the lattice is a single chain.
Efficiency
More interestingly, our simple adapted algorithm reduces the constant factor of the cubic-time bifurcation rule that dominates the runtime of mRNA design, leading to better efficiency over previous work such as CDSfold. Supplementary Fig. 7b illustrates the space and time complexity under the classical Nussinov energy model.
The single-sequence RNA folding defines a span as an item, where and are indices in the RNA sequence. For a sequence with n nucleotides, during dynamic programming, at most items are generated for the bifurcation rule ; space-wise, at most items are stored.
Extending RNA folding to lattice parsing, our work defines each item as , where and are the nodes in the lattice: and. Since there are at most two nodes at each position, the number of items stored is at most . For the bifurcation rule , items and are combined to form a bigger item , in which at most items are generated (at most two nodes each for , and ). See Supplementary Fig. 7c for the illustration of above analysis; see lines 22–25 in Supplementary Fig. 2 and lines 20–24 in Supplementary Fig. 4 for the pseudocode of the bifurcation case in our work.
By contrast, CDSfold defines each item as , where and are the nucleotides at positions and , respectively. The number of items stored in CDSfold scales , because there are at most 4 nucleotide types for each and . For the bifurcation rule , items , and , are combined to form , in which at most items are generated (at most nucleotide types at and , and nucleotide pairs between and ). See Supplementary Fig. 7d for the analysis illustration of CDSfold.
Compared to CDSfold, our work largely reduces the time complexity constant of the bifurcation rule from 128 to 8. The cubic-time bifurcation rule which dominates the runtime in CDSfold is greatly accelerated in our algorithm. Empirically, our algorithm scales quadratically rather than cubically with mRNA sequence length for practical applications (Fig. 3 and Supplementary Fig. 8).
Joint optimization of stability and codon optimality
Codon optimality is an important factor in mRNA design, which should be jointly optimized with stability5, and our work is the first to solve this joint optimization problem, thanks to the DFA representation and lattice parsing. By contrast, previous work (Cohen et al. and CDSfold) does not perform, and is impossible to be extended to perform, such a joint optimization. First, Cohen et al. only optimize stability without considering codon optimality. CDSfold uses simulated annealing to improve CAI by fine-tuning from the MFE solution, but this is a heuristic with no guarantees. Second, CDSfold’s objective function, , is impossible for dynamic programming due to the difference between MFE and CAI, where MFE is a sum of free energy for each component substructure (additive) but CAI is a geometric mean of the relative codon usages (multiplicative). To reconcile this difference, our formulation defines a novel objective that factors the logarithm of CAI for an mRNA additively onto its individual codons, thus making it decomposable and amenable to dynamic programming (see ‘The LinearDesign algorithm’ for details). By contrast, CDSfold’s objective formulation does not factor into individual codons, and thus cannot be incorporated into global optimization. Last but most importantly, even if CDSfold were to borrow our formulation, its fundamental codon representation still rules out joint optimization. Our framework easily encodes (the logarithm of) CAI in our DFA representation, for example, we can integrate CAI onto a weighted DFA for leucine (Supplementary Fig. 6a). By contrast, CDSfold has to use an extended nucleotides representation for codon choices, which makes it impossible to do joint optimization with CAI (Supplementary Fig. 6b,c).
To summarize, our framework easily incorporates codon optimality into the joint optimization that previous work did not (and could not be extended to) tackle. Our objective integrates (the logarithm of) CAI and MFE together, while the objective of CDSfold is not able to reconcile these two factors. Furthermore, even if using our objective formulation, CDSfold’s representation of codon choices still rules out the possibility of CAI integration.
Generalizability
Our DFA framework is so general that it can also represent arbitrary (non-standard) genetic codes, modified nucleotides, and coding constraints such as adjacent codon pair preference, which previous work could not handle even with major modifications. See ‘DFAs for other genetic codes, coding constraints and modified nucleotides’ for details.
Linear-time version for long sequence and suboptimal candidates
We further develop a faster, linear-time, approximate version which greatly reduces runtime for long sequences with small sacrifices in search quality, which we also use to generate multiple suboptimal candidates with varying folding stability and codon optimality as candidates for experimentation.
Verification of wet laboratory experiments
Extensive experiments confirm that compared to the standard codon-optimization benchmark, our designs are substantially better in chemical stability and protein expression in vitro, and the corresponding mRNA vaccines elicit up to 128 times higher antibody responses in vivo.
Another recent work29 optimized mRNA designs and screened them via an experimental platform. LinearDesign had a central role in their work as the starting point of their optimizations (see figure 4b of their paper), followed by fine-tunings by both human players and a Monte Carlo tree search algorithm. The resulting coding regions are flanked by different UTRs, and then tested on stability and protein expression. LinearDesign-generated sequences showed strong stability and protein expression results with different UTRs (figures 2g and 4a of their paper), independently confirming our in vitro experiments. However, they did not perform any in vivo validations.
Benchmark dataset and machine
To estimate the time complexity of LinearDesign, we collected 114 human protein sequences from UniProt24, with lengths from 78 to 3,333 amino acids (not including the stop codon); see Supplementary Table 1. We benchmarked LinearDesign on a Linux machine with 2 Intel Xeon E5-2660 v3 CPUs (2.60 GHz) and 377 GB memory, and used Clang (11.0.0) to compile. The code only uses a single thread.
Additional design constraints
Some studies have shown that protein expression level drops if the 5′-end leader region has more secondary structure5,50–53. To design sequences with less structures at 5′-end leader region, we take a simple ‘design, enumerate and concatenate’ strategy to avoid structure in the leader region: (1) design the CDS region except for the 5′-end leader region (that is, the first 15 nucleotides); (2) enumerate all possible subsequences in the 5′-end leader region; and (3) concatenate each subsequence with the designed sequence, refold, and choose the one whose 5′-end leader region has the most unpaired nucleotides.
In addition, it has been revealed that long double-stranded regions may induce unwanted innate immune responses by previous studies27,54,55. Considering this, we do not allow long double-stranded regions that include 33 or more base pairs by adding this constraint in the design process.
RNA secondary structure prediction and visualization
Vienna RNAfold from ViennaRNA package (version 2.4.14) is used for predicting and drawing the secondary structure of mRNA sequence, and calculating the MFE of secondary structures.
In vitro and in vivo experiments
Preparation of mRNA and its formulation
mRNA molecules were synthesized in vitro by T7 RNA polymerase using linearized plasmid as DNA template. The open reading frame region is flanked with the 5′ and 3′ UTRs followed by a 70-nt poly-A tail. For all spike protein-coding sequences, the in vitro transcription reaction was conducted at 37 °C for 4 h, followed by digestion with DNase I (Hongene Biotech). mRNA encoding full-length spike protein without proline substitution was then capped using Vaccinia Capping Enzyme (Hongene Biotech) and purified with magnetic Dynabeads (Thermo Fisher). Eluted mRNA was further treated with Antarctic Phosphatase (Hongene Biotech) at 37 °C for 30 min to remove residual 5′-triphosphates. For all VZV gE sequences, mRNA was co-transcriptionally capped using m7(3′OMeG)(5′)ppp(5′)(2′OMeA)pG capping reagent (Hongene Biotech) in a ‘one-pot’ reaction at 30 °C for 16 h, followed by treatment with DNase I. Capped mRNA encoding spike or VZV gE protein was then purified using beads. For the preparation of formulated mRNA vaccines, lipopolyplex (LPP) formulation was used to encapsulate mRNA cargo as described previously56. LPP is a lipid-based mRNA delivery system and has been demonstrated to provide high efficacy and good safety profile26.
Agarose gel electrophoresis and integrity assay of mRNA
To study the electrophoretic mobility profile of mRNA molecules, mRNA samples suspended in Ambion RNA storage buffer (Thermo Fisher) were denatured at 75 °C for 5 min and snap-cooled on ice before being loaded onto 1% non-denaturing agarose gel (130 V for 1 h at room temperature). Gel image was taken by Gel Doc XR+ Gel Documentation System (Bio-Rad).
To assess the in-solution stability of mRNA, samples were incubated in PBS buffer containing 10 mM Mg2+. Sampling was conducted at time points (0, 1, 2, 4, 8, 12, 16, 24, 32, 48 and 60 h). For a faster degradation process, PBS buffer containing 20 mM Mg2+ instead of 10 mM was used. Sampling was done in a relatively shorter time span (0, 1, 2, 4, 8, 12, 15, 18, 21, and 24 h). RNA integrity was analysed by Qsep100 Capillary Electrophoresis System. The integrity was represented as the proportion of full-length mRNA calculated on electropherogram. The data were normalized to time point 0 h. To extrapolate the half-life of each sequence, one-phase decay equation:
was used to fit the curve. The and plateau were set as 100 and 0, respectively. Half-life was computed as , where refers to decay rate constant.
Protein expression assay
HEK293 cells (ATCC) were cultured in Dulbecco’s Modified Eagle’s Medium (DMEM) (Hyclone) containing 10% fetal bovine serum (FBS) (GEMINI) and 1% penicillin-streptomycin (Gibco). All cells were cultured at 37 °C in a 5% CO2 condition.
For the measurement of protein expression, cells were transfected with mRNA using Lipofectamine MessengerMAX (Thermo Scientific). In brief, a mix of 2 µg mRNA and 6 µl of Lipofectamine reagent was prepared following the manual instructions and then incubated with cells for 24 or 48 h. For flow cytometric analysis, cells were collected and stained with Live/Dead cell dye (Fixable Viability Stain 510, BD) for 5 min. After washing, cells were incubated with anti-spike receptor-binding domain (RBD) chimeric monoclonal antibody (1:100 dilution, Sino Biological) for 30 min, followed by washing and incubation with PE-anti-human IgG Fc (1:100 dilution, Biolegend) for 30 min. Samples were analysed on BD Canto II (BD Biosciences). Data were processed using FlowJo V10.1 (Tree Star).
In vivo immunogenicity study
C57BL/6 mice (6 to 8 weeks of age) were intramuscularly immunized twice with 10 µg LPP-formulated mRNA vaccines at a 2-week interval. Sera and spleens were collected 14 days after boost shot.
Surrogate virus neutralization assay
Neutralizing antibody titre was measured using surrogate virus neutralization assay as previously described57, with some modifications. In brief, 96-well plates (Greiner Bio-one) were coated with recombinant with human ACE2 protein (100 ng per well, Genscript) overnight at 4 °C. Plates were washed with 1 × PBS-T and blocked with 2% BSA for 2 h at room temperature. HRP-conjugated RBD (100 ng ml−1) was incubated with serially diluted serum from immunized mice at an equal volume (60 µl each) for 30 min at 37 °C. Sera collected from PBS-treated mice were used as negative control. Then a 100 µl mixture of RBD and serum was added into each well and incubated for 15 min at 37 °C. After washing, TMB substrate (Invitrogen) was used for colour development and the absorbance at 450 nm was recorded using BioTek microplate reader. The IC50 value was calculated using four-parameter logistic non-linear regression.
Enzyme-linked immunosorbent assays
In brief, recombinant SARS- CoV-2 spike ectodomain protein or VZV gE protein (Genscript) diluted in coating buffer (Biolegend) were used to coat 96-well EIA/RIA plates (Greiner Bio-one, 100 ng per well) at 4 °C overnight. The plates were then washed with 1PBS-T (0.05% Tween-20) and blocked with 2% BSA in PBS-T for 2 h at room temperature. Serum samples with serial dilutions were added and incubated for 2 h at room temperature. After washing, HRP-conjugated goat anti-mouse IgG antibody (1:10,000) was added and incubated for 1 h. TMB substrate (Invitrogen) was then used for colour development and the absorbance was read at 450 nm using BioTek microplate reader. End-point titres were calculated as the largest sample dilution factor yielding a signal that exceeds 2.1-fold value of the background58.
ELISpot assay
Frequency of spike (or VZV gE) antigen-specific IFNγ-secreting T cells was evaluated using Mouse IFNγ ELISpotplus Kit (Mabtech) according to the manual. In brief, 3 × 105 mouse splenocytes were added to wells pre-coated with anti-mouse IFNγ capturing antibodies and were incubated with spike protein or VZV gE peptide pool (10 µg ml−1) for 20 h. After washing, plates were incubated with Streptavidin–alkaline phosphatase (1:1,000) for 1 h at room temperature. Spots were developed with BCIP/NBT substrate solution and counted using Immunospot S6 analyzer (CTL). Due to multiple steps and exponential change of antibody and antigen-specific T cells during the immunity induction process, in vivo immunogenicity data usually have high data variations. Inoculation of mRNA vaccine involves extra processes such as tissue transfection and protein translation, and the variations in these process efficiencies together with variable dosing and differences in individual mouse’s immune status usually bring more immunogenicity variations than protein-based vaccines. From our experience, the variations observed in this study are typical for mRNA vaccines.
Ethics statement
All mouse studies were performed in strict accordance with the guidelines set by the Chinese Regulations of Laboratory Animals and Laboratory Animal-Requirements of Environment and Housing Facilities. Animal experiments were carried out with the approval from the Institutional Animal Care and Use Committee (IACUC) of Shanghai Model Organisms Center.
Statistics and reproducibility
Geometric means or arithmetic means are represented by the heights of bars, or symbols, and error bars represent the corresponding s.d. Two-tailed Mann–Whitney U tests were used to compare two experimental groups for the in vivo studies. To compare more than two experimental groups, one-way ANOVA with Dunn’s multiple comparisons tests were applied in the in vitro protein expression experiment. Statistical analyses were performed using Prism v.8 (GraphPad). *P < 0.05, **P < 0.01, ***P < 0.001. The raw P values from the statistical analysis are summarized in the figshare file (10.6084/m9.figshare.22193251). In vitro experiments were independently repeated in triplicate. Animal experiments were completed once. Gel electrophoresis experiment was repeated three times to obtain similar results.
Data reporting
No statistical methods were used to predetermine sample size. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41586-023-06127-z.
Supplementary information
Source data
Acknowledgements
The authors thank R. Das (Stanford) for introducing the mRNA design problem to us; R. Li (Baidu) for connecting Baidu Research with StemiRNA; J. Li (Baidu Research) for coordinating resources for this project; G. Terai and K. Asai (Univ. of Tokyo) for sending us the CDSfold code; S. Aviran (UC Davis) for spotting a typo in the hyper-parameter λ in our earlier version; A. Solórzano (Pfizer) for the question on LinearDesign’s independence of the choice of UTRs; J. Lin (Fudan) for early discussions; and S. Li (Oregon State Univ.) for proofreading and help with Latex. We acknowledge the assistance from L. Huang and M. Liu (StemiRNA) in LPP formulation of mRNA vaccines and help from other colleagues, including Y. Yi, Q. Wang, W. Wang and Y. Ge with in vivo studies. We thank Sanofi and many other vaccine companies worldwide for licensing and early adoption of LinearDesign. D.H.M. is supported by National Institutes of Health grant R35GM145283. A.L. was an employee of StemiRNA and is currently supported by the Natural Science Foundation of Jiangsu Province (BK20221031), the National Science Foundation of China (32200764, 82061138008) and the Fundamental Research Funds for the Central Universities (2632022YC01). C.X. was sponsored by Shanghai Pujiang Talent Program (22PJ1423100). The funding from StemiRNA Therapeutics was supported by the Science and Technology Commission of Shanghai Municipality, China (20S11909100, 22S11902300); Shanghai Strategic Emerging Industry Development special fund (ZJ640070216) and the project of mRNA Innovation and Translation Center, Shanghai, China.
Extended data figures and tables
Author contributions
L.H. conceived and directed the project. L.H. designed the basic algorithm for the Nussinov model and wrote a Python prototype, and H.Z. and L.Z. extended this algorithm to the Turner model and implemented it in C++, which Z.L. optimized. L.H., H.Z. and L.Z. designed the CAI integration algorithm, which L.Z. and H.Z. implemented. L.Z. implemented the beam search and handled design constraints. K.L. made the web server. B.L. implemented a baseline. Y.Z. and H.L. supervised the in vitro and in vivo experiments. C.C. performed the mRNA synthesis and gel electrophoresis experiments. A.L., C.X., H.J., X.M. and F.Z. performed the protein expression and in vivo assays, and C.X. performed chemical stability and structure compactness assays. D.H.M. discussed the approach and provided guidance for in silico analysis and writing. L.H., H.Z., L.Z., D.H.M., A.L., C.X., H.S., H.L. and Y.Z. wrote the manuscript.
Peer review
Peer review information
Nature thanks Anna Blakney and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Data availability
The UniProt sequences used to estimate the time complexity of LinearDesign are included in Supplementary Table 1 and are deposited at our figshare repository (10.6084/m9.figshare.22193251). The SARS-CoV-2 spike and VZV gE protein-coding sequences and UTR sequences used in the biological experiments are included at the end of the Supplementary Information file and are available at our figshare repository. Source data are provided with this paper.
Code availability
The LinearDesign source code is available to all parties on GitHub (https://github.com/LinearDesignSoftware/LinearDesign) and Zenodo (10.5281/zenodo.7839739), and is free for academic and research use.
Competing interests
Baidu USA filed a patent for the LinearDesign algorithm in 2021 listing H.Z., L.Z., Z.L., K.L., B.L. and L.H. as inventors. The work of H.Z., L.Z., Z.L., K.L., B.L. and L.H. for the development of the LinearDesign algorithm was conducted at Baidu USA. StemiRNA Therapeutics has filed a provisional patent for the VZV mRNA vaccine listing C.X., H.S. and H.L. as inventors. Sanofi entered a non-exclusive licensing agreement with Baidu USA in 2021 to use LinearDesign to develop mRNA vaccines and therapeutics. H.Z., L.Z., Z.L., K.L., B.L. and L.H. are or were employees of Baidu USA. A.L., C.X., X.M., F.Z., H.J., C.C., H.S., H.L. and Y.Z. are or were employees of StemiRNA Therapeutics. L.H. and D.H.M. are also cofounders of Coderna.ai.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: He Zhang, Liang Zhang, Ang Lin, Congcong Xu, Liang Huang
Contributor Information
Hangwen Li, Email: lihangwen@stemirna.com.
David H. Mathews, Email: david_mathews@urmc.rochester.edu
Yujian Zhang, Email: yujianzhang@yahoo.com.
Liang Huang, Email: liang.huang.sh@gmail.com.
Extended data
is available for this paper at 10.1038/s41586-023-06127-z.
Supplementary information
The online version contains supplementary material available at 10.1038/s41586-023-06127-z.
References
- 1.Baden LR, et al. Efficacy and safety of the mRNA-1273 SARS-CoV-2 vaccine. N. Engl. J. Med. 2021;384:403–416. doi: 10.1056/NEJMoa2035389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Polack FP, et al. Safety and efficacy of the BNT162b2 mRNA Covid-19 vaccine. N. Engl. J. Med. 2020;383:2603–2615. doi: 10.1056/NEJMoa2034577. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Gebre MS, et al. Optimization of non-coding regions for a non-modified mRNA COVID-19 vaccine. Nature. 2022;601:410–414. doi: 10.1038/s41586-021-04231-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Crommelin DJ, Anchordoquy TJ, Volkin DB, Jiskoot W, Mastrobattista E. Addressing the cold reality of mRNA vaccine stability. J. Pharm. Sci. 2021;110:997–1001. doi: 10.1016/j.xphs.2020.12.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Mauger DM, et al. mRNA structure regulates protein expression through changes in functional half-life. Proc. Natl Acad. Sci. USA. 2019;116:24075–24083. doi: 10.1073/pnas.1908052116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Hall, K. B. Best-First Word-Lattice Parsing: Techniques for Integrated Syntactic Language Modeling. PhD thesis, Brown Univ. (2005).
- 7.Schlake T, et al. mRNA: a novel avenue to antibody therapy? Mol. Ther. 2019;27:773–784. doi: 10.1016/j.ymthe.2019.03.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Reinhard K, et al. An RNA vaccine drives expansion and efficacy of claudin-CAR-T cells against solid tumors. Science. 2020;367:446–453. doi: 10.1126/science.aay5967. [DOI] [PubMed] [Google Scholar]
- 9.Wolff JA, et al. Direct gene transfer into mouse muscle in vivo. Science. 1990;247:1465–1468. doi: 10.1126/science.1690918. [DOI] [PubMed] [Google Scholar]
- 10.Pardi N, Hogan MJ, Porter FW, Weissman D. mRNA vaccines—a new era in vaccinology. Nat. Rev. Drug Discov. 2018;17:261–279. doi: 10.1038/nrd.2017.243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Mauro VP, Chappell SA. A critical analysis of codon optimization in human therapeutics. Trends Mol. Med. 2014;20:604–13. doi: 10.1016/j.molmed.2014.09.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Gustafsson C, Govindarajan S, Minshull J. Codon bias and heterologous protein expression. Trends Biotechnol. 2004;22:346–353. doi: 10.1016/j.tibtech.2004.04.006. [DOI] [PubMed] [Google Scholar]
- 13.Nabiyouni M, Prakash A, Fedorov A. Vertebrate codon bias indicates a highly GC-rich ancestral genome. Gene. 2013;519:113–119. doi: 10.1016/j.gene.2013.01.033. [DOI] [PubMed] [Google Scholar]
- 14.Sahin U, Karikó K, Türeci Ö. mRNA-based therapeutics—developing a new class of drugs. Nat. Rev. Drug Discov. 2014;13:759–780. doi: 10.1038/nrd4278. [DOI] [PubMed] [Google Scholar]
- 15.Mathews DH, Sabina J, Zuker M, Turner DH. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. 1999;288:911–940. doi: 10.1006/jmbi.1999.2700. [DOI] [PubMed] [Google Scholar]
- 16.Mathews DH, et al. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc. Natl Acad. Sci. USA. 2004;101:7287–7292. doi: 10.1073/pnas.0401799101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Sharp PM, Li WH. The codon adaptation index–a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 1987;15:1281–1295. doi: 10.1093/nar/15.3.1281. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Durbin, R., Eddy, S. R., Krogh, A. & Mitchison G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, 1998).
- 19.Bar-Hillel Y, Perles M, Shamir E. On formal properties of simple phrase structure grammars. Zeitschrift für Phonetik, Sprachwissenschaft und Kommunikationsforschung. 1961;14:143–172. [Google Scholar]
- 20.Nederhof, M. J. & Satta, G. Probabilistic parsing as intersection. In Proc. 8th International Conference on Parsing Technologies 137–148 (2003).
- 21.Huang L, et al. LinearFold: linear-time approximate RNA folding by 5′-to-3′ dynamic programming and beam search. Bioinformatics. 2019;35:i295–i304. doi: 10.1093/bioinformatics/btz375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Cohen B, Skiena S. Natural selection and algorithmic design of mRNA. J. Comput. Biol. 2003;10:419–432. doi: 10.1089/10665270360688101. [DOI] [PubMed] [Google Scholar]
- 23.Terai G, Kamegai S, Asai K. CDSfold: an algorithm for designing a protein-coding sequence with the most stable secondary structure. Bioinformatics. 2016;32:828–834. doi: 10.1093/bioinformatics/btv678. [DOI] [PubMed] [Google Scholar]
- 24.Consertium U. UniProt: a hub for protein information. Nucleic Acids Res. 2005;42:D204–D12. doi: 10.1093/nar/gku989. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Huang, L. & Sagae, K. Dynamic programming for linear-time incremental parsing. In Proc. 48th Annual Meeting of the Association for Computational Linguistics 1077–1086 (Association for Computational Linguistics, 2010).
- 26.Yang R, et al. A core-shell structured COVID-19 mRNA vaccine with favorable biodistribution pattern and promising immunity. Signal Transduct. Target. Ther. 2021;6:213. doi: 10.1038/s41392-021-00634-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Liu L, et al. Structural basis of Toll-like receptor 3 signaling with double-stranded RNA. Science. 2008;320:379–381. doi: 10.1126/science.1155406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Mignone F, Gissi C, Liuni S, Pesole G. Untranslated regions of mRNAs. Genome Biol. 2002;3:reviews0004.1. doi: 10.1186/gb-2002-3-3-reviews0004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Leppek K, et al. Combinatorial optimization of mRNA structure, stability, and translation for RNA-based therapeutics. Nat. Commun. 2022;13:1536. doi: 10.1038/s41467-022-28776-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Rana MM. Polymer-based nano-therapies to combat COVID-19 related respiratory injury: progress, prospects, and challenges. J. Biomater. Sci. Polym. Ed. 2021;32:1219–1249. doi: 10.1080/09205063.2021.1909412. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Wrapp D, et al. Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation. Science. 2020;367:1260–1263. doi: 10.1126/science.abb2507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Cunningham AL, et al. Efficacy of the herpes zoster subunit vaccine in adults 70 years of age or older. N. Engl. J. Med. 2016;375:1019–1032. doi: 10.1056/NEJMoa1603800. [DOI] [PubMed] [Google Scholar]
- 33.Raab D, Graf M, Notka F, Schödl T, Wagner R. The GeneOptimizer algorithm: using a sliding window approach to cope with the vast sequence space in multiparameter DNA sequence optimization. Syst. Synth. Biol. 2010;4:215–225. doi: 10.1007/s11693-010-9062-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Wayment-Steele HK, et al. Theoretical basis for stabilizing messenger RNA through secondary structure design. Nucleic Acids Res. 2021;49:10604–10617. doi: 10.1093/nar/gkab764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Hie B, Zhong ED, Berger B, Bryson B. Learning the language of viral evolution and escape. Science. 2021;371:284–288. doi: 10.1126/science.abd7331. [DOI] [PubMed] [Google Scholar]
- 36.Madani A., et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol.10.1038/s41587-022-01618-2 (2023). [DOI] [PMC free article] [PubMed]
- 37.Karikó K, Buckstein M, Ni H, Weissman D. Suppression of RNA recognition by Toll-like receptors: the impact of nucleoside modification and the evolutionary origin of RNA. Immunity. 2005;23:165–175. doi: 10.1016/j.immuni.2005.06.008. [DOI] [PubMed] [Google Scholar]
- 38.Karikó K, et al. Incorporation of pseudouridine into mRNA yields superior nonimmunogenic vector with increased translational capacity and biological stability. Mol. Ther. 2008;16:1833–1840. doi: 10.1038/mt.2008.200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Rivas E. The four ingredients of single-sequence RNA secondary structure prediction. a unifying perspective. RNA Biol. 2013;10:1185–1196. doi: 10.4161/rna.24971. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Nussinov R, Jacobson AB. Fast algorithm for predicting the secondary structure of single-stranded RNA. Proc. Natl Acad. Sci. USA. 1980;77:6309–6313. doi: 10.1073/pnas.77.11.6309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Kasami, T. An Efficient Recognition and Syntax-Analysis Algorithm for Context-Free Languages, Coordinated Science Laboratory Report no. R-257 (Univ. Illinois–Urbana, 1966).
- 42.Younger DH. Recognition and parsing of context-free languages in time n3. Inf. Control. 1967;10:189–208. doi: 10.1016/S0019-9958(67)80007-X. [DOI] [Google Scholar]
- 43.Rivas E, Lang R, Eddy R. A range of complex probabilistic models for RNA secondary structure prediction that includes the nearest-neighbor model and more. RNA. 2012;18:193–212. doi: 10.1261/rna.030049.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Huang, L., Fayong, S. & Guo, Y. Structured perceptron with inexact search. In Proc. 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 142–151 (Association for Computational Linguistics, 2012).
- 45.Lorenz R, et al. ViennaRNA package 2.0. Algorithms Mol. Biol. 2011;6:26. doi: 10.1186/1748-7188-6-26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Turner DH, Mathews DH. NNDB: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure. Nucleic Acids Res. 2010;38:D280–D282. doi: 10.1093/nar/gkp892. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Kawaguchi Y, Honda H, Taniguchi-Morimura J, Iwasaki S. The codon CUG is read as serine in an asporogenic yeast Candida cylindracea. Nature. 1989;341:164–166. doi: 10.1038/341164a0. [DOI] [PubMed] [Google Scholar]
- 48.Bonitz SG, et al. Codon recognition rules in yeast mitochondria. Proc. Natl Acad. Sci. USA. 1980;77:3167–3170. doi: 10.1073/pnas.77.6.3167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Gamble CE, Brule CE, Dean KM, Fields S, Grayhack EJ. Adjacent codons act in concert to modulate translation efficiency in yeast. Cell. 2016;166:679–690. doi: 10.1016/j.cell.2016.05.070. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Ding Y, et al. In vivo genome-wide profiling of RNA secondary structure reveals novel regulatory features. Nature. 2014;505:696–700. doi: 10.1038/nature12756. [DOI] [PubMed] [Google Scholar]
- 51.Wan Y, et al. Landscape and variation of RNA secondary structure across the human transcriptome. Nature. 2014;505:706–709. doi: 10.1038/nature12946. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Shah P, Ding Y, Niemczyk M, Kudla G, Plotkin JB. Rate-limiting steps in yeast protein translation. Cell. 2013;153:1589–601. doi: 10.1016/j.cell.2013.05.049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Tuller T, Zur H. Multiple roles of the coding sequence 5′ end in gene expression regulation. Nucleic Acids Res. 2014;43:13–28. doi: 10.1093/nar/gku1313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Husain B, Mukerji I, Cole JL. Analysis of high-affinity binding of protein kinase R to double-stranded RNA. Biochemistry. 2012;51:8764–8770. doi: 10.1021/bi301226h. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Hur S. Double-stranded RNA sensors and modulators in innate immunity. Annu. Rev. Immunol. 2019;37:349–375. doi: 10.1146/annurev-immunol-042718-041356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Persano S, et al. Lipopolyplex potentiates anti-tumor immunity of mRNA-based vaccination. Biomaterials. 2017;125:81–89. doi: 10.1016/j.biomaterials.2017.02.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Tan CW, et al. A SARS-CoV-2 surrogate virus neutralization test based on antibody-mediated blockage of ACE2–spike protein–protein interaction. Nat. Biotechnol. 2020;38:1073–1078. doi: 10.1038/s41587-020-0631-z. [DOI] [PubMed] [Google Scholar]
- 58.McKay PF, et al. Self-amplifying RNA SARS-CoV-2 lipid nanoparticle vaccine candidate induces high neutralizing antibody titers in mice. Nat. Commun. 2020;11:3523. doi: 10.1038/s41467-020-17409-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The UniProt sequences used to estimate the time complexity of LinearDesign are included in Supplementary Table 1 and are deposited at our figshare repository (10.6084/m9.figshare.22193251). The SARS-CoV-2 spike and VZV gE protein-coding sequences and UTR sequences used in the biological experiments are included at the end of the Supplementary Information file and are available at our figshare repository. Source data are provided with this paper.
The LinearDesign source code is available to all parties on GitHub (https://github.com/LinearDesignSoftware/LinearDesign) and Zenodo (10.5281/zenodo.7839739), and is free for academic and research use.