Abstract
Recent COVID-19 vaccines unleashed the potential of mRNA-based therapeutics. A common bottleneck across mRNA-based therapeutic approaches is the rapid design of mRNA sequences that are translationally efficient, long-lived and non-immunogenic. Currently, an accessible software tool to aid in the design of such high-quality mRNA is lacking. Here, we present mRNAid, an open-source platform for therapeutic mRNA optimization, design and visualization that offers a variety of optimization strategies for sequence and structural features, allowing one to customize desired properties into their mRNA sequence. We experimentally demonstrate that transcripts optimized by mRNAid have characteristics comparable with commercially available sequences. To encompass additional aspects of mRNA design, we experimentally show that incorporation of certain uridine analogs and untranslated regions can further enhance stability, boost protein output and mitigate undesired immunogenicity effects. Finally, this study provides a roadmap for rational design of therapeutic mRNA transcripts.
Introduction
mRNA-based therapeutics continue to revolutionize vaccine development (1), immunotherapy (2) and targeted degradation methodologies (3), advancing the battle of modern medicine against infectious diseases, genetic disorders and cancer [4–6, reviewed in (7)]. Irrespective of the various indications and the diverse mechanisms of action of mRNA-based drugs, all derived therapeutics and vaccines share the same underlying principles. First, the mRNA sequence is designed and optimized in silico. Then, the optimized sequence is transcribed in vitro, often with selected chemical modifications. Finally, the synthetic transcript is packaged and delivered to the cytoplasm of host cells, where it is translated into a protein that exerts the desired cellular effect. Therefore, in silico mRNA design is undeniably instrumental to the success of any mRNA-based therapeutic. Transcript design is typically initiated with a decoration of the coding sequence (CDS) with flanking 5′- and 3′-UTRs (untranslated regions) and other signals (e.g. translational ramps, miRNA-binding sites, etc.) that can improve stability and translation efficiency, and enable tissue-specific expression (8,9). Then, rigorous sequence engineering is required to eliminate immunogenic properties (10) and further enhance transcript stability and translation (11–13). During in vitro transcription, chemical modifications such as a 5′-cap and nucleoside analogs can be incorporated to protect against degradation and evade host immune surveillance (14,15). At present, mRNA design is largely dependent on expert knowledge, manual sequence editing, distributed optimization and visualization tools that are often proprietary. There is no freely available tool specifically tailored for therapeutic mRNA design that combines multiple optimization strategies. Here, we present mRNAid, an open-source, integrated software that bundles several modified and extended algorithms and tools for constraint propagation, sequence optimization and secondary structure visualization. Via an intuitive and user-friendly interface, mRNAid orchestrates simultaneous optimization of several sequence and structural properties including codon usage, GC content, minimum free energy (MFE), uridine depletion and exclusion of specific motifs and/or rare codons, thereby providing a powerful platform for therapeutic mRNA design. mRNAid is available at https://github.com/MSDLLCpapers/mRNAid and as a web application at https://mrnaid.dichlab.org.
Materials and methods
Tool architecture
The application consists of several parts, which are containerized and can be easily built and run with ‘docker-compose’ utility (Supplementary Figure S1). The frontend is served as static files with the help of the Nginx server. It can also be configured as a reverse proxy. Frontend container communicates with backend through the uwsgi protocol. The backend presents a Python Flask application served by the uWSGI server. The optimization tasks are handled by a celery task queue implemented with the Redis in-memory database working as a message broker. The mounted volume is used to keep logs of the backend execution. All individual parts reside within separate containers that communicate with each other inside a docker network. The user interface is written in React.js and consists of an input form and results page. The input form allows users to select different optimization strategies, set optimization parameters and submit the optimization job. The output form includes visualization of the optimized sequences generated via the rgw Forna JavaScript visualization container combined with an MFE mountain-plot and summary of optimized sequence properties. Users can export the results in a pdf or an Excel format.
Optimization strategies
The core of the tool is the freely available sequence optimization framework for Python, DNA Chisel (16). DNA Chisel allows the use of built-in specifications to approach some of the common optimization tasks (such as matching target codon usage in the host or ensuring correct translation to protein by using only synonymous codons during the optimization, etc.) and it is very flexible with respect to defining completely new optimization specifications. These specifications can either be hard constraints, which cannot be violated in the final sequence, or they can be considered as soft constraints or objectives, whose score is maximized in the final sequence. Some specifications can be used as both constraint and objective, depending on user requirements. When multiple objectives are defined in the optimization problem, the total weighted score is maximized.
DNA Chisel solves the constraint satisfaction problem using a combination of constraint propagation and local search methods. The optimization algorithm consists of two main steps, the resolution of all hard constraints and maximization of objectives’ scores with respect to the constraints. The solver reduces the optimization problem to a set of local optimization problems, which are resolved individually. The optimization is performed either by random mutations on the sequence or by exhaustive search through the pre-computed mutation space, depending on the size of the latter. During the mRNAid optimization, the tool combines all the specified constraints and objectives together into the optimization problem and calls the DNA Chisel optimization method. This procedure is re-executed in parallel either until a user-specified number of sequences is produced or until the number of attempts is exceeded. After the optimization is completed, mRNAid ranks the optimized sequences based on the scoring function, as described below.
In our optimization approach, the following built-in specifications were used as hard constraints: AvoidPattern to ensure that certain motifs are excluded; EnforceGCContent to keep GC levels in a certain boundary across the sequence; AvoidRareCodons to not use rare codons with codon frequency below the threshold; and EnforceTranslation to ensure that the optimized sequence is translated back to the same protein as the input. As objectives, built-in MatchTargetCodonUsage was used to match codon usage frequencies in the host and EnforceGCContent to optimize GC content in the sliding window across the sequence. The details of these specifications can be found in the DNA Chisel documentation (16).
Additional custom specifications were implemented and integrated into the tool. The Uridine Depletion hard constraint was implemented to ensure no codons with uridine in the third position are present in the sequence. In addition, three new objectives were implemented: MatchTargetPairUsage to account for dinucleotide usage; MatchTargetCodonPairUsage to optimize for codon pair usage frequencies; and MinimizeMFE to use different algorithms for MFE estimation.
A short description of the specifications is also provided in the mRNAid tooltip widget in the form of a question mark (‘?’) adjacent to a given specification. On the submission page of the mRNAid tool, the built-in specifications appear as ‘Avoid motifs’ (AvoidPattern), ‘Global GC content’ and ‘Window size for local GC content’ (EnforceGCContent), ‘Codon usage frequency threshold’ (AvoidRareCodons), ‘CAI optimization’ (MaximizeCAI) and the custom specifications, ‘Uridine depletion’, ‘Match dinucleotide usage’, ‘Match codon-pair usage’, ‘Use more accurate MFE estimation’ and ‘Entropy window size’ (MFE optimization).
In the following sections, we describe the custom specifications in more detail.
Uridine depletion
This constraint ensures that there is no uridine in the third position of all the codons in an optimized sequence. This constraint is implemented on the base of DNA Chisel's CodonSpecification class.
Dinucleotides, codon-pair, CAI and MatchCodonUsage optimizations
Dinucleotides and codon-pair are custom objectives derived from usage tables from the CoCoPUTs database (17). They account for the difference between dinucleotide or codon-pair frequencies in the host organism (Homo sapiens or Mus musculus) and the current sequence. The score is calculated by the following formula:
![]() |
![]() |
where
is the score for a given nucleotide pair or codon-pair,
is the total score being the mean of all the individual scores,
is the frequency for a given pair,
is a corresponding frequency from the database and
is the total number of pairs across the sequence. The total score is maximized by the DNA Chisel optimization algorithm. Codon Adaptation Index (CAI) optimization is the built-in objective, used if a user specifies so. CAI optimization is a common optimization strategy introduced in (18). MatchCodonUsage is a built-in constraint which minimizes the sum of discrepancies over all possible codon frequencies in a given sequence and in the target organism, set as default. All codon optimization objectives are considered mutually exclusive, so it is not possible to use any combination of these in our tool.
MFE optimization
We are targeting to maximize the MFE at the specified region of the sequence starting from the 5′ end of the mRNA molecule. We call this region an entropy window. Maximization of MFE in this region enforces a more open structure with fewer base pairs formed, which makes it more accessible to ribosomes. The aim is to have the MFE of the 5′ end as close to 0 as possible (it is usually negative). The user can choose between two algorithms for MFE estimation. The first one is the RNAfold algorithm (19), based on dynamic programming which thoroughly explores all possible secondary structures. This process can take up to several seconds depending on the size of the sequence of interest and might not be the best option when multiple runs are required (which is exactly the case of mRNAid). However, the main benefit of the long computational time is the high accuracy of estimations. The RNAfold package is also used to provide the calculated secondary structures to the frontend for subsequent visualization.
The alternative option is to use a faster MFE estimation algorithm, which is based on the correlated stem–loop prediction approach proposed in (20). In this approach, all possible single stem–loop conformations are considered, and their interaction energies are averaged. This algorithm has quadratic complexity O(n^2), where ‘n’ is the number of nucleotides in sequence, compared with cubic complexity O(n^3) of the RNAfold algorithm. The simplified algorithm is used during the optimization, when mutation space is explored to estimate the score of the mutated sequence. However, when presenting the final value of the best sequence after the optimization is done, its MFE value is estimated with the RNAfold algorithm.
Scoring function
A scoring function which evaluates sequences for different criteria was applied to the list of optimized sequences. The final score used in ranking was:
![]() |
where
are individual weights of each score (
),
is the uridine depletion score,
is the GC content score,
is the CAI score,
is the total MFE score and
is the MFE 5′-end score. The weights were assigned to accommodate the order of importance for different objectives, as defined by the experimental scientists and based on previous reports. These weights were later fine-tuned based on the experimental data. As more data become available, these parameters can be further optimized and changed accordingly.
Uridine depletion score
Uridine depletion is checked by counting each uridine at the third position in a codon and normalizing to the codon number. Maximum and minimum values are 1 and 0 (all/no codons have uridine at the third position). When uridine depletion is not specified by the user, this is not included in the final scoring function (by setting the weight to 0).
GC score
GC content is calculated for the whole sequence and checked to be within the user-defined range (GC_min and GC_max). The score is calculated as a growing linear function of GC content value to favor sequences with larger GC values:
![]() |
The score is bounded in the range 0 to 1. As GC content has an influence on properties and expression rates of mRNA, we optimize the sequence to fit the GC content in a specified window.
Codon Adaptation Index score
The CAI is a widely used metric of synonymous codon usage bias, which measures the deviation of codon usage from that in a reference set of highly expressed genes (18). The CAI score is equal to the value of the CAI itself and bound in the range 0 to 1, with 1 being the most optimal CAI.
Total MFE score
It is preferable to have sequences with a lower value of MFE. To enable efficient sorting of the sequences according to this requirement, we use the following score:
![]() |
A value of 5000 was chosen based on the observed value of many input sequences that did not exceed an MFE value of 3500. In that way, the score remains between 0 and 1, where the score tends to zero when MFE goes to zero, and does not exceed 1 for MFE values around 3500 bp.
5′-MFE score (mfe_5_score)
The MFE of the 5′ end is calculated using RNAfold. MFE has a theoretical maximum of 0, but in practice does not reach that value. The score is calculated as a decreasing exponential function of the 5′-MFE:
![]() |
In this case
when
and
when
. The score is now bound in the range 0 to 1, with the aim to minimize the 5′-MFE to 0.
Comparison with COVID-19 vaccines
The sequence for the native spike surface glycoprotein gene was retrieved from the NCBI (NC_045512.2) and then optimized by mRNAid using Strategy 5 (Supplementary Table S1). The resultant optimized sequence was compared with the coding sequences within the putative Pfizer/BioNtech and Moderna vaccines (21,22). Putative Pfizer/BioNtech and Moderna assembled vaccine sequences were downloaded from: https://github.com/NAalytics/Assemblies-of-putative-SARS-CoV2-spike-encoding-mRNA-sequences-for-vaccines-BNT-162b2-and-mRNA-1273. Following CAI optimization, the mean Levenstein distance between a given mRNAid-optimized sequence and the putative Moderna or Pfizer/BioNtech assembled vaccine sequences was 131 and 350, respectively, reflecting ∼3.5% and 9.1% sequence variation, when 3819 nucleotides of the spike CDS are considered (excluding the stop codon).
Experimental validation
In vitro transcription
mRNAs with ARCA or CleanCap® with or without uridine modification were in vitro transcribed using the mMESSAGE mMACHINE® T7 Ultra transcription kit (Ambion, AMB13455). Linearized plasmid DNA containing the target gene downstream of a T7 RNA polymerase promoter was used as the template, and synthesis reactions were performed according to the manufacturer's protocol. For mRNAs with CleanCap®, T7 2× NTP/ARCA was substituted with 8 mM CleanCap® Reagent AG (TriLink Biotechnologies, N-7113) and 10 mM of each NTP. Modified uridines used included pseudouridine-5′-triphosphate (TriLink Biotechnologies, N-1019), N1-methyl-pseudouridine-5′-triphosphate (TriLink Biotechnologies, N-1081) or 5-methoxyuridine-5′-triphosphate (TriLink Biotechnologies, N-1093). mRNAs were subsequently purified by the MegaClear Transcription Clean-up kit (Ambion, AM1908) and quantified on the NanoDrop spectrophotometer.
Cell culture
All cell lines were obtained from the American Type Culture Collection (ATCC) and grown at 37°C, 5% CO2. MIA PaCa-2 (CRL-1420) cells were maintained in Dulbecco’s modified Eagle’s medium (DMEM) with high glucose and GlutaMAX™ supplement (Gibco), 10% fetal bovine serum (FBS; HyClone) and 2.5% horse serum (Gibco). BJ fibroblasts (CRL-2522) were cultured in minimal essential medium (MEM) with GlutaMAX™ supplement (Gibco) and 10% FBS (HyClone). SJCRH30 (CRL-2061) cells were cultured in RPMI with GlutaMAX™ supplement (Gibco) and 10% FBS (HyClone).
Luminescences assays
Lipofectamine™ MessengerMAX™ (Life Technologies) was diluted in opti-MEM to the desired working concentration and dispensed onto 384-well white assay plates (Greiner 781080). A source plate (Labcyte LP-0200) containing serial dilutions of the mRNAs was prepared using the Bravo liquid handler (Agilent), and a 10-point 2-fold dose titration of each mRNA was dispensed onto the assay plate using Echo555 (Labcyte). After a 10 min incubation, 4000 MIA PaCa-2 or SJCRH30 cells or 6000 BJ cells were added per well. For kinetic monitoring, 20 μM Endurazine (Promega), an extended time-released live cell substrate, was added to each well. Luminescence was measured continuously at 1 h intervals for 48 h on the Tecan Spark 10M set to 37ºC, 5% CO2. For end-point HiBiT protein detection, the NanoGlo HiBiT lytic detection assay (Promega, N3040) was performed as per the manufacturer's instructions. Luminescence signal was determined using the Envision plate reader, and values were normalized to a HiBiT-control protein (Promega, N3010)
Western blot analysis
A total of 0.08 million MIA PaCa-2 cells were seeded per well in a 24-well poly-d-lysine-coated cell culture plate (Greiner) and allowed to attach overnight before mRNA transfection with Lipofectamine™ MessengerMAX™ (Life Technologies) according to the manufacturer's protocol. After 24 h incubation, 100 μl of Bolt™ lithium dodecyl sulfate (LDS) sample buffer supplemented with Bolt™ sample reducing agent was added per well of a 24-well plate. The wells were scraped using wide orifice tips and the lysate was transferred into polymerase chain reaction (PCR)-strip tubes and sonicated for 10 × 10 s in a chilled water bath sonicator (QSonica). A 15 μl aliquot of protein extract was separated on 4–12% Bis-Tris plus gels, transferred onto nitrocellulose membranes using the Trans-Blot® Turbo™ semi-dry system (Bio-rad), and blocked for 1 h at room temperature with Intercept™ (TBS) blocking buffer (Li-Cor). Blots were probed with the appropriate primary antibodies overnight at 4°C in blocking buffer supplemented with 0.1% Tween-20, followed by the secondary antibodies IRDye® 680RD donkey anti-mouse IgG or IRDye® 800CW donkey anti-rabbit IgG (Li-Cor) for 1 h at room temperature. Fluorescent signals were imaged and quantified using Odyssey® CLx. Primary antibodies used were: NanoLuc (Promega, N7000) and glyceraldehyde phosphate dehydrogenase (GAPDH; Cell Signaling Technology, #5174)
IFN-β detection in BJ fibroblasts
BJ fibroblasts were seeded in 96-well poly-d-lysine-coated cell culture plates (Greiner) at 20 000 cells per well and transfected the next day with 50 ng per well of the respective mRNA using Lipofectamine™ MessengerMAX™ (Life Technologies). The supernatant was harvested 48 h post-transfection and interferon-β (IFN-β) levels were determined using the Bio-Plex Pro Human Inflammation Panel 1 (BioRad) as per the manufacturer's protocol. Data were acquired on the Bio-Plex Pro 200 system (BioRad).
Results and discussion
The core backbone for sequence optimization in mRNAid is based on the DNA Chisel framework (16) (Supplementary Figure S1). DNA Chisel permits global and local optimization of hard and soft constraints, which in turn enables the adjustment of the desired sequence properties along the entire transcript. Hard constraints refer to criteria that must be satisfied in the final sequence, whereas soft constraints refer to criteria whose score must be maximized. Furthermore, the ability to flexibly define new constraints makes DNA Chisel an ideal sandbox for probing the effect of a multitude of sequence properties on stability and expression.
mRNAid piggybacks on several hard constraints implemented in DNA Chisel that enforce global GC content and translation and avoid rare codons and specific motifs. mRNAid extends this list with an important uridine depletion constraint that avoids codons with uridine at their third position, which reportedly improves expression and reduces immunogenicity (10).
Codon usage optimization aims to improve expression by systematic replacement of synonymous codons based on the organism's codon frequency table. Numerous proprietary and freely available codon optimization algorithms have been reported to date (11) and many show a strong preference towards the CAI (18) since it highly correlates with gene expression (23). The CAI reflects the deviation of codon frequencies in a sequence from those observed in a reference set of highly expressed genes in the target host organism (18). DNA Chisel includes CAI and Matched Codon Usage optimizations as soft constraints. The latter method ensures that the relative frequencies of the codons in the sequence match the overall codon usage of the target organism (16). While the above methods consider each codon independently, a recent study reported a bias in codon-pair utilization and dinucleotide usage that are inevitably inter-related (17)
The CoCoPUTs database provides pre-computed codon-pair and dinucleotide frequency tables in various organisms that can be used for sequence optimization. We implemented the dinucleotide and codon-pair usage optimizations since they have been shown to affect translation fidelity and efficiency (17,24). There is a significant codon-pair usage bias in all three domains of life, and between proteins expressed at a low and high level within a species (25) that cannot be explained by individual codon bias, pointing towards a distinct mechanism of translation modulation. It has been suggested that codon-pair effects on the translation rate may be mediated by interactions of adjacent aminoacyl-tRNA molecules bound to ribosomes (25). To account for structural properties, we also incorporated the Vienna-RNA MFE optimization (19) and the correlated stem–loop prediction approach (20), given the pivotal role that mRNA secondary structures play in regulating translation efficiency (26). The Vienna-RNA method uses a thermodynamic energy model to compute the MFE of a given RNA sequence to identify the most thermodynamically stable secondary structure (19). To reduce the complexity of the MFE computation, the stem–loop method (20) estimates a pseudo-MFE as the average energy of all possible stem–loop conformations for a given sequence, increasing the computation efficiency of sequence optimization. Multiple studies reported correlation between highly structured features in the CDS and functional mRNA half-life (27,28). In contrast, the region around the translation start site is less structured in highly expressed genes (29), which presumably facilitates ribosome loading and prevents jamming. Thus, different transcript regions possess different structural properties that must be reflected in the optimization strategy. Furthermore, transcripts are often fused to already optimized UTRs, obviating the need for optimization of these regions. Given the above, and the fact that MFE optimization is extremely computationally expensive, we provide users with the flexibility to optimize MFE within an adjustable window in the 5′ end of the CDS, while accounting for global MFE in our ranking approach. To this end, we ensure that the combination of multiple constraints generates optimal sequence properties by applying a novel scoring formula that ranks sequences based on weighted scores of uridine depletion, GC content, CAI, local 5′ end and global MFEs.
Next, we selected five distinct optimization strategies in mRNAid that are based on a codon optimization approach coupled with uridine depletion, GC content and MFE optimizations (Strategies 1–5: dinucleotide, matched codon-pair usage, matched codon usage with or without uridine depletion and CAI, respectively) (Table 1). To experimentally validate these strategies, we adopted NanoLuciferase-PEST (Nluc-PEST) as the reporter system since the short half-lives of the individually produced luciferase proteins prevent confounding effects on interrogation of mRNA properties as they relate to translation efficiency or mRNA stability. Indeed, these effects can be conveniently measured through kinetic tracking of the target protein's luminescence. Area under the curve (AUC) and luminescence at 48 h (RLU @ 48 h) are represented as indicators of total protein output and functional mRNA stability, respectively, given that protein sequence and other mRNA features are kept constant. A de-optimized mRNA version of Nluc-PEST encoded by the least frequent codon for each amino acid (Rare, red, Figure 1A–C) was used as the input for mRNAid and the top four ranked mRNA sequences generated under each software setting were selected (Supplementary Tables S1 and S2). These were benchmarked against a proprietary sequence from Promega (Promega, blue, Figure 1A–C) as the codon-optimized control.
Table 1.
Parameters for the experimentally tested optimization strategies
| Optimization strategy | U-depletion | Codon optimization |
|---|---|---|
| 1. Dinucleotide | Yes | Dinucleotides |
| 2. Matched Codon Pair Usage | Yes | Matched codon pair usage |
| 3. Matched Codon Usage U-depletion | Yes | Matched codon usage—default |
| 4. Matched Codon Usage no U-depletion | No | Matched codon usage—default |
| 5. Codon Adaptation Index (CAI) | Yes | CAI |
Other parameters were the same for all strategies: codon usage frequency threshold (10%), avoid motifs (NheI EcoRV NotI BamHI BspEI), minimal GC content (50), maximal GC content (70), window size for local GC content (100) and entropy window size (80).
Figure 1.
Impact of various sequence optimization strategies in mRNAid on NanoLuc-PEST expression. Effects of five different sequence optimization strategies on NanoLuc-PEST expression in MIA PaCa-2 cells are represented as (A) area under the curve (AUC) of luminescence over 48 h and (B) relative luminescence (RLU) at 48 h post-transfection. Rare, red denotes the de-optimized input; and Promega, blue denotes the codon-optimized control. The top four outputs from each strategy were tested at a 6.25 ng dose of the respective mRNA. Scatter plots with bars represent the mean from two independent biological replicates. (C) Western blot analysis of the sequence variants from 24 h post-transfection in MIA PaCa-2 cells. GAPDH was used as a loading control. Band intensities for NanoLuc were normalized to GAPDH and represented as fold change over the Promega control. Two additional outputs for each strategy were evaluated. Data from repeat experiment are included in Supplementary Figure S2. (D) Correlation plots for AUC (top) and RLU @ 48 h (bottom) versus MFE (kcal/mol). Mean values of AUC and RLU @ 48 h were determined for the 6.25 ng mRNA dose in (A and B) and represented as fold change over the respective mean values of Promega control (blue dashed line). The red dot denotes the de-optimized Rare input. Pearson’s r is indicated as determined by GraphPad Prism.
All 20 mRNAid-optimized sequences resulted in significantly higher expression relative to the de-optimized input (Figure 1A–C; Supplementary Figure S2), attesting to the robustness of the tool. Among the five optimization strategies, all four sequences optimized by Strategy-5 (CAI optimization) were consistently better than or comparable with Promega and exhibited the highest GC content, lowest global MFE and lowest uridine content (Figure 1A–C; Supplementary Table S2). Similarly, sequences that were optimized by Strategy-2 (matched codon-pair optimization) exhibited expression comparable with Promega. The importance of codon-pair context for enhanced gene expression was also recognized in previous studies (30). Conversely, all the four sequences that were optimized by Strategy-1 (dinucleotide optimization) were expressed less well than in Promega (Figure 1A–C; Supplementary Table S2; Supplementary Figure S2). Further analyses revealed that both AUC and RLU @ 48 h were strongly correlated with global MFE (r = –0.87 and r = –0.92, respectively, Figure 1D), in line with previous reports on the impact of secondary structures on mRNA half-lives and ultimately final protein output (27). Correlations were also observed with GC (%) and U (%), but not 5′-MFE (Supplementary Figure S3). Sequence optimization using mRNAid has a strong impact on rate of decay but not translation initiation as compared with the de-optimized input (Supplementary Figure S4). These findings highlight the importance of combining a multitude of hard, soft, local and global constraints to achieve balanced sequence and structural properties that cooperatively define total protein expression.
An in silico optimization of the native SARS-CoV-2 surface glycoprotein gene using mRNAid Strategy-5 (CAI optimization) generated a transcript with 96.5% and 90.9% similarity to the putative assembled Moderna and Pfizer/BioNtech vaccine sequences, respectively (Table 2) (21,22). We also compared the expression of the mRNAid-optimized spike CDS with the putative Moderna and Pfizer/BioNtech sequences in SJCRH30 (muscle, relevant for vaccine testing) and MIA PaCa-2 cells. In both cell lines, mRNAid optimization significantly boosted expression of spike protein as compared with the native input sequence (Figure 2). Although the mRNAid sequence is inferior to the putative Moderna sequence in both cell lines, it yielded expression comparable with the putative Pfizer vaccine sequence in the muscle cell line (Figure 2). These data confirm that mRNAid sequence optimization can be applied to therapeutically relevant proteins whose expression can be modulated based on the sequence optimization parameters applied.
Table 2.
Summary of GC content, MFE and percentage similarity between mRNAid-optimized native SARS-CoV-2 spike sequence and the indicated CDS
| Sequence | GC ratio | MFE (kcal/mol) | % Similarity to mRNAid CDS |
|---|---|---|---|
| Putative Moderna CDS | 0.62 | –1476.40 | 96.5 |
| Putative Pfizer CDS | 0.57 | –1321.30 | 90.9 |
| Native CDS | 0.37 | –1068.60 | 68.9 |
| mRNAid CDS | 0.62 | –1451.70 | – |
GC-rich RNA sequences are more likely to form stable secondary structures, such as stem–loop structures or hairpins, whereas lower MFE values reflect more stable structures.
Figure 2.
Impact of mRNAid sequence optimization on spike protein expression. Luminescence signals (normalized to a HiBiT control protein) of HiBiT-tagged spike CDS optimized by mRNaid (teal) or from the Moderna (black) and Pfizer (pink) vaccines are compared with the native spike CDS (blue) at (A) 24 h and (B) 48 h in SJCRH30 and MIA PaCa-2 cell lines at the 12.5 ng dose of the respective mRNA. Scatter plots, with bars representing the mean from two independent biological replicate experiments. (C) Western blot analysis of the spike sequence variants 24 or 48 h post-transfection in MIA PaCa-2 and SJCRH30 cells. HSP90 was used as a loading control. A representative blot is shown from two independent biological replicates. (D) Band intensities from two biological replicate experiments for HiBiT–spike (full-length, FL; and cleaved S2, S2) were normalized to HSP90 and represented as fold change over the native control. The presence of the expected product and the addition of a poly(A) tail for all mRNA constructs synthesized in this study were confirmed by gel electrophoresis (Supplementary Table S4; Supplementary Figure S5).
We next explored additional features that can be incorporated into mRNAid-optimized sequences to yield optimal therapeutic mRNAs. The discovery that uridine analogs dramatically reduce immune stimulation (10) and increase protein production from synthetic mRNA (14) marked a breakthrough in mRNA-based therapeutics. We evaluated the impact of substituting uridine (U) with pseudouridine (pU), 5-methoxyuridine (5moU) or N1-methylpseudouridine (N1m) on Nluc-PEST protein expression and pro-inflammatory cytokine (IFN-β) release in human BJ fibroblasts. For the de-optimized input (Rare) and codon-optimized control (Promega) sequences, pU and N1m modifications significantly improved protein expression as compared with U (Figure 3A), in line with previous reports. Instead of improving expression, U substitution with 5moU reduced protein output from the Promega sequence. Out of 20 mRNAid sequences, we picked the best expressed sequence (Strategy-5:output-2, Figure 1C) and were able to recapitulate this improvement compared with Rare and Promega in BJ fibroblast using U (black, Figure 3A). However, protein levels were not further enhanced using uridine analogs in Strategy-5:output-2 sequence. We postulate that since Strategy-5:output-2 sequence had the lowest uridine content (14% in Strategy-5:output-2; 21% in Promega; 27% in Rare), the effect of uridine substitution on expression may be minimal. In terms of immunogenicity, unmodified mRNAs (U) caused the highest cytokine release, as expected (comparable with the positive control poly I:C), followed by pU, N1m and 5moU, respectively (Figure 3B). Thus, our findings demonstrate how sequence optimization in addition to incorporation of uridine analogs can reduce undesired innate immune responses while maintaining high target protein expression.
Figure 3.
Additional ways to engineer mRNA for therapeutic use. (A) Effect of modified nucleotides on NanoLuc-PEST expression in BJ fibroblasts. U, uridine; pU, pseudouridine; 5moU, 5-methoxyuridine; N1m, N1-methyl-pseudouridine. Individual values from two independent biological replicate experiments have been plotted. (B) Effect of modified nucleotides on innate immune activation in BJ fibroblasts. Cytokine release assay 48 h post-transfection with 50 ng of the indicated mRNAs. IFN-β levels were normalized to the Promega sequence with pU incorporated. OOR (out-of-range), where values are below the detection limits of the assay. The dashed line represents the positive control poly I:C. Scatter plot, with bars representing the mean from two biological replicate experiments. (C) Effect of AG versus GG initiator sequence after the T7 promoter on NanoLuc protein expression in MIA PaCa-2 cells 24 h post-mRNA transfection. Individual values from two independent biological replicate experiments have been plotted. (D) Effect of different UTRs on NanoLuc protein expression in MIA PaCa-2 cells 24 h post-mRNA transfection.
mRNAs are capped at the 5′ end to protect against degradation, facilitate ribosome loading and evade innate immune responses (31) Compared with legacy cap analogs such as ARCA, the proprietary co-transcriptional capping reagent CleanCap AG from TriLink was shown to have higher capping efficiency, increased RNA yield and reduced immunogenicity. We modified the conventional GG initiator sequence after the T7 promoter to AG and synthesized mRNAs using CleanCap AG. Indeed, the AG initiator significantly improved NanoLuc expression (Figure 3C) and total RNA yield (data not shown). This highlights how protein output can be further boosted and emphasizes the importance of tailoring template design to the desired capping technique. As noted, the role of 5′- and 3′-UTRs in modulating mRNA stability and translation is well established (28). To this end, we selected four pairs of UTR sequences that have been reported to boost protein expression in human cells (Supplementary Table S3). Indeed, all four UTRs increased NanoLuc expression compared with the plasmid in which default sequences flank its CDS (Figure 3D). Together, we present various opportunities to enhance mRNA potency by optimizing key mRNA components.
In summary, this study represents a first attempt to create a comprehensive playbook for rational design of therapeutic mRNA transcripts. mRNAid is an open-source software that offers advanced sequence and structural optimization strategies that generate transcripts with desired expression properties. We also experimentally demonstrate that incorporation of certain uridine analogs, and inclusion of key mRNA components can further enhance stability, boost protein output and mitigate undesired immunogenicity effects.
Despite the encouraging results of mRNAid, it is important to note its limitations. mRNAid does not optimize for MFE along the entire transcript, and the thermodynamic parameters of uridine analogs are not accounted for. Yet the experimental data we presented here clearly indicate that global MFE optimization is worthwhile, albeit computationally expensive. However, the flexible backbone of mRNAid presents an opportunity for the broader scientific community to make additional enhancements to the tool. These improvements may involve extending support to other species, incorporating MFE calculations that account for uridine analogs, implementing new constraints or optimization strategies, or considering any other sequence features that might influence stability, immunogenicity and expression.
Supplementary Material
Acknowledgements
We thank Jens Christensen, Vincent Antonucci and Carol A. Rohl for supporting this work. We are immensely grateful to David Dzamba for his help with the initial research on transcript stability and expression that eventually was not included in the final manuscript.
Author contributions: SL, PG, BH, AP and DB conceived the study. DB planned and supervised the study. NV designed and orchestrated the implementation of mRNAid. KB and XW contributed to the development of the backend, SP to the development of the frontend, MS to the architecture and open-source, and PM to code review and scoring function. AG acted as the product owner and managed the backlog. AM helped with scientific research throughout, particularly in the context of MFE. SL and PG scientifically led and conducted the experimental work with the help of CY, JG and JW. DP created a command line tool and improved the mRNAid user interface. DB wrote the manuscript, SL, PG, JW, JG, AM and NV also contributed to the main text and the Materials and Methods section. All authors read and approved the final version of the manuscript.
Contributor Information
Nikita Vostrosablin, Discovery Informatics, MSD Czech Republic s.r.o., Prague, 150 00, Czech Republic.
Shuhui Lim, Quantitative Biosciences, MSD Singapore, 138665, Singapore.
Pooja Gopal, Quantitative Biosciences, MSD Singapore, 138665, Singapore.
Kveta Brazdilova, Discovery Informatics, MSD Czech Republic s.r.o., Prague, 150 00, Czech Republic; Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology, Prague, 160 00, Czech Republic.
Sushmita Parajuli, Discovery Informatics, MSD Czech Republic s.r.o., Prague, 150 00, Czech Republic.
Xiaona Wei, Bioinformatics, MSD Singapore, 138665, Singapore.
Anna Gromek, Discovery Informatics, MSD Czech Republic s.r.o., Prague, 150 00, Czech Republic.
David Prihoda, Discovery Informatics, MSD Czech Republic s.r.o., Prague, 150 00, Czech Republic.
Martin Spale, Discovery Informatics, MSD Czech Republic s.r.o., Prague, 150 00, Czech Republic.
Anja Muzdalo, Discovery Informatics, MSD Czech Republic s.r.o., Prague, 150 00, Czech Republic.
Jamie Greig, Quantitative Biosciences, MSD Singapore, 138665, Singapore.
Constance Yeo, Quantitative Biosciences, MSD Singapore, 138665, Singapore.
Joanna Wardyn, Quantitative Biosciences, MSD Singapore, 138665, Singapore.
Petr Mejzlik, Discovery Informatics, MSD Czech Republic s.r.o., Prague, 150 00, Czech Republic.
Brian Henry, Quantitative Biosciences, MSD Singapore, 138665, Singapore.
Anthony W Partridge, Quantitative Biosciences, MSD Singapore, 138665, Singapore.
Danny A Bitton, Discovery Informatics, MSD Czech Republic s.r.o., Prague, 150 00, Czech Republic.
Data availability
All code for this publication is available in the following GitHub repository: https://github.com/MSDLLCpapers/mRNAid and as a web application at https://mrnaid.dichlab.org. The code is also available in Zenodo: https://zenodo.org/doi/10.5281/zenodo.10693976.
Supplementary data
Supplementary Data are available at NARGAB Online.
Funding
Merck Sharp & Dohme LLC, a subsidiary of Merck & Co., Inc., Rahway, NJ, USA.
Conflict of interest statement. All authors that are/were employees of Merck Sharp & Dohme LLC, a subsidiary of Merck & Co., Inc., Rahway, NJ 07065, USA may hold stocks and/or stock options in Merck & Co., Inc., Rahway, NJ, USA.
References
- 1. Pardi N., Hogan M.J., Porter F.W., Weissman D. mRNA vaccines—a new era in vaccinology. Nat. Rev. Drug Discovery. 2018; 17:261–279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Pastor F., Berraondo P., Etxeberria I., Frederick J., Sahin U., Gilboa E., Melero I.. An RNA toolbox for cancer immunotherapy. Nat. Rev. Drug Discovery. 2018; 17:751–767. [DOI] [PubMed] [Google Scholar]
- 3. Lim S., Khoo R., Juang Y.-C., Gopal P., Zhang H., Yeo C., Peh K.M., Teo J., Ng S., Henry B.et al.. Exquisitely specific anti-KRAS biodegraders inform on the cellular prevalence of nucleotide-loaded states. ACS Cent. Sci. 2020; 7:274–291. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Corbett K.S., Edwards D.K., Leist S.R., Abiona O.M., Boyoglu-Barnum S., Gillespie R.A., Himansu S., Schäfer A., Ziwawo C.T., DiPiazza A.T.et al.. SARS-CoV-2 mRNA vaccine design enabled by prototype pathogen preparedness. Nature. 2020; 586:567–571. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Martini G.V.P., Guey L.T.. A new era for rare genetic diseases: messenger RNA therapy. Hum. Gene Ther. 2019; 30:1180–1189. [DOI] [PubMed] [Google Scholar]
- 6. Hewitt S.L., Bai A., Bailey D., Ichikawa K., Zielinski J., Karp R., Apte A., Arnold K., Zacharek S.J., Iliou M.S.et al.. Durable anticancer immunity from intratumoral administration of IL-23, IL-36γ, and OX40L mRNAs. Sci. Transl. Med. 2019; 11:eaat9143. [DOI] [PubMed] [Google Scholar]
- 7. Damase T.R., Sukhovershin R., Boada C., Taraballi F., Pettigrew R.I., Cooke J.P.. The limitless future of RNA therapeutics. Front. Bioeng. Biotechnol. 2021; 9:628137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Jain R., Frederick J.P., Huang E.Y., Burke K.E., Mauger D.M., Andrianova E.A., Farlow S.J., Siddiqui S., Pimentel J., Cheung-Ong K.et al.. MicroRNAs enable mRNA therapeutics to selectively program cancer cells to self-destruct. Nucleic Acid Ther. 2018; 28:285–296. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Verma M., Choi J., Cottrell K.A., Lavagnino Z., Thomas E.N., Pavlovic-Djuranovic S., Szczesny P., Piston D.W., Zaher H.S., Puglisi J.D.et al.. A short translational ramp determines the efficiency of protein synthesis. Nat. Commun. 2019; 10:5774. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Vaidyanathan S., Azizian K.T., Haque A.K.M.A., Henderson J.M., Hendel A., Shore S., Antony J.S., Hogrefe R.I., Kormann M.S.D., Porteus M.H.et al.. Uridine depletion and chemical modification increase Cas9 mRNA activity and reduce immunogenicity without HPLC purification. Mol. Ther. Nucleic Acids. 2018; 12:530–542. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Gould N., Hendy O., Papamichail D. Computational tools and algorithms for designing customized synthetic genes. Front. Bioeng. Biotechnol. 2014; 2:41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Zhang H., Zhang L., Lin A., Xu C., Li Z., Liu K., Liu B., Ma X., Zhao F., Yao W.et al.. Algorithm for optimized mRNA design improves stability and immunogenicity. Nature. 2023; 621:396–403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Lee J., Kladwang W., Lee M., Cantu D., Azizyan M., Kim H., Limpaecher A., Gaikwad S., Yoon S., Treuille A.et al.. RNA design rules from massive open laboratory. Proc. Natl Acad. Sci. USA. 2014; 111:2122–2127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Kariko K., Muramatsu H., Welsh F.A., Ludwig J., Kato H., Akira S., Weissman D.. Incorporation of pseudouridine into mRNA yields superior nonimmunogenic vector with increased translational capacity and biological stability. Mol. Ther. 2008; 16:1833–1840. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Kormann M.S.D., Hasenpusch G., Aneja M.K., Nica G., Flemmer A.W., Herber-Jonat S., Huppmann M., Mays L.E., Illenyi M., Schams A.et al.. Expression of therapeutic proteins after delivery of chemically modified mRNA in mice. Nat. Biotechnol. 2011; 29:154–159. [DOI] [PubMed] [Google Scholar]
- 16. Zulkower V., Rosser S.. DNA Chisel, a versatile sequence optimizer. Bioinformatics. 2020; 36:4508–4509. [DOI] [PubMed] [Google Scholar]
- 17. Alexaki A., Kames J., Holcomb D.D., Athey J., Santana-Quintero L.v., Lam P.V.N., Hamasaki-Katagiri N., Osipova E., Simonyan V., Bar H.et al.. Codon and codon-pair usage tables (CoCoPUTs): facilitating genetic variation analyses and recombinant gene design. J. Mol. Biol. 2019; 431:2434–2441. [DOI] [PubMed] [Google Scholar]
- 18. Sharp P.M., Li W.H.. The codon adaptation index—a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 1987; 15:1281–1295. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Hofacker I.L. Vienna RNA secondary structure server. Nucleic Acids Res. 2003; 31:3429–3431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Gaspar P., Moura G., Santos M.A.S., Oliveira J.L.. mRNA secondary structure optimization using a correlated stem–loop prediction. Nucleic Acids Res. 2013; 41:5490. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Jeong D.-E., McCoy M., Artiles K., Ilbay O., Fire A., Nadeau K., Park H., Betts B., Boyd S., Hoh R.et al.. Assemblies-of-Putative-SARS-CoV2-Spike-Encoding-mRNA-Sequences-for-Vaccines-BNT-162b2-and-mRNA-1273. (1 March 2022, date last accessed)https://virological.org/t/assemblies-of-putative-sars-cov2-spike-encoding-mrna-sequences-for-vaccines-bnt-162b2-and-mrna-1273/663.
- 22. World Health Organization Messenger RNA Encoding the Full-length SARS-CoV-2 Spike Glycoprotein. (1 March 2022, date last accessed)https://web.archive.org/web/20210105162941/https://mednet-communities.net/inn/db/media/docs/11889.doc.
- 23. Hale R.S., Thompson G.. Codon optimization of the gene encoding a domain from human type 1 neurofibromin protein results in a threefold improvement in expression level in Escherichia coli. Protein Expr. Purif. 1988; 12:185–188. [DOI] [PubMed] [Google Scholar]
- 24. Diambra L.A. Differential bicodon usage in lowly and highly abundant proteins. PeerJ. 2017; 5:e3081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Gutman G.A., Hatfield G.W.. Nonrandom utilization of codon pairs in Escherichia coli. Proc. Natl Acad. Sci. USA. 1989; 86:3699–3703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Tuller T., Waldman Y.Y., Kupiec M., Ruppin E.. Translation efficiency is determined by both codon bias and folding energy. Proc. Natl Acad. Sci. USA. 2010; 107:3645–3650. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Mauger D.M., Cabral J., Presnyak V., Su S.V., Reid D.W., Goodman B., Link K., Khatwani N., Reynders J., Moore M.J.et al.. mRNA structure regulates protein expression through changes in functional half-life. Proc. Natl Acad. Sci. USA. 2019; 116:24075–24083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Leppek K., Byeon G.W., Kladwang W., Wayment-Steele H.K., Kerr C.H., Xu A.F., Kim D.S., Topkar V.V., Choe C., Rothschild D.et al.. Combinatorial optimization of mRNA structure, stability, and translation for RNA-based therapeutics. Nat. Commun. 2022; 13:1536–1558. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Mortimer S.A., Kidwell M.A., Doudna J.A.. Insights into RNA structure and function from genome-wide studies. Nat. Rev. Genet. 2014; 15:469–479. [DOI] [PubMed] [Google Scholar]
- 30. Chung B.K.S., Lee D.Y.. Computational codon optimization of synthetic gene for protein expression. BMC Syst. Biol. 2012; 6:134–148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Ramanathan A., Robb G.B., Chan S.H.. mRNA capping: biological functions and applications. Nucleic Acids Res. 2016; 44:7511–7526. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All code for this publication is available in the following GitHub repository: https://github.com/MSDLLCpapers/mRNAid and as a web application at https://mrnaid.dichlab.org. The code is also available in Zenodo: https://zenodo.org/doi/10.5281/zenodo.10693976.









