Abstract
Heterologous expression is the main approach for recombinant protein production ingenetic synthesis, for which codon optimization is necessary. The existing optimization methods are based on biological indexes. In this paper, we propose a novel codon optimization method based on deep learning. First, we introduce the concept of codon boxes, via which DNA sequences can be recoded into codon box sequences while ignoring the order of bases. Then, the problem of codon optimization can be converted to sequence annotation of corresponding amino acids with codon boxes. The codon optimization models for Escherichia Coli were trained by the Bidirectional Long-Short-Term Memory Conditional Random Field. Theoretically, deep learning is a good method to obtain the distribution characteristics of DNA. In addition to the comparison of the codon adaptation index, protein expression experiments for plasmodium falciparum candidate vaccine and polymerase acidic protein were implemented for comparison with the original sequences and the optimized sequences from Genewiz and ThermoFisher. The results show that our method for enhancing protein expression is efficient and competitive.
Subject terms: Bioinformatics, Gene expression analysis, Biotechnology
Introduction
With the rapid development of biotechnology, heterologous expression has been utilized to generate recombinant proteins for use in vaccines and pharmaceuticals1,2. The codon is the basic unit of correspondence between nucleic acids carrying information and proteins carrying information and is also the basic link for information transfer in vivo. Codons that encode the same amino acid are called synonymous codons. While the usage probabilities of synonymous codons are not the same during protein synthesis, a species or a gene typically prefers to use one or several specific synonymous codons called optimal codons, and this phenomenon is known as codon usage bias3. Moreover, the codon usage bias of genes differs significantly among different functions.
Codon usage bias has a complex effect on protein expression levels when recombinant proteins are heterologously expressed4. The frequency of codons in a DNA sequence is positively correlated with the corresponding tRNA in a species, and the tRNA concentration determines the number of amino acids available for protein translation extension, which in turn affects the efficiency of protein synthesis5,6. Thus, the expression levels of proteins are highly correlated with codon usage bias. Rare codons tend to reduce the rate of translation and even cause translation errors7. Furthermore, codon optimization is the most critical determinant of increasing protein expression8. In gene synthesis, codon optimization involves recombination based on different criteria without changing the sequence of the amino acid9 and can promote expression of the recombinant gene in different host organisms9–11. Therefore, codon optimization for microorganisms is an essential part of gene synthesis.
In heterologous expression systems, to maximize protein expression from the DNA sequence of the original species in the host, codon optimization improves the translation efficiency of a target gene12 by converting the DNA sequence of nucleotides of one species to that of another, such as converting human sequences to bacterial or yeast sequences, plant sequences to human sequences, and fungal sequences to yeast sequences. Various codon optimization strategies have been developed by using a range of quantitative methods to generate different mRNA sequences, which can result in different levels of final protein expression. Most optimization strategies use codons with host bias to replace less frequently occurring codons13–16. In addition, a strategy is proposed to adjust the original codon sequence to match the natural distribution of the host codons13,17–19, the goal of which is to preserve the slow translation regions that are important for protein folding9,10,20. This strategy has been recognized as the best way to optimize codons.
In the industry, many biotechnology companies perform codon optimization, such as ThermoFisher (www.thermofisher.com) and Genewiz (www.genewiz.com), whose methods are based on the aforementioned strategies and empirical indexes. As a consequence, their indexes for codon optimization mainly include the codon adaptation index (CAI)21, the frequency of relative synonymous codon usage22, the codon bias index23, optimal codon usage7, and effective codon number24. Among these indexes, the CAI is the primary index used to predict gene expression level because it indicates the extent to which the coding sequence represents the usage of codons in an organism25.
In addition to the strategies considered to eliminate rare codons, there also exist parameters with important impacts on protein expression, such as the GC content26, RNA secondary structure, cleavage sites, restriction endonuclease sites, repeats and certain added or deleted motifs27,28. Many websites and software incorporate codon optimization algorithms with various determinants, such as DNA Works29, Optimizer30, GeMS31, Gene Designer14, Gene Designer Synthetic32, ThermoFisher and Genewiz. To further optimize the DNA sequence, some researchers also perform plasmid-mediated replenishment of tRNAs corresponding to rare codons from the host33. Currently, emerging high-throughput methods for gene synthesis and screening can also increase protein expression levels34.
In fact, the method that directly replaces rare codons with host biased codons is straightforward and can be implemented easily. However, the optimized DNA sequences contain host biased codons, so the transcribed mRNA contains a high percentage of codon subsets, which results in an imbalance of different tRNAs and eventually leads to the depletion of tRNA and termination of translation14. Additionally, the method that coordinates and replaces codons can make the adjusted codon sequence consistent with the natural distribution in the host, but this method is often complicated to implement and lacks flexibility.
The CAI is an important index with which to measure protein expression, but it is not comprehensive. Therefore, we use the deep learning method instead of the index method. The CAI is used only as a reference index in this paper, and it is verified by biological experiments.
Recently, deep learning has shown impressive applicability in a variety of domains, entailing a series of machine learning algorithms. Biological and medical research is replete with big data, but the data are often perplexing. These problems might be more appropriately handled using deep learning techniques35. The original idea stems from applying deep learning techniques to obtain the distribution of codons for feasible codon optimization without any empirical rules.
In this study, the concept of a codon box is introduced as a method to recode DNA sequences. Next, a popular sequence annotation method in deep learning called Bidirectional Long-Short-Term Memory Conditional Random Field (BiLSTM-CRF)36 was adopted to annotate amino acid sequences with codon boxes or codons directly. Finally, biological experiments were conducted to analyze and compare the protein expression in Escherichia coli (E. coli) with that obtained by Genewiz and ThermoFisher.
Results
Codon box
Regardless of the base order of the codons, or equivalently, if the codons contain the same A, T, G, and C bases, then they are taken as a set, which is called a codon box. For example, the codons ATG, TAG, AGT and GAT are taken as a whole codon box {agt}, as shown in Table 1. It is coincidental that the total number ofcodon boxes is 20, which is exactly equal to the number of amino acids in the universal codon table.
Table 1.
Type of codon box | Codon box | Amino acid | Codon |
---|---|---|---|
Type-1 | {aaa} | Lys | AAA |
{ccc} | Pro | CCC | |
{ggg} | Gly | GGG | |
{ttt} | Phe | TTT | |
Type-2 | {aac} | Gln, Asn, Thr | CAA, AAC, ACA |
{aag} | Arg, Glu, Lys | AGA, GAA, AAG | |
{aat} | Ile, Asn | ATA, AAT | |
{acc} | His, Pro, Thr | CAC, CCA, ACC | |
{agg} | Arg, Glu, Gly | AGG, GAG, GGA | |
{att} | Ile, Leu, Tyr | ATT, TTA, TAT | |
{ccg} | Ala, Arg, Pro | GCC, CGC, CCG | |
{cct} | Leu, Pro, Ser | CTC, CCT, TCC | |
{cgg} | Ala, Arg, Gly | GCG, CGG, GGC | |
{ctt} | Leu, Phe, Ser | CTT, TTC, TCT | |
{ggt} | Gly, Trp, Val | GGT, TGG, GTG | |
{gtt} | Cys, Leu, Val | TGT, TTG, GTT | |
Type-3 | {acg} | Ala, Arg, Asp, Gln, Ser, Thr | GCA, CGA, GAC, CAG, AGC, ACG |
{act} | His, Ile, Leu, Ser, Thr, Tyr | CAT, ATC, CTA, TCA, ACT, TAC | |
{agt} | Asp, Met, Ser, Val | GAT, ATG, AGT, GTA | |
{cgt} | Ala, Arg, Cys, Leu, Ser, Val | GCT, CGT, TGC, CTG, TCG, GTC |
According to the codon box concept, 64 codons can be divided into 20 kinds of codon boxes. Furthermore, the codon boxes can be classified into three categories: Type-1 has only one kind of base; Type-2 has two kinds of bases; and Type-3 has three kinds of bases.
Table 1 also shows that the codon encoding a given amino acid can be uniquely determined by a codon box and the amino acid, that is, different codons in the same codon box cannot encode the same amino acid. For example, the codons encoding the amino acid Gly are GGT, GGC, GGA and GGG. The codon box {ggt} contains GGT, TGG and GTG. Therefore, once the amino acid Gly and the codon box {ggt} are given, GGT can be uniquely determined to encode Gly, as shown in Fig. 1. For other codon boxes, the corresponding codons encoding Gly are determined similarly. This critical property has not been identified previously and can play a key role in subsequent codon optimization.The codon box can be regarded as a coding method in machine learning that can simplify deep learning models, and a codon box and an amino acid together can uniquely determine a codon, which has not been reported previously. Furthermore, it was verified that the effect is better after introducing a codon box. The number of codon boxes was consistent with that of conventional amino acids. However, whether the use of codon boxes is directly relevant in biology needs further study.
Codon optimization with deep learning
The choices of synonymous codon pairs are not random in individuals3, and different species are subject to different rules embedded in the distribution of their codons. To accurately capture the codon distribution of host genes, the codon optimization problem can be converted to that of a sequence annotation problem in deep learning, as shown in Fig. 2. BiLSTM-CRF is the most widely used sequence annotation algorithm, and the code for the BiLSTM-CRF annotation method is available at https://github.com/jiesutd/NCRFpp. In this paper, our focus is not the algorithm design for BiLSTM-CRF, as shown in Fig. 2a, which is simply a training tool for the E. coli codon optimization model. Our method, based on the codon box in Table 1, is available at https://github.com/Devil625/Codon_Optimization.git, whose flowchart is shown in Fig. 2b.
It is obvious that the outputs of the two kinds of annotation models were designed according to the codon distribution of host genes. To train the annotation models, the training data including 4906 genes were selected from the DNA sequences of E. coli available from the NCBI, 80% of which were used as the training set, 10% as the validation set, and 10% as the test set. In the models, every amino acid is considered to be a word, and the dimension of the word vector is an important parameter. Considering 20 kinds of amino acids and stop codons, a word vector with 21 dimensions is a proper selection for word-embedding vectors of amino acid sequences. In practical training, the CAI indicates that a word vector with 21 dimensions can yield a better result than one with 50 dimensions or 100 dimensions.
Because gene mutation cannot theoretically be avoided in sequence annotation, that is, the optimized codon sequence may encode a different amino acid, the model with fewer mutations is better. In the case of a mutation, the mutant codon will be replaced with the original codon to ensure that the amino acid sequence remains unchanged. Surprisingly, there was almost no mutation in the training process.
Machine learning approach is a probabilistic model and therefore cannot rule out the possibility of mutation. To date, we have performed many experiments with data from the related protein expression optimization literature, and have not found such mutation. To further study the probability of such mutation, we randomly generated 10,000 genes and found that the mutation probability was 0.00%. Therefore, this operation did not alter the effect of our model.
The other hyperparameters also have significant impacts on performance when training the BiLSTM-CRF model. In this paper, the main hyperparameters of our model were selected as follows: a 4-layer BiLSTM was selected, and the hidden layer dimension was 200; the dropout was 0.5; the batch size was 32; and the learning rate was 0.003. To verify the rationality of the codon box proposed in this paper, BiLSTM-CRF(a) and BiLSTM-CRF(b) were trained in the same environment. The training times for BiLSTM-CRF(a) and BiLSTM-CRF(b) were approximately 40 h and 17 h on 1080 GPU, respectively; the test accuracy and training accuracy of BiLSTM-CRF(a) were 0.52 and 0.76, respectively; and the test accuracy and training accuracy of BiLSTM-CRF(b) were 0.52 and 0.77, respectively. BiLSTM-CRF(a) and BiLSTM-CRF(b) have almost the same model accuracy.
Because our goal in this paper was the optimization of DNA sequences, CAI is used as a main index of model comparison. The CAI is an important index with which to measure protein expression, but it is not comprehensive. Therefore, we use the deep learning method instead of the index method. The CAI is used only as a reference index in this paper, and it is verified by biological experiments. The average CAIs of BiLSTM-CRF(a) and BiLSTM-CRF(b) for the test set were 0.94 and 0.96, respectively. According to the statement on GeneScript's optimization website37, the ideal range for CAI is 0.8–1.0, and the lower the number is, the higher the chance that the gene will be expressed poorly. Therefore, BiLSTM-CRF(b) is better than BiLSTM-CRF(a) in terms of training time and the CAI.
To compare the CAI of the original sequence with those of the Genewiz, ThermoFisher, BiLSTM-CRF(a), and BiLSTM-CRF(b) optimized sequences, six codon sequences (HPDF, PAE, MMPL3, FALVAC-1, PA and PTP4A3) were extracted from six papers on gene optimization and protein expression138–42, as shown in Table 2, from which the 972 bp plasmodium falciparum candidate vaccine (FALVAC-1) and 564 bp polymerase acidic protein (PTP4A3) were randomly selected for biological experiments on protein expression.
Table 2.
DNA | bp | Original | Genewiz | ThermoFisher | BiLSTM-CRF(a) | BiLSTM-CRF(b) |
---|---|---|---|---|---|---|
HPDF | 615 | 0.70 | 0.85 | 0.92 | 0.96 | 0.98 |
PAE | 1839 | 0.76 | 0.81 | 0.92 | 0.96 | 0.98 |
MMPL3 | 2835 | 0.67 | 0.79 | 0.93 | 0.96 | 0.98 |
FALVAC-1 | 972 | 0.67 | 0.85 | 0.93 | 0.95 | 0.96 |
PA | 561 | 0.60 | 0.83 | 0.93 | 0.97 | 0.98 |
PTP4A3 | 564 | 0.70 | 0.83 | 0.93 | 0.96 | 0.98 |
Average | 1231 | 0.68 | 0.83 | 0.93 | 0.96 | 0.98 |
This table shows the CAIs of the sequences optimized by different optimization tools, among which the values for Genewiz and ThermoFisher are provided on their official websites (ThermoFisher: www.thermofisher.com, Genewiz: www.genewiz.com). BiLSTM-CRF(b) has the highest average CAI, showing that it has great potential to enhance protein expression.
As shown in Table 3, the Jaccard index43 was used to measure the similarities of the optimized sequence from BiLSTM-CRF(b) with the original sequence and the sequences from Genewiz, ThermoFisher, and BiLSTM-CRF(a). The average 20%-28% difference between BiLSTM-CRF(b) and ThermoFisher or Genewiz shows that our method is a new approach for discovering underlying features of data, and it is different from BiLSTM-CRF(a).
Table 3.
DNA | Original | Genewiz | ThermoFisher | BiLSTM-CRF(a) |
---|---|---|---|---|
PTP4A3 | 0.68 | 0.74 | 0.80 | 0.85 |
PA | 0.62 | 0.72 | 0.82 | 0.90 |
PAE | 0.70 | 0.70 | 0.79 | 0.90 |
FALVAC-1 | 0.62 | 0.73 | 0.80 | 0.88 |
HPDF | 0.70 | 0.73 | 0.80 | 0.90 |
MMPL3 | 0.65 | 0.69 | 0.76 | 0.89 |
Average | 0.66 | 0.72 | 0.80 | 0.89 |
Jaccard similarity index between the optimized sequences of BiLSTM-CRF(b) and others.
Experimental results for FALVAC-1 and PTP4A3
Because the CAI is simply a factor that affects protein expression, to further validate the rationality of our codon optimization method, the FALVAC-1 protein (FALVAC-1 was constructed as a multivalent plasmodium falciparum vaccine antigen and expressed in E. coli) and PTP4A3 protein were expressed in E. coli, and their expression levels were analyzed by western blot analysis. We compared three groups (Group 1, Group 2 and Group 3) of parallel experiments simultaneously for FALVAC-1 and PTP4A3, and compared the optimization effects of five sequences (namely, Original, Genewiz, Thermo, Opt-b and Opt-a, where Opt-b stands for BiLSTM-CRF(b) and Opt-a stands for BiLSTM-CRF(a)) in each group. The comparison of protein expression levels is shown in Fig. 3.
As shown in Fig. 3a and the corresponding Table 4, according to the optimization ratio, protein expression from the FALVAC-1 gene sequence optimized by Opt-b was significantly better than that obtained with the other methods. Furthermore, the protein expression obtained with Opt-b was better than that obtained with Opt-a, which indicates that the introduction of a codon box is necessary and useful. As shown in Fig. 3b and the corresponding Table 5, when the original sequence is well expressed, although Opt-b is still the best method, the optimization ratio is not very significant, and all methods have approximately equal optimization ratios. The reason for this result is that the codon distribution of PTP4A3 is similar to that of E. coli genes. Hence, the result indicates that the new model is robust and reflects the distribution of host genes. Moreover, the experimental results are also clearly consistent with the theoretical predictions based on the CAI in Table 2.
Table 4.
Original | Genewiz | Thermo | Opt-b | Opt-a | |
---|---|---|---|---|---|
Group 1 | 0.221 | 0.875 | 0.548 | 2.178 | 1.669 |
Group 2 | 0.090 | 0.742 | 0.352 | 2.115 | 1.747 |
Group 3 | 0.245 | 0.901 | 0.331 | 1.935 | 1.762 |
Average value | 0.186 | 0.839 | 0.410 | 2.076 | 1.726 |
Optimization ratio | 1 | 4.511 | 2.204 | 11.129 | 9.462 |
The comparison of grayscale value ratios between FALVAC-1 and GAPDH. The optimization ratio is the ratio of each method's average value to the original average value.
Table 5.
Original | Genewiz | Thermo | Opt-b | Opt-a | |
---|---|---|---|---|---|
Group 1 | 2.448 | 2.863 | 3.006 | 3.033 | 3.017 |
Group 2 | 3.398 | 3.506 | 3.564 | 4.568 | 3.266 |
Group 3 | 1.727 | 0.901 | 3.073 | 3.145 | 3.594 |
Average value | 2.558 | 3.147 | 3.238 | 3.780 | 3.292 |
Optimization ratio | 1 | 1.23 | 1.266 | 1.400 | 1.287 |
The comparison of grayscale value ratios between PTP4A3 and GAPDH. The optimization ratio is the ratio of each method's average value to the original average value.
In this paper, we chose two genes, namely, FALVAC-1 with a low expression level, which proved the effectiveness of our algorithm, and PTP4A3 with a high expression level, which proved the stability of our algorithm. FALVAC-145 was constructed as a multivalent plasmodium falciparum vaccine antigen and expressed in E. coli, and PTP4A346 was used as a negative control to prove that our algorithm will not cause low expression.
Furthermore, according to the method reported in the literature, the activities of each purified protein were detected by more experiments, as shown in Fig. 4. No significant difference in the protein's activity among the five sequences (Original, Genewiz, Thermo, Opt-b and Opt-a) was demonstrated. This result proved that our optimization had no effect on the protein's function.
Discussion
In this paper, we introduce the concept of codon boxes, via which DNA sequences can be recoded into codon box sequences while ignoring the order of bases. Then, the problem of codon optimization can be converted to sequence annotation of corresponding amino acids with codon boxes. Because deep learning is a good method to obtain the distribution characteristics of DNA sequences, it is theoretically more advantageous for tackling codon optimization than existing index optimization methods. According to the results of biological experiments, compared with the codon optimization tools that are widely used in the industry, our method is likely competitive in terms of genetic engineering. Our optimization model was originally designed for E. coli in this paper, while its generality for other species, such as insect cells and yeasts needs further research in the future. Moreover, with the development of deep learning, the optimization method can be further improved to obtain better protein expression.
In fact, codon optimization can also be regarded as an inverse problem of codon sequences coding amino acid sequences. However, the inverse problem is not one-to-one mapping, and whether a learning mechanism for amino acid sequence to DNA sequence conversion exists is not clear. Because the deep neural network is a black box, the underlying biological reasons cannot be adequately explained at the present time. Therefore, it is reasonably expected that other experts will provide further biological insights into the learning mechanism in the future.
Methods
Codon optimization
To obtain the final optimized sequence, we implemented a bidirectional long/short-term memory neural network with a conditional random field layer (BiLSTM-CRF)44 that is able to annotate amino acid sequences with codons or codon boxes. First, the codon sequences can be decoded into the corresponding amino acid sequence. The word-embedding vectors of amino acid sequences are regarded as inputs of BiLSTM-CRF. The model parameters were iteratively optimized on the training set using L2 regularization, and the model with the best performance on the validation set was chosen. BiLSTM-CRF provides each amino acid and its annotated codon or codon box token as the output. Because a codon box and an amino acid can be used to uniquely determine a codon, the optimized codon sequence can be obtained.
CAI
CAI is calculated as per formula (1):
1 |
where L is the number of codons, an wk is calculated as per formula (2):
2 |
where RSCUmax is the highest codon usage frequency for synonymous codons in highly expressed reference gene and RSCUi is the relative frequency of the unified codon of the first codon encoding the corresponding amino acids.
Protein expression
The original gene and optimized codons of PTP4A3 and FALVAC-1 were subcloned into the pET28a(+) vector with a hexahistidine affinity tag fused to the N terminus and transformed into E.coli(BL21(DE3)). All plasmids were ordered from Genewiz (www.genewiz.com/). Upon bacterial growth to an optical density of 0.6–0.8 at 600 nm in lysogeny broth containing 50 µg/ml kanamycin at 37 °C in a shaker at 220 rpm, induction was carried out at 16 °C using 0.2 mM isopropyl-b-D-thiogalactoside (IPTG), and growth was continued at 16 °C for approximately 18 h. The cells were harvested by centrifugation and stored at − 80 °C or used for the subsequent steps.
Harvested cells were resuspended in lysis buffer (PBS). M protease inhibitor (phenylmethanesulfonyl fluoride, PMSF) was added to the cell sample before lysis, and then, the cell lysate was obtained by ultrasonication. The cell lysate was centrifuged at 15,000 rpm for 45 min, after which the supernatant was collected. The protein concentration was determined by the Beyotime BCA Protein Assay Kit.
Then, a western blot analysis was carried out on the extracted samples, with anti-His as the primary antibody. Primary antibody incubation was followed by probing with the corresponding secondary antibody, and the blot was developed using Image Lab Touch Software.
Supplementary information
Acknowledgements
The authors would like to express their sincere gratitude to Philip Hamish Todd of Saltire Software; Shengchuan Wu of the Franz Corporation; Shengyong Yang and Jun Zou of the State Key Laboratory of Biotherapy, West China Hospital, Sichuan University; and Jinao He, Chong Gao, Yiran Wang and other students in our laboratory. The authors also wish to thank many anonymous reviewers for their constructive remarks. This work was partially funded by the National High Technology Research and Development Program of China (no. 2018YFB1005100, 2018YFB1005104), the National Natural Science Foundation of China (nos. 61876034, 61202257, 61650110512), and the China Postdoctoral Science Foundation (no. 2016M602677).
Author contributions
H.F. contributed the ideas and the concept of the codon box. Y.L. implemented the algorithm and performed the statistical analysis for the biological experiments. X.Z. provided the algorithm design. Z.P. designed the biological experiments. L.H. provided algorithm guidance. H.Z. assisted to design the biological experiments. Y.X. collected the related gene sequence data. W.Z. participated in the biological experiments. Z.L. provided theoretical guidance. All authors co-wrote the manuscript.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
is available for this paper at 10.1038/s41598-020-74091-z.
References
- 1.Han JH, et al. Codon optimization enhances protein expression of human peptide deformylase in E. coli. Protein Expr. Purif. 2010;70(2):224–230. doi: 10.1016/j.pep.2009.10.005. [DOI] [PubMed] [Google Scholar]
- 2.Liu B, Kong Q, Zhang D, Yan L. Codon optimization significantly enhanced the expression of human 37-kDa iLRP in Escherichia coli. 3 Biotech. 2018;8(4):210. doi: 10.1007/s13205-018-1234-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Hershberg R, Petrov DA. Selection on codon bias. Annu. Rev. Genet. 2008;42:287–299. doi: 10.1146/annurev.genet.42.110807.091442. [DOI] [PubMed] [Google Scholar]
- 4.Fiers W, Grosjean H. On codon usage. Nature. 1979;277(5694):328. doi: 10.1038/277328a0. [DOI] [PubMed] [Google Scholar]
- 5.Karlin S, Mrázek J, Campbell A, Kaiser D. Characterizations of highly expressed genes of four fast-growing bacteria. J. Bacteriol. 2001;183(17):5025–5040. doi: 10.1128/JB.183.17.5025-5040.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Gustafsson C, Govindarajan S, Minshull J. Codon bias and heterologous protein expression. Trends Biotechnol. 2004;22(7):346–353. doi: 10.1016/j.tibtech.2004.04.006. [DOI] [PubMed] [Google Scholar]
- 7.Ikemura T. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J. Mol. Biol. 1981;151(3):389–409. doi: 10.1016/0022-2836(81)90003-6. [DOI] [PubMed] [Google Scholar]
- 8.Lithwick G, Margalit H. Hierarchy of sequence-dependent features associated with prokaryotic translation. Genome Res. 2003;13(12):2665–2673. doi: 10.1101/gr.1485203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Gaspar P, Oliveira JL, Frommlet J, Santos MA, Moura G. EuGene: maximizing synthetic gene design for heterologous expression. Bioinformatics. 2012;28(20):683–2684. doi: 10.1093/bioinformatics/bts465. [DOI] [PubMed] [Google Scholar]
- 10.Komar AA, Guillemet E, Reiss C, Cullin C. Enhanced expression of the yeast Ure2 protein in Escherichia coli: the effect of synonymous codon substitutions at a selected place in the gene. Biol. Chem. 1998;379(10):1295–1300. [PubMed] [Google Scholar]
- 11.Inouye S, Sahara-Miura Y, Sato JI, Suzuki T. Codon optimization of genes for efficient protein expression in mammalian cells by selection of only preferred human codons. Protein Expr. Purif. 2015;109:47–54. doi: 10.1016/j.pep.2015.02.002. [DOI] [PubMed] [Google Scholar]
- 12.Al-Hawash AB, Zhang X, Ma F. Strategies of codon optimization for high-level heterologous protein expression in microbial expression systems. Gene Rep. 2017;9:46–53. doi: 10.1016/j.genrep.2017.08.006. [DOI] [Google Scholar]
- 13.Richardson SM, Wheelan SJ, Yarrington RM, Boeke JD. GeneDesign: rapid, automated design of multikilobase synthetic genes. Genome Res. 2006;16(4):550–556. doi: 10.1101/gr.4431306. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Villalobos A, Ness JE, Gustafsson C, Minshull J, Govindarajan S. Gene designer: a synthetic biology tool for constructing artificial DNA segments. BMC Bioinform. 2006;7(1):285. doi: 10.1186/1471-2105-7-285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Feng Z, Zhang L, Han X, Zhang Y. Codon optimization of the calf prochymosin gene and its expression in Kluyveromyces lactis. World J. Microbiol. Biotechnol. 2010;26(5):895–901. doi: 10.1007/s11274-009-0249-2. [DOI] [Google Scholar]
- 16.Marlatt NM, Spratt DE, Shaw GS. Codon optimization for enhanced Escherichia coli expression of human S100A11 and S100A1 proteins. Protein Expr. Purif. 2010;73(1):58–64. doi: 10.1016/j.pep.2010.03.015. [DOI] [PubMed] [Google Scholar]
- 17.Mellitzer A, Weis R, Glieder A, Flicker K. Expression of lignocellulolytic enzymes in Pichia pastoris. Microb. Cell Fact. 2012;11(1):61. doi: 10.1186/1475-2859-11-61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Kodumal SJ, et al. Total synthesis of long DNA sequences: synthesis of a contiguous 32-kb polyketide synthase gene cluster. Proc. Natl. Acad. Sci. USA. 2004;101(44):15573–15578. doi: 10.1073/pnas.0406911101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Welch M, et al. Design parameters to control synthetic gene expression in Escherichia coli. PLoS ONE. 2009;4(9):e7002. doi: 10.1371/journal.pone.0007002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Angov E, Hillier CJ, Kincaid RL, Lyon JA. Heterologous protein expression is enhanced by harmonizing the codon usage frequencies of the target gene with those of the expression host. PLoS ONE. 2008;3(5):e2189. doi: 10.1371/journal.pone.0002189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Sharp PM, Li WH. Balanced codon usage optimizes eukaryotic translational efficapplications. Nucleic Acids Res. 1987;15:1281. doi: 10.1093/nar/15.3.1281. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Sharp PM, Tuohy TM, Mosurski KR. Codon usage in yeast: cluster analysis clearly differentiates highly and lowly expressed genes. Nucleic Acids Res. 1986;14(13):5125–5143. doi: 10.1093/nar/14.13.5125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Bennetzen JL, Hall BD. Codon selection in yeast. J. Biol. Chem. 1982;257(6):3026–3031. [PubMed] [Google Scholar]
- 24.Wright F. The ‘effective number of codons’ used in a gene. Gene. 1990;87(1):23–29. doi: 10.1016/0378-1119(90)90491-9. [DOI] [PubMed] [Google Scholar]
- 25.Raab D, Graf M, Notka F, Schödl T, Wagner R. The GeneOptimizer Algorithm: using a sliding window approach to cope with the vast sequence space in multiparameter DNA sequence optimization. Syst. Synth. Biol. 2010;4(3):215–225. doi: 10.1007/s11693-010-9062-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Baca AM, Hol WG. Overcoming codon bias: a method for high-level overexpression of Plasmodium and other AT-rich parasite genes in Escherichia coli. Int. J. Parasitol. 2000;30(2):113–118. doi: 10.1016/S0020-7519(00)00019-9. [DOI] [PubMed] [Google Scholar]
- 27.Pertzev AV, Nicholson AW. Characterization of RNA sequence determinants and antideterminants of processing reactivity for a minimal substrate of Escherichia coli ribonuclease III. Nucleic Acids Res. 2006;34(13):3708–3721. doi: 10.1093/nar/gkl459. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Condon A, Thachuk C. Efficient codon optimization with motif engineering. J. Discret. Algorithms. 2012;16:104–112. doi: 10.1016/j.jda.2012.04.017. [DOI] [Google Scholar]
- 29.Hoover DM, Lubkowski J. DNAWorks: an automated method for designing oligonucleotides for PCR-based gene synthesis. Nucleic Acids Res. 2002;30(10):e43–e43. doi: 10.1093/nar/30.10.e43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Puigbo P, Guzman E, Antoni R, Garcia-Vallve S. Optimizer: a web server for optimizing the codon usage of DNA sequences. Nucleic Acids Res. 2007;35:W126–W131. doi: 10.1093/nar/gkm219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Jayaraj S, Reid R, Santi DV. GeMS: an advanced software package for designing synthetic genes. Nucleic Acids Res. 2005;33(9):3011–3016. doi: 10.1093/nar/gki614. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Wu G, Bashir-Bello N, Freeland SJ. The synthetic gene designer: a flexible web platform to explore sequence manipulation for heterologous expression. Protein Expr. Purif. 2006;47(2):441–445. doi: 10.1016/j.pep.2005.10.020. [DOI] [PubMed] [Google Scholar]
- 33.Brinkmann U, Mattes RE, Buckel P. High-level expression of recombinant genes in Escherichia coli is dependent on the availability of the dnaY gene product. Gene. 1989;85(1):109–114. doi: 10.1016/0378-1119(89)90470-8. [DOI] [PubMed] [Google Scholar]
- 34.Quan J, et al. Parallel on-chip gene synthesis and application to optimization of protein expression. Nat. Biotechnol. 2011;29(5):449. doi: 10.1038/nbt.1847. [DOI] [PubMed] [Google Scholar]
- 35.Ching T, et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface. 2018;15(141):20170387. doi: 10.1098/rsif.2017.0387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Yang, Jie, and Y. Zhang. NCRF++: An Open-source Neural Sequence Labeling Toolkit. arXiv preprint arXiv:1806.05626 (2018).
- 37.GenScript. GenScript Rare Codon Analysis Tool [EB/OL], 2002. https://www.genscript.com/tools/rare-codon-analysis. Accessed 10 Mar 2020.
- 38.Zhou Z, Schnake P, Xiao L, Lal AA. Enhanced expression of a recombinant malaria candidate vaccine in Escherichia coli by codon optimization. Protein Expr. Purif. 2004;34(1):87–94. doi: 10.1016/j.pep.2003.11.006. [DOI] [PubMed] [Google Scholar]
- 39.Tong Y, et al. Enhancement of soluble expression of codon-optimized Thermomicrobium roseum sarcosine oxidase in Escherichia coli via chaperone co-expression. J. Biotechnol. 2016;218:75–84. doi: 10.1016/j.jbiotec.2015.11.018. [DOI] [PubMed] [Google Scholar]
- 40.Wang X, Li X, Zhang Z, Shen X, Zhong F. Codon optimization enhances secretory expression of Pseudomonas aeruginosa exotoxin A in E coli. Protein Exp. Purif. 2010;72(1):101–106. doi: 10.1016/j.pep.2010.02.011. [DOI] [PubMed] [Google Scholar]
- 41.Zhou Z, Schnake P, Xiao L, et al. Enhanced expression of a recombinant malaria candidate vaccine in Escherichia coli by codon optimization. Protein Expr. Purif. 2004;34(1):1–94. doi: 10.1016/j.pep.2003.11.022. [DOI] [PubMed] [Google Scholar]
- 42.Zhang H, et al. PRL3 phosphatase active site is required for binding the putative magnesium transporter CNNM3. Sci. Rep. 2017;7(1):48. doi: 10.1038/s41598-017-00147-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Real R, Vargas JM. The probabilistic basis of Jaccard's index of similarity. Syst. Biol. 1996;45(3):380–385. doi: 10.1093/sysbio/45.3.380. [DOI] [Google Scholar]
- 44.Huang, Z., Xu, W. & Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).
- 45.Zhou Z, et al. Development, characterization and immunogenicity of a multi-stage, multivalent plasmodium falciparum vaccine antigen (FALVAC-1A) expressed in Escherichia coli. Hum. Vaccines. 2006;2(1):14–23. doi: 10.4161/hv.2.1.2437. [DOI] [PubMed] [Google Scholar]
- 46.Matter WF, et al. Role of PRL-3, a human muscle-specific tyrosine phosphatase, in angiotensin-II signaling. Biochem. Biophys. Res. Commun. 2001;283(5):1061–1068. doi: 10.1006/bbrc.2001.4881. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.