Machine Learning Enables Accurate Prediction of Asparagine Deamidation Probability and Rate

Jared A Delmar; Jihong Wang; Seo Woo Choi; Jason A Martins; John P Mikhail

doi:10.1016/j.omtm.2019.09.008

. 2019 Oct 1;15:264–274. doi: 10.1016/j.omtm.2019.09.008

Machine Learning Enables Accurate Prediction of Asparagine Deamidation Probability and Rate

Jared A Delmar ^1,^∗, Jihong Wang ¹, Seo Woo Choi ², Jason A Martins ², John P Mikhail ²

PMCID: PMC6923510 PMID: 31890727

Abstract

The spontaneous conversion of asparagine residues to aspartic acid or iso-aspartic acid, via deamidation, is a major pathway of protein degradation and is often seriously disruptive to biological systems. Deamidation has been shown to negatively affect both in vitro stability and in vivo biological function of diverse classes of proteins. During protein therapeutics development, deamidation liabilities that are overlooked necessitate expensive and time-consuming remediation strategies, sometimes leading to termination of the project. In this paper, we apply machine learning to a large (n = 776) liquid chromatography-tandem mass spectrometry (LC-MS/MS) dataset of monoclonal antibody peptides to create computational models for the post-translational modification asparagine deamidation, using the random decision forest method. We show that our categorical model predicts antibody deamidation with nearly 5% increased accuracy and 0.2 MCC over the best currently available models. Surprisingly, our model also paces or outperforms advanced and conventional models on an independent non-antibody dataset. In addition to deamidation probability, we are able to accurately predict deamidation rate (R² = 0.963 and Q2 = 0.822), a capability with no peer in current models. This method should enable significant improvement in protein candidate selection, especially in biopharmaceutical development, and can be applied with similar accuracy to enzymes, monoclonal antibodies, next-generation formats, vaccine component antigens, and gene therapy vectors such as adeno-associated virus.

Keywords: deamidation, machine learning, prediction, developability, stability, drug development, therapeutic protein, antibody, mab, IgG

Introduction

Therapeutic proteins are an important and growing class of drugs that includes peptides, such as insulin; cytokines, like erythropoietin; monoclonal antibodies (mAbs), which are among the most successful cancer therapies; next-generation formats, such as antibody-drug conjugates, bispecific antibodies, and fusion proteins; as well as vaccine components and gene therapy vectors. While small molecules comprise the largest class of new drug approvals, nearly 30% of US Food and Drug Administration (FDA) approvals in 2018 were protein based, up from 26% in 2017. As of the writing of this paper, half of new drugs approved by the FDA in 2019 represent biologics.¹

Therapeutic proteins offer new mechanisms of action, higher target specificity, lower toxicity, and longer-acting pharmacokinetics, compared to small molecule drugs.2, 3, 4 However, the development of therapeutic proteins poses additional challenges. Not only must the drug be effective, but it must also be “developable,” a concept that encompasses many characteristics including high yield and homogeneity from cell culture, high purity drug substance after purification processing, low viscosity and high stability at the high concentrations necessary for drug product, high stability at in vitro long-term storage conditions and in vivo after administration, high target specificity, and, for antibodies, unimpaired neonatal Fc receptor (FcRn) binding.²^,⁵ Nearly all of the factors that make a protein drug developable are derived from the amino acid sequence, including site-specific post-translational modifications (PTMs).⁶

In particular, the spontaneous non-enzymatic conversion of asparagine to aspartic acid or iso-aspartic acid via deamidation is a major pathway of protein degradation and is often seriously disruptive to biological systems.7, 8, 9 Deamidation has been shown to negatively affect both in vitro stability and in vivo biological function of diverse classes of proteins. Deamidation has been reported as a critical quality attribute in many monoclonal antibodies due to its impact on biological activity.10, 11, 12, 13 In one humanized monoclonal immunoglobulin G1 (IgG1) antibody drug, an asparagine in the heavy-chain complementarity determining region 2 (CDR2) loop was found to deamidate in vivo, which greatly decreased the drug’s efficacy.¹⁴ In another case, heavy-chain CDR deamidation resulted in an almost complete loss of potency and binding activity of a therapeutic monoclonal antibody.¹⁵ In adeno-associated virus, an emerging new vector for gene therapy, extensive capsid deamidation has been observed that impacts transduction and correlates to a loss of vector activity.¹⁶ Deamidation of asparagine residues can also significantly affect immunogenicity and efficacy of protein-based vaccines. Specifically, progress to develop next-generation anthrax vaccines has been halted by vaccine instability resulting from asparagine deamidation in anthrax protective antigen.17, 18, 19, 20, 21 Even in the nontherapeutic enzyme glucoamylase, used commercially to produce sweeteners and ethanol, asparagine deamidation causes a decrease in enzyme activity and change in thermodynamic stability.22, 23, 24

Prediction of deamidation liability as early as possible in protein drug development is important because many more candidate drugs are proposed than can be tested. For example, typical antibody generation results in hundreds of candidates, which far exceeds the capacity of a drug development organization.²^,²⁵ Development of a therapeutic protein is so costly in both money and time that, after an initial assessment for screening, only a single candidate is moved forward in most cases.¹⁶^,²⁶^,²⁷ Sequence liabilities that are not dealt with as early as possible necessitate more expensive and time-consuming remediation strategies later in development²⁶ and could lead to termination of the project.

Computational tools already exist to facilitate drug candidate screening by the identification of sequence liabilities.⁶^,28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45 In the case of asparagine deamidation,³⁶^,40, 41, 42, 43, 44, 45 currently available tools suffer from several limitations: they provide only a binary (yes, high risk to deamidate; or no, low risk to deamidate) prediction,³⁶^,⁴⁰^,⁴²^,⁴³ require an experimental crystal structure,42, 43, 44, 45 or are applicable only to antibody asparagines.³⁶^,⁴⁰ All offer no³⁶^,⁴⁰^,⁴² or low accuracy⁴¹^,43, 44, 45 predictions of deamidation rate. Oversimplified models tend to overestimate the number of deamidation sites greatly, which leads to over-engineering and rejecting good drug candidates too early in development. On the other hand, these models may also overlook asparagines for which deamidation is rarely observed (such as NK or NW sites), which can lead to costly and ineffective drug development.

In this paper, we apply machine learning to a large (n = 776) liquid chromatography-tandem mass spectrometry (LC-MS/MS) dataset of monoclonal antibody peptides to create accurate random decision forest models for the PTM asparagine deamidation.⁴⁶ We show that our categorical model predicts antibody deamidation likelihood with nearly 5% increased accuracy and 0.2 MCC over the best currently available models. Surprisingly, our model also paces or outperforms advanced and conventional models on an independent non-antibody dataset, including enzyme, antigen, and viral capsid deamidation sites. In addition to deamidation probability, we are able to accurately predict deamidation rate (R² = 0.963 and Q² = 0.822), a capability with no peer in current models. We provide evidence that our method can be applied with equal accuracy to predict the likelihood and rate of site-specific asparagine deamidation in any protein of interest.

Results and Discussion

Feature Selection

Spontaneous deamidation of asparagine to aspartic acid or iso-aspartic acid proceeds by one of two reaction mechanisms (Figure 1). At neutral to basic pH, the most common route is by a nucleophilic attack of the asparagine side chain by the backbone nitrogen of the following (N+1) residue, forming the cyclic succinimide intermediate. Hydrolysis at one of two carbonyls of the succinimide intermediate results in either aspartic acid or iso-aspartic acid. Below pH 5, direct hydrolysis of the asparagine side chain amide to aspartic acid is the dominant reaction.⁸^,⁹

Asparagine Deamidation Reaction

(A and B) Spontaneous degradation of asparagine can occur by (A) direct hydrolysis of the side chain to aspartic acid or (B) via a succinimide intermediate, produced by a nucleophilic attack of the side chain carbonyl by the following (N+1) residue backbone nitrogen, producing either iso-aspartic acid or aspartic acid. Residues are rendered as sticks with Asn, Asp, and iso-Asp, and succinimide carbons colored gray, green, and cyan, respectively (O, red; N, blue).

Both mechanisms have been shown to rely on both the primary and three-dimensional (3D) structure, with the residue immediately following the asparagine residue (N+1) having the largest effect.⁸^,⁹^,47, 48, 49, 50, 51 The amino side residue preceding the asparagine (N−1) was shown to have little to no effect on deamidation rate.⁵⁰^,⁵¹ Steric hindrance, conformational space, and electrostatic effects introduced by the N+1 residue may all affect the ability of the side chain and/or backbone to align and form the cyclic intermediate.⁹ As both reaction mechanisms require hydrolysis to form the final aspartic acid or iso-aspartic acid product, availability of water molecules, or a proton donor, and solvent exposure may directly influence the rate of deamidation.⁹ Finally, hydrogen bonding to the side chain or backbone may stabilize asparagine and prevent degradation to aspartic acid.

Taken together, these observations compiled from literature informed 12 total parameters for asparagine deamidation likelihood (Table 1), which our machine-learning models would use to predict deamidation. The N+1 residue was taken into account as both a categorical variable and as the experimental half-life of a synthetic pentapeptide (pentapeptide half-life, pphl) containing the same N−1 and N+1 sequence, measured by Robinson et al.⁵¹ Half-lives were not reported by Robinson et al. for pentapeptides with asparagine in the N+1 position, likely because it is difficult to distinguish between deamidation in the N and N+1 position in this case. Thus, when N+1 = N, we used an average pphl of 5.7 days.

Table 1.

Predictors for Asparagine Deamidation Machine Learning Model

Structural / Chemical Category	Parameter
Primary sequence	pentapeptide deamidation half-life (pphl, days)
Primary sequence	categorical N+1 residue
Backbone orientation	backbone dihedral Phi (φ,°)
	backbone dihedral Psi (Ψ,°)
	nucleophilic C-N attack distance (Å)
Side-chain orientation	side-chain dihedral chi1 (χ_1,°)
Side-chain orientation	side-chain torsion chi2 (χ_2,°)
Solvent accessibility	percent solvent accessibility (PSA, %)
Solvent accessibility	solvent accessible surface area (SASA, Å²)
Hydrogen bonding	hydrogen bonds to side chain (#)
	Asn local secondary structure (Sheet)
	Asn local secondary structure (Loop)
Machine-learning parameter (for regression model only)	categorical model probability output (%)

Statistic	Categorical Model	NG/NN/NS	Lorenzo et al.⁴¹	Robinson et al.⁴³	Jia et al.⁴²
Accuracy	93.8%	88.8%	93.8%	77.5%	95.0%
MCC	0.686	0.619	0.686	0.459	0.687
Precision	60.0%	43.8%	60.0%	28.0%	71.4%
Recall	85.7%	100.0%	85.7%	100.0%	71.4%
Specificity	94.5%	87.7%	94.5%	75.3%	97.3%
Negative predictive value	98.6%	100.0%	98.6%	100.0%	97.3%
Miss rate	14.3%	0.0%	14.3%	0.0%	28.6%
Fallout	40.0%	56.3%	40.0%	72.0%	28.6%
False discovery rate	5.5%	12.3%	5.5%	24.7%	2.7%
False omission rate	1.4%	0.0%	1.4%	0.0%	2.7%

Statistic	Categorical Model	NG/NN/NS	Yan et al.⁴⁰	Lorenzo et al.⁴¹
Accuracy	95.6%	86.8%	83.8%	91.2%
MCC	0.796	0.651	0.388	0.616
Precision	100.0%	50.0%	41.7%	66.7%
Recall	66.7%	100.0%	55.6%	66.7%
Specificity	100.0%	84.7%	88.1%	94.9%
Negative predictive value	95.2%	100.0%	92.9%	94.9%
Miss rate	33.3%	0.0%	44.4%	33.3%
Fallout	0.0%	50.0%	58.3%	33.3%
False discovery rate	0.0%	15.3%	11.9%	5.1%
False omission rate	4.8%	0.0%	7.1%	5.1%

Statistic	Categorical Model
Accuracy	100.0%
MCC	1.000
Precision	100.0%
Recall	100.0%
Specificity	0.0%
Negative predictive value	0.0%
Miss rate	0.0%
Fallout	0.0%
False discovery rate	100.0%
False omission rate	100.0%

Statistic	Categorical Model
Accuracy	93.3%
MCC	0.691
Precision	81.0%
Recall	65.4%
Specificity	97.6%
Negative predictive value	94.8%
Miss rate	34.6%
Fallout	19.0%
False discovery rate	2.4%
False omission rate	5.2%

Residue	N+1	Avg % Deamidation Giles et al.¹⁶	Categorical Model	NG/NN/NS	Lorenzo et al.⁴¹
N254	N	9%	no	yes	yes
N255	H	ND	no	no	yes
N263	G	99%	yes	yes	yes
N304	N	ND	no	yes	yes
N305	N	8%	no	yes	yes
N337	N	ND	No	yes	yes
N384	N	ND	no	yes	yes
N385	G	88%	yes	yes	yes
N410	N	3%	no	yes	yes
N459	T	7%	no	no	no
N498	N	ND	no	yes	yes
N499	N	17%	yes	yes	yes
N500	S	ND	no	yes	yes
N514	G	84%	yes	yes	yes
N517	S	4%	no	yes	yes
N540	G	79%	yes	yes	yes
N599	S	ND	no	yes	yes
N611	R	ND	no	no	no
N670	S	ND	no	yes	yes
N693	S	ND	no	yes	yes

Prediction →	Positive	Negative
Experiment ↓	Positive	Negative
Positive	137	0
Negative	0	639

Prediction →	Positive	Negative
Experiment ↓	Positive	Negative
Positive	17	9
Negative	4	165

Prediction →	Positive	Negative
Experiment ↓	Positive	Negative
Positive	5	5
Negative	0	37

Prediction →	Positive	Negative
Experiment ↓	Positive	Negative
Positive	9	1
Negative	8	29

PERMALINK

Machine Learning Enables Accurate Prediction of Asparagine Deamidation Probability and Rate

Jared A Delmar

Jihong Wang

Seo Woo Choi

Jason A Martins

John P Mikhail

Abstract

Introduction

Results and Discussion

Feature Selection

Figure 1.

Table 1.

Training and Validation Dataset Construction

Machine-Learning Models for Predicting Deamidation Likelihood and Rate

Figure 2.

Table 11.

Table 16.

Table 2.

Table 3.

Table 4.

Table 5.

Figure 3.

Table 17.

Table 20.

Table 18.

Table 19.

Comparison with Advanced and Conventional Models

Table 6.

Table 7.

Table 8.

Table 9.

Table 10.

Table 12.

Table 13.

Table 14.

Table 15.

Conclusions

Materials and Methods

3D Model Building and Parameter Extraction

Generation of Deamidated IgGs

LC-MS/MS Tryptic Peptide Mapping

Random Forest Machine Learning Model Construction

Author Contributions

Conflicts of Interest

Acknowledgments

Footnotes

Supplemental Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases