Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Mar 3.
Published in final edited form as: Anal Chem. 2020 Feb 17;92(5):3503–3507. doi: 10.1021/acs.analchem.9b05578

Predicting Electrophoretic Mobility of Proteoforms for Large-Scale Top-Down Proteomics

Daoyang Chen 1, Rachele A Lubeckyj 2, Zhichang Yang 3, Elijah N McCool 4, Xiaojing Shen 5, Qianjie Wang 6, Tian Xu 7, Liangliang Sun 8
PMCID: PMC7543059  NIHMSID: NIHMS1634046  PMID: 32043875

Abstract

Large-scale top-down proteomics characterizes proteoforms in cells globally with high confidence and high throughput using reversed-phase liquid chromatography (RPLC)–tandem mass spectrometry (MS/MS) or capillary zone electrophoresis (CZE)–MS/MS. The false discovery rate (FDR) from the target–decoy database search is typically deployed to filter identified proteoforms to ensure high-confidence identifications (IDs). It has been demonstrated that the FDRs in top-down proteomics can be drastically underestimated. An alternative approach to the FDR can be useful for further evaluating the confidence of proteoform IDs after the database search. We argue that predicting retention/migration time of proteoforms from the RPLC/CZE separation accurately and comparing their predicted and experimental separation time could be a useful and practical approach. Based on our knowledge, there is still no report in the literature about predicting separation time of proteoforms using large top-down proteomics data sets. In this pilot study, for the first time, we evaluated various semiempirical models for predicting proteoforms’ electrophoretic mobility (μef) using large-scale top-down proteomics data sets from CZE–MS/MS. We achieved a linear correlation between experimental and predicted μef of E. coli proteoforms (R2 = 0.98) with a simple semiempirical model, which utilizes the number of charges and molecular mass of each proteoform as the parameters. Our modeling data suggest that the complete unfolding of proteoforms during CZE separation benefits the prediction of their μef. Our results also indicate that N-terminal acetylation and phosphorylation both decrease the proteoforms’ charge by roughly one charge unit.

Graphical Abstract

graphic file with name nihms-1634046-f0001.jpg


Mass spectrometry (MS)-based top-down proteomics aims to delineate proteoforms in cells comprehensively with high confidence and throughput.15 Proteoforms extracted from biological samples are typically separated by reversed-phase liquid chromatography (RPLC) or capillary zone electrophoresis (CZE), followed by electrospray ioniza-tion (ESI)-tandem mass spectrometry (MS/MS). A database search is then performed for the identification (ID) of proteoform spectrum matches (PrSMs), proteoforms, and proteins through comparing experimental and theoretical masses of proteoforms and their fragments. To improve the confidence of proteoform ID, the target–decoy database search approach is typically employed,6,7 and the identified PrSMs and proteoforms were filtered by certain false discovery rates (FDRs). Recently, the Kelleher’s group showed that the FDR estimation in top-down proteomics was complicated and the FDRs could be drastically under-reported.8 High-confidence proteoform and protein IDs are vital. Therefore, after filtering the data with a specific FDR, we need to validate the data further using an alternative approach to the FDR.

The retention/migration time of proteoforms in LC/CZE can be useful information for improving the confidence of IDs. Some previous studies have deployed the retention/migration time of proteins and peptides to facilitate their IDs.912 We believe that accurate prediction of the retention/migration time of proteoforms will push the use of separation time for ID forward drastically. By comparing the experimentally observed and accurately predicted separation time of proteoforms, we could further boost the confidence of identified proteoforms, determine wrong proteoform IDs, and even provide useful information to correct proteoform IDs.

Some work has been done in predicting migration time (electrophoretic mobility, μef) of peptides from CZE separations.1321 It has been demonstrated that CZE out-performed RPLC regarding the prediction of migration/retention time of peptides for bottom-up proteomics.21 One major reason is that the size and charge of peptides for CZE can be calculated relatively easily; by contrast, the interaction between peptides and beads for RPLC is complicated.21 Krokhin et al. achieved a linear correlation (R2 = 0.995) between predicted and experimental μef of peptides in CZE using a large peptide data set and an optimized semiempirical model,21 which was based on the model reported by Cifuentes et al.,19 eq 1. Note: eq 1 is the modified version from the ref 19, and Krokhin et al. started their optimization from this equation for peptides.

μef=900×(ln(1+0.35×Q)/M0.411) (1)

In the modified Cifuentes’s model, molecular weight (M) and charge (Q) were used as the parameters. The charge (Q) was equal to the number of positively charged amino acid residues (K, R, H, and N-terminus) in the acidic background electrolyte (BGE) of CZE, for example, 5% (v/v) acetic acid (AA), pH 2.4.21 More recently, we also applied the similar model for predicting the μef of phosphorylated peptides and achieved a high correction (R2 = 0.99) between the predicted and experimental μef for monophosphorylated peptides from the HCT116 cell line.22

Great success has been achieved for predicting μef of peptides, but much more effort needs to be made on proteins/proteoforms. Some initial effort has been made using a handful of standard proteins.17,23,24 However, there is no report about predicting μef of proteins/proteoforms using large-scale proteoform data sets. There are two major reasons for that. First, large-scale top-down proteomics data sets from CZE-MS have been limited. Second, proteins/proteoforms are much larger than peptides, leading to potential difficulties in calculating their size and charge accurately. In the last 5 years, CZE–MS has been recognized as an important approach for large-scale top-down proteomics due to the improvement in CE–MS interfaces, capillary coatings, and online sample stacking techniques.2532 For instance, we identified nearly 600 proteoforms from an E. coli cell lysate in a single-shot CZE–MS/MS analysis.27 In that study, we employed a commercialized electrokinetically pumped sheath-flow CE–MS interface,33,34 a 1-m-long linear polyacrylamide (LPA)-coated capillary,35 and a dynamic pH junction-based proteoform stacking method36 to boost the sample loading capacity, separation window, and overall sensitivity of the CZE–MS system. In another study, we used a 1.5-m-long LPA-coated capillary for CZE–MS/MS analysis of zebrafish brains and identified thousands of proteoforms in a single analysis with consumption of nanograms of protein material.29 These large-scale proteoform data sets provide us great opportunities to push forward the prediction of μef of proteoforms, which will be useful for improving the confidence of proteoform IDs in top-down proteomics.

Here, we applied previously reported semiempirical mobility models in the prediction of the proteoforms’ μef and evaluated their performance using large proteoform data sets from E. coli cells and zebrafish brains under different CZE conditions. For the zebrafish brain data sets, we used the published data from our group and the detailed experimental conditions are shown in ref 29. Briefly, a 1.5 m-long LPA-coated capillary (50/360 μm i.d./o.d.) was used for CZE separation. The BGE was 10% (v/v) AA, pH 2.2. For the E. coli data sets, we generated these data for the project. In brief, the E. coli proteins were denatured, reduced, and alkylated, followed by desalting with a C4 trap column according to the procedure in ref 27. The lyophilized protein sample was redissolved in a 50 mM ammonium bicarbonate (NH4HCO3) buffer (pH 8.0) to get a 2 mg/mL protein solution for CZE–MS/MS. A 103 cm-long LPA-coated capillary (50/360 μm i.d./o.d.) was used for CZE. Three different BGEs were tested, including 5% (v/v) AA in water, 20% (v/v) AA in water, and 20% (v/v) AA in water containing 10% (v/v) isopropanol (IPA) and 15% (v/v) dimethylacetamide (DMA). Approximately 400 nL of the sample, equivalent to 800 ng of E. coli proteins was injected for analysis per CZE–MS/MS run. Technical triplicates were performed for each BGE. The commercialized electrokinetically pumped sheath-flow CE–MS interface from CMP Scientific (Brooklyn, NY) was employed to couple CZE to MS.33,34 For all the experiments, +30 kV was applied at the sample injection end, and +2 kV was applied at the interface for ESI. A Q-Exactive HF mass spectrometer was used. The raw files from E. coli cells were searched against the UniProt database (UP000000625) using TopPIC suite (version 1.2.6).37,38 The identified PrSMs and proteoforms were filtered by a 0.1% FDR and a 0.5% FDR, respectively. The experimental details are described in Supporting Information I.

The migration time (tM) of each identified proteoform was obtained from the database search result. The number of charges (Q) of each proteoform equals the number of positively charged amino acid residues within their sequences (K, R, H, and N-terminus). The molecular mass (M) of each proteoform equals the adjusted mass reported by the TopPIC. The length (N) of each proteoform equals the number of amino acid residues within the sequence. Only proteoforms without post-translational modifications (PTMs) were used for calculation of experimental μef and predicted μef. About 500–1100 proteoforms were used for the calculations. The molecular mass of proteoforms ranged from 1.5 kDa to 30 kDa. We also assumed that the electroosmotic flow (EOF) in an LPA-coated capillary with an acidic BGE was extremely low.27 The proteoforms with their experimental and predicted μef are listed in Supporting Information II. The MS raw data have been deposited to the ProteomeXchange Consortium via the PRIDE39 partner repository with the data set identifier PXD017265.

First, we calculated the experimental μef using eq 2,

experimental μef=L/((302)/L×tM)(unit of cm2 kV1 s1) (2)

where L is the capillary length in cm, and tM is the migration time in s. The 30 and 2 are the separation voltage and electrospray voltage in kilovolts.

Second, the predicted μef of proteoforms from the E. coli data sets were calculated using six classical semiempirical models,1416,1820 Table 1. For the Cifuentes’s model, we obtained the final eq 3 based on eq 1 via omitting the prefactor 900.

μef=ln(1+0.35×Q)/M0.411 (3)

where Q and M are the number of charges and molecular mass of each proteoform.

Table 1.

Summary of the Linear Correlations between Experimental μef and Predicted μef of E. coli Proteoforms Using Different Semiempirical Models and under Various CZE Conditionsa

BGE
5% (v/v) AA 20% (v/v) AA 10% (v/v) IPA15% (v/v) DMA20% (v/v) AA
semiempirical model R2 slope R2 slope R2 slope
ln(1 + 0.35Q)/M0.411 Cifuentes and Poppe19,21 0.97 0.22 0.98 0.26 0.98 0.51
ln(1 + Q)/N0.435 Grossman et al.18 0.76 1.72 0.82 2.1 0.82 4.4
Q/M2/3 Offord14 0.93 0.25 0.94 0.29 0.92 0.58
Q/M0.56 Kim et al.16 0.90 0.65 0.89 0.74 0.82 1.4
Q/M1/2 Tanford15 0.86 1.1 0.84 1.2 0.74 2.3
Q/M1/3 Reynolds et al.20 0.72 4.6 0.69 5.2 0.52 9.0
a

Only proteoforms without PTMs were used. The R2 and slope values were from the mean of the triplicate CZE–MS/MS runs, and the standard deviations of the R2 values from the triplicate analyses were about 0.01.

The Cifuentes’s model produced the best linear correlation (R2 = 0.97–0.98) between the predicted and experimental μef of proteoforms according to the R2 values for the three CZE conditions, followed by the Offord’s model (R2 = 0.92–0.94) and Kim’s model (R2 = 0.82–0.90). The Reynolds’s model generated the lowest correlation coefficient (R2 = 0.52–0.72). The Cifuentes’s model obtained a drastically better linear correlation regarding the R2 value than the Grossman’s model (0.97 vs 0.76 for the 5% AA BGE) and the two models have two differences, M0.411 vs N0.435 and 0.35 × Q vs Q. After a more detailed study using the 5% AA BGE data, we figured out that the R2 value of the Grossman’s model could be boosted from 0.76 to 0.94 by simply changing the Q to 0.35 × Q. Only a minor effect on the R2 value was observed by changing N0.435 to M0.411. We note that the slopes of the linear correlation curves from the two best models (the Cifuentes’s model and the Offord’s model) are comparable for the different CZE conditions, e.g., 0.22 vs 0.25 for the 5% AA BGE, and are obviously smaller than that from other models, suggesting that the predicted μef from these two models are much smaller than that from the other four models and significantly smaller than the experimental μef. We can add a CZE condition-dependent prefactor to the Cifuentes’s model to match the predicted and experimental μef.

The data here represents the first try of predicting μef of proteoforms using large-scale top-down proteomics data sets. The great correlation between experimental μef and predicted μef from the simple Cifuentes’s model further implies that the μef of proteoforms in CZE can be predicted easily. The predicted μef of proteoforms discussed in the following parts were obtained from the Cifuentes’s model.

We evaluated how the BGE of CZE influenced the μef of proteoforms, Figure 1A. When the AA concentration in BGE increased from 5% to 20% and when 10% (v/v) IPA and 15% (v/v) DMA were added into the BGE, the experimental μef of proteoforms decreased. Two possible reasons exist for that phenomenon. First, the lower pH of 20% (v/v) AA and the organic solvents unfold the proteoforms more completely, enlarging the size of proteoforms and reducing their mobility. It has been reported recently that in CZE protein size can increase significantly due to unfolding when the pH of BGE decreases.40 Second, the lower pH of 20% (v/v) AA and the organic solvents further eliminate the residual EOF in the capillary. In addition, when 20% (v/v) AA with or without 10% (v/v) IPA and 15% (v/v) DMA was used as the BGE, a better linear correlation was observed compared to the 5% (v/v) AA (0.98 vs 0.96). For the BGE containing 20% (v/v) AA, 10% (v/v) IPA, and 15% (v/v) DMA, the absolute value of predicted μef is much closer to that of experimental μef compared to the other two BGEs, indicated by the much larger slope of the linear correlation curve (0.51 vs 0.20–0.25). The number of outliers from the BGE containing IPA and DMA is also much smaller compared to the other BGEs. The results suggest that adding some organic solvents to the BGE of CZE could benefit the prediction of μef of proteoforms. There is also some evidence in the literature. For instance, in 2000, Katayama et al. demonstrated that the use of methanol in BGE could improve the correlation between predicted μef and experimental μef of peptides.41 We speculate that the organic solvents (IPA and DMA) in the BGE facilitate the complete unfolding of proteoforms, leading to better prediction of their μef. It has been reported that certain types of polar solvents such as dimethyl sulfoxide (DMSO), dimethylformamide (DMF), and formamide have the ability to unfold proteins.42,43

Figure 1.

Figure 1.

Linear correlations between predicted μef and experimental μef of proteoforms from E. coli cells under various CZE conditions (A) and proteoforms from zebrafish optic tectum (TEO) (B, C). For part A, only nonmodified proteoforms were used, and the data was from a single CZE-MS/MS run. For parts B and C, nonmodified, N-terminal acetylated, and monophosphorylated proteoforms were employed. In part B, the charge of proteoforms in the BGE (Q) was calculated by counting the positively charged amino acid residues (K, R, H, and N-terminal) regardless of the PTMs. In part C, the charge of proteoforms (Q) was corrected based on their PTMs. For example, one charge reduction corresponded to one N-terminal acetylation or one phosphorylation.

We then tested the Cifuentes’s model on our published zebrafish brain (optic tectum (Teo)) data and evaluated the performance of the model for predicting μef of proteoforms with certain PTMs (i.e., N-terminal acetylation and phosphorylation). When we only used nonmodified proteoforms, the predicted μef and experimental μef showed reasonably good linear correlations (R2 = 0.96). We then further included the proteoforms with N-terminal acetylation and/or phosphorylation in the analysis. The zebrafish Teo data from one CZE– MS/MS run was used, which included 1163 nonmodified proteoforms, 92 proteoforms with only N-terminal acetylation, 3 proteoforms with one phosphorylation site, and 2 proteoforms with both N-terminal acetylation and one phosphorylation site. N-terminal acetylation and phosphorylation can reduce the proteoforms’ charge by one charge unit in theory. Figure 1B shows the linear correlation between the experimental and predicted μef for these post-translationally modified proteoforms (97 in total) regardless of the PTMs. First, the linear correlation is poor (R2 = 0.76). Second, it is clear that the addition of one acetylation modification or one phosphoryl group to a proteoform can decrease its mobility significantly. After considering the effect of these PTMs on the proteoforms’ charge, we corrected the charge (Q) in the Cifuentes’s model. We achieved a linear correlation for the 97 proteoforms with PTMs (R2 = 0.92) after we adjusted the Q by −1, −1, and −2 for proteoforms with N-terminal acetylation, proteoforms with one phosphorylation site, and proteoforms with both N-terminal acetylation and phosphorylation, respectively, Figure 1C. The results show that the proteoforms’ charge shifts are very close to the theoretical contributions of N-terminal acetylation and phosphorylation. Additionally, the results suggest that the μef of proteoforms with N-terminal acetylation and phosphorylation could be predicted as accurately as nonmodified proteoforms (R2 0.92 vs 0.96). We note that some outliers exist in Figure 1C due to two possible reasons. First, for these outliers, their experimental μef values are larger than the predicted values, most likely due to the incomplete unfolding of these proteoforms in the BGE used in the experiment (10% (v/v) AA, pH 2.2). Second, since the proteoform IDs were filtered by a 0.5% FDR, some of the outliers could be simply the wrong proteoform IDs.

In summary, in this work, for the first time, we evaluated various semiempirical models for predicting proteoforms’ μef using large-scale top-down proteomics data sets. Using a simple semiempirical model, we achieved a linear correlation between experimental μef and predicted μef of E. coli proteoforms (R2 = 0.98). We note that some effort has been made on predicting retention time of proteins in RPLC using simple protein mixtures based on complicated models, producing reasonable correlations between predicted and experimental retention time (R2 = 0.86–0.90).11,44,45 We also note that our current study still has some limitations. First, the proteoforms used in this study have masses lower than 30 kDa. Top-down proteomics data sets of large proteoforms using CZE–MS/MS are required to expand the model into a wider range of proteoforms in mass. Second, the number of proteoforms with PTMs (i.e., acetylation and phosphorylation) used here is small, less than 100. Larger numbers of proteoforms with PTMs are extremely important for improving the model for post-translationally modified proteoforms.

Supplementary Material

Supporting Information I
Supporting Information II

ACKNOWLEDGMENTS

We thank Prof. Heedeok Hong’s group at the Department of Chemistry of Michigan State University for kindly providing the E. coli cells for this project. We thank the support from the National Science Foundation (CAREER Award, Grant DBI-1846913) and the National Institutes of Health (Grant R01GM125991).

Footnotes

Supporting Information

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.analchem.9b05578.

Supporting Information I, experimental procedures (PDF)

Supporting Information II, lists of proteoforms used in the study from E. coli or zebrafish brain under different CZE conditions with experimental and predicted electrophoretic mobilities (XLSX)

Complete contact information is available at: https://pubs.acs.org/10.1021/acs.analchem.9b05578

The authors declare no competing financial interest.

Contributor Information

Daoyang Chen, Department of Chemistry, Michigan State University, East Lansing, Michigan 48824, United States.

Rachele A. Lubeckyj, Department of Chemistry, Michigan State University, East Lansing, Michigan 48824, United States

Zhichang Yang, Department of Chemistry, Michigan State University, East Lansing, Michigan 48824, United States.

Elijah N. McCool, Department of Chemistry, Michigan State University, East Lansing, Michigan 48824, United States

Xiaojing Shen, Department of Chemistry, Michigan State University, East Lansing, Michigan 48824, United States;.

Qianjie Wang, Department of Chemistry, Michigan State University, East Lansing, Michigan 48824, United States.

Tian Xu, Department of Chemistry, Michigan State University, East Lansing, Michigan 48824, United States.

Liangliang Sun, Department of Chemistry, Michigan State University, East Lansing, Michigan 48824, United States.

REFERENCES

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information I
Supporting Information II

RESOURCES