IPC 2.0: prediction of isoelectric point and pKa dissociation constants

Lukasz Pawel Kozlowski

doi:10.1093/nar/gkab295

. 2021 Apr 27;49(W1):W285–W292. doi: 10.1093/nar/gkab295

IPC 2.0: prediction of isoelectric point and pK_a dissociation constants

Lukasz Pawel Kozlowski ^1,^✉

PMCID: PMC8262712 PMID: 33905510

Abstract

The isoelectric point is the pH at which a particular molecule is electrically neutral due to the equilibrium of positive and negative charges. In proteins and peptides, this depends on the dissociation constant (pK_a) of charged groups of seven amino acids and NH⁺ and COO⁻ groups at polypeptide termini. Information regarding isoelectric point and pK_a is extensively used in two-dimensional gel electrophoresis (2D-PAGE), capillary isoelectric focusing (cIEF), crystallisation, and mass spectrometry. Therefore, there is a strong need for the in silico prediction of isoelectric point and pK_a values. In this paper, I present Isoelectric Point Calculator 2.0 (IPC 2.0), a web server for the prediction of isoelectric points and pK_a values using a mixture of deep learning and support vector regression models. The prediction accuracy (RMSD) of IPC 2.0 for proteins and peptides outperforms previous algorithms: 0.848 versus 0.868 and 0.222 versus 0.405, respectively. Moreover, the IPC 2.0 prediction of pK_a using sequence information alone was better than the prediction from structure-based methods (0.576 versus 0.826) and a few folds faster. The IPC 2.0 webserver is freely available at www.ipc2-isoelectric-point.org

Graphical Abstract

INTRODUCTION

The isoelectric point (pI or IEP) is an important physicochemical parameter of many compounds, including peptides and proteins, and can be used to estimate the surface charge of molecules in various pH conditions. This physicochemical property has been extensively used in many molecular techniques, such as two-dimensional gel electrophoresis (2D-PAGE) (1,2), capillary isoelectric focusing (3,4), crystallisation (5), and mass spectrometry (MS) (6,7). It should be stressed that for polypeptides, the isoelectric point depends mostly on the acid dissociation constants (pK_a) of the ionisable groups of seven charged amino acids: glutamate (γ-carboxyl group), cysteine (thiol group), aspartate (ß-carboxyl group), tyrosine (phenol group), lysine (ϵ-ammonium group), histidine (imidazole side chains), and arginine (guanidinium group). Furthermore, other charged groups can be important, such as the amine and carboxyl-terminal groups of the polypeptide chain and the post-translational modifications (PTMs) that carry the charged groups (e.g. phosphorylation and N-terminal acetylation). Moreover, the difference between the theoretical pI and the experimental pI can be related to the ionisation state of the individual residues. Some residues are buried inside the protein structure and, therefore, their contribution to the net charge of the whole molecule is marginal. Additionally, the charge of the exposed residue can be neutralised if it is used to form interactions with other residues, such as in non-covalent salt bridges, in which a proton migrates from a carboxylic acid group to a primary amine or to the guanidine group in Arg (in proteins, Lys or Arg are used as the bases and Asp or Glu as the acids; 8–10). However, most of the in silico methods that are currently used for pI estimation are based on simply counting the numbers of charged residues and utilising the Henderson-Hasselbalch equation with customised pK_a values (11,12). Nevertheless, some attempts to build more sophisticated methods should be acknowledged, such as those using genetic algorithms (13), artificial neural networks (14) and support vector machines (15).

While the estimation of the isoelectric point can be considered a challenging task, the prediction of pK_a values for individual residues is even more difficult. The average pK_a values used for pI prediction have been measured using simplified conditions, such as alanine pentapeptides with charged residue in the centre (16). This has been done to minimise the contribution from neighbouring residues, but such an approach is of no use for pK_a estimation in real proteins, where the influence of surrounding residues must be considered. An additional problem involved in building a reliable pK_a prediction algorithm is the scarcity of data (approximately 1000 known pK_a values in proteins have been measured experimentally (17)). To date, for pK_a prediction, only programs based on protein structure have been available, such as MCCE (18), H++ (19), Propka (20) and Rosetta pK_a (21).

In this work, I present a major update of the original IPC algorithm (available at http://isoelectric.org) (12) that significantly extends its capabilities (Figure 1). The IPC 2.0 web server (available at http://www.ipc2-isoelectric-point.org and mirrored at http://ipc2.mimuw.edu.pl) incorporates two major feature upgrades:

Figure 1. — Overview of the IPC 2.0 architecture. The input (amino acid sequence in the plain format or multiple sequences in the FASTA format) is processed by individual machine learning models. Separate models depending the prediction task are used. Isoelectric point prediction for peptides is based on separable convolution model (four channels representing the one-hot-encoded sequence, AAindex features, amino acid counts, and predictions from IPC 1.0). The protein pI and pK_a prediction models use the ensembles of low level models integrated with support vector regressor. For more details, see Supplementary Figure S1 and ‘Machine Learning Details’ in the Supplementary Material.

- Prediction of the isoelectric point using state-of-the-art machine learning instead of the relatively simple pK_a optimisation used in IPC 1.0
- Prediction of individual pK_a values based solely on sequence features

The features were implemented to address the major drawbacks of IPC according to users’ feedback and to improve prediction accuracy. Additionally, a new, larger dataset for peptides was used. The input of IPC 2.0 is the peptide or protein sequence(s) in a one-letter amino acid code (for multiple sequences, the FASTA format is used). For each input sequence, IPC 2.0 runs the machine learning models for the isoelectric point and predicts the pK_a dissociation constant for each charged residue and the terminal groups of polypeptide chains. Additionally, a scatter plot with the predicted isoelectric points versus the molecular weight is presented for all proteins (in total, output from 21 isoelectric point prediction methods). All the prediction results can be downloaded in CSV format for further analysis.

MATERIALS AND METHODS

Datasets

To develop and test the IPC2.0 server, multiple benchmark datasets had been used (Table 1). They can be divided into the following three main sets:

Table 1.

Detailed statistics for the datasets used in IPC 2.0.

Dataset	Entries	Details
IPC2_protein IPC_protein_25 (25% test set) IPC_protein_75 (75% training set)	2324 581 1743	The dataset consists of proteins derived from two databases: PIP-DB and SWISS-2DPAGE (22,23). The outliers are defined at 0.5 pH unit difference between the predicted and experimental isoelectric point threshold. The same protein dataset is used in IPC and IPC 2.0. Average protein size: 387 aa.

IPC2_peptide IPC2_peptide_25 (25% test set) IPC2_peptide_75 (75% training set)	119 092 29 774 89 318	The dataset consists of the peptides from HiRIEF high-resolution isoelectric focusing experiments from Branca et al. 2014 (6) and Johansson et al. 2019 (24). Merged dataset from seven independent experiments: 3.7–4.9 (8,713 peptides), 3.7–4.9 (7361 peptides), 3.7–4.9 (35 595 peptides), 3–10 (23 975), 3–10 (15 000 peptides), 6–11 (36 827 peptides), 6–9 (38 057 peptides). Average peptide size: 14.6 aa.

IPC2_pKa IPC2_pKa_25 (test set) IPC2_pKa_75 (training set)	1337 260 1079	pK_a values from PKAD database (157 proteins). Due to small number of samples, the test set and training set was built as follows: 260 pK_a values from 34 proteins used in the pKa Rosetta method (21) were selected as a test set. The remaining samples from the PKAD database were used as the training set.

Open in a new tab

The full datasets were never used directly. First, the sequences were clustered (to remove duplicates and to average isoelectric point if multiple experimental data existed), then split randomly into 25% and 75% sets (test and training data sets, respectively). The training sets were used for the training and (hyper)parameter optimisation. The test sets were used only once to assess the final performance of the models. For individual datasets’ sequences and experimental isoelectric points, see Supplementary Data 1.

- Protein dataset: contains 2324 non-redundant proteins with experimentally measured isoelectric points (merged data from SWISS-2DPAGE and PIP-DB databases (22,23)). This dataset was also used in IPC 1.0. For more details, see (12).
- Peptide dataset: the datasets from previous studies were merged to form a total of 119 092 non-redundant peptides (6,24,25). Note that this is different peptide dataset than that used for IPC 1.0. For more details, see (12).
- pK_a dataset: 1337 pK_a values from 157 proteins were obtained from the PKAD database (17).

All datasets were clustered to avoid duplicates and, if needed, to merge and average experimental measurements. Next, the datasets were randomly split into 25% testing sets (used only for final benchmarks) and 75% training sets (used for machine learning, hyperparameters optimisation). All presented benchmarks on individual datasets had been calculated with 10-fold cross-validation.

Feature generation

The most important features for isoelectric point prediction are the sequence itself; the number of charged residues; the amino acid type on the C- and N-termini; and the isoelectric point predicted by simple methods using the Henderson–Hasselbach equation and pK_a value sets, such as IPC (12), Bjellqvist (26) and DTASelect (27). To engineer additional features, the AAindex—with 566 matrices for 20 standard amino acids—was scanned (28). To select the most informative features, the univariate feature selection with regression (f_regression) and mutual information (mutual_info_regression) was used (up to 10 highest-scoring features were selected using the SelectKBest function from Scikit-learn) (29). For pK_a models, the AAindex scores were calculated using kmers of different sizes centred on a charged amino acid (for a pentamer, e.g. xxRxx, this may correspond to ALRWI, GIRAA, WRRIL, etc. For more details, see ‘Machine Learning Details’ section in the Supplementary Material). It is important to stress that local protein features, such as secondary structure and solvent accessibility, are valid only for protein sequences. They are irrelevant for short peptides, where, for instance, the use of a mass spectrometer disrupts any higher-order structure of the molecule. Therefore, pK_a predictions should be considered valid only for proteins (>50 amino acids).

Performance evaluation metrics

The prediction of pK_a and pI values is a regression problem. Therefore, metrics such as root mean square deviation (RMSD), mean absolute error (MAE), Pearson's correlation coefficient (r²), and the number of outliers were used. While the first three metrics are commonly used, the last must be explained. The outliers were defined at 0.5 and 0.25 pH unit difference thresholds between the predicted and experimental pI for proteins and peptides, respectively. Thus, if the prediction disagree with the experimental pI by given threshold, such case has been considered as the outlier. The total number of outliers for individual datasets has been used to identify methods returning predictions within reasonable error range.

Machine learning

In this work, I try to solve three independent problems: the pI of proteins, the pI of peptides, and the pK_a of charged residues. Therefore, it is justified to design (at least) three separate models. Any machine learning process begins with the conversion of input data (in this case, polypeptide chains) into a format that can be used by machine learning packages (here, SciPy (30), sklearn (29), Tensorflow (31) and Keras (32)). The simplest approach is to use one-hot encoding, but it should be stressed that this produces a sparse matrix or vector (in this case, it would be L × 22, where L is the polypeptide length and 22 corresponds to 20 standard amino acid letters plus one for an unknown amino acid and one for padding). If the input varies in length, some padding (peptides; up to 60) or truncation (protein; down to 1000) is inevitable. With regard to pI prediction, the number and type of charged groups are most significant; thus, this information (even alone) can be used as the initial vector. Additionally, if possible, the introduction of hand-crafted features is recommended (this is important when the data size is limited). This last step can frequently be omitted because if it is given a sufficiently large dataset, the deep learning approach can learn the features of the model by itself (e.g. the convolution filters for the images). Unfortunately, if data are scarce, the prediction accuracy is hampered, and adding hand-crafted features can be unavoidable to enrich the input vectors with expert knowledge.

In the case considered by this paper, apart from the sequence alone, I used features derived from the sequence (such as charge, length, molecular weight, hydrophobicity, number of charged amino acids, and the predicted pI from other methods). The input differs according to the problem to be solved and the machine learning technique used but, in general, the input consists of two major parts: sequence-related and feature-related. Several machine learning approaches were tested. First, I used optimization techniques to find the optimal set of seven pK_a values for charged residues. In the first version of IPC, basin-hopping with a truncated Newton algorithm (33) was used; here, I used a differential evolution algorithm (34), as it performed significantly better. The population size was set to 50 and all remaining parameters were default (SciPy version (30)). Next, having the initial predictions of pI from IPC 1.0, I designed a very simple approach based on support vector regression (SVR) with RBF kernel and GridSearchCV parameters optimization. The input vector in this case was 19 predicted isoelectric points. Finally, I progressed to more advanced machine learning techniques, namely, deep learning. It is possible to start from simple dense networks (Multi-Layer Perceptron; MLP) with different numbers of dense layers and neurons that are interconnected with dropout and with different activation layers (preferably selu and elu). The final architecture for peptide isoelectric point prediction benefit from all mentioned information and is based on stacking of separable convolution layers. The input is reshaped in the following way. For peptides, the maximal length is 60 amino acids; thus, this defines the main size of the 2D matrix obtainable after one-hot encoding (all sequences were padded up to 60, if needed). This results in a matrix of 60 × 22 (20 standard amino acids, X for unknown, and 0 for padding). By analogy with the image processing from which convolution has been adopted, this 60 × 22 matrix can be considered as the main image size. Similarly, additional information can be stored in separate channels (for instance an RGB image has three channels for values of red, green, and blue). Here, I used four channels. The first channel was, as stated, a one-hot encoded sequence. The second channel stored information about the most informative features from AAindex. In the third and fourth channels, I encoded the information about 1D features, the information about charged residues counts, and the isoelectric point prediction from IPC 1.0. The input was processed by two separable convolution layers, interconnected with two average pooling layers. The initial kernel size of the filter was set to 22 × 5 to slide across the whole amino acid frame with a window of five amino acids. Then the feature maps were flattened and sent to a standard MLP unit: three dense layers (Supplementary Figure S1).

Machine learning architecture for the prediction of pK_a is very different, as here the focus is on a single charged amino acid (and its neighbourhood). Therefore, the input is very limited. I decided to use the information related to kmers of different size. With increasing size of the kmer (from three to 15), we encoded the sequence (one-hot encoding) and the amino acid scores for the most informative features from AAindex. This information was used as input for the MLP unit (three dense layers separated by dropout layer). Next, to boost the performance we used an ensemble of nine models to build final support vector regression model.

In all deep learning models, the ADAM optimiser (35) and hyper-parameter optimisation by RandomizedSearchCV were used. As the optimisation condition, the mean_squared_error loss function was used. Apart from the dropout, 10-fold cross-validation and early stopping were used to estimate the robustness of the predictions and to avoid overfitting. During the fitting process, the training set was randomly split (validation_split = 0.2). Finally, the performance was estimated for the 25% of cases that had been omitted. For more details, see ‘Machine Learning Details’ in the Supplementary Material.

Other methods

To benchmark IPC 2.0, multiple other methods were compared. The simplest methods of isoelectric point prediction are based on different pK_a sets and the Henderson–Hasselbach equation (Patrickios (36), Solomon (37), Lehninger (38), EMBOSS (39), Dawson (40), Wikipedia (pK_a values as presented in Wikipedia page in 2005), Toseland (41), Sillero (42), Thurlkill (16), Rodwell (43), DTASelect (27), Nozaki (44), Grimsley (45), Bjellqvist (26), whose method was implemented as ExPASy ‘Compute pI/Mw Tool’ (46), ProMoST (9) and finally IPC 1.0 (12)). Additionally, machine learning methods, such as PredpI (plain, TMT6, iTRAQ8 variants) (6) and pIR (15) were also used. Furthermore, IPC 2.0 pK_a predictions were compared to those of Rosetta pK_a (four variants) (21).

Implementation

The pre-processing, training, and testing of the machine learning models were done in the Python programming language. Among the libraries used, the most important were SciPy (30), sklearn (29), TensorFlow (31), and Keras (32). For the web-server implementation, the Apache server and the PHP programming language were used. In addition, the HTML front-end benefitted from the Twitter Bootstrap and CanvasJS libraries.

RESULTS

Isoelectric point prediction

The isoelectric point prediction of the IPC 2.0 method is based on two separate datasets that consist of protein and peptides. The datasets differ in size and in the difficulty of the prediction task. Peptides are much shorter and contain only a small number of charged groups. In contrast, in proteins, multiple additional factors, such as PTMs or solvent accessibility, need to be taken into account. Thus, the estimation of the isoelectric point is much more difficult. The results presented in Table 2 show that the IPC 2.0 models performed the best. Moreover, the more information and more advanced machine learning technique used, the better were the results. The optimisation models (denoted IPC2_protein and IPC2_peptide) use pKa sets that are optimal for calculating pI with the Henderson-Hasselbach equation (Supplementary Table S1). They perform better than any other methods (RMSD of 0.860 by IPC2_protein versus 0.911 by ProMoST and 0.248 by IPC2_peptide versus 0.405 by Bjellqvist). However, from an machine learning point of view, they are very simple (but at the same time very fast).

Table 2.

Isoelectric point prediction accuracy on leave-out 25% datasets

Method	Protein dataset^a				Method	Peptide dataset^b
	RMSE	MAE	R ²	Outliers^c		RMSE	MAE	R ²	Outliers^c
IPC2.protein.svr.19	0.8479	0.5906	0.5934	247	IPC2.peptide.Conv2D	0.2216	0.1216	0.9761	2691
IPC2_protein	0.8608	0.6052	0.5748	251	IPC2.peptide.svr.19	0.2299	0.1155	0.9743	2490
IPC_protein	0.8677	0.6109	0.5760	250	IPC2_peptide	0.2482	0.1394	0.9700	3179
ProMoST	0.9113	0.6444	0.5183	263	Bjellqvist	0.4051	0.2836	0.9204	11639
Toseland	0.9278	0.6537	0.5095	250	Nozaki	0.4083	0.2673	0.9191	9837
Dawson	0.9365	0.6586	0.4977	263	DTASelect	0.4235	0.2796	0.9130	10606
Bjellqvist	0.9369	0.6536	0.5005	260	Thurlkill	0.4466	0.2535	0.9033	7182
Wikipedia	0.9484	0.6795	0.4860	262	Sillero	0.4747	0.2696	0.8907	7607
Rodwell	0.9579	0.6762	0.4706	262	Dawson	0.4910	0.2642	0.8831	6698
Grimsley	0.9588	0.6953	0.4779	265	Wikipedia	0.5178	0.2974	0.8700	8326
Lehninger	0.9617	0.6783	0.4607	266	Grimsley	0.5264	0.3796	0.8656	15956
Solomon	0.9631	0.6746	0.4606	272	Rodwell	0.5855	0.3429	0.8337	9857
pIR	1.0148	0.7556	0.4161	315	Toseland	0.5860	0.3896	0.8335	13152
Nozaki	1.0164	0.7219	0.3980	288	EMBOSS	0.5971	0.3557	0.8271	11022
Thurlkill	1.0250	0.7573	0.3948	302	PredpI-iTRAQ8	0.6302	0.3503	0.8027	12059
DTASelect	1.0278	0.7798	0.3947	319	PredpI-TMT6	0.6365	0.3518	0.7988	12135
EMBOSS	1.0498	0.7757	0.3734	308	PredpI-plain	0.6480	0.3710	0.7913	12813
Sillero	1.0519	0.7694	0.3461	308	IPC_peptide	0.7459	0.4860	0.7302	13599
Patrickios	2.3764	1.8414	<0	517	Solomon	0.7518	0.4929	0.7259	13777
PredpI-TMT6	NA	NA	NA	NA	Lehninger	0.7697	0.5209	0.7127	15200
PredpI-plain	NA	NA	NA	NA	pIR	0.8529	0.7303	0.6387	27158
PredpI-iTRAQ8	NA	NA	NA	NA	ProMoST	1.1026	0.7562	0.4104	18513
					Patrickios	2.0172	1.3927	<0	22818

Open in a new tab

^aProtein dataset consisting of 581 proteins (25% randomly chosen proteins, not used for the training or optimization).

^bPeptide dataset consisting of 29 774 peptides (25% randomly chosen peptides, not used for the training or optimization).

^cThe outliers were defined at 0.5 and 0.25 pH unit difference between the predicted and experimental pI thresholds for the protein and peptide datasets.

NA: The PredpI program was designed for peptides only within the 3.7–4.9 pH range; thus, for proteins, it returned 0 and could not be evaluated on the protein dataset.

New machine learning models developed in this study are in bold. First version of IPC (12) is underscored. Scores calculated after 10-fold cross-validation. Table is sorted by RMSD. For individual methods’ predictions, see Supplementary Data 2. For more details about the datasets, see Table 1.

The next machine learning approach I used was support vector regression (SVR), in which the isoelectric point predicted by other methods was an input, in a so-called ensemble averaging technique (47,48). The main advantage of SVR is that it has only two parameters (C and gamma for RBF kernel) that need to be optimised. The main disadvantage is that the input features must be already well designed and in a similar space (for instance, enriching the input with more heterogeneous features, such as protein length or molecular weight, prevent the algorithm from converging). The SVR models not surprisingly performed better than the optimisation versions (RMSD of 0.848 and 0.230 for proteins and peptides, respectively). Another key feature of ensemble averaging is that it can be used to reduce the variance of the predictions. This can be seen in this study with the significant reduction of outliers: the IPC2.peptide.svr.19 model produced only 8.3% outliers, while the input methods on average produced ∼32% outliers with only one model with 10.6% outliers.

Finally, it is possible to progress to deep learning techniques in which the amino acid sequence can be used directly by one-hot encoding. Additionally, all remaining hand-crafted features can be added easily (e.g. the most informative features from AAindex; see Supplementary Tables S2–S4). Together with the plethora of ready-to-adapt deep learning architectures, this provides another level of improvement (and training complexity). The final architecture used for peptide pI prediction in IPC 2.0 is based on separable convolution kernels scanning multichannel input. The model obtained an RMSD of 0.222. It should be stressed that IPC 2.0 is also a robust method, as it produces the fewest outliers (247 and 2490 for proteins and peptides, respectively). Moreover, due to the strict methodology for clustering and 10-fold cross-validation, the method does not exhibit significant signs of overfitting (compare Table 2 and Supplementary Table S5). In this study, multiple machine learning models were tested (for details, see Supplementary Table S6).

pK_a dissociation constant prediction

The pK_a prediction of individual residues is a separate challenge with its own problems that need to be addressed. First, the data are very limited. Moreover, currently available methods (e.g. MCCE, H++ and pK_a Rosetta) require protein structure and are relatively slow (for instance, the pK_a Rosetta protocol used for benchmarking in this work requires several hours for a single protein). In contrast, IPC 2.0 uses only sequence-based features and returns results almost instantly with similar accuracy (Table 3 and Supplementary Table S7). The prediction of pK_a values is based on an SVR ensemble of nine MLP models that use the information derived from kmers of different sizes centred on the charged residue. This approach made it possible to capture sequence fingerprints that were located in direct proximity to the charged residue. The overall accuracy of the IPC 2.0 pK_a prediction was better than that of the pK_a Rosetta protocol (0.576 versus 0.839), although some pK_a dissociation constants were clearly worse predicted by IPC 2.0. The main sources of misprediction were His and Tyr residues: for Tyr residues at least, this can be explained by the small number of training points.

Table 3.

pK_a prediction accuracy of Rosetta pKa dataset.

Method	Rosetta pK_a dataset^a			Method	Rosetta pK_a dataset^a
	RMSE	MAE	Outliers^b		RMSE	MAE	Outliers^b
D (74; 3.45 ± 0.80)				Y (17; 10.89 ± 0.82)
IPC2_pKa	0.3883	0.2238	6	Rosseta (Site repack)	0.7750	0.6177	7
Rosseta (Site repack)	0.8193	0.5824	27	Rosseta (Neighbor repack)	0.8370	0.6647	9
Rosseta (Ensemble average)	0.8413	0.5460	25	Rosetta (Standard)	0.9579	0.8000	9
Rosseta (Neighbor repack)	0.8676	0.6378	34	IPC2_pKa	0.9766	0.8261	10
Rosetta (Standard)	1.0651	0.8554	46	Rosseta (Ensemble average)	1.1892	0.9529	13
H (76; 6.58 ± 0.98)				K (22; 10.66 ± 0.52)
Rosseta (Site repack)	0.8247	0.6408	31	IPC2_pKa	0.2933	0.1909	2
IPC2_pKa	0.8523	0.5105	27	Rosseta (Neighbor repack)	0.6216	0.5091	7
Rosseta (Neighbor repack)	0.8559	0.6487	32	Rosetta (Standard)	0.6498	0.5046	8
Rosseta (Ensemble average)	1.0244	0.7566	39	Rosseta (Site repack)	0.6705	0.5227	7
Rosetta (Standard)	1.2303	0.9961	50	Rosseta (Ensemble average)	0.7135	0.5364	6
E (71; 4.16 ± 0.80)				All (260*)
IPC2_pKa	0.3625	0.1951	7	IPC2_pKa	0.5762	0.3364	54
Rosseta (Neighbor repack)	0.8744	0.5887	29	Rosseta (Site repack)	0.8262	0.6165	102
Rosetta (Standard)	0.8880	0.7324	38	Rosseta (Neighbor repack)	0.8332	0.6185	111
Rosseta (Site repack)	0.9303	0.6549	30	Rosseta (Ensemble average)	0.9207	0.6746	114
Rosseta (Ensemble average)	0.9317	0.6972	34	Rosetta (Standard)	1.0300	0.8296	151

Open in a new tab

^aFor the validation of pK_a, the dataset from Kilambi and Gray (2012) was used (260* residues from 34 proteins). The numbers next to the residue type indicate the number of cases and the average pK_a value with standard deviation.

^bThe outliers are defined at 0.5 pH unit difference between the predicted and experimental pK_a threshold.

*The dataset consists of 260 instead of 264 residues due to parsing problems (four missing residues could not be mapped to the protein sequence, due to the wrong residue register). Scores calculated after 10-fold cross-validation.

IPC 2.0 web server

The IPC 2.0 web server (http://www.ipc2-isoelectric-point.org) takes protein(s) and peptide(s) as input (single sequence or multiple sequences in FASTA format, up to 10 000 residues in total) and returns predictions of pI and pK_a values for individual charged residues. In addition, the outputs are complemented by virtual 2D-PAGE plots. As the server is capable of multiple sequence predictions, its convenient output feature is the CSV format, which can be used for further large-scale analyses (e.g. on the proteome scale 49–52). Additionally, the standalone version of IPC 2.0 is also available on web server site and as Supplementary Data.

DISCUSSION

In this work, I have presented IPC 2.0, a new web server for isoelectric point and pKa dissociation constant prediction based on sequence information only. It uses a state-of-the-art ML methodology and represents an improvement on previous methods. Additionally, IPC 2.0 is the first, fast pK_a prediction method that can estimate pK_a values using sequence information alone. It does so within seconds, with a performance level similar to more time-consuming and structure-based methods. To boost the performance of IPC 2.0 and to bypass the limited size of datasets for protein and pK_a prediction hand-crafted features were used (Supplementary Tables S6 and S7).

However, users should be aware of some of the IPC 2.0 web server's limitations. First, the isoelectric point and pK_a predictions can be significantly distorted when PTMs are present (e.g. phosphorylation). This should be kept in mind when analysing proteins from eukaryotes that are rich in PTMs (see Supplementary Table S1 in Kozlowski, 2016 [12]). Due to very limited experimental data, it was not possible to develop machine learning models dedicated to PTMs. Another shortcoming of IPC 2.0 is that it only uses sequence information. It is expected that protein structures that contain more information about the charged residue's neighbourhood should improve the prediction performance. Such an approach could help to increase the modest performance of pK_a predictions.

The high performance of the IPC 2.0 web server (and standalone version) makes it suitable for large-scale analyses related to the charge of proteins and peptides. I believe that biologists will benefit from this web server with its user-friendly interface.

DATA AVAILABILITY

IPC 2.0 (http://www.ipc2-isoelectric-point.org), including the web service, datasets, standalone scripts, and documentation, has been donated to the public domain. Therefore, it can be freely used for any legal purpose. Nevertheless, the machine learning libraries used by some models are: sklearn, Tensorflow and Keras, which are under BSD, Apache and MIT licences, respectively. The web server will be available at the given web address for at least 10 years.

Supplementary Material

gkab295_Supplemental_Files

Click here for additional data file.^{(258.5MB, zip)}

ACKNOWLEDGEMENTS

Lukasz Pawel Kozlowski acknowledges all authors of the previous works related to isoelectric point and pK_a sets measurements that were used in this study, with special regards to Rui Mamede Branca and Janne Lehtiö for providing additional peptide datasets.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

National Science Centre, Poland [2018/29/B/NZ2/01403]. Funding for open access charge: National Science Centre, Poland [2018/29/B/NZ2/01403].

Conflict of interest statement. None declared.

REFERENCES

1. Klose J. Protein mapping by combined isoelectric focusing and electrophoresis of mouse tissues. A novel approach to testing for induced point mutations in mammals. Humangenetik. 1975; 26:231–243. [DOI] [PubMed] [Google Scholar]
2. O’Farrell P.H. High resolution two-dimensional electrophoresis of proteins. J. Biol. Chem. 1975; 250:4007–4021. [PMC free article] [PubMed] [Google Scholar]
3. Pernemalm M., Lehtiö J.. A novel prefractionation method combining protein and peptide isoelectric focusing in immobilized pH gradient strips. J. Proteome Res. 2013; 12:1014–1019. [DOI] [PubMed] [Google Scholar]
4. Zhu M., Rodriguez R., Wehr T.. Optimizing separation parameters in capillary isoelectric focusing. J. Chromatogr. A. 1991; 559:479–488. [Google Scholar]
5. Kirkwood J., Hargreaves D., O’Keefe S., Wilson J.. Using isoelectric point to determine the pH for initial protein crystallization trials. Bioinformatics. 2015; 31:1444–1451. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Branca R.M.M., Orre L.M., Johansson H.J., Granholm V., Huss M., Pérez-Bercoff Å., Forshed J., Käll L., Lehtiö J.. HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics. Nat. Methods. 2014; 11:59–62. [DOI] [PubMed] [Google Scholar]
7. Cologna S.M., Russell W.K., Lim P.J., Vigh G., Russell D.H.. Combining isoelectric point-based fractionation, liquid chromatography and mass spectrometry to improve peptide detection and protein identification. J. Am. Soc. Mass Spectrom. 2010; 21:1612–1619. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Gauci S., van Breukelen B., Lemeer S.M., Krijgsveld J., Heck A.J.R.. A versatile peptide pI calculator for phosphorylated and N-terminal acetylated peptides experimentally tested using peptide isoelectric focusing. Proteomics. 2008; 8:4898–4906. [DOI] [PubMed] [Google Scholar]
9. Halligan B.D., Ruotti V., Jin W., Laffoon S., Twigger S.N., Dratz E.A.. ProMoST (Protein Modification Screening Tool): a web-based tool for mapping protein modifications on two-dimensional gels. Nucleic. Acids. Res. 2004; 32:W638–W644. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Pace C.N., Grimsley G.R., Scholtz J.M.. Protein ionizable groups: pK values and their contribution to protein stability and solubility. J. Biol. Chem. 2009; 284:13285–13289. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Po H.N., Senozan N.M.. The Henderson-Hasselbalch equation: its history and limitations. J. Chem. Educ. 2001; 78:1499. [Google Scholar]
12. Kozlowski L.P. IPC - isoelectric point calculator. Biol. Direct. 2016; 11:55. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Cargile B.J., Sevinsky J.R., Essader A.S., Eu J.P., Stephenson J.L.. Calculation of the isoelectric point of tryptic peptides in the pH 3.5-4.5 range based on adjacent amino acid effects. Electrophoresis. 2008; 29:2768–2778. [DOI] [PubMed] [Google Scholar]
14. Skvortsov V.S., Alekseychuk N.N., Khudyakov D.V., Romero Reyes I.V.. pIPredict: a computer tool for predicting isoelectric points of peptides and proteins. Biomed. Khim. 2015; 61:83–91. [DOI] [PubMed] [Google Scholar]
15. Perez-Riverol Y., Audain E., Millan A., Ramos Y., Sanchez A., Vizcaíno J.A., Wang R., Müller M., Machado Y.J., Betancourt L.H.et al.. Isoelectric point optimization using peptide descriptors and support vector machines. J. Proteomics. 2012; 75:2269–2274. [DOI] [PubMed] [Google Scholar]
16. Thurlkill R.L., Grimsley G.R., Scholtz J.M., Pace C.N.. pK values of the ionizable groups of proteins. Protein Sci. 2006; 15:1214–1218. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Pahari S., Sun L., Alexov E.. PKAD: a database of experimentally measured pKa values of ionizable groups in proteins. Database (Oxford). 2019; 2019:baz024. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Song Y., Mao J., Gunner M.R.. MCCE2: improving protein pKa calculations with extensive side chain rotamer sampling. J. Comput. Chem. 2009; 30:2231–2247. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Anandakrishnan R., Aguilar B., Onufriev A.V.. H++ 3.0: automating pK prediction and the preparation of biomolecular structures for atomistic molecular modeling and simulations. Nucleic Acids Res. 2012; 40:W537–W541. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Rostkowski M., Olsson M.H.M., Søndergaard C.R., Jensen J.H.. Graphical analysis of pH-dependent properties of proteins predicted using PROPKA. BMC Struct. Biol. 2011; 11:6. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Kilambi K.P., Gray J.J.. Rapid calculation of protein pKa values using Rosetta. Biophys. J. 2012; 103:587–595. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Hoogland C., Mostaguir K., Sanchez J.-C., Hochstrasser D.F., Appel R.D.. SWISS-2DPAGE, ten years later. Proteomics. 2004; 4:2352–2356. [DOI] [PubMed] [Google Scholar]
23. Bunkute E., Cummins C., Crofts F.J., Bunce G., Nabney I.T., Flower D.R.. PIP-DB: the protein isoelectric point database. Bioinformatics. 2015; 31:295–296. [DOI] [PubMed] [Google Scholar]
24. Johansson H.J., Socciarelli F., Vacanti N.M., Haugen M.H., Zhu Y., Siavelis I., Fernandez-Woodbridge A., Aure M.R., Sennblad B., Vesterlund M.et al.. Breast cancer quantitative proteome and proteogenomic landscape. Nat. Commun. 2019; 10:1600. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Heller M., Ye M., Michel P.E., Morier P., Stalder D., Jünger M.A., Aebersold R., Reymond F., Rossier J.S.. Added value for tandem mass spectrometry shotgun proteomics data validation through isoelectric focusing of peptides. J. Proteome Res. 2005; 4:2273–2282. [DOI] [PubMed] [Google Scholar]
26. Bjellqvist B., Basse B., Olsen E., Celis J.E.. Reference points for comparisons of two-dimensional maps of proteins from different human cell types defined in a pH scale where isoelectric points correlate with polypeptide compositions. Electrophoresis. 1994; 15:529–539. [DOI] [PubMed] [Google Scholar]
27. Tabb D.L., McDonald W.H., Yates J.R.. DTASelect and contrast: tools for assembling and comparing protein identifications from shotgun proteomics. J. Proteome Res. 2002; 1:21–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Kawashima S., Pokarowski P., Pokarowska M., Kolinski A., Katayama T., Kanehisa M.. AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 2008; 36:D202–D205. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V.et al.. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 2011; 12:2825–2830. [Google Scholar]
30. Virtanen P., Gommers R., Oliphant T.E., Haberland M., Reddy T., Cournapeau D., Burovski E., Peterson P., Weckesser W., Bright J.et al.. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods. 2020; 17:261–272. [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Abadi M., Agarwal A., Barham P., Brevdo E., Chen Z., Citro C., Corrado G.S., Davis A., Dean J., Devin M.. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. 2016; arXiv doi:16 March 2016, preprint: not peer reviewedhttps://arxiv.org/abs/1603.04467v2.
32. Chollet F., others2015; Kerashttps://github.com/fchollet/keras.
33. Wales D.J., Doye J.P.K.. Global optimization by basin-hopping and the lowest energy structures of Lennard-Jones clusters containing up to 110 atoms. J. Phys. Chem. A. 1997; 101:5111–5116. [Google Scholar]
34. Storn R., Price K.. Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces. J. Global Optim. 1997; 11:341–359. [Google Scholar]
35. Kingma D.P., Ba J.. Adam: a method for stochastic optimization. 2017; arXiv doi:30 June 2017, preprint: not peer reviewedhttps://arxiv.org/abs/1412.6980v3.
36. Patrickios C.S., Yamasaki E.N.. Polypeptide amino acid composition and isoelectric point. II. Comparison between experiment and theory. Anal. Biochem. 1995; 231:82–91. [DOI] [PubMed] [Google Scholar]
37. Graham Solomons T.W., Fryhle C.B., Snyder S.A.. Solomons’ Organic Chemistry. 2017; 12th edn, global edition Wiley Wiley.com. [Google Scholar]
38. Nelson D.L., Cox M.M.. Lehninger Principles of Biochemistry. 2017; 7th ednMacmillan Learning for Instructors. [Google Scholar]
39. Rice P., Longden I., Bleasby A.. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000; 16:276–277. [DOI] [PubMed] [Google Scholar]
40. Dawson R.M.C., Elliott D.C., Elliott W.H., Jones K.M.E.J.. Data for Biochemical Research. 1987; t3rd ednOxford: Oxford Science Publications, OUP; 580. [Google Scholar]
41. Toseland C.P., McSparron H., Davies M.N., Flower D.R.. PPD v1.0–an integrated, web-accessible database of experimentally determined protein pKa values. Nucleic Acids Res. 2006; 34:D199–D203. [DOI] [PMC free article] [PubMed] [Google Scholar]
42. Sillero A., Ribeiro J.M.. Isoelectric points of proteins: theoretical determination. Anal. Biochem. 1989; 179:319–325. [DOI] [PubMed] [Google Scholar]
43. Rodwell J.D. Heterogeneity of component bands in isoelectric focusing patterns. Anal. Biochem. 1982; 119:440–449. [DOI] [PubMed] [Google Scholar]
44. Nozaki Y., Tanford C.. The solubility of amino acids and two glycine peptides in aqueous ethanol and dioxane solutions. Establishment of a hydrophobicity scale. J. Biol. Chem. 1971; 246:2211–2217. [PubMed] [Google Scholar]
45. Grimsley G.R., Scholtz J.M., Pace C.N.. A summary of the measured pK values of the ionizable groups in folded proteins. Protein Sci. 2009; 18:247–251. [DOI] [PMC free article] [PubMed] [Google Scholar]
46. Wilkins M.R., Gasteiger E., Bairoch A., Sanchez J.C., Williams K.L., Appel R.D., Hochstrasser D.F.. Protein identification and analysis tools in the ExPASy server. Methods Mol. Biol. 1999; 112:531–552. [DOI] [PubMed] [Google Scholar]
47. Qiu X., Zhang L., Ren Y., Suganthan P.N., Amaratunga G.. Ensemble deep learning for regression and time series forecasting. 2014 IEEE Symposium on Computational Intelligence in Ensemble Learning (CIEL). 2014; 1–6. [Google Scholar]
48. Xiao Y., Wu J., Lin Z., Zhao X.. A deep learning-based multi-model ensemble method for cancer prediction. Comput. Methods Programs Biomed. 2018; 153:1–9. [DOI] [PubMed] [Google Scholar]
49. Schavemaker P.E., Śmigiel W.M., Poolman B.. Ribosome surface properties may impose limits on the nature of the cytoplasmic proteome. Elife. 2017; 6:e30084. [DOI] [PMC free article] [PubMed] [Google Scholar]
50. Beneke T., Demay F., Hookway E., Ashman N., Jeffery H., Smith J., Valli J., Becvar T., Myskova J., Lestinova T.et al.. Genetic dissection of a Leishmania flagellar proteome demonstrates requirement for directional motility in sand fly infections. PLoS Pathog. 2019; 15:e1007828. [DOI] [PMC free article] [PubMed] [Google Scholar]
51. Kozlowski L.P. Proteome-pI: proteome isoelectric point database. Nucleic Acids Res. 2017; 45:D1112–D1116. [DOI] [PMC free article] [PubMed] [Google Scholar]
52. Méheust R., Watson A.K., Lapointe F.-J., Papke R.T., Lopez P., Bapteste E.. Hundreds of novel composite genes and chimeric genes with bacterial origins contributed to haloarchaeal evolution. Genome Biol. 2018; 19:75. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkab295_Supplemental_Files

Click here for additional data file.^{(258.5MB, zip)}

Data Availability Statement

[B1] 1. Klose J. Protein mapping by combined isoelectric focusing and electrophoresis of mouse tissues. A novel approach to testing for induced point mutations in mammals. Humangenetik. 1975; 26:231–243. [DOI] [PubMed] [Google Scholar]

[B2] 2. O’Farrell P.H. High resolution two-dimensional electrophoresis of proteins. J. Biol. Chem. 1975; 250:4007–4021. [PMC free article] [PubMed] [Google Scholar]

[B3] 3. Pernemalm M., Lehtiö J.. A novel prefractionation method combining protein and peptide isoelectric focusing in immobilized pH gradient strips. J. Proteome Res. 2013; 12:1014–1019. [DOI] [PubMed] [Google Scholar]

[B4] 4. Zhu M., Rodriguez R., Wehr T.. Optimizing separation parameters in capillary isoelectric focusing. J. Chromatogr. A. 1991; 559:479–488. [Google Scholar]

[B5] 5. Kirkwood J., Hargreaves D., O’Keefe S., Wilson J.. Using isoelectric point to determine the pH for initial protein crystallization trials. Bioinformatics. 2015; 31:1444–1451. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6. Branca R.M.M., Orre L.M., Johansson H.J., Granholm V., Huss M., Pérez-Bercoff Å., Forshed J., Käll L., Lehtiö J.. HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics. Nat. Methods. 2014; 11:59–62. [DOI] [PubMed] [Google Scholar]

[B7] 7. Cologna S.M., Russell W.K., Lim P.J., Vigh G., Russell D.H.. Combining isoelectric point-based fractionation, liquid chromatography and mass spectrometry to improve peptide detection and protein identification. J. Am. Soc. Mass Spectrom. 2010; 21:1612–1619. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] 8. Gauci S., van Breukelen B., Lemeer S.M., Krijgsveld J., Heck A.J.R.. A versatile peptide pI calculator for phosphorylated and N-terminal acetylated peptides experimentally tested using peptide isoelectric focusing. Proteomics. 2008; 8:4898–4906. [DOI] [PubMed] [Google Scholar]

[B9] 9. Halligan B.D., Ruotti V., Jin W., Laffoon S., Twigger S.N., Dratz E.A.. ProMoST (Protein Modification Screening Tool): a web-based tool for mapping protein modifications on two-dimensional gels. Nucleic. Acids. Res. 2004; 32:W638–W644. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] 10. Pace C.N., Grimsley G.R., Scholtz J.M.. Protein ionizable groups: pK values and their contribution to protein stability and solubility. J. Biol. Chem. 2009; 284:13285–13289. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] 11. Po H.N., Senozan N.M.. The Henderson-Hasselbalch equation: its history and limitations. J. Chem. Educ. 2001; 78:1499. [Google Scholar]

[B12] 12. Kozlowski L.P. IPC - isoelectric point calculator. Biol. Direct. 2016; 11:55. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13. Cargile B.J., Sevinsky J.R., Essader A.S., Eu J.P., Stephenson J.L.. Calculation of the isoelectric point of tryptic peptides in the pH 3.5-4.5 range based on adjacent amino acid effects. Electrophoresis. 2008; 29:2768–2778. [DOI] [PubMed] [Google Scholar]

[B14] 14. Skvortsov V.S., Alekseychuk N.N., Khudyakov D.V., Romero Reyes I.V.. pIPredict: a computer tool for predicting isoelectric points of peptides and proteins. Biomed. Khim. 2015; 61:83–91. [DOI] [PubMed] [Google Scholar]

[B15] 15. Perez-Riverol Y., Audain E., Millan A., Ramos Y., Sanchez A., Vizcaíno J.A., Wang R., Müller M., Machado Y.J., Betancourt L.H.et al.. Isoelectric point optimization using peptide descriptors and support vector machines. J. Proteomics. 2012; 75:2269–2274. [DOI] [PubMed] [Google Scholar]

[B16] 16. Thurlkill R.L., Grimsley G.R., Scholtz J.M., Pace C.N.. pK values of the ionizable groups of proteins. Protein Sci. 2006; 15:1214–1218. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] 17. Pahari S., Sun L., Alexov E.. PKAD: a database of experimentally measured pKa values of ionizable groups in proteins. Database (Oxford). 2019; 2019:baz024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] 18. Song Y., Mao J., Gunner M.R.. MCCE2: improving protein pKa calculations with extensive side chain rotamer sampling. J. Comput. Chem. 2009; 30:2231–2247. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] 19. Anandakrishnan R., Aguilar B., Onufriev A.V.. H++ 3.0: automating pK prediction and the preparation of biomolecular structures for atomistic molecular modeling and simulations. Nucleic Acids Res. 2012; 40:W537–W541. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] 20. Rostkowski M., Olsson M.H.M., Søndergaard C.R., Jensen J.H.. Graphical analysis of pH-dependent properties of proteins predicted using PROPKA. BMC Struct. Biol. 2011; 11:6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 21. Kilambi K.P., Gray J.J.. Rapid calculation of protein pKa values using Rosetta. Biophys. J. 2012; 103:587–595. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22. Hoogland C., Mostaguir K., Sanchez J.-C., Hochstrasser D.F., Appel R.D.. SWISS-2DPAGE, ten years later. Proteomics. 2004; 4:2352–2356. [DOI] [PubMed] [Google Scholar]

[B23] 23. Bunkute E., Cummins C., Crofts F.J., Bunce G., Nabney I.T., Flower D.R.. PIP-DB: the protein isoelectric point database. Bioinformatics. 2015; 31:295–296. [DOI] [PubMed] [Google Scholar]

[B24] 24. Johansson H.J., Socciarelli F., Vacanti N.M., Haugen M.H., Zhu Y., Siavelis I., Fernandez-Woodbridge A., Aure M.R., Sennblad B., Vesterlund M.et al.. Breast cancer quantitative proteome and proteogenomic landscape. Nat. Commun. 2019; 10:1600. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] 25. Heller M., Ye M., Michel P.E., Morier P., Stalder D., Jünger M.A., Aebersold R., Reymond F., Rossier J.S.. Added value for tandem mass spectrometry shotgun proteomics data validation through isoelectric focusing of peptides. J. Proteome Res. 2005; 4:2273–2282. [DOI] [PubMed] [Google Scholar]

[B26] 26. Bjellqvist B., Basse B., Olsen E., Celis J.E.. Reference points for comparisons of two-dimensional maps of proteins from different human cell types defined in a pH scale where isoelectric points correlate with polypeptide compositions. Electrophoresis. 1994; 15:529–539. [DOI] [PubMed] [Google Scholar]

[B27] 27. Tabb D.L., McDonald W.H., Yates J.R.. DTASelect and contrast: tools for assembling and comparing protein identifications from shotgun proteomics. J. Proteome Res. 2002; 1:21–26. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B28] 28. Kawashima S., Pokarowski P., Pokarowska M., Kolinski A., Katayama T., Kanehisa M.. AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 2008; 36:D202–D205. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] 29. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V.et al.. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 2011; 12:2825–2830. [Google Scholar]

[B30] 30. Virtanen P., Gommers R., Oliphant T.E., Haberland M., Reddy T., Cournapeau D., Burovski E., Peterson P., Weckesser W., Bright J.et al.. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods. 2020; 17:261–272. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B31] 31. Abadi M., Agarwal A., Barham P., Brevdo E., Chen Z., Citro C., Corrado G.S., Davis A., Dean J., Devin M.. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. 2016; arXiv doi:16 March 2016, preprint: not peer reviewedhttps://arxiv.org/abs/1603.04467v2.

[B32] 32. Chollet F., others2015; Kerashttps://github.com/fchollet/keras.

[B33] 33. Wales D.J., Doye J.P.K.. Global optimization by basin-hopping and the lowest energy structures of Lennard-Jones clusters containing up to 110 atoms. J. Phys. Chem. A. 1997; 101:5111–5116. [Google Scholar]

[B34] 34. Storn R., Price K.. Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces. J. Global Optim. 1997; 11:341–359. [Google Scholar]

[B35] 35. Kingma D.P., Ba J.. Adam: a method for stochastic optimization. 2017; arXiv doi:30 June 2017, preprint: not peer reviewedhttps://arxiv.org/abs/1412.6980v3.

[B36] 36. Patrickios C.S., Yamasaki E.N.. Polypeptide amino acid composition and isoelectric point. II. Comparison between experiment and theory. Anal. Biochem. 1995; 231:82–91. [DOI] [PubMed] [Google Scholar]

[B37] 37. Graham Solomons T.W., Fryhle C.B., Snyder S.A.. Solomons’ Organic Chemistry. 2017; 12th edn, global edition Wiley Wiley.com. [Google Scholar]

[B38] 38. Nelson D.L., Cox M.M.. Lehninger Principles of Biochemistry. 2017; 7th ednMacmillan Learning for Instructors. [Google Scholar]

[B39] 39. Rice P., Longden I., Bleasby A.. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000; 16:276–277. [DOI] [PubMed] [Google Scholar]

[B40] 40. Dawson R.M.C., Elliott D.C., Elliott W.H., Jones K.M.E.J.. Data for Biochemical Research. 1987; t3rd ednOxford: Oxford Science Publications, OUP; 580. [Google Scholar]

[B41] 41. Toseland C.P., McSparron H., Davies M.N., Flower D.R.. PPD v1.0–an integrated, web-accessible database of experimentally determined protein pKa values. Nucleic Acids Res. 2006; 34:D199–D203. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B42] 42. Sillero A., Ribeiro J.M.. Isoelectric points of proteins: theoretical determination. Anal. Biochem. 1989; 179:319–325. [DOI] [PubMed] [Google Scholar]

[B43] 43. Rodwell J.D. Heterogeneity of component bands in isoelectric focusing patterns. Anal. Biochem. 1982; 119:440–449. [DOI] [PubMed] [Google Scholar]

[B44] 44. Nozaki Y., Tanford C.. The solubility of amino acids and two glycine peptides in aqueous ethanol and dioxane solutions. Establishment of a hydrophobicity scale. J. Biol. Chem. 1971; 246:2211–2217. [PubMed] [Google Scholar]

[B45] 45. Grimsley G.R., Scholtz J.M., Pace C.N.. A summary of the measured pK values of the ionizable groups in folded proteins. Protein Sci. 2009; 18:247–251. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B46] 46. Wilkins M.R., Gasteiger E., Bairoch A., Sanchez J.C., Williams K.L., Appel R.D., Hochstrasser D.F.. Protein identification and analysis tools in the ExPASy server. Methods Mol. Biol. 1999; 112:531–552. [DOI] [PubMed] [Google Scholar]

[B47] 47. Qiu X., Zhang L., Ren Y., Suganthan P.N., Amaratunga G.. Ensemble deep learning for regression and time series forecasting. 2014 IEEE Symposium on Computational Intelligence in Ensemble Learning (CIEL). 2014; 1–6. [Google Scholar]

[B48] 48. Xiao Y., Wu J., Lin Z., Zhao X.. A deep learning-based multi-model ensemble method for cancer prediction. Comput. Methods Programs Biomed. 2018; 153:1–9. [DOI] [PubMed] [Google Scholar]

[B49] 49. Schavemaker P.E., Śmigiel W.M., Poolman B.. Ribosome surface properties may impose limits on the nature of the cytoplasmic proteome. Elife. 2017; 6:e30084. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B50] 50. Beneke T., Demay F., Hookway E., Ashman N., Jeffery H., Smith J., Valli J., Becvar T., Myskova J., Lestinova T.et al.. Genetic dissection of a Leishmania flagellar proteome demonstrates requirement for directional motility in sand fly infections. PLoS Pathog. 2019; 15:e1007828. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B51] 51. Kozlowski L.P. Proteome-pI: proteome isoelectric point database. Nucleic Acids Res. 2017; 45:D1112–D1116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B52] 52. Méheust R., Watson A.K., Lapointe F.-J., Papke R.T., Lopez P., Bapteste E.. Hundreds of novel composite genes and chimeric genes with bacterial origins contributed to haloarchaeal evolution. Genome Biol. 2018; 19:75. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

IPC 2.0: prediction of isoelectric point and pK_a dissociation constants

Lukasz Pawel Kozlowski

Abstract

Graphical Abstract

Graphical Abstract.

INTRODUCTION

Figure 1.

MATERIALS AND METHODS

Datasets

Table 1.

Feature generation

Performance evaluation metrics

Machine learning

Other methods

Implementation

RESULTS

Isoelectric point prediction

Table 2.

pK_a dissociation constant prediction

Table 3.

IPC 2.0 web server

DISCUSSION

DATA AVAILABILITY

Supplementary Material

ACKNOWLEDGEMENTS

SUPPLEMENTARY DATA

FUNDING

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

IPC 2.0: prediction of isoelectric point and pKa dissociation constants

Lukasz Pawel Kozlowski

Abstract

Graphical Abstract

Graphical Abstract.

INTRODUCTION

Figure 1.

MATERIALS AND METHODS

Datasets

Table 1.

Feature generation

Performance evaluation metrics

Machine learning

Other methods

Implementation

RESULTS

Isoelectric point prediction

Table 2.

pKa dissociation constant prediction

Table 3.

IPC 2.0 web server

DISCUSSION

DATA AVAILABILITY

Supplementary Material

ACKNOWLEDGEMENTS

SUPPLEMENTARY DATA

FUNDING

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

IPC 2.0: prediction of isoelectric point and pK_a dissociation constants

pK_a dissociation constant prediction