Identifying Heat Shock Protein Families from Imbalanced Data by Using Combined Features

Xiao-Yang Jing; Feng-Min Li

doi:10.1155/2020/8894478

. 2020 Sep 23;2020:8894478. doi: 10.1155/2020/8894478

Identifying Heat Shock Protein Families from Imbalanced Data by Using Combined Features

Xiao-Yang Jing ¹, Feng-Min Li ^1,^✉

PMCID: PMC7530508 PMID: 33029195

Abstract

Heat shock proteins (HSPs) are ubiquitous in living organisms. HSPs are an essential component for cell growth and survival; the main function of HSPs is controlling the folding and unfolding process of proteins. According to molecular function and mass, HSPs are categorized into six different families: HSP20 (small HSPS), HSP40 (J-proteins), HSP60, HSP70, HSP90, and HSP100. In this paper, improved methods for HSP prediction are proposed—the split amino acid composition (SAAC), the dipeptide composition (DC), the conjoint triad feature (CTF), and the pseudoaverage chemical shift (PseACS) were selected to predict the HSPs with a support vector machine (SVM). In order to overcome the imbalance data classification problems, the syntactic minority oversampling technique (SMOTE) was used to balance the dataset. The overall accuracy was 99.72% with a balanced dataset in the jackknife test by using the optimized combination feature SAAC+DC+CTF+PseACS, which was 4.81% higher than the imbalanced dataset with the same combination feature. The Sn, Sp, Acc, and MCC of HSP families in our predictive model were higher than those in existing methods. This improved method may be helpful for protein function prediction.

1. Introduction

Heat shock proteins (HSPs) are ubiquitous in living organisms. They act as molecular chaperones by facilitating and maintaining proper protein structure and function [1–4]; in addition, they are involved in various cellular processes such as protein assembly, secretion, transportation, and protein degradation [5, 6]. HSPs are rapidly expressed when the cells are exposed to physiological and environmental conditions such as elevated temperature, infection, and inflammation [7, 8]. Since the HSPs were discovered in 1962 by Ritossa [9], the HSPs have been widely studied, including their involvement in cardiovascular disease, diabetes, cancer [10–14]. According to molecular function and mass, HSPs are categorized into six different families: HSP20 (small HSPS), HSP40 (J-protein), HSP60, HSP70, HSP90, and HSP100 [15]. These families of HSPs have different functions. The HSP20 family is an ATP-independent molecular chaperone. They are efficient in preventing irreversible aggregation processes by binding denatured proteins [16]. The HSP70 family is the most highly conserved among the HSP families; it is an ATP-dependent molecular chaperone that involves protein folding and remodeling [17]. HSP40 is the cochaperone of HSP70, which participates in DNA binding, protein degradation, intracellular signal transduction, exocytosis, endocytosis, viral infection, apoptosis, and heat shock sensing [18]. HSP90 is another ATP-dependent chaperone that controls protein function and activity by facilitating protein folding, binding of ligands to their receptors or targets, or the assembly of multiprotein complexes [19]. The function of the HSP100 protein is to improve the tolerance to temperature and to promote the proteolysis of specific cellular substrates and regulation of transcription [20]. Experimental determination of HSPs are time-consuming and laborious, so it is necessary to use an effective method to predict HSPs. Recently, some computational methods for predicting HSPs have been proposed in the literature. Feng et al. developed a predictor called “iHSP-RAAAC” that selected the reduced amino acid alphabet (RAAA) as a feature vector; the overall predictive accuracy was 87.42% with the jackknife test [21]. Ahmad et al. used the split amino acid composition (SAAC), the dipeptide composition (DC), and PseAAC [22, 23] to identify HSPs; the highest overall predictive accuracy was 90.7% with the jackknife test [24]. Kumar et al. predicted HSPs and non-HSPs, and the best prediction accuracy was 72.98% by using the dipeptide composition (DC) with a 5-fold cross-validation test [25]. Meher et al. used the G-Spaced Amino Acid Pair Composition (GPC) to predict HSPs; a better result was obtained with the jackknife test [26]. Chen et al. summarized the recent advances in machine learning methods for predicting HSPs [27]. Feature selection is generally essential in a classification, and the appropriate integrated feature model generally offers higher accuracy [28]. Hence, the hybrid features have been successfully used in recent studies for constructing classifiers [29, 30]. We used the hybrid features to enhance performance. In this paper, the split amino acid composition (SAAC), the dipeptide composition (DC), the conjoint triad feature (CTF), and the pseudoaverage chemical shift (PseACS) were used to predict the HSPs with the same datasets as investigated by Feng et al. Data imbalance is always considered a problem in developing efficient and reliable prediction systems; due to an imbalanced dataset, the classifier would tend towards the majority class. Here, the syntactic minority oversampling technique (SMOTE) was used to solve the problem of imbalance. The overall accuracy was 99.72% with a balanced dataset in the jackknife test by using the optimized combination feature SAAC+DC+CTF+PseACS, which was 4.81% higher than the imbalanced dataset with the same combination feature.

2. Material and Methods

2.1. Dataset

The benchmark dataset was generated by Feng et al. [21]; the dataset was originally taken from the HSPIR database. In order to reduce homologous bias and redundancy, the program CD-HIT [31] was used to remove those sequences that have ≥40% pairwise sequence identity. 2225 sequences were obtained from different HSP families: the subset S₁ contains 357 sequences, the subset S₂ contains 1279 sequences, the subset S₃ contains 163 sequences, the subset S₄ contains 283 sequences, the subset S₅ contains 58 sequences, and the subset S₆ contains 85 sequences (see Table 1). The dataset can be freely downloaded from http://lin-group.cn/server/iHSP-PseRAAAC. The independent datasets include two datasets: the HGNC dataset and the RICE dataset (see Table 2). The HGNC dataset [32] has 96 human HSPs, and the RICE dataset has 55 RICE HSPs, which obtained 31 HSPs from Wang et al. [33] and 24 HSPs from a single family from Sarkar et al. [34]. The independent dataset can be freely downloaded from http://cabgrid.res.in:8080/ir-hsp.

Table 1.

The number of sequences in HSP families.

Dataset	Family	Number of HSP samples
S ₁	HSP20	357
S ₂	HSP40	1279
S ₃	HSP60	163
S ₄	HSP70	283
S ₅	HSP90	58
S ₆	HSP100	85
S	Overall	2225

Open in a new tab

Table 2.

The number of sequences in the independent dataset.

Families	HGNC dataset	RICE dataset
Families	HGNC dataset	Wang et al.	Sarkar et al.
HSP20	11	14	—
HSP40	49	—	—
HSP60	15	4	—
HSP70	17	7	24
HSP90	4	3	—
HSP100	—	3	—
Total	96	31	24

Open in a new tab

2.2. The Prediction Model Construction Overview

The prediction model process is illustrated in Figure 1. The feature parameters were extracted for the HSPs. By using various information parameters, the prediction results show that better prediction results may be obtained by combining the following four information parameters: the split amino acid composition (SAAC), the dipeptide composition (DC), the conjoint triad feature (CTF), and the pseudoaverage chemical shift (PseACS). In SAAC, the protein sequence was split into the N-terminus segment and the C-terminus segment according to the golden ratio. Among the four feature parameters, the split amino acid composition (SAAC), the dipeptide composition (DC), and the conjoint triad feature (CTF) are based on the protein sequence, while the pseudoaverage chemical shift (PseACS) is related to the protein secondary structure. Therefore, the feature parameters involved both sequence and structure information. The four feature parameters were combined, and the syntactic minority oversampling technique (SMOTE) was used to solve the problem of the imbalance dataset. The overall accuracy (OA) was 99.72% with the balanced dataset, and the result demonstrates that the proposed method is superior to the existing methods.

The flowchart of the proposed method. SAAC: split amino acid composition; DC: dipeptide composition; CTF: conjoint triad feature; PseACS: pseudoaverage chemical shift; SMOTE: syntactic minority oversampling technique.

2.3. Feature Extraction Techniques

In order to predict the HSPs, it is very important to choose a classifier and a set of reasonable parameters. In this paper, the split amino acid composition (SAAC), the dipeptide composition (DC) [35], the conjoint triad feature (CTF), and the pseudoaverage chemical shift (PseACS) were used to predict the HSPs.

2.3.1. Split Amino Acid Composition (SAAC)

Split amino acid composition (SAAC) is a feature extraction method based on AAC. In SAAC, the protein sequence is split into various segments; then, the composition of each segment is counted separately [36–39]. It is well known that the golden ratio is ubiquitous in nature. According to the golden ratio, the protein sequence is divided into the N-terminus segment and the C-terminus segment; the ratio of the N-terminus segment to the C-terminus segment is the golden ratio [40]. This method can be represented as follows:

\begin{matrix} SAAC_{Gr}^{1} = ({AAC}^{N}, {AAC}^{C}), \\ {AAC}^{N} = [x_{1}^{N}, x_{2}^{N}, \dots, x_{i}^{N}, \dots, x_{20}^{N}], \\ {AAC}^{C} = [x_{1}^{C}, x_{2}^{C}, \dots, x_{i}^{C}, \dots, x_{20}^{C}], \\ x_{i}^{N} = \frac{W_{i}}{L_{N}}, \\ x_{i}^{C} = \frac{W_{i}}{L_{C}}, \\ (i = 1, 2, \dots, 20), \end{matrix}

(1)

where Gr¹ is the 1-step segmentation using the golden ratio, N represents the N-terminus, C represents the C-terminus, W_i is the occurrence of amino acid i, L_N is the length of the N-terminus segment, L_C is the length of the C-terminus segment.

With this method, we can get SAAC_Gr², SAAC_Gr³,….

\begin{matrix} SAAC_{Gr}^{2} = ({AAC}_{N}^{N}, {AAC}_{N}^{C}, {AAC}_{C}^{N}, {AAC}_{C}^{C}), \\ SAAC_{Gr}^{3} = ({AAC}_{NN}^{N}, {AAC}_{NN}^{C}, {AAC}_{NC}^{N}, {AAC}_{NC}^{C}, {AAC}_{CN}^{N}, {AAC}_{CN}^{C}, {AAC}_{CC}^{N}, {AAC}_{CC}^{C}) . \end{matrix}

(2)

2.3.2. Dipeptide Composition (DC)

Dipeptide composition (DC) is a discrete method using sequence neighbor information [27, 41, 42]. The occurrence frequency of each two adjacent amino acid residue was computed; the advantage of DC is that it considers some sequence-order information. It can be calculated as follows:

\begin{matrix} P = [f_{1}, f_{2}, f_{3}, \dots, f_{i}, \dots, f_{400}], \\ f_{i} = \frac{m_{i}}{L - 1}, \end{matrix}

(3)

where m_i is the occurrence number of the ith dipeptide in the protein sequence, L is the length of the protein sequence.

2.3.3. Conjoint Triad Feature (CTF)

The conjoint triad feature (CTF) representation was used by Shen et al. [43]. In this method, the properties of one amino acid and its vicinal amino acids were considered. Three continuous amino acids were regarded as a unit. The 20 amino acids are classified into 7 groups based on dipole moments and the volume of the side chains: {A, G, V}, {I, L, F, P}, {Y, M, T, S}, {H, N, Q, W}, {R, K}, {D, E}, and {C}. Thus, each protein sequence is represented by a 343- (7 × 7 × 7) dimensional vector, where each element of the vector corresponds to the frequency of the corresponding conjoint triad in the protein sequence. The conjoint triad feature (CTF) has successfully predicted enzyme function [44], protein-protein interactions [45], RNA-protein interactions [46], and nuclear receptors [47]. The features of CTF can be formulated as follows:

\begin{matrix} CTF = [x_{1}, x_{2}, x_{3}, \dots, x_{i}, \dots, x_{343}], \\ x_{i} = \frac{n_{i}}{L - 2}, \end{matrix}

(4)

where n_i is the occurrence number of each triad type of the protein sequence, L is the length of the protein sequence.

2.3.4. Pseudoaverage Chemical Shift (PseACS)

Nuclear magnetic resonance (NMR) plays a unique role in studying the structure of proteins because it provides information on the dynamics of the internal motion of proteins on multiple time scales [48]. Protons are sensitive to the chemical environment. The protons in different chemical environments experience slightly different magnetic fields, and they absorb different frequencies in different magnetic fields; the resonant frequencies of the various proteins in relation to a stand are called the chemical shift [49]. As important parameters are measured by nuclear magnetic resonance (NMR) spectroscopy, a chemical shift has been used as a powerful indicator of the protein structure. Several researchers revealed that the averaged chemical shift (ACS) of a particular nucleus in the protein backbone empirically correlates well to its secondary structure [50]. The PseACS web is accessible at http://202.207.14.87:8032/bioinformation/acACS/index.asp.

For a protein P, each amino acid in the sequence is substituted by its averaged chemical shift, and P can be expressed as follows:

\begin{matrix} P = [A_{1}^{i}, A_{2}^{i}, A_{3}^{i}, \dots, A_{L}^{i}], (i = {}^{15}N, {{}^{13}C}_{α}, {{}^{1}H}_{α}, {{}^{1}H}_{N}), \end{matrix}

(5)

where ¹⁵N stands for nitrogen, ¹³C_α for alpha carbon, ¹H_α for alpha hydrogen, and ¹H_N for hydrogen linked with nitrogen.

After, we select λ = 54 and i = ¹⁵N, ¹³C_α, ¹H_α, ¹H, the PseACS would be expressed as follows:

\begin{matrix} ϕ_{i}^{λ} = \frac{1}{L - λ} \sum_{k = 1}^{L - λ} {[A_{k}^{i} - A_{k + λ}^{i}]}^{2}, (i = {}^{15}N, {{}^{13}C}_{α}, {{}^{1}H}_{α}, {{}^{1}H}_{N}; λ < L), \\ PseACS = [ϕ_{i}^{0}, ϕ_{i}^{1}, ϕ_{i}^{2}, \dots, ϕ_{i}^{λ}], (i = {}^{15}N, {{}^{13}C}_{α}, {{}^{1}H}_{α}, {{}^{1}H}_{N}) . \end{matrix}

(6)

2.4. Syntactic Minority Oversampling Technique (SMOTE)

As shown in Table 1, the numbers of HSP40 are about 4 times, 8 times, 5 times, 22 times, and 15 times that of HSP20, HSP60, HSP70, HSP90, and HSP100, respectively. This leads to imbalance data classification problems. In order to overcome this problem, we used the SMOTE to solve the problem of imbalance. SMOTE is an oversampling approach where the minority class is oversampled by selecting the minority class and creating new synthetic samples along the line segments connecting any or all K-Nearest Neighbors which belong to that class [51, 52]. In this paper, the protein numbers of six subfamilies are in equilibrium with SMOTE. This algorithm is implemented by the Weka software. A filter selects SMOTE when the data is loaded, and the parameters adopt the default parameters according to the number of families from small to large; the number of the remaining five families increases in turn to the number of HSP40, which is the largest number of the HSP families. In this way, SMOTE is realized.

2.5. Support Vector Machine (SVM)

The support vector machine is a machine learning algorithm, which is based on the statistical learning theory. The basic idea of SVM is to transform the input data into a high-dimensional Hilbert space and then determine the optional separating hyperplane [53, 54]. The radical basis kernel function (RBF) was used to obtain the classification hyperplane with its effectiveness and speed in the training process. To handle a multiclass problem, the regulation parameter c and kernel width parameterγ were determined via the grid search method. “One-versus-one (OVO)” and “one-versus-rest (OVR)” methods are generally applied to extend the traditional SVM. In this study, the “OVO” strategy was used. The OVO strategy constructs k × (k − 1)/2 classifiers with each one trained with the data from two different classes. SVM has been successfully applied in the field of computational biology and bioinformatics [55–64]. In this paper, the LibSVM package was used to predict HSPs, which can be downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvm.

2.6. Performance Evaluation

In statistical prediction, three cross-validation tests are commonly used to examine a predictor for its effectiveness in practical application: the k-fold cross-validation (subsampling test), the independent dataset test, and the jackknife test. Among the three methods, the jackknife test is deemed the most objective and rigorous one. In the jackknife test, each sample in the training dataset is in turn singled out as an independent test sample and all the rule parameters are calculated based on the remaining dataset without including the one being identified. Hence, the jackknife test was used to evaluate performance in this paper. To evaluate the predictive capability and reliability of our model, the performance of the classification algorithm is measured using the following: sensitivity (Sn), specificity (Sp), accuracy (Acc), Matthew's correlation coefficient (MCC), and overall accuracy (OA) [65–75]. The performance of the classification algorithm is measured through the following:

\begin{matrix} Sn = \frac{TP}{TP + FN}, \\ Sp = \frac{TN}{TN + FP}, \\ MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP) \times (TN + FN) \times (TP + FN) \times (TN + FP)}}, \\ Acc = \frac{TP + TN}{TP + TN + FP + FN}, \\ OA = \sum_{i = 1}^{m} {TP}_{i} / N, \end{matrix}

(7)

where TP represents the true positive, TN represents the true negative, FP represents the false positive, and FN represents the false negative. m = 6 is the number of subsets, and N is the number of total sequences of HSP families.

3. Results and Discussion

3.1. The Predictive Performance of HSPs

In order to investigate the effectiveness of the predictive model, many characteristic parameters were selected to predict the HSPs [76, 77]. Then, the split amino acid composition (SAAC), the dipeptide composition (DC), the conjoint triad feature (CTF), and the pseudoaverage chemical shift (PseACS) were selected to predict the HSPs. Table 3 lists the predictive performance of HSPs using individual features with the SVM classification algorithm without SMOTE; the highest overall accuracy (OA) of an individual parameter is 91.38% with the jackknife test by using PseACS. Individual features identify the families of HSPs with an overall accuracy (OA) ranging from 80.92% to 91.38%.

Table 3.

The predictive results of individual features with the jackknife test by using SVM for HSP families.

Features		HSP families						OA (%)
Features		HSP20	HSP40	HSP60	HSP70	HSP90	HSP100	OA (%)
CTF	Sn (%)	74.86	90.92	54.72	67.27	53.85	67.9	80.92
	Sp (%)	95.07	76.19	98.71	96.48	99.86	99.52
	MCC	0.7	0.68	0.63	0.66	0.69	0.75
	Acc (%)	91.79	84.68	95.5	92.75	98.76	98.35
SAAC	Sn (%)	81.07	97.53	58.49	75.9	57.69	74.07	87.25
	Sp (%)	97.7	81.06	99.36	98.26	100	99.48
	MCC	0.81	0.81	0.7	0.78	0.76	0.78
	Acc (%)	95	90.55	96.38	95.41	98.99	98.53
DC	Sn (%)	90.96	96.66	68.55	84.89	63.46	77.78	90.69
	Sp (%)	96.66	90.69	99.11	98.16	100	99.86
	MCC	0.85	0.88	0.75	0.84	0.79	0.86
	Acc (%)	95.73	94.13	96.88	96.47	99.13	99.04
PseACS	Sn (%)	92.37	95.46	75.47	87.41	67.31	83.95	91.38
	Sp (%)	99.01	89.94	98.71	98.16	99.91	99.33
	MCC	0.92	0.86	0.77	0.86	0.79	0.83
	Acc (%)	97.94	93.12	97.02	96.79	99.13	98.76

Open in a new tab

Figure 2 shows the predictive results of different combined features of HSPs with SVM without SMOTE. The results show that the combined feature of SAAC+DC+CTF+PseACS was better than the other parameters. The overall accuracy (OA) of the combined feature of SAAC+DC+CTF+PseACS was 94.91% with the jackknife test. This result indicated that the combined feature was powerful in predicting HSPs.

Prediction results of different combined features. Numbers denote features: 1 for DC, 2 for CTF, 3 for PseACS, and 4 for SAAC.

Table 4 lists the predictive performance of HSP families using the optimized combination feature SAAC+DC+CTF+PseACS with and without SMOTE. In the models with SMOTE, the Sn, Sp, Acc, and MCC of HSP families improved remarkably. For example, for HSP20 with SMOTE, Sn = 100%, Sp = 99.92%, MCC = 1, and Acc = 99.93%, which are 5.65%, 1.34%, 0.08, and 2.04% higher than those without SMOTE. In addition, OA = 99.72% with SMOTE, which is 4.81% higher than HSP families without SMOTE. The results indicate that the combined parameter SAAC+DC+CTF+PseACS with SMOTE was helpful in enhancing predictive performance.

Table 4.

The predictive results of HSPs by using the combined feature of SAAC+DC+CTF+PseACS with and without SMOTE.

Features with and without SMOTE (Y/N)			HSP families						OA (%)
Features with and without SMOTE (Y/N)			HSP20	HSP40	HSP60	HSP70	HSP90	HSP100	OA (%)
PseACS+DC+SAAC+CTF	Y	Sn (%)	100	98.33	100	100	100	100	99.72
		Sp (%)	99.92	100	99.92	99.82	100	100
		MCC	1	0.99	1	0.99	1	1
		Acc (%)	99.93	99.72	99.93	99.85	100	100
PseACS+DC+SAAC+CTF	N	Sn (%)	94.35	98.89	81.13	90.29	75	91.36	94.91
		Sp (%)	98.58	94.26	99.6	98.84	100	99.9
		MCC	0.92	0.94	0.87	0.90	0.86	0.94
		Acc (%)	97.89	96.93	98.26	97.75	99.4	99.59

Open in a new tab

3.2. Comparison with Other Algorithms

The predictive performance of our predictive model (SVM), Random Forest (RF) [78], Naive Bayes (NB), and K-Nearest Neighbors (KNN) [79] is shown in Figures 3 and 4. From Figure 3, we can see that the differences of the Sn, Sp, MCC, and Acc of the HSP families are obvious. The Sn of HSP60, HSP70, HSP90, and HSP100 using SVM and KNN were all 100%. The Sp of HSP20 using KNN and SVM were similar, and the Sp of HSP40 using SVM and KNN were 100%. The MCC of HSP20 and HSP90 using SVM and KNN were both 1. The Acc of HSP20 using KNN and SVM were similar. In addition, from Figure 4, we can see that the value of OA with SVM was 99.72%, which was 4.39%, 7.07%, and 18.99% higher than RF, KNN, and NB, respectively. The highest value of the other parameters was obtained by SVM. Therefore, the experimental results show that SVM has achieved the best measures.

The predictive sensitivity, specificity, MCC, and accuracy of HSPs by using four algorithms.

The predictive overall accuracy of HSPs by using four algorithms.

Figure 5 shows the predictive performance of HSP families using independent datasets. In the HGNC independent dataset, the OA of our predictive model was 98.96%, which was 11.60% and 11.46% higher than PredHSP and ir-HSP, respectively. In the RICE independent dataset, the OA of our predictive model reached 99.31%, which was 4.76% and 2.95% higher than PredHSP and ir-HSP, respectively. From the comparison, we can draw a conclusion that the applicability and accuracy of our prediction model for HSP prediction were improved.

A comparison of the proposed method for independent datasets.

3.3. Comparison with Existing Methods

In order to evaluate the performance of our predictive model, we made comparisons with existing methods. The method developed by Ahmad et al. did not provide any family-wise accuracy of HSPs, so we compared the effectiveness with iHSP-PseRAAAC, PredHSP, and ir-HSP. The results of the comparisons are shown in Table 5. We can see that the Sn, Sp, Acc, and MCC of HSP families in our predictive model were higher than those of PredHSP, iHSP-PseRAAAC, and ir-HSP. For example, in our predictive model, Sn = 100%, Sp = 99.92%, MCC = 1, and Acc = 99.93% for HSP20 exceeded those of ir-HSP, PredHSP, and iHSP-PseRAAAC. In addition, in our predictive model, Sn = 100 for all HSP families, except for HSP40 Sn = 98.33%. Furthermore, the overall accuracy was 99.72% in our predictive model. These results indicate that our predictive model was superior to existing methods.

Table 5.

The comparison of the predictive results between this paper and existing methods.

Method		HSP families
Method		HSP20	HSP40	HSP60	HSP70	HSP90	HSP100
iHSP-PseRAAAC^a	Sn (%)	87.68	95.31	66.87	79.15	51.72	69.41
	Sp (%)	96.36	84.87	98.93	86.54	99.89	99.84
	MCC	0.82	0.99	0.69	0.54	0.3	0.83
	Acc (%)	—	—	—	—	—	—
PredHSP^b	Sn (%)	92.16	96.09	79.75	91.17	72.41	82.35
	Sp (%)	97.16	86.26	97.24	91.97	99.12	98.08
	MCC	0.87	0.83	0.72	0.71	0.7	0.71
	Acc (%)	96.36	91.91	95.96	91.87	98.43	97.48
ir-HSP^c	Sn (%)	94.63	97.45	67.92	88.49	75	88.89
	Sp (%)	96.61	95.13	98.86	98.84	99.76	99.57
	MCC	0.8718	0.9276	0.7307	0.8871	0.8112	0.8846
	Acc (%)	96.28	96.47	96.61	97.52	99.17	99.17
Our predictive model	Sn (%)	100	98.33	100	100	100	100
	Sp (%)	99.92	100	99.92	99.82	100	100
	MCC	1	0.99	1	0.99	1	1
	Acc (%)	99.93	99.72	99.93	99.85	100	100

Open in a new tab

^aFeng et al. [21]. ^bKumar et al. [25]. ^cMeher et al. [26].

4. Conclusion

In this work, an optimized classifier for HSP family identification was developed. This model was derived from the SVM machine learning algorithm, and SMOTE was used for the imbalanced data classification problems. The overall accuracy was 99.72% with the balanced dataset and the jackknife test by using the optimized combination feature SAAC+DC+CTF+PseACS. High overall accuracy results indicate that our predictive model is a reliable tool for HSP family prediction. It is known that HSP expression is associated with human diseases, and these families of HSPs have different functions. Therefore, our predictive model will benefit researchers by quickly and effectively identifying HSP families and enabling researchers to design new drugs to achieve the goal of treating diseases.

Acknowledgments

This work was supported by the Natural Science Foundation of Inner Mongolia Autonomous Region of China (2019MS03015) and the National Natural Science Foundation of China (31360206).

Data Availability

The data used to support the findings of this study are available from the supplementary materials.

Conflicts of Interest

The authors declare that there is no conflict of interest.

Authors' Contributions

FM Li conceived the selection of feature parameters. XY Jing carried out the computation and wrote the manuscript. FM Li performed the results analysis. Both authors reviewed the manuscript.

Supplementary Materials

Supplementary 1

The sequence names of HSP families.

Click here for additional data file.^{(31.9KB, docx)}

Supplementary 2

The sequence names of the independent datasets.

Click here for additional data file.^{(17.5KB, docx)}

References

1.Liu T., Daniels C. K., Cao S. Comprehensive review on the HSC70 functions, interactions with related molecules and involvement in clinical diseases and therapeutic potential. Pharmacology & Therapeutics. 2012;136(3):354–374. doi: 10.1016/j.pharmthera.2012.08.014. [DOI] [PubMed] [Google Scholar]
2.Wu J. M., Liu T. E., Rios Z., Mei Q. B., Lin X. K., Cao S. S. Heat shock proteins and cancer. Trends in Pharmacological Sciences. 2017;38(3):226–256. doi: 10.1016/j.tips.2016.11.009. [DOI] [PubMed] [Google Scholar]
3.Feder M. E., Hofmann G. E. Heat-shock proteins, molecular chaperones, and the stress response: evolutionary and ecological physiology. Annual Review of Physiology. 1999;61(1):243–282. doi: 10.1146/annurev.physiol.61.1.243. [DOI] [PubMed] [Google Scholar]
4.Qazi S. R., Ul Haq N., Ahmad S., Shakeel S. N. HSEAT: a tool for plant heat shock element analysis, motif identification and analysis. Current Bioinformatics. 2020;15(3):196–203. doi: 10.2174/1574893614666190102151956. [DOI] [Google Scholar]
5.Chatterjee S., Burns T. F. Targeting heat shock proteins in cancer: a promising therapeutic approach. International Journal of Molecular Sciences. 2017;18(9):p. 1978. doi: 10.3390/ijms18091978. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Jolly C., Morimoto R. I. Role of the heat shock response and molecular chaperones in oncogenesis and cell death. Journal of the National Cancer Institute. 2000;92(19):1564–1572. doi: 10.1093/jnci/92.19.1564. [DOI] [PubMed] [Google Scholar]
7.Khadir A., Kavalakatt S., Cherian P., et al. Physical exercise enhanced heat shock protein 60 expression and attenuated inflammation in the adipose tissue of human diabetic obese. Frontiers in Endocrinology. 2018;9:p. 16. doi: 10.3389/fendo.2018.00016. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Ikwegbue P. C., Masamba P., Mbatha L. S., Oyinloye B. E., Kappo A. P. Interplay between heat shock proteins, inflammation and cancer: a potential cancer therapeutic target. American Journal of Cancer Research. 2019;9(2):242–249. [PMC free article] [PubMed] [Google Scholar]
9.Ritossa F. A new puffing pattern induced by temperature shock and DNP in Drosophila. Experientia. 1962;18(12):571–573. doi: 10.1007/BF02172188. [DOI] [Google Scholar]
10.Rodríguez-Iturbe B., Johnson R. J. Heat shock proteins and cardiovascular disease. Physiology international. 2018;105(1):19–37. doi: 10.1556/2060.105.2018.1.4. [DOI] [PubMed] [Google Scholar]
11.Zilaee M., Shirali S. Heat shock proteins and diabetes. Canadian Journal of Diabetes. 2016;40(6):594–602. doi: 10.1016/j.jcjd.2016.05.016. [DOI] [PubMed] [Google Scholar]
12.Lianos G. D., Alexiou G. A., Mangano A., et al. The role of heat shock proteins in cancer. Cancer Letters. 2015;360(2):114–118. doi: 10.1016/j.canlet.2015.02.026. [DOI] [PubMed] [Google Scholar]
13.Zhao T., Hu Y., Peng J., Cheng L. DeepLGP: a novel deep learning method for prioritizing lncRNA target genes. Bioinformatics. 2020 doi: 10.1093/bioinformatics/btaa428. [DOI] [PubMed] [Google Scholar]
14.Liang C., Changlu Q., He Z., Tongze F., Xue Z. gutMDisorder: a comprehensive database for dysbiosis of the gut microbiota in disorders and interventions. Nucleic Acids Research. 2019;48(D1):D554–D560. doi: 10.1093/nar/gkz843. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Nagarajan N. S., Arunraj S. P., Sinha D., Rajan V. B., Esthaki V. K., D’Silva P. HSPIR: a manually annotated heat shock protein information resource. Bioinformatics. 2012;28(21):2853–2855. doi: 10.1093/bioinformatics/bts520. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Mahmood T., Safdar W., Abbasi B. H., Naqvi S. S. An overview on the small heat shock proteins. African Journal of Biotechnology. 2010;9(7):927–939. [Google Scholar]
17.Genest O., Wickner S., Doyle S. M. Hsp 90 and Hsp 70 chaperones: collaborators in protein remodeling. The Journal of Biological Chemistry. 2019;294(6):2109–2120. doi: 10.1074/jbc.REV118.002806. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Chen T., Lin T. H., Li H. M., et al. Heat shock protein 40 (HSP40) in pacific white shrimp (Litopenaeus vannamei): molecular cloning, tissue distribution and ontogeny, response to temperature, acidity/alkalinity and salinity stresses, and potential role in ovarian development. Frontiers in Physiology. 2018;9:p. 1784. doi: 10.3389/fphys.2018.01784. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Schopf F. H., Biebl M. M., Buchner J. The HSP90 chaperone machinery. Nature Reviews. Molecular Cell Biology. 2017;18(6):345–360. doi: 10.1038/nrm.2017.20. [DOI] [PubMed] [Google Scholar]
20.Schirmer E. C., Glover J. R., Singer M. A., Lindquist S. HSP100/Clp proteins: a common mechanism explains diverse functions. Trends in Biochemical Sciences. 1996;21(8):289–296. doi: 10.1016/S0968-0004(96)10038-4. [DOI] [PubMed] [Google Scholar]
21.Feng P. M., Chen W., Lin H., Chou K. C. iHSP-PseRAAAC: identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Analytical Biochemistry. 2013;442(1):118–125. doi: 10.1016/j.ab.2013.05.024. [DOI] [PubMed] [Google Scholar]
22.Du P. F., Zhao W., Miao Y. Y., Wei L. Y., Wang L. UltraPse: a universal and extensible software platform for representing biological sequences. International Journal of Molecular Sciences. 2017;18(11):p. 2400. doi: 10.3390/ijms18112400. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Wang J., Du P. F., Xue X. Y., et al. VisFeature: a stand-alone program for visualizing and analyzing statistical features of biological sequences. Bioinformatics. 2019;36(4):1277–1278. doi: 10.1093/bioinformatics/btz689. [DOI] [PubMed] [Google Scholar]
24.Ahmad S., Kabir M., Hayat M. Identification of heat shock protein families and J-protein types by incorporating dipeptide composition into Chou’s general PseAAC. Computer Methods and Programs in Biomedicine. 2015;122(2):165–174. doi: 10.1016/j.cmpb.2015.07.005. [DOI] [PubMed] [Google Scholar]
25.Kumar R., Kumari B., Kumar M. PredHSP: sequence based proteome-wide heat shock protein prediction and classification tool to unlock the stress biology. PLoS One. 2016;11(5):p. e0155872. doi: 10.1371/journal.pone.0155872. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Meher P. K., Sahu T. K., Gahoi S. ir-HSP: improved recognition of heat shock proteins, their families and sub-types based on g-spaced di-peptide features and support vector machine. Frontiers in Genetics. 2018;8:p. 235. doi: 10.3389/fgene.2017.00235. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Chen W., Feng P., Liu T., Jin D. Recent advances in machine learning methods for predicting heat shock proteins. Current Drug Metabolism. 2019;20(3):224–228. doi: 10.2174/1389200219666181031105916. [DOI] [PubMed] [Google Scholar]
28.Li L. Q., Yu S. J., Xiao W. D., et al. Prediction of bacterial protein subcellular localization by incorporating various features into Chou’s PseAAC and a backward feature selection approach. Biochimie. 2014;104:100–107. doi: 10.1016/j.biochi.2014.06.001. [DOI] [PubMed] [Google Scholar]
29.Zhang L. N., Zhang C. J. JPPRED: prediction of types of J-proteins from imbalanced data using an ensemble learning method. BioMed Research International. 2015;2015:12. doi: 10.1155/2015/705156. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Li F. M., Wang X. Q. Identifying anticancer peptides by using improved hybrid compositions. Scientific Reports. 2016;6(1):p. 33910. doi: 10.1038/srep33910. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Li W. Z., Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–1659. doi: 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]
32.Kampinga H. H., Hageman J., Vos M. J., et al. Guidelines for the nomenclature of the human heat shock proteins. Cell Stress & Chaperones. 2009;14(1):105–111. doi: 10.1007/s12192-008-0068-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Wang Y., Lin S., Song Q., et al. Genome-wide identification of heat shock proteins (Hsps) and Hsp interactors in rice: Hsp70s as a case study. BMC Genomics. 2014;15(1):344–344. doi: 10.1186/1471-2164-15-344. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Sarkar N. K., Kundnani P., Grover A. Functional analysis of Hsp70 superfamily proteins of rice (Oryza sativa) Cell Stress & Chaperones. 2013;18(4):427–437. doi: 10.1007/s12192-012-0395-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Zhu X. J., Feng C. Q., Lai H. Y., Chen W., Lin H. Predicting protein structural classes for low-similarity sequences by evaluating different features. Knowledge-Based Systems. 2019;163:787–793. doi: 10.1016/j.knosys.2018.10.007. [DOI] [Google Scholar]
36.Ahmad K., Waris M., Hayat M. Prediction of protein submitochondrial locations by incorporating dipeptide composition into Chou’s general pseudo amino acid composition. The Journal of Membrane Biology. 2016;249(3):293–304. doi: 10.1007/s00232-015-9868-8. [DOI] [PubMed] [Google Scholar]
37.Arif M., Hayat M., Jan Z. iMem-2LSAAC: a two-level model for discrimination of membrane proteins and their types by extending the notion of SAAC into Chou’s pseudo amino acid composition. Journal of Theoretical Biology. 2018;442:11–21. doi: 10.1016/j.jtbi.2018.01.008. [DOI] [PubMed] [Google Scholar]
38.Tahir M., Hayat M. iNuc-STNC: a sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of SAAC and Chou’s PseAAC. Molecular BioSystems. 2016;12(8):2587–2593. doi: 10.1039/C6MB00221H. [DOI] [PubMed] [Google Scholar]
39.Saravanan V., Lakshmi P. T. V. Dualpred: a webserver for predicting plant proteins dual-targeted to chloroplast and mitochondria using split protein-relatedness-measure feature. Current Bioinformatics. 2015;10(3):323–331. doi: 10.2174/1574893609666140226000041. [DOI] [Google Scholar]
40.Dai Q., Ma S., Hai Y. B., Yao Y. H., Liu X. Q. A segmentation based model for subcellular location prediction of apoptosis protein. Chemometrics and Intelligent Laboratory Systems. 2016;158:146–154. doi: 10.1016/j.chemolab.2016.09.005. [DOI] [Google Scholar]
41.Yang W., Zhu X. J., Huang J., Ding H., Lin H. A brief survey of machine learning methods in protein sub-Golgi localization. Current Bioinformatics. 2019;14(3):234–240. doi: 10.2174/1574893613666181113131415. [DOI] [Google Scholar]
42.Tan J. X., Li S. H., Zhang Z. M., et al. Identification of hormone binding proteins based on machine learning methods. Mathematical Biosciences and Engineering. 2019;16(4):2466–2480. doi: 10.3934/mbe.2019123. [DOI] [PubMed] [Google Scholar]
43.Shen J. W., Zhang J., Luo X. M., et al. Predicting protein-protein interactions based only on sequences information. Proceedings of the National Academy of Sciences. 2007;104(11):4337–4341. doi: 10.1073/pnas.0607879104. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Wang Y. C., Wang Y., Yang Z. X., Deng N. Y. Support vector machine prediction of enzyme function with conjoint triad feature and hierarchical context. BMC Systems Biology. 2011;5(S1):p. S6. doi: 10.1186/1752-0509-5-S1-S6. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Wang J., Zhang L., Jia L., Ren Y., Yu G. Protein-protein interactions prediction using a novel local conjoint triad descriptor of amino acid sequences. International Journal of Molecular Sciences. 2017;18(11):p. 2373. doi: 10.3390/ijms18112373. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Wang H. C., Wu P. F. Prediction of RNA-protein interactions using conjoint triad feature and chaos game representation. Bioengineered. 2018;9(1):242–251. doi: 10.1080/21655979.2018.1470721. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Wang H. C., Hu X. H. Accurate prediction of nuclear receptors with conjoint triad feature. BMC Bioinformatics. 2015;16(1):p. 402. doi: 10.1186/s12859-015-0828-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Calligari P., Abergel D. Multiple scale dynamics in proteins probed at multiple time scales through fluctuations of NMR chemical shifts. The Journal of Physical Chemistry. B. 2014;118(14):3823–3831. doi: 10.1021/jp412125d. [DOI] [PubMed] [Google Scholar]
49.Fan G. L., Li Q. Z. Predict mycobacterial proteins subcellular locations by incorporating pseudo-average chemical shift into the general form of Chou’s pseudo amino acid composition. Journal of Theoretical Biology. 2012;304:88–95. doi: 10.1016/j.jtbi.2012.03.017. [DOI] [PubMed] [Google Scholar]
50.Sibley A. B., Cosman M., Krishnan V. V. An empirical correlation between secondary structure content and averaged chemical shifts in proteins. Biophysical Journal. 2003;84(2):1223–1227. doi: 10.1016/S0006-3495(03)74937-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Yang R. T., Zhang C., Gao R., Zhang L. N. A novel feature extraction method with feature selection to identify Golgi-resident protein types from imbalanced data. International Journal of Molecular Sciences. 2016;17(2):p. 218. doi: 10.3390/ijms17020218. [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Chawla N. V., Bowyer K. W., Hall L. O., Kegelmeyer W. P. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. 2002;16:321–357. doi: 10.1613/jair.953. [DOI] [Google Scholar]
53.Cheng L. Computational and biological methods for gene therapy. Current Gene Therapy. 2019;19(4):210–210. doi: 10.2174/156652321904191022113307. [DOI] [PubMed] [Google Scholar]
54.Cheng L., Hu Y. Human disease system biology. Current Gene Therapy. 2018;18(5):255–256. doi: 10.2174/1566523218666181010101114. [DOI] [PubMed] [Google Scholar]
55.Su W. X., Li Q. Z., Zhang L. Q., et al. Gene expression classification using epigenetic features and DNA sequence composition in the human embryonic stem cell line H1. Gene. 2016;592(1):227–234. doi: 10.1016/j.gene.2016.07.059. [DOI] [PubMed] [Google Scholar]
56.Manavalan B., Shin T. H., Lee G. PVP-SVM: sequence-based prediction of phage virion proteins using a support vector machine. Frontiers in Microbiology. 2018;9:p. 476. doi: 10.3389/fmicb.2018.00476. [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Lai H. Y., Zhang Z. Y., Su Z. D., et al. iProEP: a computational predictor for predicting promoter. Molecular Therapy-Nucleic Acids. 2019;17:337–346. doi: 10.1016/j.omtn.2019.05.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Li X., Tang Q., Tang H., Chen W. Identifying antioxidant proteins by combining multiple methods. Frontiers in Bioengineering and Biotechnology. 2020;8:p. 858. doi: 10.3389/fbioe.2020.00858. [DOI] [PMC free article] [PubMed] [Google Scholar]
59.Zhao W., Li G. P., Wang J., Zhou Y. K., Gao Y., Du P. F. Predicting protein sub-Golgi locations by combining functional domain enrichment scores with pseudo-amino acid compositions. Journal of Theoretical Biology. 2019;473:38–43. doi: 10.1016/j.jtbi.2019.04.025. [DOI] [PubMed] [Google Scholar]
60.Yang Y. H., Ma C., Wang J. S., et al. Prediction of N7-methylguanosine sites in human RNA based on optimal sequence features. Genomics. 2020;112(6):4342–4347. doi: 10.1016/j.ygeno.2020.07.035. [DOI] [PubMed] [Google Scholar]
61.Liu M. L., Su W., Guan Z. X., et al. An overview on predicting protein subchloroplast localization by using machine learning methods. Current Protein & Peptide Science. 2020;21 doi: 10.2174/1389203721666200117153412. [DOI] [PubMed] [Google Scholar]
62.Tang Q., Kang J., Yuan J., et al. DNA4mC-LIP: a linear integration method to identify N4-methylcytosine site in multiple species. Bioinformatics. 2020;36(11):3327–3335. doi: 10.1093/bioinformatics/btaa143. [DOI] [PubMed] [Google Scholar]
63.Chen J., Zhao J., Yang S., Chen Z., Zhang Z. Prediction of protein ubiquitination sites in Arabidopsis thaliana. Current Bioinformatics. 2019;14(7):614–620. doi: 10.2174/1574893614666190311141647. [DOI] [Google Scholar]
64.Kuo J.-H., Chang C.-C., Chen C.-W., Liang H.-H., Chang C.-Y., Chu Y.-W. Sequence-based structural B-cell epitope prediction by using two layer SVM model and association rule features. Current Bioinformatics. 2020;15(3):246–252. doi: 10.2174/1574893614666181123155831. [DOI] [Google Scholar]
65.Jiao Y., Du P. Performance measures in evaluating machine learning based bioinformatics predictors for classifications. Quantitative Biology. 2016;4(4):320–330. doi: 10.1007/s40484-016-0081-2. [DOI] [Google Scholar]
66.Li F. M., Gao X. W. Predicting gram-positive bacterial protein subcellular location by using combined features. BioMed Research International. 2020;2020:8. doi: 10.1155/2020/9701734.9701734 [DOI] [PMC free article] [PubMed] [Google Scholar]
67.Cheng L., Zhuang H., Ju H., et al. Exposing the causal effect of body mass index on the risk of type 2 diabetes mellitus: a Mendelian randomization study. Frontiers in Genetics. 2019;10 doi: 10.3389/fgene.2019.00094. [DOI] [PMC free article] [PubMed] [Google Scholar]
68.Cheng L., Zhao H., Wang P., et al. Computational methods for identifying similar diseases. Molecular Therapy-Nucleic Acids. 2019;18:590–604. doi: 10.1016/j.omtn.2019.09.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
69.Cheng L., Zhuang H., Yang S., Jiang H., Wang S., Zhang J. Exposing the causal effect of C-reactive protein on the risk of type 2 diabetes mellitus: a Mendelian randomization study. Frontiers in Genetics. 2018;9:p. 657. doi: 10.3389/fgene.2018.00657. [DOI] [PMC free article] [PubMed] [Google Scholar]
70.Zhang Z. Y., Yang Y. H., Ding H., Wang D., Chen W., Lin H. Design powerful predictor for mRNA subcellular location prediction in Homo sapiens. Briefings in Bioinformatics. 2020. [DOI] [PubMed]
71.Dao F. Y., Lv H., Zulfiqar H., et al. Briefings in Bioinformatics. 2020. A computational platform to identify origins of replication sites in eukaryotes. [DOI] [PubMed] [Google Scholar]
72.Dao F. Y., Lv H., Yang Y. H., Zulfiqar H., Gao H., Lin H. Computational identification of N6-methyladenosine sites in multiple tissues of mammals. Computational and Structural Biotechnology Journal. 2020;18:1084–1091. doi: 10.1016/j.csbj.2020.04.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
73.Yang H., Yang W., Dao F. Y., et al. Briefings in Bioinformatics. 2019. A comparison and assessment of computational method for identifying recombination hotspots in Saccharomyces cerevisiae. [DOI] [PubMed] [Google Scholar]
74.Liu K., Chen W. iMRM: a platform for simultaneously identifying multiple kinds of RNA modifications. Bioinformatics. 2020;36(11):3336–3342. doi: 10.1093/bioinformatics/btaa155. [DOI] [PubMed] [Google Scholar]
75.Li B.-Q., Zhang Y.-H., Jin M.-L., Huang T., Cai Y.-D. Prediction of protein-peptide interactions with a nearest neighbor algorithm. Current Bioinformatics. 2018;13(1):14–24. doi: 10.2174/1574893611666160711162006. [DOI] [Google Scholar]
76.Chen Z., Zhao P., Li F., et al. iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics. 2018;34(14):2499–2502. doi: 10.1093/bioinformatics/bty140. [DOI] [PMC free article] [PubMed] [Google Scholar]
77.Muhammod R., Ahmed S., Farid D. M., Shatabda S., Sharma A., Dehzangi A. PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences. Bioinformatics. 2019;35(19):3831–3833. doi: 10.1093/bioinformatics/btz165. [DOI] [PMC free article] [PubMed] [Google Scholar]
78.Ao C., Zhou W., Gao L., Dong B., Yu L. Prediction of antioxidant proteins using hybrid feature representation method and random forest. Genomics. 2020;112(6):4666–4674. doi: 10.1016/j.ygeno.2020.08.016. [DOI] [PubMed] [Google Scholar]
79.Kwon E., Cho M., Kim H., Son H. S. A study on host tropism determinants of influenza virus using machine learning. Current Bioinformatics. 2020;15(2):121–134. doi: 10.2174/1574893614666191104160927. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary 1

The sequence names of HSP families.

Click here for additional data file.^{(31.9KB, docx)}

Supplementary 2

The sequence names of the independent datasets.

Click here for additional data file.^{(17.5KB, docx)}

Data Availability Statement

The data used to support the findings of this study are available from the supplementary materials.

[B1] 1.Liu T., Daniels C. K., Cao S. Comprehensive review on the HSC70 functions, interactions with related molecules and involvement in clinical diseases and therapeutic potential. Pharmacology & Therapeutics. 2012;136(3):354–374. doi: 10.1016/j.pharmthera.2012.08.014. [DOI] [PubMed] [Google Scholar]

[B2] 2.Wu J. M., Liu T. E., Rios Z., Mei Q. B., Lin X. K., Cao S. S. Heat shock proteins and cancer. Trends in Pharmacological Sciences. 2017;38(3):226–256. doi: 10.1016/j.tips.2016.11.009. [DOI] [PubMed] [Google Scholar]

[B3] 3.Feder M. E., Hofmann G. E. Heat-shock proteins, molecular chaperones, and the stress response: evolutionary and ecological physiology. Annual Review of Physiology. 1999;61(1):243–282. doi: 10.1146/annurev.physiol.61.1.243. [DOI] [PubMed] [Google Scholar]

[B4] 4.Qazi S. R., Ul Haq N., Ahmad S., Shakeel S. N. HSEAT: a tool for plant heat shock element analysis, motif identification and analysis. Current Bioinformatics. 2020;15(3):196–203. doi: 10.2174/1574893614666190102151956. [DOI] [Google Scholar]

[B5] 5.Chatterjee S., Burns T. F. Targeting heat shock proteins in cancer: a promising therapeutic approach. International Journal of Molecular Sciences. 2017;18(9):p. 1978. doi: 10.3390/ijms18091978. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6.Jolly C., Morimoto R. I. Role of the heat shock response and molecular chaperones in oncogenesis and cell death. Journal of the National Cancer Institute. 2000;92(19):1564–1572. doi: 10.1093/jnci/92.19.1564. [DOI] [PubMed] [Google Scholar]

[B7] 7.Khadir A., Kavalakatt S., Cherian P., et al. Physical exercise enhanced heat shock protein 60 expression and attenuated inflammation in the adipose tissue of human diabetic obese. Frontiers in Endocrinology. 2018;9:p. 16. doi: 10.3389/fendo.2018.00016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] 8.Ikwegbue P. C., Masamba P., Mbatha L. S., Oyinloye B. E., Kappo A. P. Interplay between heat shock proteins, inflammation and cancer: a potential cancer therapeutic target. American Journal of Cancer Research. 2019;9(2):242–249. [PMC free article] [PubMed] [Google Scholar]

[B9] 9.Ritossa F. A new puffing pattern induced by temperature shock and DNP in Drosophila. Experientia. 1962;18(12):571–573. doi: 10.1007/BF02172188. [DOI] [Google Scholar]

[B10] 10.Rodríguez-Iturbe B., Johnson R. J. Heat shock proteins and cardiovascular disease. Physiology international. 2018;105(1):19–37. doi: 10.1556/2060.105.2018.1.4. [DOI] [PubMed] [Google Scholar]

[B11] 11.Zilaee M., Shirali S. Heat shock proteins and diabetes. Canadian Journal of Diabetes. 2016;40(6):594–602. doi: 10.1016/j.jcjd.2016.05.016. [DOI] [PubMed] [Google Scholar]

[B12] 12.Lianos G. D., Alexiou G. A., Mangano A., et al. The role of heat shock proteins in cancer. Cancer Letters. 2015;360(2):114–118. doi: 10.1016/j.canlet.2015.02.026. [DOI] [PubMed] [Google Scholar]

[B13] 13.Zhao T., Hu Y., Peng J., Cheng L. DeepLGP: a novel deep learning method for prioritizing lncRNA target genes. Bioinformatics. 2020 doi: 10.1093/bioinformatics/btaa428. [DOI] [PubMed] [Google Scholar]

[B14] 14.Liang C., Changlu Q., He Z., Tongze F., Xue Z. gutMDisorder: a comprehensive database for dysbiosis of the gut microbiota in disorders and interventions. Nucleic Acids Research. 2019;48(D1):D554–D560. doi: 10.1093/nar/gkz843. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15.Nagarajan N. S., Arunraj S. P., Sinha D., Rajan V. B., Esthaki V. K., D’Silva P. HSPIR: a manually annotated heat shock protein information resource. Bioinformatics. 2012;28(21):2853–2855. doi: 10.1093/bioinformatics/bts520. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] 16.Mahmood T., Safdar W., Abbasi B. H., Naqvi S. S. An overview on the small heat shock proteins. African Journal of Biotechnology. 2010;9(7):927–939. [Google Scholar]

[B17] 17.Genest O., Wickner S., Doyle S. M. Hsp 90 and Hsp 70 chaperones: collaborators in protein remodeling. The Journal of Biological Chemistry. 2019;294(6):2109–2120. doi: 10.1074/jbc.REV118.002806. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] 18.Chen T., Lin T. H., Li H. M., et al. Heat shock protein 40 (HSP40) in pacific white shrimp (Litopenaeus vannamei): molecular cloning, tissue distribution and ontogeny, response to temperature, acidity/alkalinity and salinity stresses, and potential role in ovarian development. Frontiers in Physiology. 2018;9:p. 1784. doi: 10.3389/fphys.2018.01784. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] 19.Schopf F. H., Biebl M. M., Buchner J. The HSP90 chaperone machinery. Nature Reviews. Molecular Cell Biology. 2017;18(6):345–360. doi: 10.1038/nrm.2017.20. [DOI] [PubMed] [Google Scholar]

[B20] 20.Schirmer E. C., Glover J. R., Singer M. A., Lindquist S. HSP100/Clp proteins: a common mechanism explains diverse functions. Trends in Biochemical Sciences. 1996;21(8):289–296. doi: 10.1016/S0968-0004(96)10038-4. [DOI] [PubMed] [Google Scholar]

[B21] 21.Feng P. M., Chen W., Lin H., Chou K. C. iHSP-PseRAAAC: identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Analytical Biochemistry. 2013;442(1):118–125. doi: 10.1016/j.ab.2013.05.024. [DOI] [PubMed] [Google Scholar]

[B22] 22.Du P. F., Zhao W., Miao Y. Y., Wei L. Y., Wang L. UltraPse: a universal and extensible software platform for representing biological sequences. International Journal of Molecular Sciences. 2017;18(11):p. 2400. doi: 10.3390/ijms18112400. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] 23.Wang J., Du P. F., Xue X. Y., et al. VisFeature: a stand-alone program for visualizing and analyzing statistical features of biological sequences. Bioinformatics. 2019;36(4):1277–1278. doi: 10.1093/bioinformatics/btz689. [DOI] [PubMed] [Google Scholar]

[B24] 24.Ahmad S., Kabir M., Hayat M. Identification of heat shock protein families and J-protein types by incorporating dipeptide composition into Chou’s general PseAAC. Computer Methods and Programs in Biomedicine. 2015;122(2):165–174. doi: 10.1016/j.cmpb.2015.07.005. [DOI] [PubMed] [Google Scholar]

[B25] 25.Kumar R., Kumari B., Kumar M. PredHSP: sequence based proteome-wide heat shock protein prediction and classification tool to unlock the stress biology. PLoS One. 2016;11(5):p. e0155872. doi: 10.1371/journal.pone.0155872. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] 26.Meher P. K., Sahu T. K., Gahoi S. ir-HSP: improved recognition of heat shock proteins, their families and sub-types based on g-spaced di-peptide features and support vector machine. Frontiers in Genetics. 2018;8:p. 235. doi: 10.3389/fgene.2017.00235. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] 27.Chen W., Feng P., Liu T., Jin D. Recent advances in machine learning methods for predicting heat shock proteins. Current Drug Metabolism. 2019;20(3):224–228. doi: 10.2174/1389200219666181031105916. [DOI] [PubMed] [Google Scholar]

[B28] 28.Li L. Q., Yu S. J., Xiao W. D., et al. Prediction of bacterial protein subcellular localization by incorporating various features into Chou’s PseAAC and a backward feature selection approach. Biochimie. 2014;104:100–107. doi: 10.1016/j.biochi.2014.06.001. [DOI] [PubMed] [Google Scholar]

[B29] 29.Zhang L. N., Zhang C. J. JPPRED: prediction of types of J-proteins from imbalanced data using an ensemble learning method. BioMed Research International. 2015;2015:12. doi: 10.1155/2015/705156. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] 30.Li F. M., Wang X. Q. Identifying anticancer peptides by using improved hybrid compositions. Scientific Reports. 2016;6(1):p. 33910. doi: 10.1038/srep33910. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B31] 31.Li W. Z., Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–1659. doi: 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]

[B32] 32.Kampinga H. H., Hageman J., Vos M. J., et al. Guidelines for the nomenclature of the human heat shock proteins. Cell Stress & Chaperones. 2009;14(1):105–111. doi: 10.1007/s12192-008-0068-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B33] 33.Wang Y., Lin S., Song Q., et al. Genome-wide identification of heat shock proteins (Hsps) and Hsp interactors in rice: Hsp70s as a case study. BMC Genomics. 2014;15(1):344–344. doi: 10.1186/1471-2164-15-344. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B34] 34.Sarkar N. K., Kundnani P., Grover A. Functional analysis of Hsp70 superfamily proteins of rice (Oryza sativa) Cell Stress & Chaperones. 2013;18(4):427–437. doi: 10.1007/s12192-012-0395-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B35] 35.Zhu X. J., Feng C. Q., Lai H. Y., Chen W., Lin H. Predicting protein structural classes for low-similarity sequences by evaluating different features. Knowledge-Based Systems. 2019;163:787–793. doi: 10.1016/j.knosys.2018.10.007. [DOI] [Google Scholar]

[B36] 36.Ahmad K., Waris M., Hayat M. Prediction of protein submitochondrial locations by incorporating dipeptide composition into Chou’s general pseudo amino acid composition. The Journal of Membrane Biology. 2016;249(3):293–304. doi: 10.1007/s00232-015-9868-8. [DOI] [PubMed] [Google Scholar]

[B37] 37.Arif M., Hayat M., Jan Z. iMem-2LSAAC: a two-level model for discrimination of membrane proteins and their types by extending the notion of SAAC into Chou’s pseudo amino acid composition. Journal of Theoretical Biology. 2018;442:11–21. doi: 10.1016/j.jtbi.2018.01.008. [DOI] [PubMed] [Google Scholar]

[B38] 38.Tahir M., Hayat M. iNuc-STNC: a sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of SAAC and Chou’s PseAAC. Molecular BioSystems. 2016;12(8):2587–2593. doi: 10.1039/C6MB00221H. [DOI] [PubMed] [Google Scholar]

[B39] 39.Saravanan V., Lakshmi P. T. V. Dualpred: a webserver for predicting plant proteins dual-targeted to chloroplast and mitochondria using split protein-relatedness-measure feature. Current Bioinformatics. 2015;10(3):323–331. doi: 10.2174/1574893609666140226000041. [DOI] [Google Scholar]

[B40] 40.Dai Q., Ma S., Hai Y. B., Yao Y. H., Liu X. Q. A segmentation based model for subcellular location prediction of apoptosis protein. Chemometrics and Intelligent Laboratory Systems. 2016;158:146–154. doi: 10.1016/j.chemolab.2016.09.005. [DOI] [Google Scholar]

[B41] 41.Yang W., Zhu X. J., Huang J., Ding H., Lin H. A brief survey of machine learning methods in protein sub-Golgi localization. Current Bioinformatics. 2019;14(3):234–240. doi: 10.2174/1574893613666181113131415. [DOI] [Google Scholar]

[B42] 42.Tan J. X., Li S. H., Zhang Z. M., et al. Identification of hormone binding proteins based on machine learning methods. Mathematical Biosciences and Engineering. 2019;16(4):2466–2480. doi: 10.3934/mbe.2019123. [DOI] [PubMed] [Google Scholar]

[B43] 43.Shen J. W., Zhang J., Luo X. M., et al. Predicting protein-protein interactions based only on sequences information. Proceedings of the National Academy of Sciences. 2007;104(11):4337–4341. doi: 10.1073/pnas.0607879104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B44] 44.Wang Y. C., Wang Y., Yang Z. X., Deng N. Y. Support vector machine prediction of enzyme function with conjoint triad feature and hierarchical context. BMC Systems Biology. 2011;5(S1):p. S6. doi: 10.1186/1752-0509-5-S1-S6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B45] 45.Wang J., Zhang L., Jia L., Ren Y., Yu G. Protein-protein interactions prediction using a novel local conjoint triad descriptor of amino acid sequences. International Journal of Molecular Sciences. 2017;18(11):p. 2373. doi: 10.3390/ijms18112373. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B46] 46.Wang H. C., Wu P. F. Prediction of RNA-protein interactions using conjoint triad feature and chaos game representation. Bioengineered. 2018;9(1):242–251. doi: 10.1080/21655979.2018.1470721. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B47] 47.Wang H. C., Hu X. H. Accurate prediction of nuclear receptors with conjoint triad feature. BMC Bioinformatics. 2015;16(1):p. 402. doi: 10.1186/s12859-015-0828-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B48] 48.Calligari P., Abergel D. Multiple scale dynamics in proteins probed at multiple time scales through fluctuations of NMR chemical shifts. The Journal of Physical Chemistry. B. 2014;118(14):3823–3831. doi: 10.1021/jp412125d. [DOI] [PubMed] [Google Scholar]

[B49] 49.Fan G. L., Li Q. Z. Predict mycobacterial proteins subcellular locations by incorporating pseudo-average chemical shift into the general form of Chou’s pseudo amino acid composition. Journal of Theoretical Biology. 2012;304:88–95. doi: 10.1016/j.jtbi.2012.03.017. [DOI] [PubMed] [Google Scholar]

[B50] 50.Sibley A. B., Cosman M., Krishnan V. V. An empirical correlation between secondary structure content and averaged chemical shifts in proteins. Biophysical Journal. 2003;84(2):1223–1227. doi: 10.1016/S0006-3495(03)74937-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B51] 51.Yang R. T., Zhang C., Gao R., Zhang L. N. A novel feature extraction method with feature selection to identify Golgi-resident protein types from imbalanced data. International Journal of Molecular Sciences. 2016;17(2):p. 218. doi: 10.3390/ijms17020218. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B52] 52.Chawla N. V., Bowyer K. W., Hall L. O., Kegelmeyer W. P. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. 2002;16:321–357. doi: 10.1613/jair.953. [DOI] [Google Scholar]

[B53] 53.Cheng L. Computational and biological methods for gene therapy. Current Gene Therapy. 2019;19(4):210–210. doi: 10.2174/156652321904191022113307. [DOI] [PubMed] [Google Scholar]

[B54] 54.Cheng L., Hu Y. Human disease system biology. Current Gene Therapy. 2018;18(5):255–256. doi: 10.2174/1566523218666181010101114. [DOI] [PubMed] [Google Scholar]

[B55] 55.Su W. X., Li Q. Z., Zhang L. Q., et al. Gene expression classification using epigenetic features and DNA sequence composition in the human embryonic stem cell line H1. Gene. 2016;592(1):227–234. doi: 10.1016/j.gene.2016.07.059. [DOI] [PubMed] [Google Scholar]

[B56] 56.Manavalan B., Shin T. H., Lee G. PVP-SVM: sequence-based prediction of phage virion proteins using a support vector machine. Frontiers in Microbiology. 2018;9:p. 476. doi: 10.3389/fmicb.2018.00476. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B57] 57.Lai H. Y., Zhang Z. Y., Su Z. D., et al. iProEP: a computational predictor for predicting promoter. Molecular Therapy-Nucleic Acids. 2019;17:337–346. doi: 10.1016/j.omtn.2019.05.028. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B58] 58.Li X., Tang Q., Tang H., Chen W. Identifying antioxidant proteins by combining multiple methods. Frontiers in Bioengineering and Biotechnology. 2020;8:p. 858. doi: 10.3389/fbioe.2020.00858. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B59] 59.Zhao W., Li G. P., Wang J., Zhou Y. K., Gao Y., Du P. F. Predicting protein sub-Golgi locations by combining functional domain enrichment scores with pseudo-amino acid compositions. Journal of Theoretical Biology. 2019;473:38–43. doi: 10.1016/j.jtbi.2019.04.025. [DOI] [PubMed] [Google Scholar]

[B60] 60.Yang Y. H., Ma C., Wang J. S., et al. Prediction of N7-methylguanosine sites in human RNA based on optimal sequence features. Genomics. 2020;112(6):4342–4347. doi: 10.1016/j.ygeno.2020.07.035. [DOI] [PubMed] [Google Scholar]

[B61] 61.Liu M. L., Su W., Guan Z. X., et al. An overview on predicting protein subchloroplast localization by using machine learning methods. Current Protein & Peptide Science. 2020;21 doi: 10.2174/1389203721666200117153412. [DOI] [PubMed] [Google Scholar]

[B62] 62.Tang Q., Kang J., Yuan J., et al. DNA4mC-LIP: a linear integration method to identify N4-methylcytosine site in multiple species. Bioinformatics. 2020;36(11):3327–3335. doi: 10.1093/bioinformatics/btaa143. [DOI] [PubMed] [Google Scholar]

[B63] 63.Chen J., Zhao J., Yang S., Chen Z., Zhang Z. Prediction of protein ubiquitination sites in Arabidopsis thaliana. Current Bioinformatics. 2019;14(7):614–620. doi: 10.2174/1574893614666190311141647. [DOI] [Google Scholar]

[B64] 64.Kuo J.-H., Chang C.-C., Chen C.-W., Liang H.-H., Chang C.-Y., Chu Y.-W. Sequence-based structural B-cell epitope prediction by using two layer SVM model and association rule features. Current Bioinformatics. 2020;15(3):246–252. doi: 10.2174/1574893614666181123155831. [DOI] [Google Scholar]

[B65] 65.Jiao Y., Du P. Performance measures in evaluating machine learning based bioinformatics predictors for classifications. Quantitative Biology. 2016;4(4):320–330. doi: 10.1007/s40484-016-0081-2. [DOI] [Google Scholar]

[B66] 66.Li F. M., Gao X. W. Predicting gram-positive bacterial protein subcellular location by using combined features. BioMed Research International. 2020;2020:8. doi: 10.1155/2020/9701734.9701734 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B67] 67.Cheng L., Zhuang H., Ju H., et al. Exposing the causal effect of body mass index on the risk of type 2 diabetes mellitus: a Mendelian randomization study. Frontiers in Genetics. 2019;10 doi: 10.3389/fgene.2019.00094. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B68] 68.Cheng L., Zhao H., Wang P., et al. Computational methods for identifying similar diseases. Molecular Therapy-Nucleic Acids. 2019;18:590–604. doi: 10.1016/j.omtn.2019.09.019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B69] 69.Cheng L., Zhuang H., Yang S., Jiang H., Wang S., Zhang J. Exposing the causal effect of C-reactive protein on the risk of type 2 diabetes mellitus: a Mendelian randomization study. Frontiers in Genetics. 2018;9:p. 657. doi: 10.3389/fgene.2018.00657. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B70] 70.Zhang Z. Y., Yang Y. H., Ding H., Wang D., Chen W., Lin H. Design powerful predictor for mRNA subcellular location prediction in Homo sapiens. Briefings in Bioinformatics. 2020. [DOI] [PubMed]

[B71] 71.Dao F. Y., Lv H., Zulfiqar H., et al. Briefings in Bioinformatics. 2020. A computational platform to identify origins of replication sites in eukaryotes. [DOI] [PubMed] [Google Scholar]

[B72] 72.Dao F. Y., Lv H., Yang Y. H., Zulfiqar H., Gao H., Lin H. Computational identification of N6-methyladenosine sites in multiple tissues of mammals. Computational and Structural Biotechnology Journal. 2020;18:1084–1091. doi: 10.1016/j.csbj.2020.04.015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B73] 73.Yang H., Yang W., Dao F. Y., et al. Briefings in Bioinformatics. 2019. A comparison and assessment of computational method for identifying recombination hotspots in Saccharomyces cerevisiae. [DOI] [PubMed] [Google Scholar]

[B74] 74.Liu K., Chen W. iMRM: a platform for simultaneously identifying multiple kinds of RNA modifications. Bioinformatics. 2020;36(11):3336–3342. doi: 10.1093/bioinformatics/btaa155. [DOI] [PubMed] [Google Scholar]

[B75] 75.Li B.-Q., Zhang Y.-H., Jin M.-L., Huang T., Cai Y.-D. Prediction of protein-peptide interactions with a nearest neighbor algorithm. Current Bioinformatics. 2018;13(1):14–24. doi: 10.2174/1574893611666160711162006. [DOI] [Google Scholar]

[B76] 76.Chen Z., Zhao P., Li F., et al. iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics. 2018;34(14):2499–2502. doi: 10.1093/bioinformatics/bty140. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B77] 77.Muhammod R., Ahmed S., Farid D. M., Shatabda S., Sharma A., Dehzangi A. PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences. Bioinformatics. 2019;35(19):3831–3833. doi: 10.1093/bioinformatics/btz165. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B78] 78.Ao C., Zhou W., Gao L., Dong B., Yu L. Prediction of antioxidant proteins using hybrid feature representation method and random forest. Genomics. 2020;112(6):4666–4674. doi: 10.1016/j.ygeno.2020.08.016. [DOI] [PubMed] [Google Scholar]

[B79] 79.Kwon E., Cho M., Kim H., Son H. S. A study on host tropism determinants of influenza virus using machine learning. Current Bioinformatics. 2020;15(2):121–134. doi: 10.2174/1574893614666191104160927. [DOI] [Google Scholar]

PERMALINK

Identifying Heat Shock Protein Families from Imbalanced Data by Using Combined Features

Xiao-Yang Jing

Feng-Min Li

Abstract

1. Introduction

2. Material and Methods

2.1. Dataset

Table 1.

Table 2.

2.2. The Prediction Model Construction Overview

Figure 1.

2.3. Feature Extraction Techniques

2.3.1. Split Amino Acid Composition (SAAC)

2.3.2. Dipeptide Composition (DC)

2.3.3. Conjoint Triad Feature (CTF)

2.3.4. Pseudoaverage Chemical Shift (PseACS)

2.4. Syntactic Minority Oversampling Technique (SMOTE)

2.5. Support Vector Machine (SVM)

2.6. Performance Evaluation

3. Results and Discussion

3.1. The Predictive Performance of HSPs

Table 3.

Figure 2.

Table 4.

3.2. Comparison with Other Algorithms

Figure 3.

Figure 4.

Figure 5.

3.3. Comparison with Existing Methods

Table 5.

4. Conclusion

Acknowledgments

Data Availability

Conflicts of Interest

Authors' Contributions

Supplementary Materials

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases