Discrimination of Thermophilic Proteins and Non-thermophilic Proteins Using Feature Dimension Reduction

Zifan Guo; Pingping Wang; Zhendong Liu; Yuming Zhao

doi:10.3389/fbioe.2020.584807

. 2020 Oct 22;8:584807. doi: 10.3389/fbioe.2020.584807

Discrimination of Thermophilic Proteins and Non-thermophilic Proteins Using Feature Dimension Reduction

Zifan Guo ¹, Pingping Wang ², Zhendong Liu ^3,^*, Yuming Zhao ^4,^*

PMCID: PMC7642589 PMID: 33195148

Abstract

Thermophilicity is a very important property of proteins, as it sometimes determines denaturation and cell death. Thus, methods for predicting thermophilic proteins and non-thermophilic proteins are of interest and can contribute to the design and engineering of proteins. In this article, we describe the use of feature dimension reduction technology and LIBSVM to identify thermophilic proteins. The highest accuracy obtained by cross-validation was 96.02% with 119 parameters. When using only 16 features, we obtained an accuracy of 93.33%. We discuss the importance of the different characteristics in identification and report a comparison of the performance of support vector machine to that of other methods.

Keywords: support vector machine, thermophilic proteins, feature dimension reduction, amino acid, feature selection

Introduction

Temperature is a critical condition for life. Proteins are less stable than other macromolecules, and temperature changes can easily lead to protein denaturation, which can lead to cell death (Kumar et al., 2000). Thus, it is important to develop a highly efficient method for predicting protein thermophilicity, which will contribute to the design of stable proteins. The properties of many proteins are related to their thermal stability. Studies have shown that the thermal stability of proteins is influenced by ion number, salt bridge presence, amino acid composition (AAC), dipeptide composition (DPC), and other factors (Sadeghi et al., 2006; Wang H. et al., 2018; Yin et al., 2020). Zhang and Fang (2006), Li et al. (2018), and Wang Y. et al. (2020) found significant differences in the presence of some dipeptides between thermophilic and mesothermal proteins. In addition, Gromiha et al. (1999) found that protein stability was associated with the balance between packing and solubility.

Many studies have been conducted on methods of distinguishing thermophilic proteins from normal-temperature proteins based on protein properties. Liang et al. (2005) proposed an amino acid coupling model with strong statistical ability to distinguish between thermophilic proteins and mesophilic proteins. LogitBoost Classifier and 20 features were used to distinguish thermophilic proteins by Zhang and Fang (2007) which achieved an overall classification accuracy reaching 88.9%. Montanucci et al. (2008) applied support vector machine (SVM) to investigate the impacts of mutations on the thermal stability of proteins, and with jackknife cross-validation, they achieved a prediction accuracy of 88%. Recently, Lin and Chen (2011) used feature selection technique and SVM with 30 parameters to predict thermotropic proteins, and the overall accuracy reached 93.27%. These methods have achieved good accuracy, but there remains room for improvement in the number of features used and prediction performance.

In this work, we used the data set of Lin and Chen (2011) after eliminating redundancy to distinguish between thermophilic proteins and non-thermophilic proteins. After feature extraction, MRMD2.0 was applied for feature selection and dimension reduction, and LIBSVM was used to obtain the optimal parameters of the model and establish the prediction model. Finally, from the results of cross-validation, both the number of features and the prediction accuracy were improved; the overall prediction accuracy with only 16 features in AAC was increased to 93.33%, and the highest overall accuracy, attained with 119 parameters, reached 96.02%. In addition, we analyzed the importance of features and demonstrated the strong performance of SVM by comparing this method with other methods.

Materials and Methods

Data Sets

In this article, we conducted prediction experiments using two groups of data, namely, a group of thermophilic protein data and a group of non-thermophilic protein data. The data sets were collected by Lin and Chen (2011). Generally, thermophilic proteins and non-thermophilic proteins derive from the corresponding biosome, and optimum growth temperature is the key feature used to distinguish thermophilic and non-thermophilic proteins. Therefore, we used 60°C as the minimum optimum growth temperature for thermophilic proteins and 30°C as the maximum optimum growth temperature for non-thermophilic proteins to avoid the problem of protein denaturation. As a result, 136 prokaryotic genomes conforming to the standard were selected, and their protein sequences were obtained from the Universal Protein Resource.

Next, we screened the protein sequences to increase the quality of the data sets. The filtering process employed the following criteria: (1) the sequence must have manual annotation and evaluation; (2) the protein sequence cannot include ambiguous residue; (3) the sequences cannot be fragments of other proteins; and (4) the sequence cannot be deduced from prediction or homology. After the above screening process, we obtained a total of 1,250 non-thermophilic proteins and 1,329 thermophilic proteins. Next, highly similar sequences were removed by employing the CD-HIT program, resulting in 793 non-thermophilic proteins and 915 thermophilic proteins.

Feature Extraction

Before protein prediction, the features of the protein sequences were extracted to construct the feature vectors (Figure 1). For this purpose, iFeature was used, which is a utility toolkit based on python to obtain miscellaneous numerical feature representation schemes for protein sequences (Chen et al., 2018). When using iFeature, users can combine various feature clustering, feature selection, and dimension reduction algorithms to promote the analysis of feature importance and model training. iFeature has been widely tested to ensure the validity of our calculations to further ensure the strength of our work.

We used iFeature to extract the features of the protein sequences from our data set, including AAC (Bhasin and Raghava, 2004; Pan et al., 2018; Chen et al., 2019b; Liu et al., 2019; Shen et al., 2019b; Tang et al., 2019; Li Y. H. et al., 2020), C/T/D composition (CTDC), C/T/D transition (CTDT), conjoint triad (CTriad), dipeptide deviation from the expected mean (DDE) (Saravanan and Gautham, 2015), DPC (Saravanan and Gautham, 2015; Chen et al., 2019a), tripeptide composition (TPC), composition of k-spaced amino acid pairs (CKSAAP), grouped dipeptide composition (GDPC), and grouped tripeptide composition (GTPC). The following is a concise explanation of the feature extraction protocol. In all of the following formulas, n denotes the length of the protein sequence.

AAC

AAC refers to the frequency of each amino acid in a protein or peptide sequence. There are 20 kinds of naturally occurring amino acids, namely, ACDEFGHIKLMNPQRSTVWY, and their frequencies in a sequence can be calculated by the following formula:

\begin{array}{l} f (i) = \frac{n (i)}{n}, i \in {A, C, D, E, F, \dots, W, Y} \end{array}

where n(i) refers to the number of occurrences of amino acid i.

DPC

DPC refers to the frequency of dipeptide combinations in a protein or peptide sequence, which yields 400 descriptors (Cheng J. H. et al., 2018; Tang et al., 2018). It is defined by the following formula:

\begin{array}{l} f (x, y) = \frac{n_{x y}}{n - 1}, x, y \in {A, C, D, E, F, \dots, W, Y} \end{array}

where n_xy refers to the number of dipeptides denoted by amino acids x and y.

TPC

TPC refers to the frequency of tripeptide combinations in a protein or peptide sequence, which yields 8,000 descriptors (Tan et al., 2019; Zhu et al., 2019). It is defined by the following formula:

\begin{array}{l} f (x, y, z) = \frac{n_{x y z}}{n - 2}, x, y, z \in {A, C, D, E, F, \dots, W, Y} \end{array}

where n_xyz refers to the number of tripeptides denoted by amino acid combination x, y, and z.

DDE

The DDE eigenvector is constructed by calculating three parameters: dipeptide composition (D_c), theoretical mean value (T_m), and theoretical variance (T_v). These three parameters and DDE are calculated as follows:

\begin{array}{l} D_{c} (x, y) = \frac{n_{x y}}{n - 1}, x, y \in {A, C, D, E, F, \dots, W, Y} \end{array}

where n_xy refers to the number of dipeptides displayed by amino acid combination x and y.

\begin{array}{l} T_{m} (x, y) = \frac{C_{x}}{C_{n}} \times \frac{C_{y}}{C_{n}}, x, y \in {A, C, D, E, F, \dots, W, Y} \end{array}

where C_x and C_y are the number of codons encoding the first and second amino acids, respectively, in dipeptide “x, y,” and C_n is the total number of possible codons remaining after removing the 3 terminated codons.

\begin{array}{l} T_{v} (x, y) = \frac{T_{m} (x, y) (1 - T_{m} (x, y))}{n - 1}, \\ x, y \in {A, C, D, E, F, \dots, W, Y} \\ D D E (x, y) = \frac{D_{c} (x, y) - T_{m} (x, y)}{\sqrt{T_{v} (x, y)}} \end{array}

GDPC

The GDPC encoding is a change of the DPC descriptor that includes a total of 25 descriptors, defined as follows:

\begin{array}{l} f (x, y) = \frac{n_{x y}}{n - 1}, x, y \in {g 1, g 2, g 3, g 4, g 5} \end{array}

where n_xy refers to the number of dipeptides denoted by amino acid groups x and y.

GTPC

The GTPC is another change of TPC descriptor, which consists of a total of 125 descriptors and is defined as follows:

\begin{array}{l} f (x, y, z) = \frac{n_{x y z}}{n - 2}, x, y, z \in {g 1, g 2, g 3, g 4, g 5} \end{array}

where n_xyz refers to the number of tripeptides denoted by amino acid combination x, y, and z.

CTD

CTD features represent the structural or physicochemical distribution patterns of amino acids in protein or peptide sequences (Dubchak et al., 1999; Tang et al., 2020). Thirteen types of physicochemical properties were used to calculate these characteristics, including hydrophobicity, standardized van der Waals volume, solvent accessibility, polarity, secondary structure, polarizability, and charge. These descriptors were computed by the following procedures: (1) the amino acid sequences were changed into residues with certain structural or physicochemical properties; (2) according to the main cluster of Tomii and Kanehisa (1996) amino acid index, the 20 amino acids were divided into 3 groups according to 7 physicochemical properties.

CTDC

After all 20 amino acids are divided into three groups, the composition descriptor is composed of 3 values, which are the total percentages of group 1, group 2, and group 3 of the protein sequences. The descriptor is calculated as follows:

\begin{array}{l} C (x) = \frac{n (x)}{n}, x \in {group 1, group 2, group 3} \end{array}

where n(x) refers to the number of occurrences of amino acid x in the encoded sequence.

CTDT

The transformation descriptor T also contains three values. The transition from group 1 to group 2 is the percentage frequency of a residue from group 1 followed by a residue from group 2 or a residue from group 2 followed by a residue from group 1. Transformations between group 2 and group 3 and between group 3 and group 1 are defined in a similar manner. The transformation descriptor can be calculated as follows:

\begin{array}{l} T (x, y) = \frac{n (x, y) + n (y, x)}{n - 1}, \\ x, y \in {(group 1, group 2), (group 2, group 3), (group 3, group 1)} \end{array}

where n(x, y) and n(y, x) refer to the numbers of dipeptides denoted by “x, y” and “y, x,” respectively, in the protein sequence.

Feature Selection

Feature selection is an important step in the process of protein classification (Figure 1) (Feng et al., 2017; Cheng, 2019; Liu, 2019; Yang W. et al., 2019; Zheng et al., 2019; Wang M. et al., 2020; Yang et al., 2020b; Zhao et al., 2020). MRMD2.0 is a very deep feature selection method, which uses the concept of the PageRank algorithm and is combined with methods such as analysis of variance (Scheffe, 1960), minimal redundancy and maximal relevance (Ding and Peng, 2005), maximal information coefficient, and least absolute shrinkage and selection operator (Xu et al., 2017). As a result, MRMD2.0 integrates seven different feature ranking algorithms with PageRank algorithm and detects optimized dimensionality with forward adding strategy. PageRank algorithm was originally used to attach weight value to each target page: pages with large weight values are displayed in the front, whereas pages with small weight values are displayed in the back. Similarly, MRMD2.0 uses PageRank algorithm and several other feature ranking algorithms to generate a corresponding weight value for each feature to form a ranking of the importance of all features.

In this study, MRMD2.0 was used to select features and reduce the dimension of the obtained features to improve the feature prediction ability. By treating each group of features in the previous step with MRMD2.0, we obtained the combination of features with the highest classification accuracy and the importance ranking of each group of features. Generally, the combination of features with the highest classification accuracy has fewer dimensions, so we refer to this process as feature dimension reduction. Based on the classification performance, we ranked the group of features. After combining the features with good classification performance, we applied MRMD2.0 to select them again. Finally, after comparing the results, we obtained the combination of features with the best classification ability.

In addition, we applied MRMD2.0 to obtain the importance ranking of features. On the rank list, higher-ranked features are more predictive; accordingly, we identified the most important features for the classification of thermophilic proteins and non-thermophilic proteins. The resulting information enhances our knowledge of the properties of proteins and can aid the construction of stable proteins in protein engineering.

LIBSVM

In this study, LIBSVM was used to construct models and make predictions (Figure 1). LIBSVM is an effective SVM pattern recognition and regression software package designed by Chih-Jen Lin, a professor at Taiwan University, and has been applied in many fields (Lin et al., 2012; Liu et al., 2012, 2017; Ding et al., 2017; Zeng et al., 2017; Wei et al., 2018, 2019; Xu et al., 2018b,c; Cheng et al., 2019b; Deng et al., 2019; Liang et al., 2019; Shen et al., 2019b,a; Su et al., 2019; Yang H. et al., 2019; Li F. et al., 2020; Wang H. et al., 2020; Yang et al., 2020a; Zhang et al., 2020). Before training SVM on a problem, the parameters must be specified (Jiang et al., 2013; Zhao et al., 2015, 2017). We selected the best parameters, C and g, through a simple tool provided by LIBSVM for evaluating a grid of parameters. The accuracy for each parameter setting is obtained in LIBSVM, allowing the parameters with the highest cross-validation accuracy to be determined. Next, we trained the whole data set with the best parameters C and g to obtain the prediction model. Finally, we tested and predicted our data set with the obtained model.

Performance Measurement

We used three commonly used indicators to evaluate model performance: sensitivity (SE), specificity (SP), and accuracy (ACC) (Figure 1) (Wang et al., 2010; Wei et al., 2017a,b; Zhang et al., 2018; Cheng et al., 2019a; Ding et al., 2019a; Junwei et al., 2019; Liang et al., 2019; Liu and Li, 2019; Tian et al., 2019; Jia et al., 2020; Liu and Chen, 2020; Li J. et al., 2020; Lv et al., 2020; Wang Z. et al., 2020). They are described as follows:

\begin{array}{l} SE = \frac{TP}{TP + FN} \\ SE = \frac{TP}{TP + FN} \\ ACC = \frac{TP + TN}{TP + FN + TN + FP} \end{array}

where TN, TP, FN, and FP refer to the numbers of correctly predicted non-thermophilic proteins, correctly predicted non-thermophilic proteins, incorrectly predicted non-thermophilic proteins, and incorrectly predicted thermophilic proteins, respectively. SE and SP indicators measure the predictive ability of a model in positive and negative situations, respectively, and ACC is used to evaluate the overall performance of a prediction model (Wang et al., 2008; Zou et al., 2017a,b; Cheng L. et al., 2018; Wang G. et al., 2018; Xue et al., 2018; Xu et al., 2018a, 2019; Ding et al., 2019b; Shen et al., 2019b; Yang, 2019; Zeng et al., 2019; Fu et al., 2020; Hong et al., 2020).

Results and Discussion

Identification of Protein Thermostability

The results of feature selection by using MRMD2.0 are shown in Table 1. Among them, features with good classification performance include AAC, DPC, CTDC, and dipeptide deviation from the expected mean. However, although the classification ACC of dipeptide deviation from the expected mean after dimension reduction reached 85.6%, it had 365-dimensional features. Considering the excessive dimension and the unexceptional performance, only AAC, DPC, and CTDC were subsequently combined for classification.

Table 1.

The results of feature selection by using MRMD2.0.

Feature	Dimensions	Accuracy (%)
AAC	16/20	87.94
DPC	103/400	87.00
DDE	365/400	85.60
CTDC	33/39	85.01
CTDT	39/39	80.50
CTriad	338/343	79.80
CKSAAP	143/150	79.04
GTPC	107/125	78.63
GDPC	13/25	78.57
TPC	1,008/1023	77.11

Open in a new tab

The two numbers in the second column of the table are the number after dimension reduction and the number before dimension reduction.

Next, based on LIBSVM and grid parameter optimization, we used various combinations of these three features to construct models and perform cross-validation for our data sets. The results are shown in Table 2. The overall ACC of three schemes is higher than that of Lin and Chen (2011) (93%).

Table 2.

The results of classification using SVM and various feature combinations.

Feature combination	SE (%)	SN (%)	Accuracy (%)
The method of Lin and Chen (2011)	93.77	92.69	93.27
AAC (16)	93.44	93.19	93.33
AAC (16) + CTDC (33)	93.77	92.81	93.33
AAC (16) + DPC (103)	95.85	96.22	96.02

Open in a new tab

The numbers in parentheses in the first column of the table represent the number of arguments to the feature preceding the parentheses.

Initially, we used AAC with 16 dimensions alone to build a prediction model for the data set, achieving an overall ACC rate of 93.33% through cross-validation, which is slightly higher than that of Lin and Chen (2011). In addition, Zhang and Fang (2006) and Gromiha and Suresh (2010) used all 20 amino acids composition to predict the thermostability of protein, and their overall ACC was 90.5 and 89%, respectively. Furthermore, Wang and Li (2014) enhanced the ACC to 95% by selecting 9 AAC and 38 DPC using a genetic algorithm. In contrast, the scheme used only 16 parameters, but the ACC reached 93.33%, which is fewer than the dimensions used in previous studies. The results show that AAC plays an important role in the identification of thermophilic proteins.

The top two features in Table 3 were AAC and DPC. The model constructed with 16 parameters of AAC and 103 parameters of DPC achieved the highest overall ACC of 96.02%. The SE and SP of this method were 95.85 and 96.22%, respectively, which indicates that the predictive ability of this model in both positive and negative situations is excellent.

Table 3.

The results of classification accuracy using LIBSVM and various combinations of important features.

Dimension	Feature	Accuracy (%)
1	K	76.41
2	K + D	77.50
3	K + D + LK	78.29

Open in a new tab

A plus sign in the second column of the table indicates the use of these characteristics for model training and classification. For example, “K + D” indicates the modeling and classification of the data sets with the two-dimension characteristics K and D.

In addition, we used the combination of AAC with 16 dimensions and CTDC with 33 dimensions to build a prediction model and obtained the same overall ACC as the first model. However, this second model had higher SE and lower SP than the first model, indicating that it was slightly inferior to the model built with 16 dimensions of AAC.

Feature Importance

We aimed to identify the most important features of the method with 119 parameters that can achieve the highest ACC and analyze them. To assess feature importance, first, we used MRMD2.0 to rank all 119 features by importance. We found that the top three features were K, D, and LK (Feature K is the percentage of lysine in the amino acid sequence, feature D is the percentage of aspartic acid in the amino acid sequence, and feature LK is the percentage content of the dipeptide consisting of leucine and lysine in the amino acid sequence). These three features are arguably the most predictive among the 119 features for the classification of thermophilic proteins.

Next, to obtain the classification performance of the above features, we used one-dimensional (K), two-dimensional (K and D), and three-dimensional (K, D, and LK) features to classify our data set based on LIBSVM. The results are shown in Table 3.

As seen from Table 3, the classification ACC of the K feature alone reached 76.41%, whereas the ACC achieved with K combined with D and LK was only slightly greater. To better analyze the classification ability of these three important features, we constructed a violin diagram, scatter diagram, and 3D scatter diagram for the 1-, 2-, and 3-dimension features. The results are shown in Figure 2.

Visualization of the ability of important features to classify thermophilic and non-thermophilic proteins. **(A)** is a violin diagram of the K feature. **(B)** is a scatter diagram of the K feature and D feature. **(C)** is a 3D scatter diagram of the K, D, and LK features. K is the percentage of lysine in the amino acid sequence, D is the percentage of aspartic acid in the amino acid sequence, and LK is the percentage content of the dipeptide consisting of leucine and lysine in the amino acid sequence.

As seen from Figure 2A, the K value of the thermophilic proteome is concentrated ~0.08, whereas the K value of the non-thermophilic proteome is concentrated ~0.03. These results indicate that the K feature can well distinguish thermophilic proteins from non-thermophilic proteins, a finding of great significance for the identification of the thermophilic properties of proteins. All three panels reveal obvious differences in the distribution pattern between the two data sets, which indicates that these features have strong recognition ability and good performance in distinguishing thermophilic proteins from non-thermophilic proteins, as shown in Table 3.

Comparison With Other Classification Methods

To reveal the advantage of our method, we applied six other classification methods to train our data sets based on the Waikato environment for knowledge analysis (Weka) tool (Witten and Frank, 2002): logistic, random forest, BayesNet, logistic model trees (LMTs), J48, and reduced error pruning tree (REPTree).

We used the combination with the highest overall ACC in this article (16 features in AAC and 103 features in DPC) as the input, and we used the above classifiers to predict the data set to obtain the SE, SP, and ACC of each method. To ensure a robust comparison, we also used cross-validation to predict the data set. By comparing the performance of different methods, the performance of different classifiers was evaluated. The prediction results of each method applied to the data set are shown in Table 4.

Table 4.

The performance of different classification methods in the prediction of the data sets.

Classification method	SE (%)	SN (%)	Accuracy (%)
SVM (this article)	95.85	96.22	96.02
LMT	92.35	90.29	91.40
Logistic	91.15	88.90	90.11
Random Forest	91.69	87.51	89.75
BayesNet	88.08	86.25	87.24
REPTree	83.60	84.62	84.07
J48	83.50	80.33	82.03

Open in a new tab

It can be seen from Table 4 that the SVM we used in this study achieved the best performance; the SE, SP, and ACC of the other methods were all lower than those of the SVM method of this article. To visualize the data, we constructed a cluster histogram of the performance of the different methods, shown in Figure 3.

The performance of the method described in this article and other six predictors when the input is 16 parameters of amino acid composition and 103 parameters of dipeptide composition. The performance metrics are sensitivity (SE), specificity (SP), and accuracy (ACC).

The advantage of using SVM to predict data sets is apparent from the histogram.

Conclusion

In this article, we distinguished 915 thermophilic proteins and 793 non-thermophilic proteins. We applied iFeature to extract the features of the protein sequences. MRMD2.0 was used to reduce the dimensions of features and select the ones that performed the best. LIBSVM was used to optimize the parameters and establish the prediction model. As a result, the overall ACC was improved, which reached 96.02% under cross-validation. Furthermore, we constructed a prediction model by LIBSVM with 16 parameters, and the ACC determined by cross-validation was 93.33%. In addition, we found that the K feature played a significant role in the identification. Finally, we demonstrated the advantage of SVM by comparing its performance with that of other methods. We aim to analyze information, such as the family of misclassified proteins, to optimize our method in the future.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: doi: 10.1016/j.mimet.2010.10.013.

Author Contributions

ZG made the design of the subject and the whole idea of the whole experiment, did comparative experiments, and the analysis of the experiment. PW did experimental data analysis. ZL and YZ analyzed the results of the experiment and made some improvements to this paper. All authors contributed to the article and approved the submitted version.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Footnotes

Funding. This work was supported by the Natural Science Foundation of China (Nos. 61971119 and 61672328).

References

Bhasin M., Raghava G. P. S. (2004). Classification of nuclear receptors based on amino acid composition and dipeptide composition. J. Biol. Chem. 279, 23262–23266. 10.1074/jbc.M401932200 [DOI] [PubMed] [Google Scholar]
Chen W., Feng P., Liu T., Jin D. (2019b). Recent advances in machine learning methods for predicting heat shock proteins. Curr. Drug Metab. 20, 224–228. 10.2174/1389200219666181031105916 [DOI] [PubMed] [Google Scholar]
Chen W., Feng P., Nie F. (2019a). iATP: a sequence based method for identifying anti-tubercular peptides. Med. Chem. 16, 620–625. 10.2174/1573406415666191002152441 [DOI] [PubMed] [Google Scholar]
Chen Z., Zhao P., Li F., Leier A., Marquezlago T. T., Wang Y., et al. (2018). iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 34, 2499–2502. 10.1093/bioinformatics/bty140 [DOI] [PMC free article] [PubMed] [Google Scholar]
Cheng J. H., Yang H., Liu M. L., Su W., Feng P. M., Ding H., et al. (2018). Prediction of bacteriophage proteins located in the host cell using hybrid features. Chemometr. Intell. Lab. 180, 64–69. 10.1016/j.chemolab.2018.07.006 [DOI] [Google Scholar]
Cheng L. (2019). Computational and biological methods for gene therapy. Curr. Gene Ther. 19, 210–210. 10.2174/156652321904191022113307 [DOI] [PubMed] [Google Scholar]
Cheng L., Jiang Y., Ju H., Sun J., Peng J., Zhou M., et al. (2018). InfAcrOnt: calculating cross-ontology term similarities using information flow by a random walk. BMC Genomics 19(Suppl. 1):919. 10.1186/s12864-017-4338-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
Cheng L., Wang P., Tian R., Wang S., Guo Q., Luo M., et al. (2019a). LncRNA2Target v2.0: a comprehensive database for target genes of lncRNAs in human and mouse. Nucleic Acids Res. 47, D140–D144. 10.1093/nar/gky1051 [DOI] [PMC free article] [PubMed] [Google Scholar]
Cheng L., Zhuang H., Ju H., Yang S., Han J., Tan R., et al. (2019b). Exposing the causal effect of body mass index on the risk of type 2 Diabetes mellitus: a mendelian randomization study. Front. Genet. 10:94. 10.3389/fgene.2019.00094 [DOI] [PMC free article] [PubMed] [Google Scholar]
Deng L., Wang J., Zhang J. (2019). Predicting gene ontology function of human MicroRNAs by integrating multiple networks. Front. Genet. 10:3. 10.3389/fgene.2019.00003 [DOI] [PMC free article] [PubMed] [Google Scholar]
Ding C., Peng H. (2005). Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 3, 185–205. 10.1142/S0219720005001004 [DOI] [PubMed] [Google Scholar]
Ding Y., Tang J., Guo F. (2017). Identification of drug-target interactions via multiple information integration. Inf. Sci. 418–419, 546–560. 10.1016/j.ins.2017.08.04527294413 [DOI] [Google Scholar]
Ding Y., Tang J., Guo F. (2019a). Identification of drug-side effect association via multiple information integration with centered kernel alignment. Neurocomputing 325, 211–224. 10.1016/j.neucom.2018.10.02830507518 [DOI] [Google Scholar]
Ding Y., Tang J., Guo F. (2019b). Identification of drug-side effect association via semi-supervised model and multiple kernel learning. IEEE J. Biomed. Health Inform. 23, 2619–2632. 10.1109/JBHI.2018.2883834 [DOI] [PubMed] [Google Scholar]
Dubchak I., Muchnik I., Mayor C., Dralyuk I., Kim S. (1999). Recognition of a protein fold in the context of the SCOP classification. Proteins 35, 401–407. [PubMed] [Google Scholar]
Feng P., Ding H., Lin H., Chen W. (2017). AOD: the antioxidant protein database. Sci. Rep. 7:7449. 10.1038/s41598-017-08115-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
Fu X., Cai L., Zeng X., Zou Q. (2020). StackCPPred: a stacking and pairwise energy content-based prediction of cell-penetrating peptides and their uptake efficiency. Bioinformatics 36, 3028–3034. 10.1093/bioinformatics/btaa131 [DOI] [PubMed] [Google Scholar]
Gromiha M. M., Oobatake M., Sarai A. (1999). Important amino acid properties for enhanced thermostability from mesophilic to thermophilic proteins. Biophys. Chem. 82, 51–67. 10.1016/S0301-4622(99)00103-9 [DOI] [PubMed] [Google Scholar]
Gromiha M. M., Suresh M. X. (2010). Discrimination of mesophilic and thermophilic proteins using machine learning algorithms. Proteins 70, 1274–1279. 10.1002/prot.21616 [DOI] [PubMed] [Google Scholar]
Hong Z., Zeng X., Wei L., Liu X. (2020). Identifying enhancer–promoter interactions with neural network based on pre-trained DNA vectors and attention mechanism. Bioinformatics 36, 1037–1043. 10.1093/bioinformatics/btz694 [DOI] [PubMed] [Google Scholar]
Jia C., Bi Y., Chen J., Leier A., Li F., Song J. (2020). PASSION: an ensemble neural network approach for identifying the binding sites of RBPs on circRNAs. Bioinformatics 36, 4276–4282. 10.1093/bioinformatics/btaa522 [DOI] [PubMed] [Google Scholar]
Jiang Q. H., Wang G. H., Jin S. L., Li Y., Wang Y. D. (2013). Predicting human microRNA-disease associations based on support vector machine. Int. J. Data Min. Bioinform. 8, 282–293. 10.1504/IJDMB.2013.056078 [DOI] [PubMed] [Google Scholar]
Junwei H., Xudong H., Qingfei K., Liang C. (2019). psSubpathway: a software package for flexible identification of phenotype-specific subpathways in cancer progression. Bioinformatics 36, 2303–2305. 10.1093/bioinformatics/btz894 [DOI] [PubMed] [Google Scholar]
Kumar S., Tsai C., Nussinov R. (2000). Factors enhancing protein thermostability. Protein Eng. 13, 179–191. 10.1093/protein/13.3.179 [DOI] [PubMed] [Google Scholar]
Li F., Zhou Y., Zhang X., Tang J., Yang Q., Zhang Y., et al. (2020). SSizer: determining the sample sufficiency for comparative biological study. J. Mol. Biol. 432, 3411–3421. 10.1016/j.jmb.2020.01.027 [DOI] [PubMed] [Google Scholar]
Li J., Pu Y., Tang J., Zou Q., Guo F. (2020). DeepAVP: a dual-channel deep neural network for identifying variable-length antiviral peptides. IEEE J. Biomed. Health Inform. 10.1109/JBHI.2020.2977091. [Epub ahead of print]. [DOI] [PubMed] [Google Scholar]
Li Y. H., Li X. X., Hong J. J., Wang Y. X., Fu J. B., Yang H., et al. (2020). Clinical trials, progression-speed differentiating features and swiftness rule of the innovative targets of first-in-class drugs. Brief. Bioinform. 21, 649–662. 10.1093/bib/bby130 [DOI] [PMC free article] [PubMed] [Google Scholar]
Li Y. H., Yu C. Y., Li X. X., Zhang P., Tang J., Yang Q., et al. (2018). Therapeutic target database update 2018: enriched resource for facilitating bench-to-clinic research of targeted therapeutics. Nucleic Acids Res. 46, D1121–D1127. 10.1093/nar/gkx1076 [DOI] [PMC free article] [PubMed] [Google Scholar]
Liang C., Changlu Q., He Z., Tongze F., Xue Z. (2019). gutMDisorder: a comprehensive database for dysbiosis of the gut microbiota in disorders and interventions. Nucleic Acids Res. 48, D554–D560. 10.1093/nar/gkz843 [DOI] [PMC free article] [PubMed] [Google Scholar]
Liang H., Huang C., Ko M., Hwang J. (2005). Amino acid coupling patterns in thermophilic proteins. Proteins 59, 58–63. 10.1002/prot.20386 [DOI] [PubMed] [Google Scholar]
Lin H., Chen W. (2011). Prediction of thermophilic proteins using feature selection technique. J. Microbiol. Methods 84, 67–70. 10.1016/j.mimet.2010.10.013 [DOI] [PubMed] [Google Scholar]
Lin H., Ding C., Song Q., Yang P., Ding H., Deng K. J., et al. (2012). The prediction of protein structural class using averaged chemical shifts. J. Biomol. Struct. Dyn. 29, 643–649. 10.1080/07391102.2011.672628 [DOI] [PubMed] [Google Scholar]
Liu B. (2019). BioSeq-Analysis: a platform for DNA, RNA, and protein sequence analysis based on machine learning approaches. Brief. Bioinform. 20, 1280–1294. 10.1093/bib/bbx165 [DOI] [PubMed] [Google Scholar]
Liu B., Gao X., Zhang H. (2019). BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA, and protein sequences at sequence level and residue level based on machine learning approaches. Nucleic Acids Res. 47:e127. 10.1093/nar/gkz740 [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu B., Li C., Yan K. (2012). DeepSVM-fold: protein fold recognition by combining support vector Machines and pairwise sequence similarity scores generated by deep learning networks. Brief. Bioinform. 21, 1733–1741. 10.1093/bib/bbz098 [DOI] [PubMed] [Google Scholar]
Liu B., Li K. (2019). iPromoter-2L2.0: identifying promoters and their types by combining smoothing cutting window algorithm and sequence-based features. Mol. Ther. Nucleic Acids. 18, 80–87. 10.1016/j.omtn.2019.08.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu K., Chen W. (2020). iMRM:a platform for simultaneously identifying multiple kinds of RNA modifications. Bioinformatics 36, 3336–3342. 10.1093/bioinformatics/btaa155 [DOI] [PubMed] [Google Scholar]
Liu Y., Zeng X., He Z., Zou Q. (2017). Inferring MicroRNA-disease associations by random walk on a heterogeneous network with multiple data sources. IEEE/ACM Trans. Comput. Biol. Bioinform. 14, 905–915. 10.1109/TCBB.2016.2550432 [DOI] [PubMed] [Google Scholar]
Lv H., Dao F. Y., Zhang D., Guan Z. X., Yang H., Su W., et al. (2020). iDNA-MS. an integrated computational tool for detecting DNA modification sites in multiple genomes. iScience 23:100991. 10.1016/j.isci.2020.100991 [DOI] [PMC free article] [PubMed] [Google Scholar]
Montanucci L., Fariselli P., Martelli P. L., Casadio R. (2008). Predicting protein thermostability changes from sequence upon multiple mutations. Bioinformatics 2008, 190–195. 10.1093/bioinformatics/btn166 [DOI] [PMC free article] [PubMed] [Google Scholar]
Pan Y., Wang Z., Zhan W., Deng L. (2018). Computational identification of binding energy hot spots in protein-RNA complexes using an ensemble approach. Bioinformatics 34, 1473–1480. 10.1093/bioinformatics/btx822 [DOI] [PubMed] [Google Scholar]
Sadeghi M., Naderimanesh H., Zarrabi M., Ranjbar B. (2006). Effective factors in thermostability of thermophilic proteins. Biophys. Chem. 119, 256–270. 10.1016/j.bpc.2005.09.018 [DOI] [PubMed] [Google Scholar]
Saravanan V., Gautham N. (2015). Harnessing computational biology for exact linear B-cell epitope prediction: a novel amino acid composition-based feature descriptor. OMICS 19, 648–658. 10.1089/omi.2015.0095 [DOI] [PubMed] [Google Scholar]
Scheffe H. (1960). The analysis of variance. Soil Sci. 89:360 10.1097/00010694-196006000-00016 [DOI] [Google Scholar]
Shen C., Jiang L., Ding Y., Tang J., Guo F. (2019b). LPI-KTASLP. prediction of lncRNA-protein interaction by semi-supervised link learning with multivariate information. IEEE Access 7, 13486–13496. 10.1109/ACCESS.2019.289422530697228 [DOI] [Google Scholar]
Shen Y., Ding Y., Tang J., Zou Q., Guo F. (2019a). Critical evaluation of web-based prediction tools for human protein subcellular localization. Brief. Bioinform. 21, 1628–1640. 10.1093/bib/bbz106 [DOI] [PubMed] [Google Scholar]
Shen Y., Tang J., Guo F. (2019b). Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou's general PseAAC. J. Theor. Biol. 462, 230–239. 10.1016/j.jtbi.2018.11.012 [DOI] [PubMed] [Google Scholar]
Su R., Wu H., Xu B., Liu X., Wei L. (2019). Developing a multi-dose computational model for drug-induced hepatotoxicity prediction based on toxicogenomics data. IEEE/ACM Trans. Comp. Biol. Bioinform. 16, 1231–1239. 10.1109/TCBB.2018.2858756 [DOI] [PubMed] [Google Scholar]
Tan J. X., Li S. H., Zhang Z. M., Chen C. X., Chen W., Tang H., et al. (2019). Identification of hormone binding proteins based on machine learning methods. Math. Biosci. Eng. 16, 2466–2480. 10.3934/mbe.2019123 [DOI] [PubMed] [Google Scholar]
Tang H., Zhao Y. W., Zou P., Zhang C. M., Chen R., Huang P., et al. (2018). HBPred: a tool to identify growth hormone-binding proteins. Int. J. Biol. Sci. 14, 957–964. 10.7150/ijbs.24174 [DOI] [PMC free article] [PubMed] [Google Scholar]
Tang J., Fu J., Wang Y., Li B., Li Y., Yang Q., et al. (2020). ANPELA: analysis and performance assessment of the label-free quantification workflow for metaproteomic studies. Brief. Bioinform. 21, 621–636. 10.1093/bib/bby127 [DOI] [PMC free article] [PubMed] [Google Scholar]
Tang J., Fu J., Wang Y., Luo Y., Yang Q., Li B., et al. (2019). Simultaneous improvement in the precision, accuracy, and robustness of label-free proteome quantification by optimizing data manipulation chains. Mol. Cell. Proteomics 18, 1683–1699. 10.1074/mcp.RA118.001169 [DOI] [PMC free article] [PubMed] [Google Scholar]
Tian B., Wu X., Chen C., Qiu W., Ma Q., Yu B. (2019). Predicting protein–protein interactions by fusing various Chou's pseudo components and using wavelet denoising approach. J. Theor. Biol. 462, 329–346. 10.1016/j.jtbi.2018.11.011 [DOI] [PubMed] [Google Scholar]
Tomii K., Kanehisa M. (1996). Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng. 9, 27–36. 10.1093/protein/9.1.27 [DOI] [PubMed] [Google Scholar]
Wang G., Luo X., Wang J., Wan J., Xia S., Zhu H., et al. (2018). MeDReaders: a database for transcription factors that bind to methylated DNA. Nucleic Acids Res. 46, D146–D151. 10.1093/nar/gkx1096 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang G., Wang Y., Feng W., Wang X., Yang J. Y., Zhao Y., et al. (2008). Transcription factor and microRNA regulation in androgen-dependent and -independent prostate cancer cells. BMC Genomics. 9(Suppl. 2):S22. 10.1186/1471-2164-9-S2-S22 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang G., Wang Y., Teng M., Zhang D., Li L., Liu Y. (2010). Signal transducers and activators of transcription-1 (STAT1) regulates microRNA transcription in interferon gamma-stimulated HeLa cells. PLoS ONE 5:e11794. 10.1371/journal.pone.0011794 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang H., Ding Y., Tang J., Guo F. (2020). Identification of membrane protein types via multivariate information fusion with Hilbert-schmidt independence criterion. Neurocomputing 383, 257–269. 10.1016/j.neucom.2019.11.103 [DOI] [Google Scholar]
Wang H., Liu C., Deng L. (2018). Enhanced prediction of hot spots at protein-protein interfaces using extreme gradient boosting. Sci. Rep. 8:14285. 10.1038/s41598-018-32511-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang L. Q., Li C. F. (2014). Optimal subset selection of primary sequence features using the genetic algorithm for thermophilic proteins identification. Biotechnol. Lett. 36, 1963–1969. 10.1007/s10529-014-1577-3 [DOI] [PubMed] [Google Scholar]
Wang M., Yue L., Cui X., Chen C., Zhou H., Ma Q., et al. (2020). Prediction of extracellular matrix proteins by fusing multiple feature information, elastic net, and random forest algorithm. Mathematics 8:169 10.3390/math8020169 [DOI] [Google Scholar]
Wang Y., Zhang S., Li F., Zhou Y., Zhang Y., Wang Z., et al. (2020). Therapeutic target database 2020: enriched resource for facilitating research and early development of targeted therapeutics. Nucleic Acids Res. 48, D1031–D1041. 10.1093/nar/gkz981 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Z., He W., Tang J., Guo F. (2020). Identification of highest-affinity binding sites of yeast transcription factor families. J. Chem. Inf. Model. 60, 1876–1883. 10.1021/acs.jcim.9b01012 [DOI] [PubMed] [Google Scholar]
Wei L., Wan S., Guo J., Wong K. K. (2017a). A novel hierarchical selective ensemble classifier with bioinformatics application. Artif. Intell. Med. 83, 82–90. 10.1016/j.artmed.2017.02.005 [DOI] [PubMed] [Google Scholar]
Wei L., Xing P., Shi G., Ji Z., Zou Q. (2019). Fast prediction of methylation sites using sequence-based feature selection technique. IEEE/ACM Trans. Comp. Biol. Bioinform. 16, 1264–1273. 10.1109/TCBB.2017.2670558 [DOI] [PubMed] [Google Scholar]
Wei L., Xing P., Zeng J., Chen J., Su R., Guo F. (2017b). Improved prediction of protein–protein interactions using novel negative samples, features, and an ensemble classifier. Artif. Intell. Med. 83, 67–74. 10.1016/j.artmed.2017.03.001 [DOI] [PubMed] [Google Scholar]
Wei L., Zhou C., Chen H., Song J., Su R. (2018). ACPred-FL: a sequence-based predictor based on effective feature representation to improve the prediction of anti-cancer peptides. Bioinformatics 34, 4007–4016. 10.1093/bioinformatics/bty451 [DOI] [PMC free article] [PubMed] [Google Scholar]
Witten I. H., Frank E. (2002). Data mining: practical machine learning tools and techniques with Java implementations. SIGMOD Rec. 31, 76–77. 10.1145/507338.50735519274708 [DOI] [Google Scholar]
Witten I. H., Frank E. (2002). Data mining: practical machine learning tools and techniques with Java implementations. SIGMOD Rec. 31, 76-77. 10.1145/507338.50735519274708 [DOI] [Google Scholar]
Xu L., Liang G., Liao C., Chen G.-D., Chang C-C. (2019). k-skip-n-gram-RF: a random forest based method for Alzheimer's disease protein identification . Front. Genet. 10:33. 10.3389/fgene.2019.00033 [DOI] [PMC free article] [PubMed] [Google Scholar]
Xu L., Liang G., Liao C., Chen G.-D., Chang C.-C. (2018a). An efficient classifier for alzheimer's disease genes identification. Molecules 23:3140. 10.3390/molecules23123140 [DOI] [PMC free article] [PubMed] [Google Scholar]
Xu L., Liang G., Shi S., Liao C. (2018b). SeqSVM. a sequence-based support vector machine method for identifying antioxidant proteins. Int. J. Mol. Sci. 19:1773. 10.3390/ijms19061773 [DOI] [PMC free article] [PubMed] [Google Scholar]
Xu L., Liang G., Wang L., Liao C. (2018c). A novel hybrid sequence-based model for identifying anticancer peptides. Genes 9:158. 10.3390/genes9030158 [DOI] [PMC free article] [PubMed] [Google Scholar]
Xu Y., Wang Y., Luo J., Zhao W., Zhou X. (2017). Deep learning of the splicing (epi) genetic code reveals a novel candidate mechanism linking histone modifications to ESC fate decision. Nucleic Acids Res. 45, 12100–12112. 10.1093/nar/gkx870 [DOI] [PMC free article] [PubMed] [Google Scholar]
Xue W., Yang F., Wang P., Zheng G., Chen Y., Yao X., et al. (2018). What contributes to serotonin-norepinephrine reuptake inhibitors' dual-targeting mechanism? The key role of transmembrane domain 6 in human serotonin and norepinephrine transporters revealed by molecular dynamics simulation. ACS Chem. Neurosci. 9, 1128–1140. 10.1021/acschemneuro.7b00490 [DOI] [PubMed] [Google Scholar]
Yang C. (2019). Interaction of cell and gene therapy with the immune system. Curr. Gene Ther. 19, 69–70. 10.2174/156652321902190722112944 [DOI] [PubMed] [Google Scholar]
Yang H., Yang W., Dao F. Y., Lv H., Ding H., Chen W., et al. (2019). A comparison and assessment of computational method for identifying recombination hotspots in Saccharomyces cerevisiae. Brief. Bioinform. 21, 1568–1580. 10.1093/bib/bbz123 [DOI] [PubMed] [Google Scholar]
Yang Q., Li B., Tang J., Cui X., Wang Y., Li X., et al. (2020a). Consistent gene signature of schizophrenia identified by a novel feature selection strategy from comprehensive sets of transcriptomic data. Brief. Bioinform. 21, 1058–1068. 10.1093/bib/bbz049 [DOI] [PubMed] [Google Scholar]
Yang Q., Wang Y., Zhang Y., Li F., Xia W., Zhou Y., et al. (2020b). NOREVA: enhanced normalization and evaluation of time-course and multi-class metabolomic data. Nucleic Acids Res. 48, W436–W448. 10.1093/nar/gkaa258 [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang W., Zhu X. J., Huang J., Ding H., Lin H. (2019). A brief survey of machine learning methods in protein sub-Golgi localization. Curr. Bioinform. 14, 234–240. 10.2174/1574893613666181113131415 [DOI] [Google Scholar]
Yin J., Sun W., Li F., Hong J., Li X., Zhou Y., et al. (2020). VARIDT 1.0: variability of drug transporter database. Nucleic Acids Res. 48, D1042–D1050. 10.1093/nar/gkz779 [DOI] [PMC free article] [PubMed] [Google Scholar]
Zeng X., Liao Y., Liu Y., Zou Q. (2017). Prediction and validation of disease genes using hetesim scores. IEEE/ACM Trans. Comput. Biol. Bioinform. 14, 687–695. 10.1109/TCBB.2016.2520947 [DOI] [PubMed] [Google Scholar]
Zeng X., Zhu S., Liu X., Zhou Y., Nussinov R., Cheng F. (2019). deepDR: a network-based deep learning approach to in silico drug repositioning. Bioinformatics 35, 5191–5198. 10.1093/bioinformatics/btz418 [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang F., Ma A., Wang Z., Ma Q., Liu B., Huang L., et al. (2018). A central edge selection based overlapping community detection algorithm for the detection of overlapping structures in protein–protein interaction networks. Molecules 23:2633. 10.3390/molecules23102633 [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang G., Fang B. (2006). Discrimination of thermophilic and mesophilic proteins via pattern recognition methods. Process Biochem. 41, 552–556. 10.1016/j.procbio.2005.09.003 [DOI] [Google Scholar]
Zhang G., Fang B. (2007). LogitBoost classifier for discriminating thermophilic and mesophilic proteins. J. Biotechnol. 127, 417–424. 10.1016/j.jbiotec.2006.07.020 [DOI] [PubMed] [Google Scholar]
Zhang Z. Y., Yang Y. H., Ding H., Wang D., Chen W., Lin H. (2020). Design powerful predictor for mRNA subcellular location prediction in Homo sapiens. Brief. Bioinform. 10.1093/bib/bbz177. [Epub ahead of print]. [DOI] [PubMed] [Google Scholar]
Zhao X., Jiao Q., Li H., Wu Y., Wang H., Huang S., et al. (2020). ECFS-DEA: an ensemble classifier-based feature selection for differential expression analysis on expression profiles. BMC Bioinform. 21:43. 10.1186/s12859-020-3388-y [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Y., Wang F., Chen S., Wan J., Wang G. (2017). Methods of MicroRNA promoter prediction and transcription factor mediated regulatory network. Biomed. Res. Int. 2017:7049406. 10.1155/2017/7049406 [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Y., Wang F., Juan L. (2015). MicroRNA promoter identification in arabidopsis using multiple histone markers. Biomed. Res. Int. 2015:861402. 10.1155/2015/861402 [DOI] [PMC free article] [PubMed] [Google Scholar]
Zheng N., Wang K., Zhan W., Deng L. (2019). Targeting virus-host protein interactions: feature extraction and machine learning approaches. Curr. Drug Metab. 20, 177–184. 10.2174/1389200219666180829121038 [DOI] [PubMed] [Google Scholar]
Zhu X. J., Feng C. Q., Lai H. Y., Chen W., Lin H. (2019). Predicting protein structural classes for low-similarity sequences by evaluating different features. Knowl-Based Syst. 163, 787–793. 10.1016/j.knosys.2018.10.007 [DOI] [Google Scholar]
Zou Q., Chen L., Huang T., Zhang Z., Xu Y. (2017a). Machine learning and graph analytics in computational biomedicine. Artificial intelligence in medicine. Artif. Intell. Med. 83:1. 10.1016/j.artmed.2017.09.003 [DOI] [PubMed] [Google Scholar]
Zou Q., Mrozek D., Ma Q., Xu Y. (2017b). Scalable data mining algorithms in computational biology and biomedicine. Biomed Res. Int. 2017:5652041. 10.1155/2017/5652041 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: doi: 10.1016/j.mimet.2010.10.013.

[B1] Bhasin M., Raghava G. P. S. (2004). Classification of nuclear receptors based on amino acid composition and dipeptide composition. J. Biol. Chem. 279, 23262–23266. 10.1074/jbc.M401932200 [DOI] [PubMed] [Google Scholar]

[B2] Chen W., Feng P., Liu T., Jin D. (2019b). Recent advances in machine learning methods for predicting heat shock proteins. Curr. Drug Metab. 20, 224–228. 10.2174/1389200219666181031105916 [DOI] [PubMed] [Google Scholar]

[B3] Chen W., Feng P., Nie F. (2019a). iATP: a sequence based method for identifying anti-tubercular peptides. Med. Chem. 16, 620–625. 10.2174/1573406415666191002152441 [DOI] [PubMed] [Google Scholar]

[B4] Chen Z., Zhao P., Li F., Leier A., Marquezlago T. T., Wang Y., et al. (2018). iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 34, 2499–2502. 10.1093/bioinformatics/bty140 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] Cheng J. H., Yang H., Liu M. L., Su W., Feng P. M., Ding H., et al. (2018). Prediction of bacteriophage proteins located in the host cell using hybrid features. Chemometr. Intell. Lab. 180, 64–69. 10.1016/j.chemolab.2018.07.006 [DOI] [Google Scholar]

[B6] Cheng L. (2019). Computational and biological methods for gene therapy. Curr. Gene Ther. 19, 210–210. 10.2174/156652321904191022113307 [DOI] [PubMed] [Google Scholar]

[B7] Cheng L., Jiang Y., Ju H., Sun J., Peng J., Zhou M., et al. (2018). InfAcrOnt: calculating cross-ontology term similarities using information flow by a random walk. BMC Genomics 19(Suppl. 1):919. 10.1186/s12864-017-4338-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] Cheng L., Wang P., Tian R., Wang S., Guo Q., Luo M., et al. (2019a). LncRNA2Target v2.0: a comprehensive database for target genes of lncRNAs in human and mouse. Nucleic Acids Res. 47, D140–D144. 10.1093/nar/gky1051 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] Cheng L., Zhuang H., Ju H., Yang S., Han J., Tan R., et al. (2019b). Exposing the causal effect of body mass index on the risk of type 2 Diabetes mellitus: a mendelian randomization study. Front. Genet. 10:94. 10.3389/fgene.2019.00094 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] Deng L., Wang J., Zhang J. (2019). Predicting gene ontology function of human MicroRNAs by integrating multiple networks. Front. Genet. 10:3. 10.3389/fgene.2019.00003 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] Ding C., Peng H. (2005). Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 3, 185–205. 10.1142/S0219720005001004 [DOI] [PubMed] [Google Scholar]

[B12] Ding Y., Tang J., Guo F. (2017). Identification of drug-target interactions via multiple information integration. Inf. Sci. 418–419, 546–560. 10.1016/j.ins.2017.08.04527294413 [DOI] [Google Scholar]

[B13] Ding Y., Tang J., Guo F. (2019a). Identification of drug-side effect association via multiple information integration with centered kernel alignment. Neurocomputing 325, 211–224. 10.1016/j.neucom.2018.10.02830507518 [DOI] [Google Scholar]

[B14] Ding Y., Tang J., Guo F. (2019b). Identification of drug-side effect association via semi-supervised model and multiple kernel learning. IEEE J. Biomed. Health Inform. 23, 2619–2632. 10.1109/JBHI.2018.2883834 [DOI] [PubMed] [Google Scholar]

[B15] Dubchak I., Muchnik I., Mayor C., Dralyuk I., Kim S. (1999). Recognition of a protein fold in the context of the SCOP classification. Proteins 35, 401–407. [PubMed] [Google Scholar]

[B16] Feng P., Ding H., Lin H., Chen W. (2017). AOD: the antioxidant protein database. Sci. Rep. 7:7449. 10.1038/s41598-017-08115-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] Fu X., Cai L., Zeng X., Zou Q. (2020). StackCPPred: a stacking and pairwise energy content-based prediction of cell-penetrating peptides and their uptake efficiency. Bioinformatics 36, 3028–3034. 10.1093/bioinformatics/btaa131 [DOI] [PubMed] [Google Scholar]

[B18] Gromiha M. M., Oobatake M., Sarai A. (1999). Important amino acid properties for enhanced thermostability from mesophilic to thermophilic proteins. Biophys. Chem. 82, 51–67. 10.1016/S0301-4622(99)00103-9 [DOI] [PubMed] [Google Scholar]

[B19] Gromiha M. M., Suresh M. X. (2010). Discrimination of mesophilic and thermophilic proteins using machine learning algorithms. Proteins 70, 1274–1279. 10.1002/prot.21616 [DOI] [PubMed] [Google Scholar]

[B20] Hong Z., Zeng X., Wei L., Liu X. (2020). Identifying enhancer–promoter interactions with neural network based on pre-trained DNA vectors and attention mechanism. Bioinformatics 36, 1037–1043. 10.1093/bioinformatics/btz694 [DOI] [PubMed] [Google Scholar]

[B21] Jia C., Bi Y., Chen J., Leier A., Li F., Song J. (2020). PASSION: an ensemble neural network approach for identifying the binding sites of RBPs on circRNAs. Bioinformatics 36, 4276–4282. 10.1093/bioinformatics/btaa522 [DOI] [PubMed] [Google Scholar]

[B22] Jiang Q. H., Wang G. H., Jin S. L., Li Y., Wang Y. D. (2013). Predicting human microRNA-disease associations based on support vector machine. Int. J. Data Min. Bioinform. 8, 282–293. 10.1504/IJDMB.2013.056078 [DOI] [PubMed] [Google Scholar]

[B23] Junwei H., Xudong H., Qingfei K., Liang C. (2019). psSubpathway: a software package for flexible identification of phenotype-specific subpathways in cancer progression. Bioinformatics 36, 2303–2305. 10.1093/bioinformatics/btz894 [DOI] [PubMed] [Google Scholar]

[B24] Kumar S., Tsai C., Nussinov R. (2000). Factors enhancing protein thermostability. Protein Eng. 13, 179–191. 10.1093/protein/13.3.179 [DOI] [PubMed] [Google Scholar]

[B25] Li F., Zhou Y., Zhang X., Tang J., Yang Q., Zhang Y., et al. (2020). SSizer: determining the sample sufficiency for comparative biological study. J. Mol. Biol. 432, 3411–3421. 10.1016/j.jmb.2020.01.027 [DOI] [PubMed] [Google Scholar]

[B26] Li J., Pu Y., Tang J., Zou Q., Guo F. (2020). DeepAVP: a dual-channel deep neural network for identifying variable-length antiviral peptides. IEEE J. Biomed. Health Inform. 10.1109/JBHI.2020.2977091. [Epub ahead of print]. [DOI] [PubMed] [Google Scholar]

[B27] Li Y. H., Li X. X., Hong J. J., Wang Y. X., Fu J. B., Yang H., et al. (2020). Clinical trials, progression-speed differentiating features and swiftness rule of the innovative targets of first-in-class drugs. Brief. Bioinform. 21, 649–662. 10.1093/bib/bby130 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B28] Li Y. H., Yu C. Y., Li X. X., Zhang P., Tang J., Yang Q., et al. (2018). Therapeutic target database update 2018: enriched resource for facilitating bench-to-clinic research of targeted therapeutics. Nucleic Acids Res. 46, D1121–D1127. 10.1093/nar/gkx1076 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] Liang C., Changlu Q., He Z., Tongze F., Xue Z. (2019). gutMDisorder: a comprehensive database for dysbiosis of the gut microbiota in disorders and interventions. Nucleic Acids Res. 48, D554–D560. 10.1093/nar/gkz843 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] Liang H., Huang C., Ko M., Hwang J. (2005). Amino acid coupling patterns in thermophilic proteins. Proteins 59, 58–63. 10.1002/prot.20386 [DOI] [PubMed] [Google Scholar]

[B31] Lin H., Chen W. (2011). Prediction of thermophilic proteins using feature selection technique. J. Microbiol. Methods 84, 67–70. 10.1016/j.mimet.2010.10.013 [DOI] [PubMed] [Google Scholar]

[B32] Lin H., Ding C., Song Q., Yang P., Ding H., Deng K. J., et al. (2012). The prediction of protein structural class using averaged chemical shifts. J. Biomol. Struct. Dyn. 29, 643–649. 10.1080/07391102.2011.672628 [DOI] [PubMed] [Google Scholar]

[B33] Liu B. (2019). BioSeq-Analysis: a platform for DNA, RNA, and protein sequence analysis based on machine learning approaches. Brief. Bioinform. 20, 1280–1294. 10.1093/bib/bbx165 [DOI] [PubMed] [Google Scholar]

[B34] Liu B., Gao X., Zhang H. (2019). BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA, and protein sequences at sequence level and residue level based on machine learning approaches. Nucleic Acids Res. 47:e127. 10.1093/nar/gkz740 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B35] Liu B., Li C., Yan K. (2012). DeepSVM-fold: protein fold recognition by combining support vector Machines and pairwise sequence similarity scores generated by deep learning networks. Brief. Bioinform. 21, 1733–1741. 10.1093/bib/bbz098 [DOI] [PubMed] [Google Scholar]

[B36] Liu B., Li K. (2019). iPromoter-2L2.0: identifying promoters and their types by combining smoothing cutting window algorithm and sequence-based features. Mol. Ther. Nucleic Acids. 18, 80–87. 10.1016/j.omtn.2019.08.008 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B37] Liu K., Chen W. (2020). iMRM:a platform for simultaneously identifying multiple kinds of RNA modifications. Bioinformatics 36, 3336–3342. 10.1093/bioinformatics/btaa155 [DOI] [PubMed] [Google Scholar]

[B38] Liu Y., Zeng X., He Z., Zou Q. (2017). Inferring MicroRNA-disease associations by random walk on a heterogeneous network with multiple data sources. IEEE/ACM Trans. Comput. Biol. Bioinform. 14, 905–915. 10.1109/TCBB.2016.2550432 [DOI] [PubMed] [Google Scholar]

[B39] Lv H., Dao F. Y., Zhang D., Guan Z. X., Yang H., Su W., et al. (2020). iDNA-MS. an integrated computational tool for detecting DNA modification sites in multiple genomes. iScience 23:100991. 10.1016/j.isci.2020.100991 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B40] Montanucci L., Fariselli P., Martelli P. L., Casadio R. (2008). Predicting protein thermostability changes from sequence upon multiple mutations. Bioinformatics 2008, 190–195. 10.1093/bioinformatics/btn166 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B41] Pan Y., Wang Z., Zhan W., Deng L. (2018). Computational identification of binding energy hot spots in protein-RNA complexes using an ensemble approach. Bioinformatics 34, 1473–1480. 10.1093/bioinformatics/btx822 [DOI] [PubMed] [Google Scholar]

[B42] Sadeghi M., Naderimanesh H., Zarrabi M., Ranjbar B. (2006). Effective factors in thermostability of thermophilic proteins. Biophys. Chem. 119, 256–270. 10.1016/j.bpc.2005.09.018 [DOI] [PubMed] [Google Scholar]

[B43] Saravanan V., Gautham N. (2015). Harnessing computational biology for exact linear B-cell epitope prediction: a novel amino acid composition-based feature descriptor. OMICS 19, 648–658. 10.1089/omi.2015.0095 [DOI] [PubMed] [Google Scholar]

[B44] Scheffe H. (1960). The analysis of variance. Soil Sci. 89:360 10.1097/00010694-196006000-00016 [DOI] [Google Scholar]

[B45] Shen C., Jiang L., Ding Y., Tang J., Guo F. (2019b). LPI-KTASLP. prediction of lncRNA-protein interaction by semi-supervised link learning with multivariate information. IEEE Access 7, 13486–13496. 10.1109/ACCESS.2019.289422530697228 [DOI] [Google Scholar]

[B46] Shen Y., Ding Y., Tang J., Zou Q., Guo F. (2019a). Critical evaluation of web-based prediction tools for human protein subcellular localization. Brief. Bioinform. 21, 1628–1640. 10.1093/bib/bbz106 [DOI] [PubMed] [Google Scholar]

[B47] Shen Y., Tang J., Guo F. (2019b). Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou's general PseAAC. J. Theor. Biol. 462, 230–239. 10.1016/j.jtbi.2018.11.012 [DOI] [PubMed] [Google Scholar]

[B48] Su R., Wu H., Xu B., Liu X., Wei L. (2019). Developing a multi-dose computational model for drug-induced hepatotoxicity prediction based on toxicogenomics data. IEEE/ACM Trans. Comp. Biol. Bioinform. 16, 1231–1239. 10.1109/TCBB.2018.2858756 [DOI] [PubMed] [Google Scholar]

[B49] Tan J. X., Li S. H., Zhang Z. M., Chen C. X., Chen W., Tang H., et al. (2019). Identification of hormone binding proteins based on machine learning methods. Math. Biosci. Eng. 16, 2466–2480. 10.3934/mbe.2019123 [DOI] [PubMed] [Google Scholar]

[B50] Tang H., Zhao Y. W., Zou P., Zhang C. M., Chen R., Huang P., et al. (2018). HBPred: a tool to identify growth hormone-binding proteins. Int. J. Biol. Sci. 14, 957–964. 10.7150/ijbs.24174 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B51] Tang J., Fu J., Wang Y., Li B., Li Y., Yang Q., et al. (2020). ANPELA: analysis and performance assessment of the label-free quantification workflow for metaproteomic studies. Brief. Bioinform. 21, 621–636. 10.1093/bib/bby127 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B52] Tang J., Fu J., Wang Y., Luo Y., Yang Q., Li B., et al. (2019). Simultaneous improvement in the precision, accuracy, and robustness of label-free proteome quantification by optimizing data manipulation chains. Mol. Cell. Proteomics 18, 1683–1699. 10.1074/mcp.RA118.001169 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B53] Tian B., Wu X., Chen C., Qiu W., Ma Q., Yu B. (2019). Predicting protein–protein interactions by fusing various Chou's pseudo components and using wavelet denoising approach. J. Theor. Biol. 462, 329–346. 10.1016/j.jtbi.2018.11.011 [DOI] [PubMed] [Google Scholar]

[B54] Tomii K., Kanehisa M. (1996). Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng. 9, 27–36. 10.1093/protein/9.1.27 [DOI] [PubMed] [Google Scholar]

[B55] Wang G., Luo X., Wang J., Wan J., Xia S., Zhu H., et al. (2018). MeDReaders: a database for transcription factors that bind to methylated DNA. Nucleic Acids Res. 46, D146–D151. 10.1093/nar/gkx1096 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B56] Wang G., Wang Y., Feng W., Wang X., Yang J. Y., Zhao Y., et al. (2008). Transcription factor and microRNA regulation in androgen-dependent and -independent prostate cancer cells. BMC Genomics. 9(Suppl. 2):S22. 10.1186/1471-2164-9-S2-S22 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B57] Wang G., Wang Y., Teng M., Zhang D., Li L., Liu Y. (2010). Signal transducers and activators of transcription-1 (STAT1) regulates microRNA transcription in interferon gamma-stimulated HeLa cells. PLoS ONE 5:e11794. 10.1371/journal.pone.0011794 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B58] Wang H., Ding Y., Tang J., Guo F. (2020). Identification of membrane protein types via multivariate information fusion with Hilbert-schmidt independence criterion. Neurocomputing 383, 257–269. 10.1016/j.neucom.2019.11.103 [DOI] [Google Scholar]

[B59] Wang H., Liu C., Deng L. (2018). Enhanced prediction of hot spots at protein-protein interfaces using extreme gradient boosting. Sci. Rep. 8:14285. 10.1038/s41598-018-32511-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B60] Wang L. Q., Li C. F. (2014). Optimal subset selection of primary sequence features using the genetic algorithm for thermophilic proteins identification. Biotechnol. Lett. 36, 1963–1969. 10.1007/s10529-014-1577-3 [DOI] [PubMed] [Google Scholar]

[B61] Wang M., Yue L., Cui X., Chen C., Zhou H., Ma Q., et al. (2020). Prediction of extracellular matrix proteins by fusing multiple feature information, elastic net, and random forest algorithm. Mathematics 8:169 10.3390/math8020169 [DOI] [Google Scholar]

[B62] Wang Y., Zhang S., Li F., Zhou Y., Zhang Y., Wang Z., et al. (2020). Therapeutic target database 2020: enriched resource for facilitating research and early development of targeted therapeutics. Nucleic Acids Res. 48, D1031–D1041. 10.1093/nar/gkz981 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B63] Wang Z., He W., Tang J., Guo F. (2020). Identification of highest-affinity binding sites of yeast transcription factor families. J. Chem. Inf. Model. 60, 1876–1883. 10.1021/acs.jcim.9b01012 [DOI] [PubMed] [Google Scholar]

[B64] Wei L., Wan S., Guo J., Wong K. K. (2017a). A novel hierarchical selective ensemble classifier with bioinformatics application. Artif. Intell. Med. 83, 82–90. 10.1016/j.artmed.2017.02.005 [DOI] [PubMed] [Google Scholar]

[B65] Wei L., Xing P., Shi G., Ji Z., Zou Q. (2019). Fast prediction of methylation sites using sequence-based feature selection technique. IEEE/ACM Trans. Comp. Biol. Bioinform. 16, 1264–1273. 10.1109/TCBB.2017.2670558 [DOI] [PubMed] [Google Scholar]

[B66] Wei L., Xing P., Zeng J., Chen J., Su R., Guo F. (2017b). Improved prediction of protein–protein interactions using novel negative samples, features, and an ensemble classifier. Artif. Intell. Med. 83, 67–74. 10.1016/j.artmed.2017.03.001 [DOI] [PubMed] [Google Scholar]

[B67] Wei L., Zhou C., Chen H., Song J., Su R. (2018). ACPred-FL: a sequence-based predictor based on effective feature representation to improve the prediction of anti-cancer peptides. Bioinformatics 34, 4007–4016. 10.1093/bioinformatics/bty451 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B68] Witten I. H., Frank E. (2002). Data mining: practical machine learning tools and techniques with Java implementations. SIGMOD Rec. 31, 76–77. 10.1145/507338.50735519274708 [DOI] [Google Scholar]

[B69] Witten I. H., Frank E. (2002). Data mining: practical machine learning tools and techniques with Java implementations. SIGMOD Rec. 31, 76-77. 10.1145/507338.50735519274708 [DOI] [Google Scholar]

[B70] Xu L., Liang G., Liao C., Chen G.-D., Chang C-C. (2019). k-skip-n-gram-RF: a random forest based method for Alzheimer's disease protein identification . Front. Genet. 10:33. 10.3389/fgene.2019.00033 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B71] Xu L., Liang G., Liao C., Chen G.-D., Chang C.-C. (2018a). An efficient classifier for alzheimer's disease genes identification. Molecules 23:3140. 10.3390/molecules23123140 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B72] Xu L., Liang G., Shi S., Liao C. (2018b). SeqSVM. a sequence-based support vector machine method for identifying antioxidant proteins. Int. J. Mol. Sci. 19:1773. 10.3390/ijms19061773 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B73] Xu L., Liang G., Wang L., Liao C. (2018c). A novel hybrid sequence-based model for identifying anticancer peptides. Genes 9:158. 10.3390/genes9030158 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B74] Xu Y., Wang Y., Luo J., Zhao W., Zhou X. (2017). Deep learning of the splicing (epi) genetic code reveals a novel candidate mechanism linking histone modifications to ESC fate decision. Nucleic Acids Res. 45, 12100–12112. 10.1093/nar/gkx870 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B75] Xue W., Yang F., Wang P., Zheng G., Chen Y., Yao X., et al. (2018). What contributes to serotonin-norepinephrine reuptake inhibitors' dual-targeting mechanism? The key role of transmembrane domain 6 in human serotonin and norepinephrine transporters revealed by molecular dynamics simulation. ACS Chem. Neurosci. 9, 1128–1140. 10.1021/acschemneuro.7b00490 [DOI] [PubMed] [Google Scholar]

[B76] Yang C. (2019). Interaction of cell and gene therapy with the immune system. Curr. Gene Ther. 19, 69–70. 10.2174/156652321902190722112944 [DOI] [PubMed] [Google Scholar]

[B77] Yang H., Yang W., Dao F. Y., Lv H., Ding H., Chen W., et al. (2019). A comparison and assessment of computational method for identifying recombination hotspots in Saccharomyces cerevisiae. Brief. Bioinform. 21, 1568–1580. 10.1093/bib/bbz123 [DOI] [PubMed] [Google Scholar]

[B78] Yang Q., Li B., Tang J., Cui X., Wang Y., Li X., et al. (2020a). Consistent gene signature of schizophrenia identified by a novel feature selection strategy from comprehensive sets of transcriptomic data. Brief. Bioinform. 21, 1058–1068. 10.1093/bib/bbz049 [DOI] [PubMed] [Google Scholar]

[B79] Yang Q., Wang Y., Zhang Y., Li F., Xia W., Zhou Y., et al. (2020b). NOREVA: enhanced normalization and evaluation of time-course and multi-class metabolomic data. Nucleic Acids Res. 48, W436–W448. 10.1093/nar/gkaa258 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B80] Yang W., Zhu X. J., Huang J., Ding H., Lin H. (2019). A brief survey of machine learning methods in protein sub-Golgi localization. Curr. Bioinform. 14, 234–240. 10.2174/1574893613666181113131415 [DOI] [Google Scholar]

[B81] Yin J., Sun W., Li F., Hong J., Li X., Zhou Y., et al. (2020). VARIDT 1.0: variability of drug transporter database. Nucleic Acids Res. 48, D1042–D1050. 10.1093/nar/gkz779 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B82] Zeng X., Liao Y., Liu Y., Zou Q. (2017). Prediction and validation of disease genes using hetesim scores. IEEE/ACM Trans. Comput. Biol. Bioinform. 14, 687–695. 10.1109/TCBB.2016.2520947 [DOI] [PubMed] [Google Scholar]

[B83] Zeng X., Zhu S., Liu X., Zhou Y., Nussinov R., Cheng F. (2019). deepDR: a network-based deep learning approach to in silico drug repositioning. Bioinformatics 35, 5191–5198. 10.1093/bioinformatics/btz418 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B84] Zhang F., Ma A., Wang Z., Ma Q., Liu B., Huang L., et al. (2018). A central edge selection based overlapping community detection algorithm for the detection of overlapping structures in protein–protein interaction networks. Molecules 23:2633. 10.3390/molecules23102633 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B85] Zhang G., Fang B. (2006). Discrimination of thermophilic and mesophilic proteins via pattern recognition methods. Process Biochem. 41, 552–556. 10.1016/j.procbio.2005.09.003 [DOI] [Google Scholar]

[B86] Zhang G., Fang B. (2007). LogitBoost classifier for discriminating thermophilic and mesophilic proteins. J. Biotechnol. 127, 417–424. 10.1016/j.jbiotec.2006.07.020 [DOI] [PubMed] [Google Scholar]

[B87] Zhang Z. Y., Yang Y. H., Ding H., Wang D., Chen W., Lin H. (2020). Design powerful predictor for mRNA subcellular location prediction in Homo sapiens. Brief. Bioinform. 10.1093/bib/bbz177. [Epub ahead of print]. [DOI] [PubMed] [Google Scholar]

[B88] Zhao X., Jiao Q., Li H., Wu Y., Wang H., Huang S., et al. (2020). ECFS-DEA: an ensemble classifier-based feature selection for differential expression analysis on expression profiles. BMC Bioinform. 21:43. 10.1186/s12859-020-3388-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[B89] Zhao Y., Wang F., Chen S., Wan J., Wang G. (2017). Methods of MicroRNA promoter prediction and transcription factor mediated regulatory network. Biomed. Res. Int. 2017:7049406. 10.1155/2017/7049406 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B90] Zhao Y., Wang F., Juan L. (2015). MicroRNA promoter identification in arabidopsis using multiple histone markers. Biomed. Res. Int. 2015:861402. 10.1155/2015/861402 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B91] Zheng N., Wang K., Zhan W., Deng L. (2019). Targeting virus-host protein interactions: feature extraction and machine learning approaches. Curr. Drug Metab. 20, 177–184. 10.2174/1389200219666180829121038 [DOI] [PubMed] [Google Scholar]

[B92] Zhu X. J., Feng C. Q., Lai H. Y., Chen W., Lin H. (2019). Predicting protein structural classes for low-similarity sequences by evaluating different features. Knowl-Based Syst. 163, 787–793. 10.1016/j.knosys.2018.10.007 [DOI] [Google Scholar]

[B93] Zou Q., Chen L., Huang T., Zhang Z., Xu Y. (2017a). Machine learning and graph analytics in computational biomedicine. Artificial intelligence in medicine. Artif. Intell. Med. 83:1. 10.1016/j.artmed.2017.09.003 [DOI] [PubMed] [Google Scholar]

[B94] Zou Q., Mrozek D., Ma Q., Xu Y. (2017b). Scalable data mining algorithms in computational biology and biomedicine. Biomed Res. Int. 2017:5652041. 10.1155/2017/5652041 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Discrimination of Thermophilic Proteins and Non-thermophilic Proteins Using Feature Dimension Reduction

Zifan Guo

Pingping Wang

Zhendong Liu

Yuming Zhao

Abstract

Introduction

Materials and Methods

Data Sets

Feature Extraction

Figure 1.

AAC

DPC

TPC

DDE

GDPC

GTPC

CTD

CTDC

CTDT

Feature Selection

LIBSVM

Performance Measurement

Results and Discussion

Identification of Protein Thermostability

Table 1.

Table 2.

Table 3.

Feature Importance

Figure 2.

Comparison With Other Classification Methods

Table 4.

Figure 3.

Conclusion

Data Availability Statement

Author Contributions

Conflict of Interest

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases