Prediction of Gene Phenotypes Based on GO and KEGG Pathway Enrichment Scores

Tao Zhang; Min Jiang; Lei Chen; Bing Niu; Yudong Cai

doi:10.1155/2013/870795

. 2013 Nov 7;2013:870795. doi: 10.1155/2013/870795

Prediction of Gene Phenotypes Based on GO and KEGG Pathway Enrichment Scores

Tao Zhang ¹, Min Jiang ², Lei Chen ³, Bing Niu ^4,^*, Yudong Cai ^1,^*

PMCID: PMC3838811 PMID: 24312912

Abstract

Observing what phenotype the overexpression or knockdown of gene can cause is the basic method of investigating gene functions. Many advanced biotechnologies, such as RNAi, were developed to study the gene phenotype. But there are still many limitations. Besides the time and cost, the knockdown of some gene may be lethal which makes the observation of other phenotypes impossible. Due to ethical and technological reasons, the knockdown of genes in complex species, such as mammal, is extremely difficult. Thus, we proposed a new sequence-based computational method called kNNA-based method for gene phenotypes prediction. Different to the traditional sequence-based computational method, our method regards the multiphenotype as a whole network which can rank the possible phenotypes associated with the query protein and shows a more comprehensive view of the protein's biological effects. According to the prediction result of yeast, we also find some more related features, including GO and KEGG information, which are making more contributions in identifying protein phenotypes. This method can be applied in gene phenotype prediction in other species.

1. Introduction

Recognition of gene phenotypes of proteins is a central challenge of the modern genetics to modulate protein functions and biological processes, and many well-known diseases, such as HIV [1–4], cancers [5–8], chronic liver diseases [9], and Gaucher disease [10], are all closed to protein phenotypes. Hence, determination of protein's phenotypes is quite fundamental and essential in systems biology and proteomics. Except for phenotypes attributes, there are also many other multilabel attributes of proteins, such as subcellular locations [11–13] and multiple functional types of antimicrobial peptides. Multilabel molecule biosystems are very common.

During the past decades, numerous efforts have been made in the prediction of gene phenotype of yeast protein based on the following approaches: experimental methods and computational methods. As for experimental approaches, the high-throughput phenotype assays [14, 15] combining with gene perturbation technology [16, 17] provide fast identification for active gene in a response [18]. For example, using yeast mutant strain collections identifies the phenotypes [19]. However, due to the high complexity of phenotypes, it is both costly and time-consuming to determine protein phenotypes by experiments. Sometimes, the results derived from experiment are even of high false rates [20]. Computational methods provide important complementary tools for this problem. Many studies based on sequence-based methods and network-based methods have been made in protein's gene phenotypes identification [21–23]. In this research, we presented a new sequence-based method called kNNA-based method to predict gene phenotypes.

2. Materals and Methods

2.1. Benchmark Dataset

In this study, 6,732 proteins of yeast were taken from CYGD (the MIPS Comprehensive Yeast Genome Database [24], which collects information on the molecular structure and functional network of the budding yeast. After removing those without sequences, information, or phenotype annotations, the remaining 1,462 composed the benchmark dataset S. According to their phenotypes, these proteins were classified into the following 11 categories: (I) conditional phenotypes, (II) cell cycle defects, (III) mating and sporulation defects, (IV) auxotrophies, carbon and nitrogen utilization defects, (V) cell morphology and organelle mutants, (VI) stress response defects, (VII) carbohydrate and lipid biosynthesis, (VIII) nucleic acid metabolism defects, (IX) sensitivity to amino acid anaglogs and other drugs, (X) sensitivity to antibiotics. (XI) sensitivity to immunosuppressants. Let us use T ₁, T ₂,…, T ₁₁ to represent the tags of the 11 phenotypic categories, where T ₁ denotes “conditional phenotypes,” T ₂ denotes “cell cycle defects,” and so forth (see column 1 and 2 of Table 1 for the correspondence of tags and phenotypic categories). Thus, the benchmark dataset S can be formulated as

\begin{matrix} S = S_{1} \cup S_{2} \cup \dots \cup S_{11}, \end{matrix}

(1)

where S _i represents the set of proteins with tag T _i. The IDs of proteins in each S _i are available online in Supplementary Material at http://dx.doi.org/10.1155/2013/870795. From Table 1, we can see that the total number of proteins in each category is much larger than the total number of proteins investigated in this study, this means that some proteins are associated with multiple phenotypes. Like the cases in dealing with the proteins or compounds with multiple attributes [25–29], the proposed method could predict multiclassification phenotypes.

Table 1.

Breakdown of 1462 budding yeast proteins according to their 11 phenotypes.

Tag	Phenotype category	Number of proteins
T ₁	Conditional phenotypes	536
T ₂	Cell cycle defects	272
T ₃	Mating and sporulation defects	198
T ₄	Auxotrophies, carbon, and nitrogen utilization defects	266
T ₅	Cell morphology and organelle mutants	535
T ₆	Stress response defects	147
T ₇	Carbohydrate and lipid biosynthesis	46
T ₈	Nucleic acid metabolism defects	219
T ₉	Sensitivity to amino acid analogs and other drugs	124
T ₁₀	Sensitivity to antibiotics	43
T ₁₁	Sensitivity to immunosuppressants	14

Total	—	2,400

Open in a new tab

2.2. Feature Construction

The first important step to build an efficient prediction model is to encode each sample by numeric vector. Here, to catch the information of protein phenotype, Gene Ontology (GO) and KEGG enrichment scores were employed to represent the protein, which have been used in some biological problems [30, 31]. Their detailed definition can be found at [30, 31].

2.3. Protein Representation and Feature Reduction

Each protein was represented with 4682 features which include 4583 GO enrichment scores and 99 KEGG enrichment scores. However, among the 4,682 features, some features were with little relationship to the target, which may bring noises to the prediction model. Therefore, these features should be removed. Before removing the irrelevant features, the following formula was used to adjust all features to a standard scale:

\begin{matrix} U_{i j} = \frac{(u_{i j} - u_{j})}{T_{j}}, \end{matrix}

(2)

where T _j and u _j are the standard deviation and mean value of the jth feature, while u _ij and U _ij are the original value and standardized value of the ith sample on the jth feature.

After the transformation, the correlation coefficient between each feature with the target vector was computed and those with correlation coefficient less than 0.1 were discarded. Finally, 989 features remained. Within these 989 features, there were 947 Gene Ontology (GO) enrichment scores and 42 KEGG enrichment scores. Thus, each protein P _z was finally represented by a 989-D vector.

2.4. mRMR Method

Minimum Redundancy Maximum Relevance (mRMR), first proposed by Peng et al. [32], is an effective algorithm to identify discriminative features. The detailed algorithm of mRMR can be found at [32] and its program can be downloaded from http://penglab.janelia.org/proj/mRMR/.

mRMR has been widely used in the areas of bioinformatics [25, 33–36].

2.5. Prediction Model

2.5.1. kNNA-Based Method

Nearest neighbor algorithm is effective in solving classification and optimization problems in the field of bioinformatics due to its simplicity. It is adopted here to construct the multilabel prediction classifier.

Within k-NNA method, we used the cosine of the angle between two vectors to measure the similarity between them as follows:

\begin{matrix} Cos 〈 p_{x}, p_{y} 〉 = \frac{\vec{p_{x}} \cdot \vec{p_{y}}}{|| \vec{p_{x}} || \cdot || \vec{p_{y}} ||}, \end{matrix}

(3)

where $\vec{p_{x}} \cdot \vec{p_{y}}$ represents the inner product between the n-dimensional vector of protein p _x and p _y and ||p|| is the modulus of the vector.

For a query protein, k proteins in the training set which are closest to the query protein are first identified and are denoted by p ₁, p ₂,…, p _k. Then, the categories of the query protein can be inferred from the categories of the k nearest proteins identified. The procedure of the methodology is described in detail as follows.

(a)
Identifying the k nearest neighbors of the query protein, denoted by p ₁, p ₂,…, p _k, with the k cosines of angle values as w ₁, w ₂,…, w _k.
(b)
Then, the following formula:
$\begin{matrix} S (P \Rightarrow j) = \sum_{i = 1}^{k} w_{i} \cdot t_{p_{i}, j} (j = 1,2, \dots, 11) \end{matrix}$ (4)
is used to calculate the probability that the query protein P belongs to the jth category, where t _{p_i,j} is the item in t _{p_i} of protein p _i. The probabilities (the scores of the 11 categories) calculated above are sorted in descending order for each query protein as
$\begin{matrix} D^{↓} {S (P_{z} \Rightarrow j) ∣ j = 1,2, \dots, 11} = V = [\begin{matrix} \begin{matrix} \begin{matrix} μ_{1} \\ μ_{2} \end{matrix} \\ ⋮ \end{matrix} \\ μ_{j} \\ \begin{matrix} ⋮ \\ μ_{10} \\ μ_{11} \end{matrix} \end{matrix}] . \end{matrix}$ (5)
(c)
The corresponding category labels of the category scores are denoted as
$\begin{matrix} P^{D^{↓}} = [P^{μ_{1}}, P^{μ_{2}}, \dots, P^{μ_{i}}, \dots, P^{μ_{11}}] \\ (i = 1,2, \dots, 11), \end{matrix}$ (6)
where P ^μ_i is the class that scores ith in D ^↓.

2.5.2. Comparison with RPC-Based Method

In the ranking by pairwise comparison (RPC) method, for each pair of labels, a data is allocated to the pair of labels if the data belong to one and only one of the two labels (not both). Given q category labels, because there are C _q ² = q · (q − 1)/2 possible pairwise combinations of the labels, data subsets, each for corresponding pairwise labels discrimination, are generated.

Given a new instance, all pairwise classifiers are trained to predict its label, and the ranking of the labels is obtained by counting the votes of each label, where if the instance is classified into a label, the label receives one vote.

Each dataset contains those examples of D that are annotated by at least one of the two corresponding labels, but not both. A binary classifier that learns to discriminate between the two labels is trained from each of these data sets. Given a new instance, all binary classifiers are invoked, and ranking is obtained by counting the votes received by each label.

2.6. Evaluation

(a) Jackknife Testing. Three methods are often used to evaluate a prediction model, including (1) independent test dataset, (2) subsampling (K-fold) test, and (3) jackknife Test. The first method uses unseen data for testing, which needs a large quantity of data. The second method partitions the training set into k portions, then taking each portion of the data as the test data and the others (k − 1) as the training data. The third one, also named as leave-one-out method, leaves each sample out in turn as the test data and others as the training data. To maximize the quantity of the training data, jackknife test is used to test the predictor developed in the paper; that is, each protein is in turn knocked out as the query protein, and the remaining ones as the training data of the kNNA-based method.

(b) Metric. Let us define t _{z,P_z^μ_i} = 1 as protein P _z being correctly predicted to class p _z ^μ₁; otherwise, t _{z,P_z^μ_i} = 0.

The ith prediction accuracy A ⁱ is calculated as follows (the ith order predictions in P ^{D^↓}):

\begin{matrix} A^{i} = \frac{\sum_{j = 1}^{m} t_{j, p_{j}^{μ_{i}}}}{m}, \end{matrix}

(7)

where m is the number of the training data.

2.7. Incremental Feature Selection

Incremental feature selection (IFS) is often used to search out an optimal feature subset that performs best. Specifically, features in the ranked feature set are added one by one from higher to lower rank and the first n features that perform best are regarded as the optimal features. When one feature is added, a new feature subset is constructed. Thus, given N features, N feature subsets will be constructed, where the ith -order feature subset is

\begin{matrix} S_{i} = {f_{1}, f_{2}, \dots, f_{i}} (1 \leq i \leq 989), \end{matrix}

(8)

in which f _i represents the ith feature taken from the mRMR ranking.

Each feature subset is used to make prediction and the feature subset (first n features) that performs best is deemed as the optimal feature subset.

3. Results and Discussion

3.1. Results

3.1.1. mRMR Results

We apply mRMR method to the dataset, and obtain two tables for the features (see Supplementary Material). One is called MaxRel feature table that ranks the features based on their relevance to the class of samples and the other is called mRMR feature table that lists the ranked features by the maximum relevance and minimum redundancy to the class of samples. Such list of ranked features was to be used in the following IFS procedure for the optimal features set selection.

3.1.2. Performance of kNNA-Based Method

The first-order prediction accuracy of Jackknife test is 62.38%, while k = 17 (k-NN) and n = 651 (number of optimal features). More details of the 11 order prediction accuracies by using kNNA-based method are listed in Table 2 and Figure 1. IFS curve of kNNA-based method can be seen in Figure 2, which contains 30 curves corresponding to different values of k, and their detailed computing results of accuracy (ACC) can be seen at Supplementary Material. We highlighted the peak area of these curves to find optimal k in Figure 3.

Table 2.

The 11 order prediction accuracies by kNNA-based method.

	Method order
	1	2	3	4	5	6	7	8	9	10	11
kNN-based method (ACC)	62.38	30.44	22.16	14.09	9.03	6.43	5.75	2.8	3.08	3.49	4.51

Open in a new tab

The curve showing the trend of the 11 order prediction accuracies.

30 IFS curves of kNNA-based method corresponding to different values of k.

The peak and its coordinate of these IFS curves.

3.1.3. Performance of RPC-Based Method

Firstly, we classify the total labels into 55(C ₁₁ ⁵) sublabels. Select the sample which meets the demands that one sample belongs to one and only one of the two labels (not both). Then, 55 binary subsets were constructed. Three well-known binary classification algorithms including RandomForest, SMO, and Dagging were applied to build the prediction model. The prediction results are summarized in Table 3.

Table 3.

The 11 order prediction accuracies by RPC-based methods (Dagging, RandomForest, SMO).

	Methods order
	1	2	3	4	5	6	7	8	9	10	11
Dagging	60.05	33.58	21.96	13.75	10.53	8.28	6.57	3.56	2.6	1.85	1.44
RandomForest	58.62	34.2	22.3	14.7	9.92	7.66	5.95	5.2	3.28	1.5	0.82
SMO	56.16	34.68	21.55	14.84	10.88	7.8	6.36	4.65	3.21	2.26	1.78

Open in a new tab

3.1.4. Comparison with RPC-Based Method

We compared the first-order prediction accuracy of our method with the first-order prediction accuracy of RPC-based method. It can be found that the first-order prediction accuracies of RPC-based method using Dagging, RandomForest, and SMO are all lower than our kNNA-based method.

3.2. Discussion

To illustrate the biological meanings of the selected optimal feature subset, we firstly classified GO terms into three kinds: the biological process, cellular component, and molecular function GO terms. The 622 GO terms in the mRMR feature list were mapped to the Gene Ontology (GO) terms, the children of the three root GO terms. The figures show the frequency of each GO term in the feature subset, and display the ratio of the number of each GO term to the scale of the number of its children terms.

3.2.1. Biological Process GO Terms

In BP frequency, the top five GO biological process terms are GO:0009987: cellular process (399), GO:0008152: metabolic process (316), GO:0019740: nitrogen utilization (216), GO:0065007: biological regulation (136), and GO:0050789: regulation of biological process (131). In BP percentage, the top five GO biological processes are GO:0019740: nitrogen utilization (4.20%), GO:0071840: cellular component organization or biogene (3.57%), GO:0000003: reproduction (2.94%), GO:0022414: reproductive process (2.88%), and GO:0009987: cellular process (2.04%). For both GO biological process term number and percentage distribution analysis, the GO terms corresponding to the nitrogen utilization (GO:0019740) and cellular process (GO:0009987) were highlighted within the top five GO terms. This indicates that proteins assigned with these two GO terms may affect protein phenotype determination greatly. This conclusion is consistent with the common knowledge that specific cellular biological activities of the proteins confer with special phenotypes. It was also reported by Granek and Magwene that two key signaling networks: the filamentous growth MAP kinase cascade and the Ras-cAMP-PKA pathway, can regulate the yeast colony morphology response [37]. Additionally, the yeast cell wall integrity pathway was involved in resistance of the yeast Saccharomyces cerevisiae to the biocide polyhexamethylene biguanide [38].

The highlight of nitrogen utilization (GO:0019740) suggests that the nitrogen utilization, which is essential for life survival and development, may have more definite affection on protein phenotype. Nutrient stresses trigger a variety of developmental switches in the budding yeast Saccharomyces cerevisiae. It was demonstrated that low levels of carbon combined with abundant nitrogen trigger complex colony formation in yeast [37].

3.2.2. Cellular Component GO Terms

In CC frequency, the top six GO cellular component terms are GO:0005623: cell (171), GO:0044464: cell part (169), GO:0043226: organelle (135), GO:0044422: organelle part (103), GO:0032991: macromolecular complex (84), and GO:0031974: membrane-enclosed lumen (39). In CC percentage, the top six GO cellular component terms are GO:0031974: membrane-enclosed lumen (12.4%), GO:0044422: organelle part (8.42%), GO:0043226: organelle (8.4%), GO:0032991: macromolecular complex (5.20%), GO:0044464: cell part (4.77%), and GO:0005623: cell (4.20%). For both GO cellular component term number and percentage distribution analysis, the GO terms corresponding to the organelle (GO:0043226) and organelle part (GO:0044422) were highlighted within the top six GO terms. It may be concluded that proteins located in all cellular organelles should be guaranteed. It suggests that organelles, which have specific structural and functional attributes, may possess more definite protein phenotype to carry out their specific functions. This also implicated that proteins assigned to these GO terms could contribute relatively more to the overall protein phenotype determination. For example, the communication between mitochondrial and nuclear loci (i.e., COX1-MSY1 and Q0182-RSM7) showed significant reductions in the absence of mitochondrial encoded reverse transcriptase machinery [39]. The inclusion of macromolecular complex (GO:0032991) suggests that proteins expressing some phenotype need to interact with each other to function together and that macromolecular complex should certainly determine the phenotype of proteins. The inclusion of membrane-enclosed lumen (GO:0031974) also suggests that proteins assigned to this cellular component could greatly contribute to protein phenotype, because most of the cellular organelles are enclosed by membrane, such as mitochondrial and nucleus.

3.2.3. Molecular Function GO Terms

In MF frequency, the top six GO molecular function terms are GO:0003824: catalytic activity (79), GO:0005488: binding (69), GO : 0001071: nucleic acid binding transcription factor activity (40), GO:0000988: protein binding transcription factor activity (14). GO:0065009: regulation of molecular function (8), and GO:0005215: transporter activity (7). Proteins assigned to these three GO terms required binding or interaction to carry out their structural or functional activities. This suggests that proteins assigned to these six GO terms contributed profoundly to the protein phenotype. In MF percentage, the top six GO molecular function terms are GO:0009055: electron carrier activity (25%), GO:0016530: metallochaperone activity (25%), GO:0045182: translation regulator activity (14.3%), GO:0005198: structural molecule activity (11.8%), GO:0001071: nucleic acid binding transcription factor activity (9.0%), GO:0005488: binding (3.99%), and GO:0016209: antioxidant activity (3.85%). The relatively small base number made protein GO terms influencing protein phenotype relatively more enriched in the top six molecular function GO terms, especially in electron carrier activity (GO:0009055) and metallochaperone activity (GO:0016530). The highlight of electron carrier activity (GO:0009055) may be attributed to the relatively limited and definite function of these proteins. It was reported that some ontology drug can interact with the electron transport chain (ETC) to generate high levels of ROS within the organelle and consequently cell leads to death [40]. The highlight of metallochaperone activity (GO:0016530) may be ascribed to that metalloprotein used to express specific function with metallochaperone and metallic ion. In all bacteria, a panel of metalloregulatory proteins controls the expression of genes encoding membrane transporters and metal trafficking proteins [41]. Because of the large base number of the top six GO terms in MF frequency, they have relatively lower enrichment within the top eight GO terms in MF percentage.

Supplementary Material

Supplementary Material 1：The ID of yeast 1462 proteins with phenotype annotation.

Supplementary Material 2：The MaxRel and mRMR feature tables using mRMR method.

Supplementary Material 3: The 11 order prediction accuracies based on mRMR features list using kNNA-based method (k=1,2, ...31).

Click here for additional data file.^{(11.9KB, txt)}

Click here for additional data file.^{(60KB, txt)}

Click here for additional data file.^{(567.5KB, xls)}

Authors' Contribution

Tao Zhang and Min Jiang contributed equally to this research.

Acknowledgments

This work was supported by grants from the National Basic Research Program of China (2011CB510101, 2011CB510102), the National Natural Science Foundation of China (31371335), the Innovation Program of Shanghai Municipal Education Commission (12ZZ087), the Leading Academic Discipline Project of Shanghai Municipal Education Commission “Molecular Physiology,” the grant of “The First-class Discipline of Universities in Shanghai,” and the Foundation for The Excellent Youth (SHU10022).

References

1.van Houtte M, Picchio G, van der Borght K, Pattery T, Lecocq P, Bacheler LT. A comparison of HIV-1 drug susceptibility as provided by conventional phenotyping and by a phenotype prediction tool based on viral genotype. Journal of Medical Virology. 2009;81(10):1702–1709. doi: 10.1002/jmv.21585. [DOI] [PubMed] [Google Scholar]
2.Vasil’ev AV, Kazennova EV, Bobkova MR. Prediction of phenotype R5/X4 of HIV-1 variants circulating in Russia, by using computer methods. Voprosy Virusologii. 2009;54(3):17–21. [PubMed] [Google Scholar]
3.Xu S, Huang X, Xu H, Zhang C. Improved prediction of coreceptor usage and phenotype of HIV-1 based on combined features of V3 loop sequence using random forest. Journal of Microbiology. 2007;45(5):441–446. [PubMed] [Google Scholar]
4.Vermeiren H, Van Craenenbroeck E, Alen P, Bacheler L, Picchio G, Lecocq P. Prediction of HIV-1 drug susceptibility phenotype from the viral genotype using linear regression modeling. Journal of Virological Methods. 2007;145(1):47–55. doi: 10.1016/j.jviromet.2007.05.009. [DOI] [PubMed] [Google Scholar]
5.Lin T-Y, Chang JT-C, Wang H-M, et al. Proteomics of the radioresistant phenotype in head-and-neck cancer: GP96 as a novel prediction marker and sensitizing target for radiotherapy. International Journal of Radiation Oncology Biology Physics. 2010;78(1):246–256. doi: 10.1016/j.ijrobp.2010.03.002. [DOI] [PubMed] [Google Scholar]
6.Bathen TF, Jensen LR, Sitter B, et al. MR-determined metabolic phenotype of breast cancer in prediction of lymphatic spread, grade, and hormone status. Breast Cancer Research and Treatment. 2007;104(2):181–189. doi: 10.1007/s10549-006-9400-z. [DOI] [PubMed] [Google Scholar]
7.Lakhani SR, Reis-Filho JS, Fulford L, et al. Prediction of BRCA1 status in patients with breast cancer using estrogen receptor and basal phenotype. Clinical Cancer Research. 2005;11(14):5175–5180. doi: 10.1158/1078-0432.CCR-04-2424. [DOI] [PubMed] [Google Scholar]
8.Dwyer T, Stankovich JM, Blizzard L, et al. Does the addition of information on genotype improve prediction of the risk of melanoma and nonmelanoma skin cancer beyond that obtained from skin phenotype? American Journal of Epidemiology. 2004;159(9):826–833. doi: 10.1093/aje/kwh120. [DOI] [PubMed] [Google Scholar]
9.Piruzyan LA, Korshunov IB, Morozova NV, Pyn’ko NE, Radkevich LA. Prediction of chronic liver diseases on the basis of the N-acetyltransferase 2 phenotype. Doklady Biochemistry and Biophysics. 2004;395(1–6):84–87. doi: 10.1023/b:dobi.0000025552.40172.db. [DOI] [PubMed] [Google Scholar]
10.Whitfield PD, Nelson P, Sharp PC, et al. Correlation among genotype, phenotype, and biochemical markers in Gaucher disease: implications for the prediction of disease severity. Molecular Genetics and Metabolism. 2002;75(1):46–55. doi: 10.1006/mgme.2001.3269. [DOI] [PubMed] [Google Scholar]
11.Li G-Z, Wang X, Hu X, Liu J-M, Zhao R-W. Multilabel learning for protein subcellular location prediction. IEEE Transactions on NanoBioscience. 2012;11(3):237–243. doi: 10.1109/TNB.2012.2212249. [DOI] [PubMed] [Google Scholar]
12.Wang X, Li G-Z. A multi-label predictor for identifying the subcellular locations of singleplex and multiplex eukaryotic proteins. PLoS ONE. 2012;7(5) doi: 10.1371/journal.pone.0036317.e36317 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Wang X, Li G-Z, Lu W-C. Virus-ECC-mPLoc: a multi-label predictor for predicting the subcellular localization of virus proteins with both single and multiple sites based on a general form of Chou's pseudo amino acid composition. Protein and Peptide Letters. 2013;20(3):309–317. doi: 10.2174/0929866511320030009. [DOI] [PubMed] [Google Scholar]
14.Drees BL, Thorsson V, Carter GW, et al. Derivation of genetic interaction networks from quantitative phenotype data. Genome Biology. 2005;6(4):p. R38. doi: 10.1186/gb-2005-6-4-r38. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Dudley AM, Janse DM, Tanay A, Shamir R, Church GM. A global view of pleiotropy and phenotypically derived gene function in yeast. Molecular Systems Biology. 2005;1:11 pages. doi: 10.1038/msb4100004.2005.0001 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Fire A, Xu S, Montgomery MK, Kostas SA, Driver SE, Mello CC. Potent and specific genetic interference by double-stranded RNA in caenorhabditis elegans. Nature. 1998;391(6669):806–811. doi: 10.1038/35888. [DOI] [PubMed] [Google Scholar]
17.Winzeler EA, Liang H, Shoemaker DD, Davis RW. Functional analysis of the yeast genome by precise deletion and parallel phenotypic characterization. Novartis Foundation Symposium. 2000;229:105–109. doi: 10.1002/047084664x.ch14. discussion 109–111. [DOI] [PubMed] [Google Scholar]
18.Carter GW, Prinz S, Neou C, et al. Prediction of phenotype and gene expression for combinations of mutations. Molecular Systems Biology. 2007;3:p. 96. doi: 10.1038/msb4100137. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Scherens B, Goffeau A. The uses of genome-wide yeast mutant collections. Genome Biology. 2004;5(7, article 229) doi: 10.1186/gb-2004-5-7-229. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.McGary KL, Lee I, Marcotte EM. Broad network-based predictability of Saccharomyces cerevisiae gene loss-of-function phenotypes. Genome Biology. 2007;8(12, article R258) doi: 10.1186/gb-2007-8-12-r258. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Cedano J, Aloy P, Perez-Pons JA, Querol E. Relation between amino acid composition and cellular location of proteins. Journal of Molecular Biology. 1997;266(3):594–600. doi: 10.1006/jmbi.1996.0804. [DOI] [PubMed] [Google Scholar]
22.Resch W, Hoffman N, Swanstrom R. Improved success of phenotype prediction of the human immunodeficiency virus type 1 from envelope variable loop 3 sequence using neural networks. Virology. 2001;288(1):51–62. doi: 10.1006/viro.2001.1087. [DOI] [PubMed] [Google Scholar]
23.Pillai S, Good B, Richman D, Corbeil J. A new perspective on V3 phenotype prediction. AIDS Research and Human Retroviruses. 2003;19(2):145–149. doi: 10.1089/088922203762688658. [DOI] [PubMed] [Google Scholar]
24.Güldener U, Münsterkötter M, Kastenmüller G, et al. CYGD: the comprehensive yeast genome database. Nucleic Acids Research. 2005;33:D364–D368. doi: 10.1093/nar/gki053. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Chen L, Zeng W-M, Cai Y-D, Feng K-Y, Chou K-C. Predicting anatomical therapeutic chemical (ATC) classification of drugs by integrating chemical-chemical interactions and similarities. PLoS ONE. 2012;7(4) doi: 10.1371/journal.pone.0035254.e35254 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Gao P, Wang QP, Chen L, Huang T. Prediction of human genes' regulatory functions based on proteinprotein interaction network. Protein and Peptide Letters. 2012;19(9):910–916. doi: 10.2174/092986612802084528. [DOI] [PubMed] [Google Scholar]
27.Hu L, Huang T, Shi X, Lu W-C, Cai Y-D, Chou K-C. Predicting functions of proteins in mouse based on weighted protein-protein interaction network and protein hybrid properties. PLoS ONE. 2011;6(1) doi: 10.1371/journal.pone.0014556.e14556 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Hu L-L, Huang T, Cai Y-D, Chou K-C. Prediction of body fluids where proteins are secreted into based on protein interaction network. PLoS ONE. 2011;6(7) doi: 10.1371/journal.pone.0022989.e22989 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Du P, Li T, Wang X. Recent progress in predicting protein sub-subcellular locations. Expert Review of Proteomics. 2011;8(3):391–404. doi: 10.1586/epr.11.20. [DOI] [PubMed] [Google Scholar]
30.Huang T, Chen L, Cai Y-D, Chou K-C. Classification and analysis of regulatory pathways using graph property, biochemical and physicochemical property, and functional property. PLoS ONE. 2011;6(9) doi: 10.1371/journal.pone.0025297.e25297 [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Huang T, Zhang J, Xu Z-P, et al. Deciphering the effects of gene deletion on yeast longevity using network and machine learning approaches. Biochimie. 2012;94(4):1017–1025. doi: 10.1016/j.biochi.2011.12.024. [DOI] [PubMed] [Google Scholar]
32.Peng H, Long F, Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005;27(8):1226–1238. doi: 10.1109/TPAMI.2005.159. [DOI] [PubMed] [Google Scholar]
33.Cai Y, Huang T, Hu L, Shi X, Xie L, Li Y. Prediction of lysine ubiquitination with mRMR feature selection and analysis. Amino Acids. 2011:1–9. doi: 10.1007/s00726-011-0835-0. [DOI] [PubMed] [Google Scholar]
34.Hu L, Cui W, He Z, et al. Cooperativity among short amyloid stretches in long amyloidogenic sequences. PLoS ONE. 2012;7(6) doi: 10.1371/journal.pone.0039369.e39369 [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Li BQ, Hu LL, Chen L, Feng KY, Cai YD, Chou KC. Prediction of protein domain with mRMR feature selection and analysis. PLoS ONE. 2012;7(6) doi: 10.1371/journal.pone.0039308.e39308 [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Li B-Q, Huang T, Liu L, Cai Y-D, Chou K-C. Identification of colorectal cancer related genes with mrmr and shortest path in protein-protein interaction network. PLoS ONE. 2012;7(4) doi: 10.1371/journal.pone.0033393.e33393 [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Granek JA, Magwene PM. Environmental and genetic determinants of colony morphology in yeast. PLoS Genetics. 2010;6(1) doi: 10.1371/journal.pgen.1000823.e1000823 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Elsztein C, de Lucena RM, de Morais MA., Jr. The resistance of the yeast Saccharomyces cerevisiae to the biocide polyhexamethylene biguanide: involvement of cell wall integrity pathway and emerging role for YAP1. BMC Molecular Biology. 2011;12, article 38 doi: 10.1186/1471-2199-12-38. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Rodley CDM, Grand RS, Gehlen LR, Greyling G, Jones MB, O’Sullivan JM. Mitochondrial-nuclear DNA interactions contribute to the regulation of nuclear transcript levels as part of the inter-organelle communication system. PLoS ONE. 2012;7(1) doi: 10.1371/journal.pone.0030943.e30943 [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Blackman RK, Cheung-Ong K, Gebbia M, et al. Mitochondrial electron transport is the cellular target of the oncology drug Elesclomol. PLoS ONE. 2012;7(1) doi: 10.1371/journal.pone.0029798.e29798 [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Reyes-Caballero H, Campanello GC, Giedroc DP. Metalloregulatory proteins: metal selectivity and allosteric switching. Biophysical Chemistry. 2011;156(2-3):103–114. doi: 10.1016/j.bpc.2011.03.010. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1：The ID of yeast 1462 proteins with phenotype annotation.

Supplementary Material 2：The MaxRel and mRMR feature tables using mRMR method.

Supplementary Material 3: The 11 order prediction accuracies based on mRMR features list using kNNA-based method (k=1,2, ...31).

Click here for additional data file.^{(11.9KB, txt)}

Click here for additional data file.^{(60KB, txt)}

Click here for additional data file.^{(567.5KB, xls)}

[B1] 1.van Houtte M, Picchio G, van der Borght K, Pattery T, Lecocq P, Bacheler LT. A comparison of HIV-1 drug susceptibility as provided by conventional phenotyping and by a phenotype prediction tool based on viral genotype. Journal of Medical Virology. 2009;81(10):1702–1709. doi: 10.1002/jmv.21585. [DOI] [PubMed] [Google Scholar]

[B2] 2.Vasil’ev AV, Kazennova EV, Bobkova MR. Prediction of phenotype R5/X4 of HIV-1 variants circulating in Russia, by using computer methods. Voprosy Virusologii. 2009;54(3):17–21. [PubMed] [Google Scholar]

[B3] 3.Xu S, Huang X, Xu H, Zhang C. Improved prediction of coreceptor usage and phenotype of HIV-1 based on combined features of V3 loop sequence using random forest. Journal of Microbiology. 2007;45(5):441–446. [PubMed] [Google Scholar]

[B4] 4.Vermeiren H, Van Craenenbroeck E, Alen P, Bacheler L, Picchio G, Lecocq P. Prediction of HIV-1 drug susceptibility phenotype from the viral genotype using linear regression modeling. Journal of Virological Methods. 2007;145(1):47–55. doi: 10.1016/j.jviromet.2007.05.009. [DOI] [PubMed] [Google Scholar]

[B5] 5.Lin T-Y, Chang JT-C, Wang H-M, et al. Proteomics of the radioresistant phenotype in head-and-neck cancer: GP96 as a novel prediction marker and sensitizing target for radiotherapy. International Journal of Radiation Oncology Biology Physics. 2010;78(1):246–256. doi: 10.1016/j.ijrobp.2010.03.002. [DOI] [PubMed] [Google Scholar]

[B6] 6.Bathen TF, Jensen LR, Sitter B, et al. MR-determined metabolic phenotype of breast cancer in prediction of lymphatic spread, grade, and hormone status. Breast Cancer Research and Treatment. 2007;104(2):181–189. doi: 10.1007/s10549-006-9400-z. [DOI] [PubMed] [Google Scholar]

[B7] 7.Lakhani SR, Reis-Filho JS, Fulford L, et al. Prediction of BRCA1 status in patients with breast cancer using estrogen receptor and basal phenotype. Clinical Cancer Research. 2005;11(14):5175–5180. doi: 10.1158/1078-0432.CCR-04-2424. [DOI] [PubMed] [Google Scholar]

[B8] 8.Dwyer T, Stankovich JM, Blizzard L, et al. Does the addition of information on genotype improve prediction of the risk of melanoma and nonmelanoma skin cancer beyond that obtained from skin phenotype? American Journal of Epidemiology. 2004;159(9):826–833. doi: 10.1093/aje/kwh120. [DOI] [PubMed] [Google Scholar]

[B9] 9.Piruzyan LA, Korshunov IB, Morozova NV, Pyn’ko NE, Radkevich LA. Prediction of chronic liver diseases on the basis of the N-acetyltransferase 2 phenotype. Doklady Biochemistry and Biophysics. 2004;395(1–6):84–87. doi: 10.1023/b:dobi.0000025552.40172.db. [DOI] [PubMed] [Google Scholar]

[B10] 10.Whitfield PD, Nelson P, Sharp PC, et al. Correlation among genotype, phenotype, and biochemical markers in Gaucher disease: implications for the prediction of disease severity. Molecular Genetics and Metabolism. 2002;75(1):46–55. doi: 10.1006/mgme.2001.3269. [DOI] [PubMed] [Google Scholar]

[B11] 11.Li G-Z, Wang X, Hu X, Liu J-M, Zhao R-W. Multilabel learning for protein subcellular location prediction. IEEE Transactions on NanoBioscience. 2012;11(3):237–243. doi: 10.1109/TNB.2012.2212249. [DOI] [PubMed] [Google Scholar]

[B12] 12.Wang X, Li G-Z. A multi-label predictor for identifying the subcellular locations of singleplex and multiplex eukaryotic proteins. PLoS ONE. 2012;7(5) doi: 10.1371/journal.pone.0036317.e36317 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13.Wang X, Li G-Z, Lu W-C. Virus-ECC-mPLoc: a multi-label predictor for predicting the subcellular localization of virus proteins with both single and multiple sites based on a general form of Chou's pseudo amino acid composition. Protein and Peptide Letters. 2013;20(3):309–317. doi: 10.2174/0929866511320030009. [DOI] [PubMed] [Google Scholar]

[B14] 14.Drees BL, Thorsson V, Carter GW, et al. Derivation of genetic interaction networks from quantitative phenotype data. Genome Biology. 2005;6(4):p. R38. doi: 10.1186/gb-2005-6-4-r38. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15.Dudley AM, Janse DM, Tanay A, Shamir R, Church GM. A global view of pleiotropy and phenotypically derived gene function in yeast. Molecular Systems Biology. 2005;1:11 pages. doi: 10.1038/msb4100004.2005.0001 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] 16.Fire A, Xu S, Montgomery MK, Kostas SA, Driver SE, Mello CC. Potent and specific genetic interference by double-stranded RNA in caenorhabditis elegans. Nature. 1998;391(6669):806–811. doi: 10.1038/35888. [DOI] [PubMed] [Google Scholar]

[B17] 17.Winzeler EA, Liang H, Shoemaker DD, Davis RW. Functional analysis of the yeast genome by precise deletion and parallel phenotypic characterization. Novartis Foundation Symposium. 2000;229:105–109. doi: 10.1002/047084664x.ch14. discussion 109–111. [DOI] [PubMed] [Google Scholar]

[B18] 18.Carter GW, Prinz S, Neou C, et al. Prediction of phenotype and gene expression for combinations of mutations. Molecular Systems Biology. 2007;3:p. 96. doi: 10.1038/msb4100137. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] 19.Scherens B, Goffeau A. The uses of genome-wide yeast mutant collections. Genome Biology. 2004;5(7, article 229) doi: 10.1186/gb-2004-5-7-229. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] 20.McGary KL, Lee I, Marcotte EM. Broad network-based predictability of Saccharomyces cerevisiae gene loss-of-function phenotypes. Genome Biology. 2007;8(12, article R258) doi: 10.1186/gb-2007-8-12-r258. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 21.Cedano J, Aloy P, Perez-Pons JA, Querol E. Relation between amino acid composition and cellular location of proteins. Journal of Molecular Biology. 1997;266(3):594–600. doi: 10.1006/jmbi.1996.0804. [DOI] [PubMed] [Google Scholar]

[B22] 22.Resch W, Hoffman N, Swanstrom R. Improved success of phenotype prediction of the human immunodeficiency virus type 1 from envelope variable loop 3 sequence using neural networks. Virology. 2001;288(1):51–62. doi: 10.1006/viro.2001.1087. [DOI] [PubMed] [Google Scholar]

[B23] 23.Pillai S, Good B, Richman D, Corbeil J. A new perspective on V3 phenotype prediction. AIDS Research and Human Retroviruses. 2003;19(2):145–149. doi: 10.1089/088922203762688658. [DOI] [PubMed] [Google Scholar]

[B24] 24.Güldener U, Münsterkötter M, Kastenmüller G, et al. CYGD: the comprehensive yeast genome database. Nucleic Acids Research. 2005;33:D364–D368. doi: 10.1093/nar/gki053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] 25.Chen L, Zeng W-M, Cai Y-D, Feng K-Y, Chou K-C. Predicting anatomical therapeutic chemical (ATC) classification of drugs by integrating chemical-chemical interactions and similarities. PLoS ONE. 2012;7(4) doi: 10.1371/journal.pone.0035254.e35254 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] 26.Gao P, Wang QP, Chen L, Huang T. Prediction of human genes' regulatory functions based on proteinprotein interaction network. Protein and Peptide Letters. 2012;19(9):910–916. doi: 10.2174/092986612802084528. [DOI] [PubMed] [Google Scholar]

[B27] 27.Hu L, Huang T, Shi X, Lu W-C, Cai Y-D, Chou K-C. Predicting functions of proteins in mouse based on weighted protein-protein interaction network and protein hybrid properties. PLoS ONE. 2011;6(1) doi: 10.1371/journal.pone.0014556.e14556 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B28] 28.Hu L-L, Huang T, Cai Y-D, Chou K-C. Prediction of body fluids where proteins are secreted into based on protein interaction network. PLoS ONE. 2011;6(7) doi: 10.1371/journal.pone.0022989.e22989 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] 29.Du P, Li T, Wang X. Recent progress in predicting protein sub-subcellular locations. Expert Review of Proteomics. 2011;8(3):391–404. doi: 10.1586/epr.11.20. [DOI] [PubMed] [Google Scholar]

[B30] 30.Huang T, Chen L, Cai Y-D, Chou K-C. Classification and analysis of regulatory pathways using graph property, biochemical and physicochemical property, and functional property. PLoS ONE. 2011;6(9) doi: 10.1371/journal.pone.0025297.e25297 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B31] 31.Huang T, Zhang J, Xu Z-P, et al. Deciphering the effects of gene deletion on yeast longevity using network and machine learning approaches. Biochimie. 2012;94(4):1017–1025. doi: 10.1016/j.biochi.2011.12.024. [DOI] [PubMed] [Google Scholar]

[B32] 32.Peng H, Long F, Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005;27(8):1226–1238. doi: 10.1109/TPAMI.2005.159. [DOI] [PubMed] [Google Scholar]

[B33] 33.Cai Y, Huang T, Hu L, Shi X, Xie L, Li Y. Prediction of lysine ubiquitination with mRMR feature selection and analysis. Amino Acids. 2011:1–9. doi: 10.1007/s00726-011-0835-0. [DOI] [PubMed] [Google Scholar]

[B35] 34.Hu L, Cui W, He Z, et al. Cooperativity among short amyloid stretches in long amyloidogenic sequences. PLoS ONE. 2012;7(6) doi: 10.1371/journal.pone.0039369.e39369 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B36] 35.Li BQ, Hu LL, Chen L, Feng KY, Cai YD, Chou KC. Prediction of protein domain with mRMR feature selection and analysis. PLoS ONE. 2012;7(6) doi: 10.1371/journal.pone.0039308.e39308 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B37] 36.Li B-Q, Huang T, Liu L, Cai Y-D, Chou K-C. Identification of colorectal cancer related genes with mrmr and shortest path in protein-protein interaction network. PLoS ONE. 2012;7(4) doi: 10.1371/journal.pone.0033393.e33393 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B38] 37.Granek JA, Magwene PM. Environmental and genetic determinants of colony morphology in yeast. PLoS Genetics. 2010;6(1) doi: 10.1371/journal.pgen.1000823.e1000823 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B39] 38.Elsztein C, de Lucena RM, de Morais MA., Jr. The resistance of the yeast Saccharomyces cerevisiae to the biocide polyhexamethylene biguanide: involvement of cell wall integrity pathway and emerging role for YAP1. BMC Molecular Biology. 2011;12, article 38 doi: 10.1186/1471-2199-12-38. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B40] 39.Rodley CDM, Grand RS, Gehlen LR, Greyling G, Jones MB, O’Sullivan JM. Mitochondrial-nuclear DNA interactions contribute to the regulation of nuclear transcript levels as part of the inter-organelle communication system. PLoS ONE. 2012;7(1) doi: 10.1371/journal.pone.0030943.e30943 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B41] 40.Blackman RK, Cheung-Ong K, Gebbia M, et al. Mitochondrial electron transport is the cellular target of the oncology drug Elesclomol. PLoS ONE. 2012;7(1) doi: 10.1371/journal.pone.0029798.e29798 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B42] 41.Reyes-Caballero H, Campanello GC, Giedroc DP. Metalloregulatory proteins: metal selectivity and allosteric switching. Biophysical Chemistry. 2011;156(2-3):103–114. doi: 10.1016/j.bpc.2011.03.010. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Prediction of Gene Phenotypes Based on GO and KEGG Pathway Enrichment Scores

Tao Zhang

Min Jiang

Lei Chen

Bing Niu

Yudong Cai

Abstract

1. Introduction

2. Materals and Methods

2.1. Benchmark Dataset

Table 1.

2.2. Feature Construction

2.3. Protein Representation and Feature Reduction

2.4. mRMR Method

2.5. Prediction Model

2.5.1. kNNA-Based Method

2.5.2. Comparison with RPC-Based Method

2.6. Evaluation

2.7. Incremental Feature Selection

3. Results and Discussion

3.1. Results

3.1.1. mRMR Results

3.1.2. Performance of kNNA-Based Method

Table 2.

Figure 1.

Figure 2.

Figure 3.

3.1.3. Performance of RPC-Based Method

Table 3.

3.1.4. Comparison with RPC-Based Method

3.2. Discussion

3.2.1. Biological Process GO Terms

3.2.2. Cellular Component GO Terms

3.2.3. Molecular Function GO Terms

Supplementary Material

Authors' Contribution

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases