Meta analysis of Chronic Fatigue Syndrome through integration of clinical, gene expression, SNP and proteomic data

Vasyl Pihur; Somnath Datta; Susmita Datta

doi:10.6026/97320630006120

. 2011 Apr 22;6(3):120–124. doi: 10.6026/97320630006120

Meta analysis of Chronic Fatigue Syndrome through integration of clinical, gene expression, SNP and proteomic data

Vasyl Pihur ¹, Somnath Datta ², Susmita Datta ^2,^*

PMCID: PMC3089886 PMID: 21584188

Abstract

We start by constructing gene-gene association networks based on about 300 genes whose expression values vary between the groups of CFS patients (plus control). Connected components (modules) from these networks are further inspected for their predictive ability for symptom severity, genotypes of two single nucleotide polymorphisms (SNP) known to be associated with symptom severity, and intensity of the ten most discriminative protein features. We use two different network construction methods and choose the common genes identified in both for added validation. Our analysis identified eleven genes which may play important roles in certain aspects of CFS or related symptoms. In particular, the gene WASF3 (aka WAVE3) possibly regulates brain cytokines involved in the mechanism of fatigue through the p38 MAPK regulatory pathway.

Keywords: CFS, gene-gene interactions, microarray, proteomics, SNP

Background

Chronic Fatigue Syndrome (CFS) is a relatively rare, poorly understood, complex disorder that is characterized by severe and chronic physical and mental fatigue not attributable to other causes (diseases) which is sometimes accompanied by other symptoms such as weak immune response, digestive problems and depression. A great deal of effort has been put forth in recent years in collecting clinical, gene expression, gynotypic and proteomic data by the Chronic Fatigue Syndrome Group at CDC in an attempt to find a genetic basis of CFS. Even though these data have been analyzed by numerous researchers (and research teams) in the last two years resulting in a special issue of the journal Pharmacogenomics [1] and were also as part the Critical Assessment of Microarray Data Analysis (CAMDA) conference in 2006, the type of success has been mixed and limited. Since genes do not act alone, especially, for a complex disorder such as CFS, our attempt in analyzing these data takes a systems biology approach where we study groups of genes (called modules) obtained from gene-gene association networks. Thus, our approach is similar to that of [2], although our network construction methods and the statistical analyses are different from theirs. At the end, we identify eleven “interesting” genes which may play important roles in certain aspects of CFS or related symptoms. In particular, the gene WASF3 (aka WAVE3) possibly regulates brain cytokines involved in the mechanism of fatigue through the p38 MAPK regulatory pathway. A preliminary version of this work was presented in the CAMDA 2007 conference [3].

Methodology

The CDC Chronic Fatigue Syndrome Research Group provided challenge datasets consisting of clinical, microarray, proteomics, and SNP data that were used for both CAMDA 2006 and CAMDA 2007 competitions. 227 subjects filled self-administered questionnaires and had their blood drawn for lab analysis. For many of them, microarray (163) and proteomics (63) data were also collected for the purpose of discovering biological (genetic) basis of CFS. In this work, we integrate clinical, microarray, SNP and proteomics data for our analysis.

Microarray data

CAMDA 2006 microarray data consists of 177 arrays, 9 of which were repeated twice at different times during the study. We discarded these 9 microarrays for multiplicity reasons and additional 5 arrays were excluded from this analysis due to the absence of clinical information on the subjects. Thus, we started our analysis with 163 arrays. Subtracted ARM (Artifactremoved) density column which is already adjusted for the background density was log-transformed to stabilize the variance.

Clinical data

Clinical data contains extensive information on 227 subjects and can be linked to microarray and SNP data via the ABTID subject ID. The two pieces of clinical data that we made use of were the Intake Classific variable classifies patients into 5 categories and the Cluster variable provides information on the severity of the symptoms (“Worst‟, “Middle”, “Least”) for some patients.

SNP data

Forty two Single nucleotide polymorphisms (SNP's) for 10 different genes were genotyped. For the purposes of this analysis, we selected two SNP's, hCV245410 (on gene TPH2) and hCV7911132 (on gene SLC6A4), which were previously identified [2] to be associated with CFS severity.

Proteomic data

Protein spectra are available for 63 subjects in the study. Serum was originally separated into 6 fractions of which we use the last four and then applied to three different SELDI surfaces, giving us a total combination of 12 different settings. Experiments were repeated twice and we averaged the two spectra for each subject. We removed the first 4000 m/z values from our analysis which roughly corresponds to m/z values smaller than 1700 Da. After that we divided the spectrum into the bins of size 10 and took the maximum intensity value in each bin. The data was reduced by a factor of 10, leaving 2650 m/z values in the data for further analysis. To de-noised data, we estimated the standard deviation for each m/z bin and took the median of these as a measure of noise' standard deviation σ. Intensity values smaller than 3σ were considered to be pure noise. If this happened in all samples, the m/z value was removed from the analysis. Then the data was then log transformed.

Statistical analysis

The first step of the statistical analysis we performed was to identify a set of differentially expressed genes between different groups of subjects. Disease status of subjects came from the clinical portion of the CFS data (Intake Classific variable). All subjects included in the microarray study were classified into 5 different groups: Ever CFS - 45 subjects ever experiencing CFS, Non-fatigues - 34 controls who never experienced CFS, Ever ISF - 45 subjects who are fatigued but cannot be classified as CFS because of insufficient symptoms, Ever ISF-MDDm - 20 subjects experiencing ISF with melancholic depression, Ever CFS-MDDm - 19 subjects experiencing CFS along with melancholic depression. ANOVA F-test for each probe was carried out to determine differentially expressed genes across the five groups. 286 probes were identified as differentially expressed (p-values < 0.01). Since we are not interested in determining the differentially expressed genes per se, multiplicity correction was not used. The reduced microarray data consisting of 286 probes and 163 samples (subjects) was used later for further statistical analysis as discussed below.

Network construction and identification of associated gene sets

To better understand the relationships between the selected 286 probes in terms of interactions/ associations, we employ two computational network inference techniques. The first method is based on the Partial Least Squares regression (PLS) [4], while the second method is based on the Partial Correlations (PC) [5]. A number of similar characteristics are shared by the two approaches, such as computing association scores whose magnitude reflects the strength of the interaction between genes and local false discovery rate (local fdr) Empirical Bayes procedure for multiplicity adjustment in testing multiple hypotheses. The results from applying the PLS and PC network reconstruction techniques to the reduced microarray data are summarized in the first three columns of Tables 1 (for PLS) and 2 (for PC). The actual visual representation of the networks themselves can be found from Figure 1 & Figure 2, respectively. Both Tables 1 and 2 have the same structure. The first column shows the number of genes in distinct gene association modules (connected components) within each network. Gene association modules were defined to be clusters of 4 or more connected genes such that genes in two distinct components are not connected by an edge. Thus, it differs from the definition used in [2]. The tables are sorted by the second column which displays the percentages of each module's average association score when compared to the module with the largest average association score (the first module in each table). The exact definition of association scores are dependent on the method used. As for example, for the PC method, the association score of an edge is the partial correlation between the connected gene pair. Finally, in the third column we list all the genes belonging to each individual module. Genes shown in red are the genes that appear in both tables.

Gene-Gene Association Network constructed using the PLS method

Gene-Gene Association Network constructed using the PC method

Prediction of symptom severity

After identifying clusters of associated/interacting genes, we investigate the ability of each module to predict the CFS severity level. For that purpose, we fit a log-linear model for each gene module to regress the clinical variable Cluster on the set of expression profiles of genes included in the module. The overall predictive ability of the CFS severity by a given module can be judged on the basis of the likelihood ratio test which compares the full model (all genes in a module included as covariates in the model) and the null model which includes no covariates. The p-values obtained from the tests are shown in the fourth column of (see Tables 1 and 2). Small p-values indicate that gene association modules are effective in predicting the symptom severity categories.

SNP association

Carrying out a similar analysis as in the previous section, we study how effectively each gene cluster (module) can predict the genotypes of the two SNP's, hCV245410 and hCV7911132, which have been identified by [2] to be associated with symptom severity. Again, we fit multiple log-linear models and compute the p-values for the likelihood ratio tests. The p-values for both SNP's are shown in columns 5 and 6.

Integration of proteomic data

We have run a number of well regarded classifiers (Random Forest, LDA, and others) based on the class information with the hope of identifying the features possessing the greatest classification ability; however this approach was abandoned since none of the classifiers produced desirable classification error rates when cross validation was used. An alternative analysis consisted of performing a t-test for each m/z value to compare case and control samples which identified the discriminating features by the magnitude of the p-values. Then we fitted regression models to predict the intensity values of the ten most discriminating features from the collection of expressions of the genes in the two modules (from PLS and PC, respectively) identified by our analysis of associated gene sets. The genes have a good predictive ability as can be seen from Table 3 (see Table 3)

Discussion

Two gene association modules (indicated by asterisks) are of interest based on their predictive ability of symptom severity, at least, one of the SNP genotypes and intensity of identified proteomic features. The first cluster comes from the PLS reconstructed network and the other one from the PC reconstructed network. Table 4 (see Table 4) lists the eleven genes that are in common between these two gene modules. The GO annotations listed in the table were mined from the BioGrid online repository [6] and the pathway analysis was conducted using the DAVID webtools [7] in addition to mining existing literature. It is plausible that these genes are responsible for certain aspects of CFS or its symptoms. As for example, the first gene on the list WASF3 (aka WAVE3) is thought to take part in the p38 MAPK regulatory pathway [8]. On the other hand, in recent animal model studies [9], it has been demonstrated that regulation of brain cytokines through p38 MAPK pathway is involved in the in the central mechanisms of fatigue and therefore may play a role in the pathogenesis of the CFS. The list also includes autoimmune response gene NUP98 and genes related to tumor activities (PRUNE, TNK2, HOXA1). Gene expression of HDAC7A has been shown to be correlated with unexplained fatigue in a past study [10]. The gene GPR41's role in autoimmune disorders including CFS has been hypothesized in [11].

Conclusion

It is possible and perhaps desirable to integrate information from various experimental platforms in order to understand complex disorders. The findings in this study are based on data mining approaches using clinical, gene expression, SNP and proteomic data. The predictive models obtained here may explain certain aspects of CFS and may pave the way for further experimental validation.

Supplementary material

Data 1

97320630006120S1.pdf^{(87.4KB, pdf)}

Footnotes

Citation:Pihur et al, Bioinformation 6(3): 120-124 (2011)

References

1.SD Vernon, WC Reeves. Pharmacogenomics. 2006;7(3):345. doi: 10.2217/14622416.7.3.345. [DOI] [PubMed] [Google Scholar]
2.A Presson, et al. Integration of genetic and genomic approaches for the analysis of chronic fatigue syndrome implicates forkhead box n1 (CAMDA 2006 Conference Paper).2006. [Google Scholar]
3.V Pihur, et al. Understanding Chronic Fatigue Syndrome (CFS) from CAMDA data: A systems biology approach. (CAMDA 2007 Conference Paper).2007. [Google Scholar]
4.V Pihur, et al. Bioinformatics. 2008;24:561. [Google Scholar]
5.J SchÄafer, K Strimmer. Bioinformatics. 2005;21(6):754. doi: 10.1093/bioinformatics/bti062. [DOI] [PubMed] [Google Scholar]
6.C Stark, et al. Nucleic Acids Res. 2006;34:D535. doi: 10.1093/nar/gkj109. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.J Dennis, et al. Genome Biol. 2003;4(5):P3. [PubMed] [Google Scholar]
8.K Sossey-Alaoui, et al. Exp Cell Res. 2005;308(1):135. doi: 10.1016/j.yexcr.2005.04.011. [DOI] [PubMed] [Google Scholar]
9.T Katafuchi, et al. Ann N Y Acad Sci. 2006;1088:230. doi: 10.1196/annals.1366.020. [DOI] [PubMed] [Google Scholar]
10.T Whistler, et al. Pharmacogenomics. 2006;7(3):395. doi: 10.2217/14622416.7.3.395. [DOI] [PubMed] [Google Scholar]
11.D Staines. Med Hypotheses. 2005;65(1):29. doi: 10.1016/j.mehy.2005.02.013. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data 1

97320630006120S1.pdf^{(87.4KB, pdf)}

[R01] 1.SD Vernon, WC Reeves. Pharmacogenomics. 2006;7(3):345. doi: 10.2217/14622416.7.3.345. [DOI] [PubMed] [Google Scholar]

[R02] 2.A Presson, et al. Integration of genetic and genomic approaches for the analysis of chronic fatigue syndrome implicates forkhead box n1 (CAMDA 2006 Conference Paper).2006. [Google Scholar]

[R03] 3.V Pihur, et al. Understanding Chronic Fatigue Syndrome (CFS) from CAMDA data: A systems biology approach. (CAMDA 2007 Conference Paper).2007. [Google Scholar]

[R04] 4.V Pihur, et al. Bioinformatics. 2008;24:561. [Google Scholar]

[R05] 5.J SchÄafer, K Strimmer. Bioinformatics. 2005;21(6):754. doi: 10.1093/bioinformatics/bti062. [DOI] [PubMed] [Google Scholar]

[R06] 6.C Stark, et al. Nucleic Acids Res. 2006;34:D535. doi: 10.1093/nar/gkj109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R07] 7.J Dennis, et al. Genome Biol. 2003;4(5):P3. [PubMed] [Google Scholar]

[R08] 8.K Sossey-Alaoui, et al. Exp Cell Res. 2005;308(1):135. doi: 10.1016/j.yexcr.2005.04.011. [DOI] [PubMed] [Google Scholar]

[R09] 9.T Katafuchi, et al. Ann N Y Acad Sci. 2006;1088:230. doi: 10.1196/annals.1366.020. [DOI] [PubMed] [Google Scholar]

[R10] 10.T Whistler, et al. Pharmacogenomics. 2006;7(3):395. doi: 10.2217/14622416.7.3.395. [DOI] [PubMed] [Google Scholar]

[R11] 11.D Staines. Med Hypotheses. 2005;65(1):29. doi: 10.1016/j.mehy.2005.02.013. [DOI] [PubMed] [Google Scholar]

PERMALINK

Meta analysis of Chronic Fatigue Syndrome through integration of clinical, gene expression, SNP and proteomic data

Vasyl Pihur

Somnath Datta

Susmita Datta

Abstract

Background

Methodology

Microarray data

Clinical data

SNP data

Proteomic data

Statistical analysis

Network construction and identification of associated gene sets

Figure 1.

Figure 2.

Prediction of symptom severity

SNP association

Integration of proteomic data

Discussion

Conclusion

Supplementary material

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Meta analysis of Chronic Fatigue Syndrome through integration of clinical, gene expression, SNP and proteomic data

Vasyl Pihur

Somnath Datta

Susmita Datta

Abstract

Background

Methodology

Microarray data

Clinical data

SNP data

Proteomic data

Statistical analysis

Network construction and identification of associated gene sets

Figure 1.

Figure 2.

Prediction of symptom severity

SNP association

Integration of proteomic data

Discussion

Conclusion

Supplementary material

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases