Abstract
A major contributor to the scientific reproducibility crisis has been that the results from homogeneous, single-center studies do not generalize to heterogeneous, real world populations. Multi-cohort gene expression analysis has helped to increase reproducibility by aggregating data from diverse populations into a single analysis. To make the multi-cohort analysis process more feasible, we have assembled an analysis pipeline which implements rigorously studied meta-analysis best practices. We have compiled and made publicly available the results of our own multi-cohort gene expression analysis of 103 diseases, spanning 615 studies and 36,915 samples, through a novel and interactive web application. As a result, we have made both the process of and the results from multi-cohort gene expression analysis more approachable for non-technical users.
Keywords: Multi-cohort Analysis, Meta-Analysis, Gene Expression, Reproducibility, Web Application, Software
1. Introduction
Prior to translation of the results of a biological experiment into clinical practice, they must be replicated and validated in multiple independent cohorts. However, the majority of findings fail to validate, leading to a ‘reproducibility crisis’ in science.1,2 One of the factors in this lack of reproducibility is that traditional, single cohort studies do not represent the heterogeneity observed in the real world patient population.3 As a result, observed and reported effects are often specific to a population subset instead of generalizable across the population.
More than two million publicly available gene expression microarrays present novel opportunities to incorporate the real-world heterogeneity observed in patient populations into analysis.4,5 However, the biological (tissue, treatment, demographics) and technical (experimental protocol, microarray) heterogeneity present in such data poses a daunting challenge for their integration and reuse. Consequently, many tools, which allow reuse of these data, are unable to combine evidence across multiple data sets and place that burden on the end user, leading to under-utilization of these datasets.6,7
Previously, we have described a novel multi-cohort analysis framework for integrating multiple heterogeneous datasets to identify robust and reproducible signatures by leveraging the biological and technical heterogeneity in these datasets. We have repeatedly demonstrated the utility of our framework for identifying novel diagnostic and prognostic biomarkers, drug targets, and repurposing FDA-approved drugs in diverse diseases, including organ transplantation, cancer, infection, and neurodegenerative diseases.8–16 In each of these analyses, we analyzed more than a thousand human samples from more than 10 independent cohorts to generate and validate data-driven hypotheses. Many of these results also been further validated in experimental settings.8,11,16 These results have further demonstrated the ability of our framework to create “Big Data” by combining multiple smaller studies that are collectively representative of the real word patient population heterogeneity.
We recently published a systematic comparison of gene expression meta-analysis to evaluate existing tools, including GeneMeta, MAMA, MetaDE, ExAtlas, rmeta, and metafor,17–21 and described best practice guidelines for gene expression meta-analysis.22 While these existing packages perform both generic and gene expression meta-analysis, none provide coverage of the entire gene expression meta-analysis workflow: downloading data from public repositories, rigorously implementing gene expression meta-analysis best practices, and providing visualizations of the final results.
2. Multi-Cohort Gene Expression Analysis with MetaIntegrator
Despite its demonstrated utility in identifying robust, reproducible, and biologically as well as clinically relevant disease signatures, our multi-cohort analysis framework has previously required manual dataset download, pipeline set up, and visualization generation. To lower this barrier to entry, we have developed MetaIntegrator, an R package that automates most of the multi-cohort analysis framework. Our package guides the user from data download to execution of statistical analysis to evaluation of the results [Figure 1].
2.1. Data Processing
The first step in the multi-cohort analysis is downloading the requisite experimental information, notably the class labels (case or control), the gene expression data, and any interesting phenotypic information about the samples. Since we have found that most users will download data from the NCBI's Gene Expression Omnibus (GEO), we have integrated an automatic downloading and processing of GEO data into our analysis pipeline. MetaIntegrator will automatically download the expression data and all available annotations, perform sanity checks that the data have been appropriately normalized, and compile the data into the MetaIntegrator object format.
2.2. Multi-cohort Analysis
2.2.1. Combining effect sizes
Our meta-analysis approach computes an Hedges g effect size for each gene in each dataset defined as:
(1) |
where X̄1 and X̄0 are the average expression for cases and controls, S1 and S0 are the standard deviations for cases and controls, and n1 and n0 are the number of cases and controls.8,23 J is the Hedges' g correction factor, which is computed as:
(2) |
where df are the degrees of freedom.
To pool these effect sizes across datasets, the summary effect size gs is computed using a random effect model as:
(3) |
where n is the number of studies, gi is the Hedges' g of that gene within dataset i, Wi is a weight equal to 1/(Vi + T2), Vi is the variance of that gene within a given dataset i, and T2 is the inter-dataset variation as estimated by the DerSimonian-Laird method.23,24 The standard error for the summary effect size is . Given gs and SEgs, we calculate a p-value based on a standard normal distribution and perform a Benjamini-Hochberg FDR correction for multiple hypothesis testing across all genes.25
2.2.2. Heterogeneity of effect size
We calculate Cochrane's Q value for evaluating heterogeneity of effect size estimates between studies:
(4) |
where Wi, gi, and gs are the same as above.23 The p-value of Cochrane's Q is calculated against a chi-squared distribution and adjusted for multiple hypothesis testing using the Benjamini-Hochberg FDR method across all genes.25 A statistically significant Cochrane's Q indicates significant heterogeneity of effect sizes between studies.
2.2.3. Combining p-values
We use Fisher's method for combining p-values across studies.26 We calculate the log sum of p-values that each gene is up-regulated as:
(5) |
where n is the number of studies and pi is the t-test p-value that the gene of interest is up-regulated in study i. Similarly, we calculate Fdown as the log-sum of p-values that each gene is down-regulated.
For each gene, we calculate the p-value of Fup and Fdown under a chi-squared distribution and perform a Benjamini-Hochberg FDR correction across all genes.25
2.3. Signature Selection
Once meta-analysis is performed, a subset of genes must be identified as the disease signature. MetaIntegrator allows the user to identify these genes by varying the filtering parameters based on gene effect size, effect size false discovery rate, Fisher's method false discovery rate, heterogeneity of effect size, and the number of studies in which the gene was present. In order to avoid disproportionate influence of a single study, MetaIntegrator allows the user only include genes which were similarly significant across all leave-one-dataset-out analyses. By varying these criterion, the user may control whether they identify a larger set of genes, which may be ideal for understanding molecular pathogenesis and identifying drug targets, or a smaller set of genes, which may be optimal developing a parsimonious clinical diagnostic.
For users that are particularly interested in developing a powerful diagnostic, we have integrated forward and backward search, which reduce gene set size to optimize the area under the receiver operating characteristic curve on the training data.10
2.4. Score Calculation
For a set of signature genes, a signature score can be computed for every sample, i, as:
(6) |
where pos and neg are the sets of positive and negative genes, respectively, and xi(gene) is the expression of any particular gene in sample i (a positive score indicates an association with cases and a negative score with controls). This score Si is normalized to a z-score to center the samples for each study around zero.
2.5. Visualization
With scores calculated for each sample, we are able to visualize comparisons of cases vs. controls, regression of continuous variables against the score, and consistency of gene expression across datasets. Some of the built in visualizations, in counter-clockwise order from Figure 1:
Forest plots. Examine the effect sizes and standard errors for a single gene across studies, including the summary effect size.
Regression plots. Evaluate the relationship of the signature score with continuous variables like clinical severity and time.
Heatmap plots. Observe consistency of differential expression for all signature genes across studies.
Violin plots. Compare signature scores across categorical variables like disease subtypes, treatment protocols, and demographic groups.
ROC plots. Evaluate classification performance for signature score on a single dataset in terms of specificity and sensitivity.
3. Data-Driven Biological Hypotheses with MetaSignature
We have created MetaSignature (http://metasignature.stanford.edu), a web application which empowers researchers to generate data-driven hypotheses by enabling access to the results of our multi-cohort gene expression analysis framework. We focused on enabling intuitive data access for researchers with specific interest in either a disease, a gene, or several genes, while requiring little or no analytic background.
3.1. Data
Thus far, we have aggregated 615 gene expression studies composed of more than 35,000 human samples with approximately 1.5 billion data points from 103 diseases, a number which we will continue to grow. For each disease, we applied our multi-cohort analysis approach to compute the gene expression differences between the manually curated cases and controls. To perform these multi-cohort analyses, we searched for relevant studies in GEO, identified cases and controls in every study, and calculated disease effect sizes using the MetaIntegrator R package. We stored the multi-cohort analysis results in a MySQL database for rapid retrieval. As more studies are incorporated into our database, we recalculate the disease summary effect sizes.
3.2. Gene-centric Analysis
For researchers that are interested in the expression of a particular gene, MetaSignature provides visualizations that allow researchers to quickly identify the diseases in which specified gene is most differentially expressed [Figure 2a], study-level data of the gene expression in particular diseases [Figure 2b], and cell type-specific gene expression patterns [Figure 2c].
For instance, consider a researcher who has developed a drug, such as atorvastatin, that effectively reduces plasma levels of CXCL10, and seeks to identify the most promising clinical applications. Using MetaSignature, she determines CXCL10 is significantly up-regulated in transplant rejection [Figure 2a]. A drilldown further identifies eight separate studies that have measured CXCL10 in transplant rejection, indicating a highly positive effect size in all except one of these studies [Figure 2b]. The researcher further observes that CXCL10 is up-regulated in monocytes, compared to other immune cell types. [Figure 2c]. Taken together, these findings would motivate a clinical investigation of the use of a CXCL10 inhibitor, such as atorvastatin, in monocytes of patients at risk for transplant rejection. We have already verified this data-driven hypothesis in mouse models and patient electronic health records, where, in both cases, atorvastatin increases graft survival.8
Beyond single gene analysis, MetaSignature empowers users to examine gene sets in terms of correlation of those genes based on their disease effect sizes [Figure 2e] and correlation of diseases based on expression of that set of genes [similar to Figure 2f]. These visualizations enable dissection of positively- and negatively-correlated members of gene families.
3.3. Disease-centric Analysis
If a researcher is more interested in a particular disease, then MetaSignature enables identification of genes that are most up- or down-regulated in that disease [Figure 2d] and exploration of that disease's relationship to other diseases based on gene expression [Figure 2f]. When we compute disease-disease correlation based on gene expression data, we observe clustering patterns that map to established disease categories.
To follow our example from the gene-centric analysis, consider a researcher who is interested in improving transplant rejection outcomes. To gain a global understanding of transplant rejection, the researcher observes that transplant rejection falls into a cluster of inflammatory diseases, including discoid lupus, Crohn's disease, and interstitial cystitis [Figure 2f]. By examining the transplant rejection expression data in MetaSignature, he or she would recognize that CXCL10, a chemokine important in inflammatory response, is one of the most up-regulated genes in transplant rejection [Figure 2d].27 After verifying that this observation is consistent across studies [Figure 2b], the researcher identifies that CXCL10 is a reasonable target for therapeutic inhibition in transplant rejection. Looking at other genes which are up- and down-regulated in transplant rejection, he or she recognizes that CXCL10 expression is in a positively correlated with several other genes, including TRAF2 and CD38 [Figure 2e]. Collectively from these observations, the researcher has learned that transplant rejection is related to inflammatory diseases, which is consistent with the observed up-regulation of CXCL10, an inflammatory chemokine. As noted in the gene-centric analysis above, we have observed increased graft survival through administration of atorvastatin.8
4. Discussion
The reproducibility crisis in biomedical research has led to erroneous conclusions and wasted resources. Here, we present a vertically integrated platform that can both assist with gene expression multi-cohort analysis (MetaIntegrator) and provide aggregated results for users who wish to rapidly test hypotheses (MetaSignature). By leveraging the growing public data available for study, this new resource can drastically reduce the time and effort for biological hypothesis testing across numerous studies and diseases. While many software packages exist for similar analyses,17–21 ours offers simple, custom software for plotting and analysis, automated downloading of data from GEO, and integration to the MetaSignature database.
Our package is complementary to the recently published OMiCC platform, which enables curation and meta-analysis of GEO studies.28 OMiCC relies on the RankProd package for performing meta-analysis using rank-based statistics for identifying differentially expressed genes.29 While others have provided thorough comparisons of the different meta-analysis methods, the most notable difference between RankProd and MetaIntegrator is that rank-based statistics fail to produce a summary effect size across multiple studies.30,31 By leveraging our MetaIntegrator package, OMiCC could produce differential gene expression profiles across multiple studies instead of internal to single studies.
Although MetaIntegrator is currently applicable to microarray gene expression data, we plan to expand the MetaIntegrator package to handle the count data which is generated by RNAseq experiments. Additionally, we intend to enable download from additional data repositories including ArrayExpress and, once RNAseq processing is implemented, Sequence Read Archive.?,5
Our work promises to increase reproducibility of research for both data analysts and wet lab researchers. For data analysts, we have made multi-cohort gene expression analysis publicly available through a straightforward R package. By performing integrative, multi-cohort analyses, these analysts will generate more reproducible results. For wet lab researchers, we are empowering data-driven hypotheses prior to experimentation. Rather than performing broad assays to identify disease related genes, researchers can focus on performing targeted experiments on genes which are reproducible across cohorts.
5. Package and Source Code Distribution
The MetaIntegrator R package, including an introductory vignette, may be installed using the following command in R:
install.packages(“MetaIntegrator”)
The source code for MetaIntegrator is available at: https://cran.rstudio.com/web/packages/MetaIntegrator/
MetaSignature was developed using R and Shiny and is hosted at: http://metasignature.stanford.edu/
Acknowledgments
We thank Alex Schrenchuk for computer support. WAH is funded by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-114747. FV is funded by the National Institute of Health K12 Career Award 5K12HL120001-02. MAT is funded by La Caixa Foundation. EB is funded by Gabilan Fellowship. PK is funded by the National Institute of Allergy and Infectious Diseases grants 1U19AI109662, U19AI057229, U54I117925, and U01AI089859.
References
- 1.Ioannidis JPA. PLoS medicine. 2005 Aug;2:e124. doi: 10.1371/journal.pmed.0020124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Baker M. Nature. 2016:452. [Google Scholar]
- 3.Ioannidis JP, Ntzani EE, Trikalinos TA, Contopoulos-Ioannidis DG. Nature Genetics. 2001 Nov;29:306. doi: 10.1038/ng749. [DOI] [PubMed] [Google Scholar]
- 4.Edgar R. Nucleic Acids Research. 2002 Jan;30:207. doi: 10.1093/nar/30.1.207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, Holloway E, Kapushesky M, Kemmeren P, Lara GG, Oezcimen A, Rocca-Serra P, Sansone SA. Nucleic Acids Research. 2003 Jan;31:68. doi: 10.1093/nar/gkg091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Engreitz JM, Chen R, Morgan AA, Dudley JT, Mallelwar R, Butte AJ. Bioinformatics. 2011 Dec;27:3317. doi: 10.1093/bioinformatics/btr548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Petryszak R, Burdett T, Fiorelli B, Fonseca NA, Gonzalez-Porta M, Hastings E, Huber W, Jupp S, Keays M, Kryvych N, McMurry J, Marioni JC, Malone J, Megy K, Rustici G, Tang AY, Taubert J, Williams E, Mannion O, Parkinson HE, Brazma A. Nucleic Acids Research. 2014 Jan;42:D926. doi: 10.1093/nar/gkt1270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Khatri P, Roedder S, Kimura N, De Vusser K, Morgan AA, Gong Y, Fischbein MP, Robbins RC, Naesens M, Butte AJ, Sarwal MM. The Journal of experimental medicine. 2013 Oct;210:2205. doi: 10.1084/jem.20122709. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Chen R, Khatri P, Mazur PK, Polin M, Zheng Y, Vaka D, Hoang CD, Shrager J, Xu Y, Vicent S, Butte AJ, Sweet-Cordero EA. Cancer Research. 2014 May;74:2892. doi: 10.1158/0008-5472.CAN-13-2775. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Sweeney TE, Shidham A, Wong HR, Khatri P. Science Translational Medicine. 2015 May;7:287ra71. doi: 10.1126/scitranslmed.aaa5993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Andres-Terre M, McGuire HM, Pouliot Y, Bongen E, Sweeney TE, Tato CM, Khatri P. Immunity. 2015 Dec;43:1199. doi: 10.1016/j.immuni.2015.11.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Li MD, Burns TC, Morgan AA, Khatri P. Acta neuropathologica communications. 2014 Jan;2:93. doi: 10.1186/s40478-014-0093-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Mazur PK, Reynoird N, Khatri P, Jansen PWTC, Wilkinson AW, Liu S, Barbash O, Van Aller GS, Huddleston M, Dhanak D, Tummino PJ, Kruger RG, Garcia BA, Butte AJ, Vermeulen M, Sage J, Gozani O. Nature. 2014 May; doi: 10.1038/nature13320. advance on. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Mazur PK, Herner A, Mello SS, Wirth M, Hausmann S, S´anchez-Rivera FJ, Lofgren SM, Kuschma T, Hahn SA, Vangala D, Trajkovic-Arsic M, Gupta A, Heid I, No¨el PB, Braren R, Erkan M, Kleeff J, Sipos B, Sayles LC, Heikenwalder M, Heβmann E, Ellenrieder V, Esposito I, Jacks T, Bradner JE, Khatri P, Sweet-Cordero EA, Attardi LD, Schmid RM, Schneider G, Sage J, Siveke JT. Nature Medicine. 2015 Sep;21:1163. doi: 10.1038/nm.3952. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Sweeney TE, Braviak L, Tato CM, Khatri P. The Lancet Respiratory Medicine. 2016;4:213. doi: 10.1016/S2213-2600(16)00048-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Sweeney TE, Wong HR, Khatri P. Science translational medicine. 2016 Jul;8:346ra91. doi: 10.1126/scitranslmed.aaf7165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Lusa L, Gentleman R, Ruschhaupt M. GeneMeta: MetaAnalysis for High Throughput Experiments [Google Scholar]
- 18.Ihnatova I. MAMA: Meta-Analysis of MicroArray. 2013 [Google Scholar]
- 19.Lumley T. rmeta: Meta-analysis. 2012 [Google Scholar]
- 20.Wang X, Kang DD, Shen K, Song C, Lu S, Chang LC, Liao SG, Huo Z, Tang S, Ding Y, Kaminski N, Sibille E, Lin Y, Li J, Tseng GC. Bioinformatics. 2012;28:2534. doi: 10.1093/bioinformatics/bts485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Sharov A, Schlessinger D, Ko MSH. Journal of Bioinformatics and Computational Biology. 2015;13:1550019. doi: 10.1142/S0219720015500195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Sweeney TE, Haynes WA, Vallania F, Ioannidis JP, Khatri P. Nucleic acids research. 2016 Sep;:gkw797. doi: 10.1093/nar/gkw797. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. Introduction to Meta-Analysis. 2009 [Google Scholar]
- 24.DerSimonian R, Kacker R. Contemporary Clinical Trials. 2007;28:105. doi: 10.1016/j.cct.2006.04.004. [DOI] [PubMed] [Google Scholar]
- 25.Benjamini Y, Hochberg Y. Journal of the Royal Statistical Society Series B (Methodological) 1995;57:289. [Google Scholar]
- 26.Fisher R. Statistical methods for research workers. 1925 [Google Scholar]
- 27.Neville LF, Mathiak G, Bagasra O. Cytokine & Growth Factor Reviews. 1997 Sep;8:207. doi: 10.1016/s1359-6101(97)00015-4. [DOI] [PubMed] [Google Scholar]
- 28.Shah N, Guo Y, Wendelsdorf KV, Lu Y, Sparks R, Tsang JS. Nature Biotechnology. 2016 Jun; doi: 10.1038/nbt.3603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Hong F, Breitling R, McEntee CW, Wittner BS, Nemhauser JL, Chory J. Bioinformatics (Oxford, England) 2006 Nov;22:2825. doi: 10.1093/bioinformatics/btl476. [DOI] [PubMed] [Google Scholar]
- 30.Chang LC, Lin HM, Sibille E, Tseng GC. BMC bioinformatics. 2013;14:368. doi: 10.1186/1471-2105-14-368. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Ramasamy A, Mondry A, Holmes CC, Altman DG. PLoS medicine. 2008 Sep;5:e184. doi: 10.1371/journal.pmed.0050184. [DOI] [PMC free article] [PubMed] [Google Scholar]