There has been considerable progress in our understanding of the genetic architecture of susceptibility to inflammatory diseases in recent years: several hundred susceptibility loci have been discovered in genome-wide association studies (GWAS) of human populations. This success has created an important challenge in identifying the functional consequences of these risk-associated variants and in elucidating how the repercussions of individual susceptibility loci integrate to yield dysregulation of immune pathways and, ultimately, syndromic clinical phenotypes. The integration of GWAS association signals with high-resolution transcriptome and other genomic data that capture the dynamics of cellular state and function in the context of individual's collection of susceptibility alleles has proven to be a successful avenue of investigation. The rapid pace of methodological development in this area has been coupled with an accumulation of experimental data that makes the elucidation of complex biological networks underlying susceptibility to these common inflammatory diseases a reasonable goal in the near future.
Advances in genotyping technology, together with the discovery that genetic variation in the human genome is structured in such a way that common nucleotide variants do not segregate entirely independently of each other, ushered in an era of genome-wide association studies (GWAS). A GWAS is now a common study design for discovering genetic variation that contributes to complex traits and, to be successful, typically requires the evaluation of hundreds of thousands to millions of genetic variants for correlation to a given phenotype in several thousand individuals. To date, approximately 200 loci harboring commonly-occurring genetic variants (minor allele frequency > 0.05) associated with disease risk have been convincingly associated with inflammatory disease in humans, and the National Human Genome Research Institute at the U.S. National Institutes of Health updates a catalog of published GWAS results (http://www.genome.gov/gwastudies/) on a weekly basis. Yet, despite our success in discovering susceptibility loci, the identification of the precise variant(s) contributing to a given trait and the mechanism by which such variants exert their effect on disease remain elusive. One challenge is the sheer number of susceptibility loci, as the detailed dissection of a single locus represents a substantial investment in effort and resources. Typically, many loci are associated with trait variance, and each locus contributes only a small effect to syndromic traits such as susceptibility to an inflammatory disease. Thus, a priori, there is often not a clear order with which to proceed.
Another challenge is that a locus, the segment of chromosomal DNA containing the trait-associated variant(s), may contain multiple genes: mapped trait-associated variants localize to gene-rich as well as gene-poor regions. Importantly, most of the associated variants within a given locus are surrogate markers that are in linkage disequilibrium (LD) not only with the causal variant(s) but also many other variants. Thus, the causal variants are not readily identified at the end of a genome scan. Nonetheless, this set of correlated variants is useful in that it defines the boundaries of the locus that contains the causal variant(s). The gene(s) present in this chromosomal segment are the ones that are most likely to be affected by the disease-associated variant (although long distance regulatory effects are also possible), but, if there is more than one gene in the region, often one cannot statistically differentiate which one may be more likely to be affected. Since only a small number of trait-associated variants are coding variants that affect protein sequence non-synonymously and have been proven to have an effect on gene function, most loci require fine mapping of an association and further characterization of a locus to understand their role in the trait of interest.
Since a causal chain links a risk factor such as a genetic variant to immune dysfunction and eventually a syndromic phenotype such as susceptibility to an inflammatory disease, identifying the effect of variants in a susceptibility locus on pertinent intermediate phenotypes has proven to be a fruitful strategy with which to explore the functional consequences of a susceptibility locus and to refine the identity of the causal allele. Gene expression is one such intermediate trait that has been successfully leveraged in a number of disease studies [e.g., 1–7]. In this article, we review the current state-of-the art for integrating gene regulatory genomics with GWAS results in a systematic manner, using studies of inflammatory disease variants to illustrate the different strategies that have been successfully deployed.
The genetic basis of gene regulatory variation
While DNA sequence variants, such as null alleles, can result in extremes of gene expression that may be deleterious, population-based studies of healthy individuals have reported high levels of heritable inter-individual variation in gene expression levels, and studies have mapped genetic variation contributing to gene expression levels in a number of different cell types [8–14]. Collectively, studies that map genetic variation contributing to transcriptional variation are referred to as expression quantitative trait locus (eQTL) mapping studies. Their general design consists of genotyping subjects genome-wide and capturing a transcriptome-wide mRNA profile using microarrays, or more recently high-throughput RNA-sequencing. As in gene discovery studies related to a given disease, imputation using a reference map of human genome variation is used to interrogate the role of variants that are not directly genotyped but are in LD with a genotyped marker. Thus, almost every common marker (e.g., those alleles with a minor allele frequency > 5%) evaluated in GWAS studies will have been evaluated, directly or indirectly, in an eQTL study. An eQTL analysis itself consists of applying regression-based or non-parametric models to test millions of genetic variants for regulatory effects on the expression of nearby and distant genes. In a whole-genome eQTL analysis, many millions of tests are performed (N = number of genetic variants × number of genes or transcripts), requiring strict statistical thresholds for significance. While necessary in a genome-wide analysis, such strict thresholds can obscure many biologically meaningful effects of genetic variation on mRNA expression. Thus, approaches to appropriately constrain such analyses have evolved: based on our understanding of the architecture of mammalian genes, sequences involved in the regulation of gene expression of a given gene are most likely to be found near that gene and harbor genetic variation influencing gene expression. Such `cis-eQTL' analyses are focused on assessing the role of genetic variants on the expression of genes in their vicinity and, empirically, have been demonstrated to be well-powered to detect regulatory effects that are replicated [15]. For a given tissue and cohort, these analyses provide a list of genetic variants associated with a given gene or transcript's expression levels, the allelic direction of the association, and the magnitude of the effect, often quantified as the mean change in expression between individuals homozygous for either of the two alternate alleles. As such, they provide annotation of functional regulatory variation in the human genome.
When GWAS variants and eQTL variants co-localize to the same genomic region, this generates a testable hypothesis that a given genetic variant influences trait variance through effects on expression of a given gene. An example of identification of an eQTL that co-localizes with Multiple Sclerosis (MS) associated variants is shown in Figure 1. The Figure shows a region of chromosome 20 where a genetic variant contributing to MS susceptibility has been localized (lower panel). This susceptibility locus encompasses multiple genes; however, the MS-associated genetic variant and surrogate markers that are in LD with it are also associated with altered gene expression of one of the genes in the region, specifically the CD40 gene, in peripheral blood mononuclear cells of multiple sclerosis patients. This result suggests that MS susceptibility may be, in part, due to altered CD40 gene expression levels that are influenced by genetic variants near to the gene. The result is by no means conclusive, but, it does suggest testable hypotheses to pursue in future studies.
Figure 1. Colocalization of cis-expression quantitative trait loci and Multiple Sclerosis GWAS signals in the CD40 locus.
The top panel reports p-values for cis-regulatory associations of SNPs on CD40 RNA expression levels in peripheral blood mononuclear cells of multiple sclerosis patients; -log10(p-value) is reported on the y-axis. The lower panel reports the Multiple Sclerosis (MS) GWAS signal (-log10p-value) for each SNP over the same chromosomal segment (1 Mb total) [58]. The RefSeq genes in the region are shown at the bottom of the figure. The LD (as quantified by r2) for each SNP with the index MS-associated CD40 SNP (rs6074022) is illustrated with the use of colors, as indicated in the top right of the figure. The figure was generated using LocusZoom [60].
To date, the majority of eQTL studies have been performed in healthy subjects and reveal a substantial amount of functional regulatory variation in the human genome. The degree to which the detected associations are population- and cell-type specific, or observable only under certain conditions, is an important consideration for integrating with GWAS associations. Clearly, some genes exhibit highly cell type-, tissue-, and context-specific expression patterns [16–17], and the extent to which the eQTL patterns are shared across cell-types or tissues is still being quantified. Most eQTL studies to date have been of modest size, limiting the assessment of tissue overlap because of poor statistical power to support a negative result. Published studies have estimated that only 0.4%–0.5% of genes have a significant cis-eQTL in at least two or more tissues, and approximately 70–80% of eQTLs may be cell-type specific [8–10,18–21]. While these modestly sized studies may overestimate the true proportion of cell-type specific eQTLs, it is clear that, in many situations, the cell- or tissue-type in which mRNA is profiled will have an important effect on the presence and magnitude of a variant's effect on gene expression. This is important as it determines the relevance of existing eQTL datasets to disease studies and informs the design of new studies that have to balance the competing needs of accessing a specific cell type of interest (which may be technically and practically challenging) or using a suitable surrogate cell population or cell mixture. To date, the majority of human eQTL studies have been conducted using cell lines, e.g., lymphoblastic cell lines (LCLs) that are derived from B lymphocytes transformed using the Epstein-Barr virus [13–14,22–24], but increasingly studies are being conducted in primary cells or tissues, including blood [3,6,19,25–26], monocytes [9,12], primary B-cells[9], liver [11,27–29], skin[10,18], adipose tissues[3,30], skeletal muscle [31], brain [32–34], T-cells [8,35], and fibroblasts [8]. Recognizing the utility of eQTL analysis and the tissue specificity of eQTL associations, The U.S. National Institutes of Health has funded an unprecedented multi-tissue eQTL study of human gene expression (http://commonfund.nih.gov/GTEx/). The Genotype-Tissue Expression (GTEx) program will publish summarized eQTL analysis results from many human tissues and will catalyze methods development to best utilize this wealth of data.
Population ancestry also contributes to context specificity of eQTLs. A cis-eQTL analysis in LCLs of eight cohorts representing different human ancestries estimates that up to 31% of well-annotated genes have a significant cis-eQTL relationship in at least one population. Of these genes with cis-eQTLs, more than 50% exhibit that cis-eQTL in at least two independent populations, and 6% of genes contain an eQTL effect in all eight populations [13]. This pattern likely reflects the modest sample size of each population, effects of the transformation process, possible effects of environmental variables on the source tissue, and some clear examples of population-specific associations. This suggests that in some cases, population ancestry may be an important consideration for integration of eQTL observations with GWAS results, particularly as GWAS is applied to populations that are not of European ancestry. Interestingly, while variation in eQTL relationships among populations is expected because of differences in LD structure among human populations, current studies suggest that allele direction and the magnitude of the effect on expression levels vary little for those associations that are shared across human populations [13]. Thus, in a minority of loci, there may be little in the way of population-specific modifying effects, and further work is needed to understand the role of these eQTLs that may relate to fundamental aspects of human biology. It is important to note that nearly all eQTL investigations conducted to date have focused on common genetic variants (minor allele frequency > 5%), thus our understanding of the levels and patterns of regulatory variation and effect sizes are limited to this set. However, with recent developments in genome sequencing and rare variant genotyping and imputation, investigators are now beginning to identify and characterize regulatory effects of low-frequency variants [36], which may help in interpreting rare variant trait associations that are identified by GWAS or whole-genome sequencing projects.
Studies have demonstrated that, as a group, trait-associated variants identified through GWAS are enriched for eQTLs [37–38], with inflammatory disease risk variants showing a similar degree of enrichment for eQTLs in relevant cell types [6,25,39–41]. Publicly-available eQTL genome browsers and databases (Table 1) provide useful resources for investigating potential eQTL associations for SNPs of interest. Though, because most report only significant findings, and analyses use different statistical models and criteria for data inclusion, false negatives are difficult to interpret. Even in the case where a trait-associated variant is also determined to be an eQTL, one has to exercise caution in interpretation. Due to patterns of LD in the human genome and the large number of disease-associated and eQTL-annotated genetic variants, a simple overlap of significant eQTL variants and GWAS variants can occur even when causal variants underlying both signals are different. Statistical frameworks have now been developed to provide an estimate of the degree to which these signals of association overlap [41–42], and further methods development continues in this area.
Table 1.
Database or browser name | web-link | Reference |
eqtl.chicago.edu browser | http://eatl.uchicago.edu/cgi-bin/gbrowse/eqtl/ | Veyrieras et al 2008 |
SNP Express | http://computel.lsrc.duke.edu/softwares/SNPExpress/1_database.php | Heinzen et al 2008 |
mRNAby SNP browser | http://www.sph.umich.edu/csg/liang/asthma/ | Dixon et al 2007, Moffatt et al 2007 |
SCAN:SNP and CNV Annotation Database | http://www.scandb.org | Gamazon et al 2010 |
Genevar: GENe Expression VARiation | http://www.sanger.ac.uk/software/analvsis/genevar/ | Yang, T.P. et al 2010 |
seeQTL | http://www.bios.unc.edu/research/genomic_software/seeQTL/ | Xia, K. et al 2012 |
GTEx (Genotype-Tissue Expression) eQTL Browser | http://www.ncbi.nlm.nih.gov/gtex/test/GTEX2/gtex.cgi | |
WebQTL | http://webqtl.org | Wang et al 2003 |
Functional experiments to elucidate mechanisms underlying disease associations need to be performed in relevant cell types due to the context specificity of many cellular processes. Until recently, identifying the most relevant cell type for a trait of interest has relied on educated guesses based on interpretation of current literature. For inflammatory diseases, this poses a particular challenge: while a specific cell type may have a prominent role based on our current understanding of disease pathophysiology and treatment mechanism, inflammatory diseases are systemic diseases and multiple different immune cells are likely to contribute to disease susceptibility and to be the target of susceptibility variants. Studies quantifying enrichment of eQTLs among inflammatory disease-associated variants have identified specific cell types where the enrichment is greatest and have suggested that these cell types may be the ones that are more relevant in terms of genetic susceptibility to an inflammatory disease [9,38]. A different approach has been proposed by Raychaudhuri and colleagues [43] who devised a systematic assessment of cell types that starts with a set of disease-associated variants, identifies genes within each locus, and calculates a probabilistic model that designates the tissue or cell type(s) in which the expression of these potential susceptibility genes is enriched. Applying this method to three inflammatory diseases and using RNA expression profiles from a large number of purified murine immune cells from the Immunological Genome Consortium [44], they designate cell types that are most relevant to systemic lupus erythematosus, Crohn's disease, and rheumatoid arthritis. While limited by the completeness of the list of susceptibility alleles and the use of gene sets defined in murine cells, these results implicate different combinations of cell types for each disease that are consistent with our current understanding of pathophysiology. Similar approaches using genome-wide maps outlining the state of chromatin in different human cell types have also successfully identified cell types relevant to specific diseases and will help to design studies that seek to explore disease-related eQTLs (Raychaudhuri, personal communication).
Systems biology to elucidate causal networks
While eQTL studies provide excellent annotation for the detailed characterization of a given locus, this becomes cumbersome when one explores the coordinated functional consequences of multiple different susceptibility loci. With each inflammatory disease having many dozens of susceptibility loci, the need for systematic and semi-automated evaluations of variant function has become acute and has led to the development of several different approaches. These approaches rely on the assumption that a set of disease-associated loci is likely to reflect a limited number of underlying mechanisms that may be detectable through enrichment analysis within biologically relevant gene sets defined using the correlation structure observed in RNA data (Kyoto Encyclopedia of Genes and Genomes; KEGG [45], Ingenuity Pathway Analysis (Ingenuity Systems), Gene Ontology; GO [46], Protein ANalysis THrough Evolutionary Relationships; PANTHER [47], Biocarta (http://www.biocarta.com), Reactome [48]. Enrichment analyses need to be performed carefully, taking into account parameters such as LD, gene size, pathway size, pathway complexity, etc that can easily skew results and return spurious associations that are driven by the properties of a gene set and not a true connection with disease susceptibility. Currently, INRICH [Interval-based Enrichment Analysis Tool for Genome Wide Association Studies 49]), MAGENTA [Meta-Analysis Gene-set Enrichment of variaNT Associations 50]), and DAVID [Database for Annotation, Visualization and Integrated Discovery 51,52]) are user-friendly analysis tools that conduct pathway analysis on several types of genomic variation, including but not limited to SNPs, copy number variants (CNVs), genes, as well as their combinations, taking into account several potential confounders. A different approach that uses annotations of protein-protein interactions as well as gene expression, DAPPLE [53], begins to integrate additional levels of information with RNA data and, using known susceptibility genes as a seed, has demonstrated an ability to enrich its results for genes that are later validated to have disease-associated variants [53].
Clearly, the path forward is to increase, in a meaningful way, the number and types of data considered in such analyses: the convergence of data from several different cell types identified as pertinent for a given disease may, for example, offer a better perspective on the functional repercussions of groups of variants on the interactions of different immune cells that ultimately lead to immune dysregulation and disease: certain groups of susceptibility variants may function in different cell types and have functional consequences at the cellular level that interact with those of another group of variants in a different cell type. This type of approach gradually constrains the complexity of the analysis to build a hierarchical model outlining a network of cell-specific networks, each of which may be driven by a different subset of susceptibility variants. Such models provide an excellent substrate for the design of human immunologic studies that can validate the model and use it to investigate novel questions and perhaps develop algorithms that are useful in a clinical setting to support a diagnostic work-up.
Challenges and future directions
Integration of eQTLs with disease-associated variants is merely one tool available to try to elucidate perturbed biology and mechanisms. It is the first step in testing a very specific hypothesis, i.e., that associated variants exert their effects through gene regulatory mechanisms. If a GWAS variant is an eQTL variant in a relevant tissue or cell-type and the association signals from both data types are highly correlated, additional experiments are warranted to prove the links between genetic variation, mechanism, gene, and trait. This approach is particularly well suited to inflammatory disease where peripheral blood cell populations are relevant to the disease pathophysiology and can be sampled to rapidly confirm an eQTL observation and to develop new experiments that are based on the observation of differential gene expression relative to the susceptibility variant: the eQTL therefore provides a critical first observation that enables the rapid development of testable hypotheses. An important challenge of such functional characterization is access to healthy subjects with the common susceptibility variants of interest that can be recalled based on genotype to interrogate the role of given variants without the confounding effects of treatment and fluctuations in disease activity that are seen in subjects with an inflammatory disease. However, several resources – such as the PhenoGenetic Project at Brigham & Women's Hospital and the Cambridge BioResource – have emerged to successfully fulfill this need [54–55].
While eQTL analysis is a powerful tool, it can be enriched by adding other forms of information that also capture genetic variation that influence cellular function, such as signatures of natural selection: we and others have leveraged both types of data to gain insights into the consequences of disease-associated variants [56]. This strategy may be particularly well-suited to inflammatory disease variants since variation in immune response has clearly been under natural selection by pathogens at various times over the course of human history. Ultimately, multi-dimensional “omics” data (transcriptomics, epigenomics, lipidomics, proteomics, glycomics, etc.) from different cell-types of well-characterized cohorts, together with analytic tools to integrate varied types of data, will help to elucidate disease pathways and the manner in which genetic variants contribute to trait such as susceptibility to an inflammatory disease.
Can we already translate some of these findings in the clinical sphere? The past five years have seen a tremendous growth in our understanding of the genetic architecture of inflammatory diseases and, particularly, in the extent of their shared architecture [57]. Further, we see that two diseases, such as MS and celiac disease may share susceptibility loci but that the direction of the effect (in terms of the risk allele) is the same in only 50% of the loci. In the other 50%, the allele associated with risk in one disease is protective in the other [58]. Analyzing the functional consequences of these loci in different combinations that recognize an underlying molecular pathway architecture may help us to understand how the same variants can lead to very different diseases. In addition, such analyses will uncover key nodes downstream of several different variants that (1) may be excellent targets for drug development and (2) will need to be considered in drug development as treatment for one disease may push immune responses in a direction that precipitates a second inflammatory disease while treating the first one. Thyroiditis as an adverse effect of alemtuzumab treatment in MS is one such example [59]. Overall, the immune system is highly integrated and even tissue-specific inflammatory diseases have immune alterations that are seen in the peripheral circulation that affect most immune cells directly or indirectly. Thus, it is premature to focus analyses on intermediate traits of a specific cell type for a given disease, and communities of investigators should instead focus on developing large-scale approaches that perform integrated analyses of multiple cell types that will inform the study of multiple different diseases. This is a realistic goal for the near future and will enable the integration of molecular signals across cell types – ideally in different activation contexts - that will yield higher order perspectives of immune pathways for more global approaches to immune modulation in inflammatory diseases.
Genome-wide association studies have identified many genetic variants associated with inflammatory disease risk.
Gene expression may help to elucidate causal genes and functional mechanisms underlying disease susceptibility.
Inflammatory disease variants are enriched for gene regulatory function.
Because of cell-type and tissue specificity of gene regulation, functional studies need to be performed in relevant cell types.
Analyses evolve into a more “systems genomic” approach where genomic data are integrated to identify causal mechanisms and pathways for disease.
We thank Towfique Raj and Manik Kuchroo for creating Figure 1.
