Abstract
Mucus obstruction is a central feature in the cystic fibrosis (CF) airways. A genome-wide association study (GWAS) of lung disease by the CF Gene Modifier Consortium (CFGMC) identified a significant locus containing two mucin genes, MUC20 and MUC4. Expression quantitative trait locus (eQTL) analysis using human nasal epithelia (HNE) from 94 CF-affected Canadians in the CFGMC demonstrated MUC4 eQTLs that mirrored the lung association pattern in the region, suggesting that MUC4 expression may mediate CF lung disease. Complications arose, however, with colocalization testing using existing methods: the locus is complex and the associated SNPs span a 0.2 Mb region with high linkage disequilibrium (LD) and evidence of allelic heterogeneity. We previously developed the Simple Sum (SS), a powerful colocalization test in regions with allelic heterogeneity, but SS assumed eQTLs to be present to achieve type I error control. Here we propose a two-stage SS (SS2) colocalization test that avoids a priori eQTL assumptions, accounts for multiple hypothesis testing and the composite null hypothesis, and enables meta-analysis. We compare SS2 to published approaches through simulation and demonstrate type I error control for all settings with the greatest power in the presence of high LD and allelic heterogeneity. Applying SS2 to the MUC20/MUC4 CF lung disease locus with eQTLs from CF HNE revealed significant colocalization with MUC4 (p = 1.31 × 10−5) rather than with MUC20. The SS2 is a powerful method to inform the responsible gene(s) at a locus and guide future functional studies. SS2 has been implemented in the application LocusFocus.
Keywords: SS2, colocalization, mucin genes, lung disease, two-stage test, cystic fibrosis, composite null hypothesis, multiple hypothesis testing, meta-analysis, dependent samples, heterogeneity
Introduction
Cystic fibrosis (CF [MIM: 219700]) is a life-limiting genetic disease caused by mutations in the CF transmembrane conductance regulator (CFTR [MIM: 602421]). Multiple organs are affected in CF with variation in disease severity influenced by CFTR genotype, environmental factors, and modifier genes.1 The majority of morbidity and mortality in CF results from lung disease which is heritable beyond the contributions of CFTR.2 Mucus pathology is a hallmark of CF airway disease, and thus the mucin family of genes have been hypothesized to contribute to lung disease severity in CF.3 A genome-wide association study (GWAS; n = 6,365) of CF lung disease from the International CF Gene Modifier Consortium (ICFGMC) identified an associated locus on chromosome 3 in an intergenic region between two mucin genes—mucin 4, tracheobronchial (MUC4 [MIM: 158372]) and mucin 20, cell surface-associated (MUC20 [MIM: 610360])—providing support for the mucin hypothesis but leading to uncertainty around the responsible gene(s) at the locus. Given the associated variants are not tagging protein coding variation, the assumption is that the associated locus is marking gene regulation.
Colocalization analysis using GWAS and gene expression quantitative trait locus (eQTL) summary statistics in a CF airway model can test this hypothesis and provide statistical support for the most probable gene(s) at the locus. However, complications arise when we try to formally test colocalization at this locus using published tools.4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 First, the null hypothesis of no colocalization is a composite null hypothesis that consists of several different null scenarios, e.g., the null scenario where there are significant GWAS and eQTL SNPs at a locus, but the two do not colocalize, or that there is a GWAS significant SNP but no eQTL at the locus (see Material and methods for the four different null scenarios). Type I error rate control under all possible null scenarios can be challenging for any method, especially in the presence of multiple hypothesis testing across multiple genes or tissues. Second, the associated SNPs at the MUC4/MUC20 locus span a 0.2 Mb region with evidence of allelic heterogeneity and high linkage disequilibrium (LD); these are factors that can substantially reduce the power of existing methods.4,5 Third, in our study, the participants in the eQTL analysis overlap with a subset of the participants in the GWAS; this induces correlation between the eQTL and GWAS summary statistics that could lead to an increased false positive rate of colocalization depending on the degree of overlap.16 Lastly, our GWAS summary statistics are obtained from a meta-analysis with related individuals within sub-studies, further complicating LD adjustment in colocalization inference.
Several statistical methods have been developed to test for colocalization between GWAS and eQTL summary statistics, but none of the existing methods accommodate all the aforementioned complications, and we were surprised to see that the other published colocalization methods4,11,12,14 did not conclude strong colocalization for MUC4 eQTLs and the GWAS summary statistics. Bayesian colocalization approaches that are amenable to the use of summary statistics include eCAVIAR,5 COLOC,4 ENLOC,9 GWAS-PW,6 and COLOC2.14 These methods aim to identify whether there is a shared causal genetic variant that contributes to both the disease outcome and gene expression variation. Both eCAVIAR and ENLOC compute SNP-level posterior probabilities for being causal and a regional colocalization probability (RCP) by summing up the SNP-level probabilities. The RCP is computed for a locus that has been identified by GWASs, where the evidence under the composite null hypothesis is not explicitly calculated so it is unknown whether this method will have reliable operating characteristics more generally. COLOC, in contrast, computes the posterior probability incorporating subjective priors under five scenarios: the four null scenarios and a specific alternative when one single causal variant is shared. Colocalization is concluded using a recommended value for the posterior colocalization probability (e.g., 0.8).14 Although the threshold for the posterior probability can be modified to account for multiple hypothesis testing by adjusting the prior probability and other factors in the calculation of false positive report probability (FPRP),17,18 explicit recommendations to control the false positive rate when testing multiple genes and/or tissues at a locus for these Bayesian colocalization methods have not been detailed. GWAS-PW extends COLOC by empirically estimating priors from the genome-wide data for the five scenarios to investigate genetic variants that influence a pair of traits and provides software functions to account for overlapping participants in the two studies.6 Similarly, COLOC2 incorporates features implemented in GWAS-PW and provides an updated version of COLOC.14 However, COLOC, GWAS-pw, and COLOC2 all assume one single causal variant, and it has been shown that COLOC can have substantial loss in power when there are multiple, independent GWAS or eQTL signals at a locus (allelic heterogeneity).8 To gain additional power for a locus with multiple independent eQTL signals, Dobbyn et al.14 proposed a forward stepwise conditional analysis before conducting the COLOC2 analysis. For each identified eQTL signal, COLOC2 integrates the GWAS result with eQTL evidence conditional on all the other eQTL signals. However, the conditional analysis requires individual-level data and can be computationally intensive. An alternative approach adopts the SuSiE19 framework to distinguish multiple signals for a given trait, and then conducts COLOC analysis on all possible pairs of signals between the traits.20 This method has been shown to provide more accurate inference than COLOC based on conditional analysis; however, the identification of different signals relies on the power of SuSiE.20
There are several frequentist-based methods that also calculate colocalization evidence, for example the gene expression imputation approaches such as TWAS13 and PrediXcan.21 These methods first use reference expression databases to define a set of genetic variants predictive of gene expression levels, then they impute gene expression levels in a study sample and test for association with a disease phenotype. The reliability of these methods depends on their prediction accuracy which can be limited.22, 23, 24 Extensions that enable the use of external summary statistics such as S-PrediXcan,7 S-TWAS,13 and S-MultiXcan10 and pre-computed parameters are available. However, in our case we already have gene expression data from individuals with CF in a relevant airway model. Integration methods based on Mendelian randomization also indirectly address the question of colocalization. Two such approaches, SMR11 and Multi-SNP-based SMR (SMR-multi),12 derive their test statistics by assuming independence between GWASs and eQTL studies and use SNPs with eQTL p values < 5 × 10−8 as instrumental variables to test whether gene expression differences cause the disease phenotype. SMR and SMR-multi restrict their analyses to regions with significant eQTLs, and they can accommodate meta-analysis but their robustness to related or overlapping samples is unknown. Lastly, JLIM15 evaluates whether there is a shared causal variant between eQTL and GWASs by developing a test statistic that contrasts the joint likelihood assuming one shared causal variant with that assuming distinct causal variants between eQTL and GWAS. To calculate accurate colocalization p values, JLIM requires individual-level gene expression data for permutation.
We previously developed an alternative frequentist and summary statistics-based colocalization method called Simple Sum (SS).8 It too does not address all of the four factors necessary for reliable colocalization analysis at our chr3q29 locus. SS performs well in the presence of linkage disequilibrium and allelic heterogeneity, but at a given GWAS region with significant association but no eQTLs, the SS could have type I error inflation under this null scenario of no colocalization. Thus, we recommended an ad hoc approach that restricted SS analyses to regions with significant eQTL evidence, similar to those imposed by other methods (e.g., SMR and SMR-multi). Here we extend the SS and propose a principled two-stage SS colocalization method (SS2) that controls the type I error rate for the composite null hypothesis, even in the presence of multiple hypothesis testing across multiple genes and/or tissues, and remains powerful in the presence of LD and heterogeneity. The SS2 can also accommodate summary statistics calculated from meta-analysis with related samples within sub-studies, and it is robust to the presence of a subset of overlapping samples between the two sets of summary statistics. The SS2 has been implemented in the application LocusFocus.25
For method comparison, we choose COLOC and COLOC2 from the class of Bayesian methods, and SMR and SMR-multi from the class of frequentist methods. Among the three Bayesian approaches that consider the issue of the composite null hypothesis (COLOC, GWAS-PW, and COLOC2), COLOC has been shown to have better performance than GWAS-PW when a single causal variant is shared between the two studies,15 while GWAS-PW can have better performance when eQTL and GWAS causal variants are distinct. We thus implement COLOC2 that incorporates features of both GWAS-PW and COLOC. Among the frequentist colocalization approaches that are applicable to summary statistics, S-PrediXcan, S-TWAS, and S-MultiXcan use prediction weights inferred from a publicly available source (see PredictDB Data Repository in web resources) to estimate the association between the predicted transcriptome and phenotypes, which can lead to biased or anti-conservative results.22, 23, 24 In our application, we already have expression from CF tissue (human nasal epithelia). Therefore, we implement SMR and SMR-multi which can be directly applied to summary statistics from any eQTL study without the need for training on a new sample to predict weights.
We first conduct extensive simulation studies to compare the proposed SS2 method with COLOC, COLOC2, SMR, and SMR-multi. We then apply these colocalization methods to the GWAS summary statistics from the MUC4/MUC20 (chr3q29) CF lung function locus26 and the eQTL summary statistics, respectively, for MUC20 and MUC4 in CF primary human nasal epithelia (HNE). Finally, for completeness we extend the colocalization analyses to the genes within 1 Mb of the locus using CF primary HNE, as well as other CF-related tissues from the genotype tissue expression project (GTEx).27 We demonstrate statistical support for MUC4 as the responsible gene at the locus. The MUC4 impact appears to be relevant in the lungs, both from GTEx and CF HNE, even after adjustment for multiple hypothesis testing of all 564 gene-by-tissue pairs evaluated at this locus.
Material and methods
Notation and model
For a locus of interest (e.g., chr3q29), here we assume that available information includes summary statistics from a phenotype-SNP association analysis and an expression-SNP eQTL analysis for a gene and tissue of interest (e.g., MUC20 in human nasal epithelia), although this is generalizable to any SNP-level summary data. Assume there are j = 1,…, m SNPs at the locus, let be the vector containing the GWAS summary statistics, and containing the eQTL summary statistics for a specific gene-by-tissue pair. can be obtained from, for example, GTEx27 or one’s own expression study, although the latter requires additional care if the GWAS and eQTL study samples are related or overlap, which we discuss later.
Our alternative of interest is that the GWAS signal and the eQTL signal coincide. As the complement to this alternative, the null hypothesis, H0, that there is no colocalization is composite, including four different scenarios:4
H01: no SNP-phenotype association and no eQTL,
H02: no SNP-phenotype association but eQTL present,
H03: SNP-phenotype association present but no eQTL,
H04: both SNP-phenotype association and eQTL present but occurring at two independent variants.
That is,
To test the composite H0, the Simple Sum test statistic is defined as
(Equation 1) |
where , , and is a continuous eQTL evidence measure that can be defined as or –log10(eQTL p).8 In practice, and –log10(eQTL p) produce SS p values that provide the same colocalization conclusion.
Under the null hypothesis scenarios, , that there is no SNP-phenotype association, , where captures the LD structure of the region of interest (e.g., chr3q29).28, 29, 30, 31, 32 Thus, SS is distributed as under , where the’s are the eigenvalues of . Implementing a one-sided test based on SS will then control the type I error not only under H01 or H02, but also under H04 as SS tends to be negative when the SNP-phenotype association and eQTL signals occur at two independent SNPs. However, this test could have type I error inflation under H03, and therefore we previously used caution to interpret colocalization findings when calculated from a small observed eQTL signal.
SS2 colocalization test and type I error control
Here we develop SS2 to test the composite null hypothesis that there is no colocalization controlling the type I error rate under all four null scenarios. SS2 is a two-stage testing procedure that formally evaluates the eQTL evidence at the region of interest prior to conducting the colocalization analysis. We further extend the method for use with summary statistics obtained from meta-analysis and related individuals within sub-studies, and we investigate the robustness of SS2 to GWAS and eQTL summary statistics calculated using overlapping samples.
If we let the matrix A in Equation 1 be the identity matrix, the SS test statistic is simplified to which has been used as a gene-based association test statistic.33, 34, 35 Here we replace ’s with the eQTL summary statistics ’s to evaluate eQTL evidence at the locus, as stage 1 of the SS2 test
(Equation 2) |
for a given gene in a tissue of interest. Alternative gene-based eQTL tests (e.g., maximum of )36,37can be implemented, depending on factors such as the genetic architecture at the locus of interest.
Recall that under the null scenario of H03 where there is GWAS association but no eQTL, the original SS test has inflated type I error rate because the assumption of is violated. However, in this case, so is distributed as where ’s are the eigenvalues of .35 Thus, the SS2 stage 1 test can control the type I error rate of SS2 under H03: if stage 1 of the SS2 test is not significant, we conclude that there is no evidence of an eQTL and thus there is no evidence of colocalization; if stage 1 of the SS2 test is significant, we then implement the SS test statistic of Equation 1 in stage 2.
In terms of the overall type I error control of the proposed two-stage SS2 test under the composite null, , intuitively, under the null scenarios of H01 or H03 when there are no eQTLs, stage 1 already controls the false positives. Under the null scenarios of H02 or H04, even if the power of detecting the eQTL evidence is 100% in stage 1, stage 2 then provides the control of false positives. In the supplemental material and methods, we show the independence between the stage 1 and stage 2 tests when there is no overlapping or related samples between the GWAS and eQTL study and demonstrate analytically the type I error rate control of SS2.
To ensure type I error rate control under the composite null hypothesis across all four different null scenarios, SS2 is conservative for certain null scenarios. For example, under H01where there is no SNP-phenotype association or eQTL evidence, both stage 1 and stage 2 tests control the false positive rate at the nominal level; by the independence of the two tests, the overall type I error rate is (Table 1). This conservativeness is necessary so that the type I error rate under other null scenarios (e.g., H03) is not inflated. Furthermore, this conservativeness is necessary for conducting multiple colocalization hypothesis tests where the family of tests could consist of different null scenarios, which we discuss in the section on Colocalization testing in the presence of multiple hypothesis testing.
Table 1.
Locus | Null scenarios | Type I error of SS2 | Type I error of SMR | Type I error of SMR-multi | False positive rate of COLOC | False positive rate of COLOC2 |
---|---|---|---|---|---|---|
MUC20/MUC4 | H01 | 0.0021 | <10−4 | <10−4 | <10−4 | 0.0011 |
H02 | 0.0450 | 0.0377 | 0.0332 | 1.00 × 10−4 | 0.0245 | |
H03 | 0.0229 | <10−4 | <10−4 | 4.00 × 10−4 | 0.0858 | |
H04 | 0.0262 | 0.0389 | 0.0367 | <10−4 | 1.00 × 10−4 | |
SLC6A14 | H01 | 0.0025 | <10−4 | <10−4 | <10−4 | 0.0022 |
H02 | 0.0507 | 0.0372 | 0.0342 | 5 × 10−4 | 0.0468 | |
H03 | 0.0099 | <10−4 | <10−4 | 9 × 10−4 | 0.088 | |
H04 | 0.0133 | 0.0368 | 0.0336 | <10−4 | 0.0012 |
The LD pattern at the simulated region follows that at the MUC20/MUC4 and SLC6A14 loci, respectively. Each row corresponds to a specific null scenario when there is no co-localization. H01 represents the scenario when there are no SNP-phenotype associations and no eQTL; H02 represents the scenario when there are no SNP-phenotype associations but eQTLs are present; H03 represents the scenario where SNP-phenotype associations are present but no eQTL; H04 represents the scenario where both SNP-phenotype association and eQTLs are present, but occurring at two independent SNPs. For SS2, SMR,11 and SMR-multi,12 the nominal type I error was set at . SMR and Multi-SNP-based SMR test (SMR-multi) are conducted under the default setting such that a SNP is picked only if the eQTL p value is less than 5 × 10−8. For COLOC214 and COLOC,4 the false positive rates are calculated by applying the 0.8 threshold (as recommended by Dobbyn et al.14) for the colocalization posterior probability. In total, 104 replications are simulated for each null scenario.
Meta-analyses with related individuals within sub-studies
Many GWASs, such as our CF lung function GWAS, use a fixed or random effects meta-analysis to combine association evidence across multiple studies,26,38 and these multiple studies may contain related individuals within the sub-studies. To implement a colocalization analysis using meta GWAS Z-scores, COLOC4 assumes unrelated individuals, while COLOC214 incorporates features from GWAS-PW6 that consider sample overlap between studies but not sample relatedness within a study. SMR11 is derived based on the assumption that samples are independent between two studies, and its operating characteristics are unknown when there are related samples in a component study. Other methods such as eCAVIAR,5 ENLOC,9 SMR-multi,12 and JLIM15 compute their statistics under the assumption that GWAS Z-scores follow a multivariate normal distribution with covariance matrix , where is estimated assuming an independent sample using either one’s own data or external data such as that from the 1000 Genomes Project.39 However, accounting for the sample relatedness in the LD matrix, , is important for valid colocalization analysis which we address below.
Assume the GWAS meta-analysis consists of C studies with sample sizes . Let and denote, respectively, the vector of Z scores and the vector of estimated effect sizes for SNP from the C studies. Let denote the Z score from the meta-analysis for SNP j, then
(Equation 3) |
where represents the weight for study c, which can take different forms.38 For the traditional fixed-effect approach, is the inverse variance of or estimated by , where represents the estimated minor allele frequency for SNP in study . For the random-effect approach, the estimated between-study variance is incorporated into to account for heterogeneity between studies.38 Note that Equation 3 is often applied for the scenario with no related or overlapping samples between sub-studies. The meta-analysis with presence of related or overlapping samples between sub-studies can be conducted by alternative approaches such as the method proposed in Zhu et al.,40 which is not the scenario in the CF lung GWAS and thus is not applied here.
Based on Equation 3, we can show that
(Equation 4) |
for SNPs and . When individuals in study c are independent of each other and a simple linear model is used for GWAS, where is the standard Pearson correlation coefficient that represents the LD between SNPs and in study c.5 In this case, the covariance of meta-Z scores between the two SNPs is a weighted sum of study-specific LD measures.
In the presence of related individuals within sub-studies, assume that the GWAS is conducted using a linear mixed effect model, Z scores and their asymptotic covariance matrix can be written in a closed form that is equivalent to using generalized least-squares (supplemental material and methods). For a GWAS that only contains the genotypes as predictors, , where can be viewed as the pairwise Pearson correlation coefficient derived from the Cholesky-transformed genotype matrix. The supplemental material and methods provides a general form for in the presence of additional covariates (e.g., age and sex) and alternative ways to approximate using (for example) the R package nlme or GMMAT.41
The eQTL summary statistics are typically obtained from a single study. If the eQTL summary statistics are also obtained from a meta-analysis, the covariance adjustment is also needed when conducting the stage 1 eQTL testing, following the principle in Equation 4. This adjustment, however, does not influence the covariance adjustment in the stage 2 colocalization testing, where the inference is conditional on the observed eQTL evidence.
Overlapping or related samples between GWASs and eQTL studies
The presence of overlapping or related samples can induce correlation between summary statistics even when there are no shared genetic effects, which may bias the model in favor of the alternative hypothesis.6,42 Let and denote the summary statistics for a variant from a GWAS of sample size and an eQTL study for a given gene-by-tissue pair of sample size, respectively. In particular when there is no relatedness between those non-overlapping samples,
(Equation 5) |
where represents the number of overlapping samples; represents the correlation between phenotypes for the overlapping samples due to (for example) shared environmental factors, and equals 0 if there is no sample overlap.6 In the presence of related individuals between the GWAS and eQTL study, the correlation between summary statistics could be more complicated than Equation 5,40 but could be calculated by using methods proposed in Zhu et al.40 or Province and Borecki.43 Several published approaches have addressed the influence of overlapping or related samples in the context of two GWASs, and they have implemented decorrelation approaches that we repurpose for our SS2 colocalization test in Colocalization analysis at the MUC4/MUC20 CF lung disease modifier locus.16,40,43,44
Colocalization testing in the presence of multiple hypothesis testing
So far we have investigated the properties of SS2 when testing colocalization for a single gene in a specified tissue. In the present study, we are interested in determining whether the SNP-phenotype association evidence is colocalizing with gene expression of MUC4 or MUC20 in HNE, an established CF airway model.3 In fact, there are 50 genes annotated to a 1 Mb region surrounding the top GWAS signal. Moreover, even though the MUC4/MUC20 locus was identified as associated with CF lung disease, it would be of interest to investigate colocalization evidence in other tissues that may be affected in CF and for which the GTEx consortium provides eQTL summary statistics (GTEx V8).
To evaluate type I error rate control of the proposed SS2 colocalization test in the presence of multiple hypothesis testing, we consider the family-wise error rate (FWER). To maintain FWER control of SS2 under the composite null hypothesis for testing multiple genes and tissues, we implement stage 1 of the SS2 test of Equation 2 for all the genes in each tissue and adjust the for the total number of tests by Bonferroni correction, . We then implement stage 2 of the SS2 test of Equation 1 only for those significant stage 1 eQTL tests and adjust for the corresponding multiple hypothesis testing by .
The two-stage Bonferroni correction, followed by , intuitively should control FWER at or below the nominal level. Indeed, we can show (supplemental material and methods) that when there is no GWAS association, the upper bound of the FWER is when the tests are a mixture of H01 and H02, or all under one of the null scenarios (H01 or H02. However, complications arise when there is GWAS association at the locus but the tests are a mixture of H03 and H04; some genes/tissues do not have eQTLs (H03) while the remaining ones have eQTLs (H04) but do not overlap with the GWAS signal. To see this, first, when all tests are under H03, even though the colocalization test in stage 2 may have inflated type I error, say as large as 1, the FWER is controlled at via stage 1, as each eQTL test in stage 1 is controlled at . Second, when all tests are under H04, even though all tests pass stage 1 due to strong eQTLs, the corresponding colocalization tests in stage 2 have properly controlled type I error. However, when there is a mixture of H03 and H04, the eQTL test in stage 1 is no longer controlled at due to the presence of H04. This, combined with inflated false positives in stage 2 for H03, can lead to increased FWER, which has not been investigated before by us or by others.
In the supplemental material and methods, we provide a specific example and show that a crude upper bound for FWER is 2. However, this bound assumes the empirical type I error rate of the colocalization test for H03 in stage 2 is 1, which is unrealistic. Our empirical studies below show that we did not observe a single iteration with empirical FWER greater than the specified significance level under H03 and H04 (Tables 2 and S8), and the proposed two-stage SS2 testing procedure tends to be conservative under H01 and H02.
Table 2.
Locus | Proportion of genes with eQTL association but do not colocalize | FWER of SS2 | FWER of SMR | FWER of SMR-multi | False positive rate of COLOC | False positive rate of COLOC2 |
---|---|---|---|---|---|---|
MUC20/MUC4 | 0% | 0.0377 | 0.0010 | 0.0010 | 0.1307 | 0.0088 |
20% | 0.0294 | 0.0003 | 0.0002 | 0.1089 | 0.0023 | |
40% | 0.0297 | 0.0002 | 0.0001 | 0.0854 | 0.0027 | |
60% | 0.0300 | 9.00 × 10−5 | 5.00 × 10−5 | 0.0606 | 0.0034 | |
80% | 0.0317 | 8.00 × 10−5 | 4.00 × 10−5 | 0.0351 | 0.0040 | |
100% | 0.0346 | 6.00 × 10−5 | 5.00 × 10−5 | 0.0043 | 0.0044 | |
SLC6A14 | 0% | 0.0054 | 0.0008 | 0.0008 | 0.1626 | 0.0003 |
20% | 0.0016 | 0.0008 | 0.0004 | 0.1360 | 0.0049 | |
40% | 0.0011 | 0.0004 | 0.0002 | 0.1103 | 0.0056 | |
60% | 0.0010 | 0.0003 | 0.0002 | 0.0831 | 0.0070 | |
80% | 0.0008 | 0.0002 | 0.0001 | 0.0542 | 0.0080 | |
100% | 0.0007 | 0.0001 | 9.00 × 10−5 | 0.0234 | 0.0089 |
The height of the GWAS peak is set at 5.06 on the −log10p scale such that 10% power is achieved to detect the GWAS association at significance level of 10−8. In total, 600 genes are simulated based on the LD pattern at the MUC20/MUC4 locus or the SLC6A14 locus, respectively. Each row corresponds to a different proportion of genes that have eQTL association (0%, 20%, 40%, 60%, 80%, and 100%). The eQTL peaks are randomly generated from 6 different intervals (50%–60%, 60%–70%, 70%–80%, 80%–90%, 90%–95%, 95%–100% power is achieved to detect the eQTL association at the significance level of 10−8) with probabilities according to the proportion of the -log10(maximum eQTL p value) within each interval observed at the corresponding locus. None of the eQTL peak colocalizes with the GWAS peak for FWER evaluation. SMR and Multi-SNP-based SMR test (SMR-multi) are conducted under the default setting such that a SNP is picked only if the eQTL p value is less than 5 × 10−8. COLOC2 is conducted by using the algorithm implemented in GWAS-PW, where the posterior probability is calculated based on the likelihood of all gene-by-tissue pairs. In total, 105 replications are simulated to evaluate FWER of 0.05 and the false positive rates by applying the 0.8 threshold (as recommended by Dobbyn et al.14) for the colocalization posterior probability. The empirical FWER (or false positive rates for COLOC and COLOC2) is calculated by counting the proportion of 105 replications where at least one gene has a false colocalization claim.
Simulations
To evaluate the performance of the proposed SS2 colocalization test, we conduct extensive simulation studies using the LD pattern observed at the MUC4/MUC20 locus and for completeness also LD at SLC6A14, a locus previously investigated.8 For method comparison, we choose four alternative colocalization procedures, SMR, SMR-multi, COLOC, and COLOC2. For SS2, SMR,11 and SMR-multi,12 the nominal significance level is set at . For COLOC214 and COLOC,4 colocalization is concluded for each gene and/or tissue if the colocalization posterior probability > 0.8 (as recommended by Dobbyn et al.14), although there is no theoretical reason to expect that this threshold will correspond to a 0.05 false positive rate. We outline the simulation study design here and provide additional simulation details in the supplemental material and methods.
Simulation for a single gene-by-tissue pair
We first consider a simulation study assessing colocalization at a locus between SNP-phenotype association (GWAS) p values and SNP-expression (eQTL) p values for a single gene in a given tissue type. Following the simulation procedure in Hormozdiari et al.5 and Gong et al.,8 we focus on SNPs 0.1 Mb on either side of the top-associated SNP from the CF lung GWAS26 at the MUC4/MUC20 locus and at the SLC6A14 locus, respectively, to generate data for the two loci with different LD distributions. For type I error evaluation, we generate GWAS and eQTL summary statistics from a multivariate normal distribution based on the LD pattern of SNPs at each of the two loci and simulate null scenarios from H01to H04under the composite null hypothesis. Simulation details are provided in the supplemental material and methods and their corresponding parameter values are provided in Table S6.
We then assess the power of different methods by considering six alternative colocalization scenarios (Figures 1, 2, and S1), which go beyond the simple alternative when only one causal variant is shared by the GWAS and eQTL studies. Simulation details are provided in the supplemental material and methods, and their corresponding parameter values are provided in Table S7.
Simulation for multiple gene-by-tissue pairs
We next evaluate the performance of the methods when studying colocalization evidence across many gene-by-tissue pairs. To be consistent with the presence of a GWAS signal and 564 gene-by-tissue pairs in the CF application, we simulate a locus, based on the LD pattern at the MUC20/MUC4 locus, with one GWAS signal and 600 sets of independent eQTL summary statistics. We also considered 100, 200, 300, 400, or 500 colocalization tests simultaneously, varying the composition of the different alternative and null scenarios. Finally, we repeat the analysis for SLC6A14, a locus we studied previously8 with a LD pattern different from MUC20/MUC4. COLOC2 is conducted by using the algorithm implemented in GWAS-PW, where the posterior probability is calculated based on the likelihood of all gene-by-tissue pairs.
To evaluate the empirical FWER control, the simulated locus of interest has GWAS association evidence as in the CF application, while the eQTL summary statistics are simulated either under the null scenario of no eQTL (H03 or with an eQTL but does not colocalize with the GWAS signal (H04. When none (0%) of the 600 genes have eQTL evidence, all 600 tests are under H03. We also consider five different proportions of genes (20%, 40%, 60%, 80%, and 100%) with eQTL summary statistics but under H04. When these proportions are varied, the 600 tests are a mixture of H03 and H04, which is challenging in terms of type I error control as discussed earlier.
To evaluate power across different amounts of eQTL evidence while the locus has the same SNP-phenotype association evidence as in practice we vary the colocalized eQTL evidence among the genes analyzed. In 5% of the 600 genes, we simulate the eQTL summary statistics that colocalize with the SNP-phenotype association evidence (i.e., simulated under the alternative). For the remaining 95% of genes, we simulate a proportion of the eQTL summary statistics to have no eQTL signal (under H03 while the remaining to have eQTL signals distinct from the SNP-phenotype association signal (under H04. We calculate power (or true positive rates for COLOC and COLOC2) by determining the proportion of 105 simulated replications where at least one gene from the alternative is correctly identified.
Simulation for overlapping samples
In the presence of overlapping samples, the summary statistics for a variant from the GWAS and eQTL studies are correlated, which can lead to increased false positives in theory.6 To evaluate the practical impact of sample overlap on SS2, we consider the scenario where half or all of the participants whose genotypes used to compute the eQTL summary statistics are also included in the GWAS. Mimicking the scenario in the CF application, we simulate 100 participants in the eQTL study and to be conservative we include an additional 1,900 participants in the GWAS (i.e., the GWAS sample size is 2,000 of which 100 overlap with the eQTL study). We evaluate the empirical type I error rate of SS2 under the composite null hypothesis from H01 to H04, with four different phenotypic correlations, 0.3, 0.5, 0.7, and 0.9. For comparison, we also demonstrate the type I error rate when there are no samples overlapping using the same simulation procedure. In addition, considering a fixed level of phenotypic correlation (0.5 or 0.9), we also demonstrate the empirical type I error rate control of SS2 as we vary sample size for the eQTL study, from 100, 200, 300, 400, to 500. Lastly, we simulate the scenario where all GWAS samples are overlapping with eQTL samples (both with 2,000 individuals) and demonstrate the type I error rate of SS2 under the composite null hypothesis.
EQTL analysis
Informed consent
The Canadian Gene Modifier Study (CGMS) was approved by the Research Ethics Board of the Hospital for Sick Children (# 0020020214 from 2012 to 2019 and #1000065760 from 2019 to the present) and all participating sub-sites. Written informed consent was obtained from all participants or parents/guardians/substitute decision makers prior to inclusion in the study. The CGMS is approved by the Research Ethics Board of the Hospital for Sick Children for the usage of public and external data.
Sample source and collection
We conducted RNA sequencing of HNE cells (n = 94) collected as part of the CGMS and the CF Canada Sick Kids Program in Individual CF Therapy (CFIT).45 The HNE samples are collected using a 3-mm diameter sterile cytology brush (MP Corporation) or Rhino-probe curette on either inferior turbinate. Sequencing was performed in two rounds with Illumina HiSeq 2000 and HiSeq 2500 platforms (Illumina Inc.), respectively. The HiSeq 2000 round was sequenced with 25 million paired-end reads (49 base pairs in length) per sample, and the samples processed using the HiSeq 2500 platform had average library size of 35 mill paired-end reads (124 base pairs in length).
RNA-seq data processing and analysis
Quality of sequencing reads was assessed using FastQC (v.0.11.5; web resources) before and after trimming by Trim Galore (v.0.4.4; web resources). Processed reads were aligned to human reference genome hg38 with GENCODE comprehensive gene annotation (release 29) using STAR (v.2.5.4b).46 The reference genome included alternative haplotype contigs to account for the sequence diversity in the mucin gene locus. Expression quantification was performed by RNA-SeQC (v.2.0.0), which generated both read counts and normalized transcripts per million (TPM) measures.47 Normalized trimmed mean of M values (TMM) measures were obtained for a sub-sample of genes with ≥0.1 TPM and ≥6 read counts in more than 20% of the sample.48
Expression quantitative trait loci were analyzed by conducting differential gene expression analysis of the effect of SNP genotypes on TMM-normalized expression level. eQTL analysis was carried out using FastQTL (v.2.0)49 with RNA-sequencing (RNA integrity number 7) of HNE from 94 Canadians with CF enrolled in the ICFGMC; these 94 individuals were also included in the CF lung GWAS. Additional covariates adjusted for in the model include the top 3 principal components, 15 probabilistic estimates of expression residuals (PEER) factors, study sites, sex, genotyping platform, RNA integrity number (RIN), and PTPRC/CD45 gene expression (immune cell content adjustment). R packages GENESIS (v.2.14.3) and peer (v.1.0) were used to generate genotype principal components and PEER factors, respectively.50, 51, 52
Analysis of colocalization at the MUC4/MUC20 locus
The CGMS participants were included in a genome-wide association study of CF lung disease by the International Cystic Fibrosis Gene Modifier Consortium, comprised of 6,365 individuals (including 1,443 sib-pairs) with CF.26 The lung disease severity was measured as a percentile from a CF reference population of forced expiratory volume in 1 s that is survival adjusted.53 Individuals on CFTR-modulator treatment were not included. The GWAS summary statistics are publicly available (see data and code availability) and we use them here in the implementation of the SS2. The summary statistics were constructed from a meta-analysis of 13 studies, including siblings and individuals who ranged in age from 6 to 63.3 years. A detailed description of the studies and participants included can be found in Corvol et al.26 Among other loci, 24 genome-wide significant SNPs were identified and annotated between two mucin genes: MUC4 and MUC20 on chromosome 3.
To determine whether gene expression variation at the chr3q29 locus could influence lung disease severity and which genes the associated SNPs impact, we first focused on colocalization analysis for MUC4 and MUC20 given their biologic plausibility. For completeness, we subsequently expanded our analysis to investigate eQTLs of 50 genes in a 1 Mb region on either side of the peak and in 14 CF-relevant tissues from GTEx V8. We focus on SNPs in a 0.1 Mb region on either side of the lead GWAS SNP for colocalization analysis of all 564 gene-by-tissue pairs for which there is gene expression data.
To conduct the SS2 stage 1 eQTL test using Equation 2, we computed the LD matrix for the eQTL analysis in HNE by calculating the Pearson correlation coefficient using the 94 CF independent samples, while for the eQTL studies in other CF-related tissues we use the GTEx resource. To apply the SS2 stage 2 colocalization test to the GWAS meta-summary statistics while accounting for sample relatedness, we estimated the LD matrix using Equation 4, where the covariance of summary statistics for the sub-studies were calculated from Cholesky-transformed genotype data. The sample size of the CF lung GWAS is 6,365 while the sample size of the eQTL study is 94. Among the 94 CF samples with eQTL data, there are 85 participants included in both the GWASs and eQTL studies, but otherwise the participants across the two studies were unrelated. To account for the overlapping samples in the analysis, we note that 0.11 and so even under the extreme case that , 0.11, which would have a negligible impact on our inference. We demonstrate this empirically through a comprehensive simulation study (Tables S1–S5) and apply the decorrelation approach16 in our CF SS2 implementation assuming that the lung function distribution in the 85 overlapping samples is representative of the full CF sample included in the GWAS.
Results
Simulation results
Simulation results for a single gene-by-tissue pair
Table 1 demonstrates the empirical type I error rates of SS2, SMR,11 and SMR-multi12 and the false positive rates of COLOC4 and COLOC214 for each of the four null scenarios of the composite null hypothesis, based on the LD pattern observed at the MUC20/MUC4 locus or the SLC6A14 locus. The SS2, SMR, and SMR-multi all control the type I error rate at or below the nominal 0.05 significance level under all four null scenarios. When there are no eQTLs, under H01 or H03, SMR and SMR-multi are extremely conservative (the empirical < 10−4). This is due to the recommended pre-screening step whereby colocalization analysis can only be conducted when there is an eQTL p value less than 5 × 10−8 at the locus under investigation. Other eQTL p value thresholds such as 0.01 or 0.05 could be adopted when conducting the SMR or SMR-multi; however, there is type I error inflation under H03 when one chooses to do so (Table S18).
Among all the methods, COLOC shows the most conservative results under the different null scenarios. The false positive rate for COLOC2 is controlled except for the H03 scenario. When there is SNP-phenotype association but no eQTLs, the empirical false positive rate is 0.09 for the nominal level of 0.05. The empirical false positive rate of COLOC2 decreases as one increases the sample size of the eQTL study (Table S17). For example, with 2,000 participants in the GWAS, the empirical false positive rate of COLOC2 is below 0.05 when the sample size for the eQTL study is larger than 1,000. However, in practice, the sample size of an eQTL study (e.g., GTEx)27 is usually much smaller than 1,000, and in our CF study we only have 94 participants included in the eQTL analysis in HNE. We observed qualitatively similar results with the different LD patterns between the MUC20 and SLC6A14 loci.
The effect of either half or all of the participants in the eQTL study (100 individuals) overlapping with the GWAS study (2,000 individuals) on the type I error rate of the SS2 is demonstrated in Tables S2–S5. The type I error rate of the SS2 remains controlled with increasing phenotypic correlation and increasing eQTL sample sizes, which demonstrates that the overlapping samples have minimal impact on the inference in practice.
We evaluate the power of SS2, SMR, and SMR-multi at the = 0.05 significance level and the true positive rate of COLOC2 and COLOC using a posterior probability of colocalization cut-off of 0.8 (Figure 2). Figure 2A illustrates the power under the simplest alternative scenario 1, where one single SNP-phenotype association signal colocalizes with one single eQTL signal. In that case, COLOC2 has the highest true positive rate across different levels of eQTL evidence, but its false positive rate is inflated under H03 as demonstrated in Table 1. Among the methods that control the false positive rates across all null scenarios, namely SS2, SMR, SMR-multi, and COLOC, the proposed SS2 method is the most powerful.
When there are two GWAS peaks under the alternative scenarios 2 and 3 (Figures 2B and 2C, respectively) and the eQTL evidence is weak ( 5.21 in Figure 2B and 5.21 in Figure 2C; i.e., less than 30% power to detect the eQTL evidence at 0.05), SS2 has less power than COLOC2; this represents a trade-off to achieve type I error control under H03. However, SS2 is more powerful than the other three methods that do not have inflated false positive rates. SMR and SMR-multi are very conservative in this case because there are few SNPs with eQTL p values less than 5 × 10−8. As the eQTL evidence gets stronger ( 5.21 in Figure 2B and 5.21 in Figure 2C; eQTL power greater than 30%), the power of SS2 could reach a similar level or even exceed the true positive rate of COLOC2. The power of SMR and SMR-multi are similar when there is only one eQTL SNP in the region (Figures 2A–2C) and increase rapidly as the size of the eQTL signal increases. Under the alternative scenario 3, when the eQTL evidence is strong ( 6.25; eQTL power greater than 70% power) and colocalizes with the second SNP-phenotype association peak (Figure 2C), SMR and SMR-multi are more powerful than SS2.
Figures 2D and 2E demonstrate the alternative scenarios 4 and 5, respectively, when there are two independent eQTL SNPs but only one eQTL SNP colocalizes with the SNP-phenotype association. In this case, the SS2 is more powerful than all of the other methods across all levels of eQTL evidence considered. The power advantage of SS2 over other methods is especially notable when, between the two eQTL SNPs, the one with weaker signal colocalizes with the SNP-phenotype associated variant. Finally, under the alternative scenario 6 when two independent GWAS SNPs colocalize with two independent eQTL SNPs, methods SS2, SMR, and SMR-multi show equally high power, while COLOC and COLOC2 have reduced power due to the allelic heterogeneity. Qualitatively similar results based on the LD pattern at the SLC6A14 locus are evident in Figure S1.
Simulation result for multiple gene-tissue pairs
Table 2 demonstrates the impact of multiple hypothesis testing on the family-wise error rate, where a locus with 600 genes and LD modeled after the MUC20/MUC4 and SLC6A14 loci were evaluated. Similar to the single hypothesis testing result in Table 1, SS2, SMR, and SMR-multi control the FWER with the SMR-based tests being the most conservative. However, COLOC now has inflated false positive rates (>0.08) for multiple scenarios, for example when 60% of genes at the locus have no eQTLs (H03) while the remaining 40% of genes have eQTLs but these are distinct from the SNP-phenotype association signal (H04). As the proportion of H04 genes decrease from 40% to 0% (or the proportion of H03 genes increase from 60% to 100%), the empirical false positive rate of COLOC increases from 0.08 to 0.13. Although COLOC2 shows an inflated false positive rate under H03 for our single hypothesis test investigation (Table 1), the false positive rate of COLOC2 is controlled after applying the algorithm implemented in GWAS-PW where the posterior probability is calculated based on the likelihood of all gene-by-tissue pairs (Table 2).
We also investigated type I error rate control of the five methods when the number of genes tested at the locus was varied from 100 to 500 (Tables S8–S12). Overall, SS2, SMR, and SMR-multi show conservative FWER (<0.05; Table S8, S9, and S10, respectively). As the number of genes increases, the FWER of SS2 decreases when the LD is modeled after the SLC6A14 locus and moderately increases when the LD is modeled after the MUC20 locus. The increase in FWER tapers off as the number of genes tested at the locus increase. This is because, with more genes passing the first stage test, the stage 2 colocalization test requires a more stringent significance level, resulting in the overall two-stage test being conservative enough to control the FWER. The false positive rate of COLOC increases as the number of genes evaluated increases, with inflation observed when the number of genes tested exceeds 200 (Table S11). This is due to the subjective choice of priors and cut-off (i.e., 0.8) for the colocalization posterior probability without explicit adjustment for multiple genes. In contrast, COLOC2 is conservative and stays conservative as one increases the total number of genes evaluated at a locus from 100 to 500 (Table S12).
Given the inflation in false positives for COLOC (Table S11), we compare power only between COLOC2, SMR, SMR-multi, and SS2. Keeping the eQTL evidence constant and with 600 genes, SS2 demonstrates the greatest power among the four methods; SMR has more power than SMR-multi (Table 3). The power of the four methods for testing 100 to 500 genes is provided in Tables S13–S16. SS2 shows consistently higher power compared to SMR when the number of genes is greater than 200, and SS2 shows higher power than SMR-multi, COLOC, and COLOC2 across all gene numbers investigated. Interestingly, however, we observe different power trends with increasing numbers of genes analyzed at the locus. The power of SS2, SMR, and SMR-multi decrease as the number of genes increases due to the Bonferroni correction, while the true positive rate of COLOC2 increases as the number of genes increases.
Table 3.
Locus | EQTL height for the 5% genes have colocalization | Power of SS2 | Power of SMR | Power of SMR-multi | True positive rate of COLOC2 |
---|---|---|---|---|---|
MUC20/MUC4 | 5.48–5.73 | 0.8181 | 0.6826 | 0.6059 | 0.6218 |
5.73–5.98 | 0.8178 | 0.6935 | 0.6119 | 0.6355 | |
5.98–6.25 | 0.8171 | 0.7033 | 0.6169 | 0.6447 | |
6.25–6.57 | 0.8161 | 0.7144 | 0.6216 | 0.6507 | |
6.57–7.01 | 0.8143 | 0.7270 | 0.6254 | 0.6536 | |
SLC6A14 | 5.48–5.73 | 0.6855 | 0.6323 | 0.5531 | 0.6605 |
5.73–5.98 | 0.6992 | 0.6462 | 0.5614 | 0.6684 | |
5.98–6.25 | 0.7076 | 0.6588 | 0.5679 | 0.6741 | |
6.25–6.57 | 0.7138 | 0.6706 | 0.5745 | 0.6786 | |
6.57–7.01 | 0.7184 | 0.6831 | 0.5817 | 0.6821 |
The LD pattern at the simulated region follows that at the MUC20/MUC4 and SLC6A14 loci, respectively. The height of the GWAS peak is set at 5.06 on the −log10p scale such that 10% power is achieved to detect the GWAS association at significance level of 10−8. Each row corresponds to a different range of the eQTL height for the 5% genes that have colocalization ([5.48, 5.73], [5.73, 5.98], [5.98, 6.25], [6.25, 6.57], and [6.57, 7.01]). The eQTL peaks are set with 5 different intervals such that 40%–50%, 50%–60%, 60%–70%, 70%–80%, 80%–90% power is achieved to detect the eQTL association at the significance level of 10−8. For the remaining 95% of genes, there is eQTL evidence with a mixture of null cases under H03 and H04 and details are demonstrated in the supplemental material and methods. SMR and Multi-SNP-based SMR test (SMR-multi) are conducted under the default setting such that a SNP is picked only if the eQTL p value is less than 5 × 10−8. In total, 105 replications are simulated to evaluate power at 0.05 significance level and the true positive rates by applying the 0.8 threshold (as recommended by Dobbyn et al.14) for the colocalization posterior probability. The power (or true positive rate for COLOC and COLOC2) is calculated by counting the proportions of 105 replications where at least one gene is correctly identified with colocalization.
Colocalization analysis at the MUC4/MUC20 CF lung disease modifier locus
Figure 3 shows that there are multiple SNPs with similar GWAS p values as the lead SNP, suggesting the presence of strong LD in the locus. Given the same LD structure, the eQTL peak is much wider than the GWAS peak, suggesting allelic heterogeneity for the eQTL summary statistics. There is a clear GWAS signal around the lead GWAS SNP visually coinciding with one of the eQTL signals, suggesting that MUC4 expression may mediate CF lung disease. These characteristics suggest this locus is similar to scenario 4 in Figure 1.
We first assess the colocalization evidence for MUC20 and MUC4 using the CF lung GWAS meta-analysis summary statistics26 with eQTLs calculated from the HNE gene expression and genotype data of 94 individuals with CF (Table 4). For MUC20 in the HNE, the stage 1 test does not provide evidence of an eQTL at the 5% level (uncorrected p value = 0.083), suggesting that the eQTL evidence is not strong enough to move on to stage 2 colocalization analysis. For the MUC4 eQTLs, both stage 1 and stage 2 tests are significant with p value = 3.16 × 10−7 and 1.31 × 10−5, respectively, providing statistical evidence of colocalization consistent with the visualization (Figure 3). We applied the decorrelation approach demonstrated in LeBlanc et al.16 to account for the 85 individuals included in both the GWASs and eQTL studies. The resulting stage 1 and stage 2 p values were 2.62 × 10−7 and 1.19 × 10−5, respectively, assuming the worst case scenario that is 1. As expected, these p values are very similar to the p values without sample overlap correction.
Table 4.
Gene and tissue |
SS2 stage 1 test |
SS2 stage 2 test |
SMR |
SMR-multi |
COLOC |
COLOC2 |
|||||
---|---|---|---|---|---|---|---|---|---|---|---|
p value | Adjusted p value | p value | Adjusted p value | p value | Adjusted p value | p value | Adjusted p value | CLPP | CLPP1a | CLPP2 | |
MUC20 and HNE | 0.083 | 1 | N/A | N/Ab | N/A | N/A | N/A | N/A | 0.1057 | 0.0666 | 0.0005 |
MUC4 and HNE | 3.16 × 10−7 | 1.78 × 10−4 | 1.31 × 10−5 | 1.49 × 10−3 | 2.65 × 10−3 | 0.379 | 2.07 × 10−4 | 0.0296 | 0.7573 | 0.9482c | 0.0007 |
Colocalization analyses are conducted for all genes within a 1 Mb region on either side of the peak lung GWAS-associated variant and 14 CF-related tissues. In total, there are 564 gene-by-tissue pairs. Raw p values and adjusted p values by 564 gene-by-tissue pairs are both demonstrated for the SS2, SMR, and SMR-multi. The eQTL evidence for conducting the SS2 is the eQTL p value based on the −log10(eQTL p) scale for a specified gene and tissue. SMR and Multi-SNP-based SMR test (SMR-multi) are conducted under the default setting such that a SNP is picked only if the eQTL p value is less than 5 × 10−8. N/As are listed for MUC20 since no SNP has eQTL p value less than 5 × 10−8. For COLOC and COLOC2, the colocalization posterior probability (CLPP) is calculated, and a high posterior probability (>0.8) suggests strong colocalization evidence. For COLOC2, we show both the CLPP calculated based on the likelihood from the single gene and tissue (CLPP1) and the CLPP calculated based on the likelihood from 564 gene-by-tissue pairs (CLPP2).
This method has been shown with inflation in Table 1. CLPP1 > 0.8 suggests colocalization.
The stage 1 SS2 test p value (0.083) for MUC20 and HNE does not pass the significant threshold (0.05) and therefore, the stage 2 SS2 test p value is not applicable (N/A).
This method has been shown with inflation in Table 1.
To be comprehensive, we apply the SS2 to all genes annotated by GENCODE v.26 for hg38 GTEx V8 to the 1 Mb region encompassing the peak CF lung-associated variant at the MUC4/MUC20 GWAS locus. The colocalization evidence for each gene is calculated for the set of SNPs within 0.1 Mb of the peak GWAS variant. Using the cross-tissue eQTLs from GTEx,27 we select the tissues that are relevant to CF and remove genes with low or no expression in a given tissue; this results in 564 gene-by-tissue pairs available for colocalization analysis. We apply the stage 1 set-based test on all the gene-by-tissue pairs, using a Bonferroni corrected significance level of 8.87 × 10−5. This results in 114 gene-by-tissue pairs providing evidence of significant eQTLs to move to the stage 2 test for colocalization. Stage 2 requires a significance level of = 0.00044 for each gene-by-tissue pair to conclude colocalization; 39 colocalization tests exceed this threshold. We present the SS2 cross-tissue and gene colocalization results in heatmaps (Figures 4A and S2). For MUC20 and MUC4 we provide the stage 1 p value adjusted by 564 gene-by-tissue pairs and the stage 2 p value adjusted by 114 significant gene-by-tissue pairs in Table 4. MUC4 remains significant after correction for the multiple tests at this locus, with stage 1 adjusted p value of 1.78 × 10−4 and stage 2 adjusted p value of 1.49 × 10−3, respectively. Interestingly, no other gene shows significant evidence of colocalization in HNE with the SS2. MUC4 does show evidence of colocalization across several tissues, although these tissues are likely not relevant to lung disease. Several genes in the region also appear regulated by the GWAS-associated SNPs such as the pseudogene SDHAP1, but not in lung-relevant tissues (Figure S2).
For comparison, we also implement SMR, SMR-multi, COLOC, and COLOC2. Among the 564 gene-by-tissue pairs, SMR and SMR-multi are calculated on 143 gene-by-tissue pairs with top eQTL p value less than 5 × 10−8, then applying a Bonferroni corrected significance level of = 0.00035. We provide both raw p values and multiple testing adjusted p values for SMR and SMR-multi analyses in Table 4. For MUC20, there are no SNPs with eQTL p value smaller than 5 × 10−8 and therefore the SMR and SMR-multi test would not be applied. For MUC4, SMR provides a raw colocalization p value of 2.65 × 10−3, but the multiple testing adjusted p value = 0.379. SMR-multi demonstrates an association between gene-expression and the lung GWAS statistics with multiple testing adjusted colocalization p value of 0.0296. Overall, the SMR and SMR-multi tests identified, respectively, 16 and 34 significant genes and tissues with colocalization evidence (Figures 4B, 4C, and S2).
The colocalization posterior probability of COLOC2 for MUC20 and MUC4 are both small with 0.000472 and 0.000712, respectively. In this case, the empirical estimation of the prior for colocalization is low, which drags down the colocalization evidence for this locus. At this locus, there are no gene-by-tissue pairs with COLOC2 posterior probability higher than 0.8. In contrast, if we apply COLOC2 with the GWAS-PW algorithm ignoring the multiple hypothesis testing and focusing only on the likelihood from the single gene MUC4, the colocalization posterior probability is high (0.9482). However, the colocalization posterior probability for MUC20 is low (0.0666; Table 4). A COLOC2 analysis based on the likelihood from each single gene-by-tissue pair at this locus provides 39 gene-by-tissue pairs with posterior probabilities higher than 0.8 (Figure S3).
The COLOC2 results at this locus are consistent with our simulation study, where COLOC2 has inflation of false positives when the GWAS-PW algorithm is implemented based on the likelihood from a single gene (Table 1). Yet, the test becomes over-conservative when priors are empirically estimated from the likelihood of multiple gene-by-tissue pairs (Table 2).
COLOC does not implement an approach to adjust for multiple hypothesis testing, but the empirical posterior probabilities of colocalization are <0.8 (Table 4) for both MUC20 (0.1057) and MUC4 (0.7573). The results of COLOC applied to all 564 gene-by-tissue pairs are shown in Figures 4D and S2.
Discussion
The majority of associated genetic variants identified through GWASs fall in non-coding regions of the genome, and thus the underlying mechanism by which the associated variants contribute to disease remains unclear but may point to gene regulation. The associations identified in the largest GWAS of CF lung disease to date26 are of no exception, with none of the five genome-wide significant loci tagging protein-coding variation. One locus at chr3q29 is especially noteworthy as it encompasses MUC4 and MUC20, members of a gene family that encode membrane-spanning “tethered” mucins.54, 55, 56 These mucins prevent mucus penetration into the periciliary space and are present in the airway mucus, possibly contributing to mucociliary host defense. Mucus pathology is a defining characteristic of CF, with mucus hyperproduction and plugging, most notably in the CF airways.3 It has been presumed that mucus pathology is a downstream consequence of CFTR dysfunction,8 but GWAS identification at this locus suggests the possibility that polymorphisms impacting gene regulation of mucins may, themselves, impact the severity of CF lung disease.
The scenario of the chr3q29 locus in CF highlights the challenges we commonly face while conducting colocalization analysis, where both high LD and allelic heterogeneity are present at a locus with significant SNP-phenotype association as in scenario 4 in Figure 1. The purpose of colocalization analysis is to inform the causal mechanism and guide future functional investigations. Since GWAS identifies loci and many genes could be annotated to a GWAS locus and the tissue of action may not always be obvious, colocalization analysis informs the gene and cell type for future study.
Application of the SS2 to the CF lung disease associated locus at ch3q29 with eQTLs from CF HNE support that the associated lung disease variants colocalize with eQTLs for MUC4, prioritizing MUC4 at the locus for further functional investigation. However, it should be noted that MUC4 and MUC20 are localized to a highly polymorphic region26 with several tandem repeats including a 48 bp repeat region ranging from 7 to 19 kb. The GWAS array data suggests a high frequency of large copy number variants around the clustered mucin region, but highly variable across individuals. This complex genomic context requires further consideration when differentiating between the two mucin genes at the locus.
There are several published colocalization methods, but the chr3q29 locus in a region of high LD with evidence of allelic heterogeneity poses challenges for existing procedures.4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 Furthermore, the CF lung disease GWAS summary statistics are derived from a meta-analysis that includes sub-studies with related individuals; current methods cannot accommodate this scenario. We therefore developed the frequentist two-stage colocalization test, SS2. The SS2 integrates GWAS summary statistics with eQTL summary statistics across any number of gene-by-tissue pairs, is applicable when there are overlapping participants in the two studies and can be applied to GWAS summary statistics computed through meta-analysis, even with related individuals. Through simulation we demonstrate that the SS2 controls the type I error rate under the composite null hypotheses and is powerful in regions with high LD and allelic heterogeneity.
Bayesian colocalization approaches aim to identify a shared causal variant between two studies and to differentiate between distinct causal variants in LD. Similarly, Zhu et al. implemented the HEIDI test11 after the SMR test to further differentiate distinct causal variants in LD if the SMR test suggests significant association between two sets of summary statistics. Previous studies show that the statistical power of COLOC and HEIDI for differentiating distinct causal variants decreases as the LD between causal variants increase.11 For a fair comparison, we calculated the FWER of SMR and SMR-multi tests without taking into account the results from the HEIDI test since the SS2 test does not try to differentiate between distinct causal variants if they are in LD. In contrast, the SS2 aims to identify the association between two studies by leveraging the LD in the region and making inference based on the pattern similarity between summary statistics. Therefore, the SS2 can provide reliable inference even when the causal variant is not contained in the analysis set, as long as the LD pattern with the missing variant is retained.
The SS2 is a two-stage framework, designed to accomplish type 1 error control over the complex, composite null hypothesis. Although we implemented the gene/set-based test as the first stage test of the SS2, in practice, the method does not require use of one gene-based test over another. The alternative gene/set-based tests include versions with weighted sums of summary statistics, known as gene set analysis (GSA) tests or burden tests for rare variants.57, 58, 59 Summary statistics can also be decorrelated before being summed together, which is powerful under heterogeneity of effect sizes and variation between pairwise LD patterns.57 The stage 1 test implemented here has the same functional form as that used in VEGAS34 and fastBAT.35 This set-based test can be more powerful than the max-of-chi-square approach (an approach implemented in GATES36 and Pascal-Max)37 when there are multiple independent association signals, but can be less powerful when there is a single causal variant present at the locus.35 In contrast, SMR and SMR-multi use a p value threshold (i.e., 5 × 10−8) to screen regions for analysis, presumably to ensure they are not using a weak instrument (this is similar to the approach implemented when the SS colocalization statistic was first defined).8 This stringent screening step results in power loss compared to SS2 when the eQTL association is only moderate which can be a function of several factors including sample size (Figure 2).
We modeled our simulation studies after the CF application and demonstrated that the SS2 has type I error rate control when 85 samples are included in both the GWAS and HNE eQTL studies. For other applications where there is a higher proportion of overlapping samples, the SS2 could have type I error inflation (Table S19) due to the correlation induced by the sample overlap. To address the effect of overlapping samples on statistical inference, several methods propose ways to estimate the correlation using summary statistics.6,16,40,43 We implemented one such approach16 to decorrelate the summary statistics before applying the SS2 framework, although it was not necessary for our CF application as we demonstrated through simulation and application.
When the eQTL summary statistics are replaced by GWAS evidence from a second phenotype, the SS2 framework enables the study of genetic overlap of the two traits. Similarly, the SS2 could assess colocalization using any SNP-level data including DNA methylation (meQTLs), protein QTLs (pQTLs), or metabolites (metQTLs). The SS2 framework as delineated here does not integrate summary statistics from greater than two studies, although there would be value in colocalizing GWAS summary data with multiple molecular phenotypes and multiple GWAS traits simultaneously as proposed in Giambartolomei et al.60 This will be addressed in future work. SS2 is implemented in a web-based colocalization tool, LocusFocus,25 which enables integration of GWAS summary statistics with any secondary SNP-level dataset by using p values and LD for the region of interest. The eQTL summary statistics from GTEx are made available for selection within the web server to test colocalization with tissues and genes from GTEx. All code and sample datasets are publicly available via GitHub under the MIT license.
Data and Code Availability
Summary statistics from the GWAS are available at https://strug.research.sickkids.ca/GWAS_Summary_Public/gwas.public.txt.gz. The calculated covariance matrix for summary statistics from the GWAS is available at https://strug.research.sickkids.ca/GWAS_Summary_Public/ld_matrix_chr3_195560_195935Kbp.tar.gz. R scripts to enable the extensions here can be found at https://github.com/FanWang0216/SimpleSum2Colocalization and https://github.com/naim-panjwani/LocusFocus. Access to the RNA-sequencing data from the nasal epithelial are available through the CF Canada-SickKids Program for Individualized Therapy Biobank https://lab.research.sickkids.ca/cfit/.
Acknowledgments
We thank the CF participants, care providers, and clinic coordinators at CF Centers throughout Canada for their contributions to the CF Patient Registry and Canadian CF Gene Modifier Study. We would like to thank the CF Canada-Sickkids Program for Individualized Therapy (CFIT) for generating gene expression data. Funding was provided by Cystic Fibrosis Foundation STRUG17PO, Natural Sciences and Engineering Research Council of Canada (RGPIN-2015-03742, RGPIN-04934, RGPAS-522594), the Canadian Institutes of Health Research (FRN-167282, FRN-310732), Cystic Fibrosis Canada (2626), and by the Government of Canada through Genome Canada (OGI-148) and supported by a grant from the Government of Ontario. The datasets used for the analyses described in this manuscript were obtained from dbGaP at https://www.ncbi.nlm.nih.gov/gap through dbGaP accession number phs000424.v8.p2. The funders of the study play no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. F.W. is a trainee of the CANSSI-Ontario STAGE training program at the University of Toronto.
Declaration of interests
The authors declare no competing interests.
Published: January 21, 2022
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2021.12.012.
Contributor Information
Lei Sun, Email: sun@utstat.toronto.edu.
Lisa J. Strug, Email: lisa.strug@utoronto.ca.
Web resources
FastQC (ver. 0.11.5), https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
GTEx Portal (release v.8), https://www.gtexportal.org
LocusFocus, https://locusfocus.research.sickkids.ca
Online Mendelian Inheritance in Man, https://www.omim.org/
PredictDB Data Repository, https://predictdb.org
Trim Galore (ver. 0.4.4), https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/
Supplemental information
References
- 1.Cutting G.R. Modifier genes in Mendelian disorders: the example of cystic fibrosis. Ann. N Y Acad. Sci. 2010;1214:57–69. doi: 10.1111/j.1749-6632.2010.05879.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Vanscoy L.L., Blackman S.M., Collaco J.M., Bowers A., Lai T., Naughton K., Algire M., McWilliams R., Beck S., Hoover-Fong J., et al. Heritability of lung disease severity in cystic fibrosis. Am. J. Respir. Crit. Care Med. 2007;175:1036–1043. doi: 10.1164/rccm.200608-1164OC. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Kreda S.M., Davis C.W., Rose M.C. CFTR, mucins, and mucus obstruction in cystic fibrosis. Cold Spring Harb. Perspect. Med. 2012;2:a009589. doi: 10.1101/cshperspect.a009589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Giambartolomei C., Vukcevic D., Schadt E.E., Franke L., Hingorani A.D., Wallace C., Plagnol V. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet. 2014;10:e1004383. doi: 10.1371/journal.pgen.1004383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Hormozdiari F., van de Bunt M., Segrè A.V., Li X., Joo J.W.J., Bilow M., Sul J.H., Sankararaman S., Pasaniuc B., Eskin E. Colocalization of GWAS and eQTL Signals Detects Target Genes. Am. J. Hum. Genet. 2016;99:1245–1260. doi: 10.1016/j.ajhg.2016.10.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Pickrell J.K., Berisa T., Liu J.Z., Ségurel L., Tung J.Y., Hinds D.A. Detection and interpretation of shared genetic influences on 42 human traits. Nat. Genet. 2016;48:709–717. doi: 10.1038/ng.3570. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Barbeira A.N., Dickinson S.P., Bonazzola R., Zheng J., Wheeler H.E., Torres J.M., Torstenson E.S., Shah K.P., Garcia T., Edwards T.L., et al. GTEx Consortium Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat. Commun. 2018;9:1825. doi: 10.1038/s41467-018-03621-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Gong J., Wang F., Xiao B., Panjwani N., Lin F., Keenan K., Avolio J., Esmaeili M., Zhang L., He G., et al. Genetic association and transcriptome integration identify contributing genes and tissues at cystic fibrosis modifier loci. PLoS Genet. 2019;15:e1008007. doi: 10.1371/journal.pgen.1008007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Wen X., Pique-Regi R., Luca F. Integrating molecular QTL data into genome-wide genetic association analysis: Probabilistic assessment of enrichment and colocalization. PLoS Genet. 2017;13:e1006646. doi: 10.1371/journal.pgen.1006646. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Barbeira A.N., Pividori M., Zheng J., Wheeler H.E., Nicolae D.L., Im H.K. Integrating predicted transcriptome from multiple tissues improves association detection. PLoS Genet. 2019;15:e1007889. doi: 10.1371/journal.pgen.1007889. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Zhu Z., Zhang F., Hu H., Bakshi A., Robinson M.R., Powell J.E., Montgomery G.W., Goddard M.E., Wray N.R., Visscher P.M., Yang J. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 2016;48:481–487. doi: 10.1038/ng.3538. [DOI] [PubMed] [Google Scholar]
- 12.Wu Y., Zeng J., Zhang F., Zhu Z., Qi T., Zheng Z., Lloyd-Jones L.R., Marioni R.E., Martin N.G., Montgomery G.W., et al. Integrative analysis of omics summary data reveals putative mechanisms underlying complex traits. Nat. Commun. 2018;9:918. doi: 10.1038/s41467-018-03371-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Gusev A., Ko A., Shi H., Bhatia G., Chung W., Penninx B.W.J.H., Jansen R., de Geus E.J.C., Boomsma D.I., Wright F.A., et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 2016;48:245–252. doi: 10.1038/ng.3506. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Dobbyn A., Huckins L.M., Boocock J., Sloofman L.G., Glicksberg B.S., Giambartolomei C., Hoffman G.E., Perumal T.M., Girdhar K., Jiang Y., et al. CommonMind Consortium Landscape of Conditional eQTL in Dorsolateral Prefrontal Cortex and Co-localization with Schizophrenia GWAS. Am. J. Hum. Genet. 2018;102:1169–1184. doi: 10.1016/j.ajhg.2018.04.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Chun S., Casparino A., Patsopoulos N.A., Croteau-Chonka D.C., Raby B.A., De Jager P.L., Sunyaev S.R., Cotsapas C. Limited statistical evidence for shared genetic effects of eQTLs and autoimmune-disease-associated loci in three major immune-cell types. Nat. Genet. 2017;49:600–605. doi: 10.1038/ng.3795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.LeBlanc M., Zuber V., Thompson W.K., Andreassen O.A., Frigessi A., Andreassen B.K., Schizophrenia and Bipolar Disorder Working Groups of the Psychiatric Genomics Consortium A correction for sample overlap in genome-wide association studies in a polygenic pleiotropy-informed framework. BMC Genomics. 2018;19:494. doi: 10.1186/s12864-018-4859-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Consortium W.T.C.C., Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Wacholder S., Chanock S., Garcia-Closas M., El Ghormli L., Rothman N. Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. J. Natl. Cancer Inst. 2004;96:434–442. doi: 10.1093/jnci/djh075. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Wang G., Sarkar A., Carbonetto P., Stephens M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Series B Stat. Methodol. 2020;82:1273–1300. doi: 10.1111/rssb.12388. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Wallace C. A more accurate method for colocalisation analysis allowing for multiple causal variants. PLoS Genet. 2021;17:e1009440. doi: 10.1371/journal.pgen.1009440. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Gamazon E.R., Wheeler H.E., Shah K.P., Mozaffari S.V., Aquino-Michaels K., Carroll R.J., Eyler A.E., Denny J.C., Nicolae D.L., Cox N.J., Im H.K., GTEx Consortium A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 2015;47:1091–1098. doi: 10.1038/ng.3367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Fryett J.J., Morris A.P., Cordell H.J. Investigation of prediction accuracy and the impact of sample size, ancestry, and tissue in transcriptome-wide association studies. Genet. Epidemiol. 2020;44:425–441. doi: 10.1002/gepi.22290. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Wainberg M., Sinnott-Armstrong N., Mancuso N., Barbeira A.N., Knowles D.A., Golan D., Ermel R., Ruusalepp A., Quertermous T., Hao K., et al. Opportunities and challenges for transcriptome-wide association studies. Nat. Genet. 2019;51:592–599. doi: 10.1038/s41588-019-0385-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Wang S., McCormick T.H., Leek J.T. Post-prediction inference. bioRxiv. 2020 doi: 10.1101/2020.01.21.914002. [DOI] [Google Scholar]
- 25.Panjwani N., Wang F., Wang C., He G., Mastromatteo S., Bao A., Gong J., Rommens J.M., Sun L., Strug L.J. LocusFocus: A web-based colocalization tool for the annotation and functional follow-up of GWAS. bioRxiv. 2020 doi: 10.1101/2020.01.02.891291. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Corvol H., Blackman S.M., Boëlle P.-Y., Gallins P.J., Pace R.G., Stonebraker J.R., Accurso F.J., Clement A., Collaco J.M., Dang H., et al. Genome-wide association meta-analysis identifies five modifier loci of lung disease severity in cystic fibrosis. Nat. Commun. 2015;6:8382. doi: 10.1038/ncomms9382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Lonsdale J., Thomas J., Salvatore M., Phillips R., Lo E., Shad S., Hasz R., Walters G., Garcia F., Young N., et al. GTEx Consortium The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 2013;45:580–585. doi: 10.1038/ng.2653. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.He X., Fuller C.K., Song Y., Meng Q., Zhang B., Yang X., Li H. Sherlock: detecting gene-disease associations by matching patterns of expression QTL and GWAS. Am. J. Hum. Genet. 2013;92:667–680. doi: 10.1016/j.ajhg.2013.03.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Nica A.C., Montgomery S.B., Dimas A.S., Stranger B.E., Beazley C., Barroso I., Dermitzakis E.T. Candidate causal regulatory effects by integration of expression QTLs with complex trait genetic associations. PLoS Genet. 2010;6:e1000895. doi: 10.1371/journal.pgen.1000895. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Yanai I., Benjamin H., Shmoish M., Chalifa-Caspi V., Shklar M., Ophir R., Bar-Even A., Horn-Saban S., Safran M., Domany E., et al. Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics. 2005;21:650–659. doi: 10.1093/bioinformatics/bti042. [DOI] [PubMed] [Google Scholar]
- 31.Kryuchkova-Mostacci N., Robinson-Rechavi M. A benchmark of gene expression tissue-specificity metrics. Brief. Bioinform. 2017;18:205–214. doi: 10.1093/bib/bbw008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Howie B., Fuchsberger C., Stephens M., Marchini J., Abecasis G.R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 2012;44:955–959. doi: 10.1038/ng.2354. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Sun L., Rommens J.M., Corvol H., Li W., Li X., Chiang T.A., Lin F., Dorfman R., Busson P.F., Parekh R.V., et al. Multiple apical plasma membrane constituents are associated with susceptibility to meconium ileus in individuals with cystic fibrosis. Nat. Genet. 2012;44:562–569. doi: 10.1038/ng.2221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Liu J.Z., McRae A.F., Nyholt D.R., Medland S.E., Wray N.R., Brown K.M., Hayward N.K., Montgomery G.W., Visscher P.M., Martin N.G., Macgregor S., AMFS Investigators A versatile gene-based test for genome-wide association studies. Am. J. Hum. Genet. 2010;87:139–145. doi: 10.1016/j.ajhg.2010.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Bakshi A., Zhu Z., Vinkhuyzen A.A.E., Hill W.D., McRae A.F., Visscher P.M., Yang J. Fast set-based association analysis using summary data from GWAS identifies novel gene loci for human complex traits. Sci. Rep. 2016;6:32894. doi: 10.1038/srep32894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Li M.X., Gui H.S., Kwan J.S., Sham P.C. GATES: a rapid and powerful gene-based association test using extended Simes procedure. Am. J. Hum. Genet. 2011;88:283–293. doi: 10.1016/j.ajhg.2011.01.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Lamparter D., Marbach D., Rueedi R., Kutalik Z., Bergmann S. Fast and Rigorous Computation of Gene and Pathway Scores from SNP-Based Summary Statistics. PLoS Comput. Biol. 2016;12:e1004714. doi: 10.1371/journal.pcbi.1004714. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Han B., Eskin E. Random-effects model aimed at discovering associations in meta-analysis of genome-wide association studies. Am. J. Hum. Genet. 2011;88:586–598. doi: 10.1016/j.ajhg.2011.04.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Auton A., Brooks L.D., Durbin R.M., Garrison E.P., Kang H.M., Korbel J.O., Marchini J.L., McCarthy S., McVean G.A., Abecasis G.R., 1000 Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Zhu X., Feng T., Tayo B.O., Liang J., Young J.H., Franceschini N., Smith J.A., Yanek L.R., Sun Y.V., Edwards T.L., et al. COGENT BP Consortium Meta-analysis of correlated traits via summary statistics from GWASs with an application in hypertension. Am. J. Hum. Genet. 2015;96:21–36. doi: 10.1016/j.ajhg.2014.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Chen H., Conomos M.P., Chen M.H. 2019. Package ‘GMMAT’. [Google Scholar]
- 42.Park H., Li X., Song Y.E., He K.Y., Zhu X. Multivariate analysis of anthropometric traits using summary statistics of genome-wide association studies from GIANT Consortium. PLoS ONE. 2016;11:e0163912. doi: 10.1371/journal.pone.0163912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Province M.A., Borecki I.B. A correlated meta-analysis strategy for data mining “OMIC” scans. Pac. Symp. Biocomput. 2013;2013:236–246. [PMC free article] [PubMed] [Google Scholar]
- 44.Lin D.-Y., Sullivan P.F. Meta-analysis of genome-wide association studies with overlapping subjects. Am. J. Hum. Genet. 2009;85:862–872. doi: 10.1016/j.ajhg.2009.11.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Eckford P.D.W., McCormack J., Munsie L., He G., Stanojevic S., Pereira S.L., Ho K., Avolio J., Bartlett C., Yang J.Y., et al. The CF Canada-Sick Kids Program in individual CF therapy: A resource for the advancement of personalized medicine in CF. J. Cyst. Fibros. 2019;18:35–43. doi: 10.1016/j.jcf.2018.03.013. [DOI] [PubMed] [Google Scholar]
- 46.Dobin A., Davis C.A., Schlesinger F., Drenkow J., Zaleski C., Jha S., Batut P., Chaisson M., Gingeras T.R. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.DeLuca D.S., Levin J.Z., Sivachenko A., Fennell T., Nazaire M.-D., Williams C., Reich M., Winckler W., Getz G. RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics. 2012;28:1530–1532. doi: 10.1093/bioinformatics/bts196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Robinson M.D., Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11:R25. doi: 10.1186/gb-2010-11-3-r25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Ongen H., Buil A., Brown A.A., Dermitzakis E.T., Delaneau O. Fast and efficient QTL mapper for thousands of molecular phenotypes. Bioinformatics. 2016;32:1479–1485. doi: 10.1093/bioinformatics/btv722. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Gogarten S.M., Sofer T., Chen H., Yu C., Brody J.A., Thornton T.A., Rice K.M., Conomos M.P. Genetic association testing using the GENESIS R/Bioconductor package. Bioinformatics. 2019;35:5346–5348. doi: 10.1093/bioinformatics/btz567. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Conomos M.P., Miller M.B., Thornton T.A. Robust inference of population structure for ancestry prediction and correction of stratification in the presence of relatedness. Genet. Epidemiol. 2015;39:276–293. doi: 10.1002/gepi.21896. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Stegle O., Parts L., Piipari M., Winn J., Durbin R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc. 2012;7:500–507. doi: 10.1038/nprot.2011.457. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Taylor C., Commander C.W., Collaco J.M., Strug L.J., Li W., Wright F.A., Webel A.D., Pace R.G., Stonebraker J.R., Naughton K., et al. A novel lung disease phenotype adjusted for mortality attrition for cystic fibrosis genetic modifier studies. Pediatr. Pulmonol. 2011;46:857–869. doi: 10.1002/ppul.21456. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Kesimer M., Ehre C., Burns K.A., Davis C.W., Sheehan J.K., Pickles R.J. Molecular organization of the mucins and glycocalyx underlying mucus transport over mucosal surfaces of the airways. Mucosal Immunol. 2013;6:379–392. doi: 10.1038/mi.2012.81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Ali M., Lillehoj E.P., Park Y., Kyo Y., Kim K.C. Analysis of the proteome of human airway epithelial secretions. Proteome Sci. 2011;9:4. doi: 10.1186/1477-5956-9-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Reid C.J., Gould S., Harris A. Developmental expression of mucin genes in the human respiratory tract. Am. J. Respir. Cell Mol. Biol. 1997;17:592–598. doi: 10.1165/ajrcmb.17.5.2798. [DOI] [PubMed] [Google Scholar]
- 57.Vsevolozhskaya O.A., Shi M., Hu F., Zaykin D.V. DOT: Gene-set analysis by combining decorrelated association statistics. PLoS Comput. Biol. 2020;16:e1007819. doi: 10.1371/journal.pcbi.1007819. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Zhao Y., Sun L. On set-based association tests: Insights from a regression using summary statistics. Can. J. Stat. 2019;49:754–770. [Google Scholar]
- 59.Derkach A., Lawless J.F., Sun L. Pooled Association Tests for Rare Genetic Variants: A Review and Some New Results. Stat. Sci. 2014;29:302–321. doi: 10.1214/13-STS456. [DOI] [Google Scholar]
- 60.Giambartolomei C., Zhenli Liu J., Zhang W., Hauberg M., Shi H., Boocock J., Pickrell J., Jaffe A.E., Pasaniuc B., Roussos P., CommonMind Consortium A Bayesian framework for multiple trait colocalization from summary association statistics. Bioinformatics. 2018;34:2538–2545. doi: 10.1093/bioinformatics/bty147. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Summary statistics from the GWAS are available at https://strug.research.sickkids.ca/GWAS_Summary_Public/gwas.public.txt.gz. The calculated covariance matrix for summary statistics from the GWAS is available at https://strug.research.sickkids.ca/GWAS_Summary_Public/ld_matrix_chr3_195560_195935Kbp.tar.gz. R scripts to enable the extensions here can be found at https://github.com/FanWang0216/SimpleSum2Colocalization and https://github.com/naim-panjwani/LocusFocus. Access to the RNA-sequencing data from the nasal epithelial are available through the CF Canada-SickKids Program for Individualized Therapy Biobank https://lab.research.sickkids.ca/cfit/.