Abstract
Genetic association studies often collect information on secondary phenotypes related to the primary disease status. In many situations, the secondary phenotypes are only measured in subjects with the disease condition. It would be advantageous to model the primary trait and the secondary phenotype together if they share certain level of genetic heritability. We propose a family of multi-locus testing procedures to detect the composite association between a set of genetic markers and two traits (the primary trait and a secondary phenotype), in order to identify genes influencing both traits. The proposed test is derived from a random effect model with two variance components, with each presenting the genetic effect on one trait, and incorporates a model selection procedure for seeking the optimal model to represent the two sources of genetic effects. We conduct simulation studies to evaluate performance of the proposed procedure and apply the method to a genome-wide association study of prostate cancer with the Gleason score as the secondary phenotype.
Keywords: Secondary phenotype, multi-locus test, variance component, genome-wide association study, multiple testing, prostate cancer
1. Introduction
Population-based genetic association studies have been widely used for uncovering the genetic basis underlying complex diseases. Although they are typically designed to study one primary trait, information on other secondary phenotypes is often collected and is potentially valuable for the study of the primary trait. For example, besides knowing the disease status of each subject in a genetic association study of breast cancer, we might also have additional information measured on breast cancer tumor tissues, which provides more details on pathologic and molecular characteristics of the disease. Those secondary phenotypes could be helpful in identifying the disease susceptibility loci if they share certain level of genetic heritability with the primary trait.
Recently, Wu et al.1 proposed a single-marker testing framework to assess the association between a genetic marker and two traits simultaneously in situations where the secondary phenotype is quantitative and is only measured on subjects in a particular primary trait-dependent stratum. For example, the secondary phenotype might be only available on the subjects with disease condition. The data can be collected prospectively or retrospectively with the primary trait being the disease status. Their method aims at detecting genetic markers associated with both traits and maintaining robust power even if the marker is associated with only one of traits.
Although the single-marker test has been the most commonly used approach in detecting genetic susceptibility loci, increasing evidence has suggested that multiple correlated markers within a gene could jointly influence complex diseases.2 A multi-locus test that aggregates association evidence across multiple genetic markers in a considered gene or a genomic region may be more powerful than a single-marker test.3–6 Here, we focus on the similar setting considered by Wu et al.1 and derive a class of multi-locus tests for the association between a set of genetic markers within a considered gene and two traits. The proposed test extends the sequence kernel association test (SKAT)7 to a random effect model with two variance components, with each presenting a genetic effect on one trait, and incorporates a model selection procedure for seeking the optimal model to represent the two sources of genetic effects.
Many existing multi-locus tests require complete observations on the set of genetic markers. Under the current single-nucleotide polymorphism (SNP) genotyping technology, the genotype missing rate at a given SNP is very low. However, the proportion of subjects with at least one missing genotypes within a considered genomic region could still be high if the number of SNPs in the region is relatively large. Excluding those subjects with at least one missing genotype can reduce the sample size substantially and thus diminish the power of the multi-locus test. Using statistical imputation algorithms8–10 to impute the missing genotype is a commonly used strategy to retain the sample, but it has several limitations. For example, it requires the knowledge of haplotype distribution on the studying population. Furthermore, it is known that the imputed genotype on a SNP with a relatively low minor allele frequency (MAF) is not very accurate.11 To make the proposed test more flexible in practice, we generalize the test so that it can handle missing genotypes without resorting to imputation or removing samples.
2. Model
For a study with n samples, let {Di, Zi, Xi, Gi} be the observed data on the ith sample, with Di being the primary dichotomous trait (e.g., disease status), Zi being the secondary quantitative trait, Xi being the set of covariates to be adjusted, and Gi being the vector of genotypes on m genetic markers in a given gene or region. We assume that the genotype is coded as 0, 1, and 2, representing the number of minor alleles at a given marker. Other coding schemes can be dealt with similarly. We will describe our method for data sampled from a prospective cohort study, where the secondary phenotype Zi is only available on subjects with Di = 1. We then extend the application of our method to a retrospective case–control study, wherein Zi is collected in cases. We refer to our method as MAPS, i.e., a Multimulti-locus Association test for a dichotomous Primary trait and a quantitative Secondary phenotype.
2.1. A random effect model for a prospective cohort study
In a prospective cohort study, we assume that the dichotomous trait Di can be modeled by the following logistic regression model given the covariates and genotypes of multiple markers within a gene
| (1) |
with the intercept term absorbed in covariates by adding a column of 1s in Xi. We also assume that Zi on a subject with Di = 1 follows the normal distribution . The likelihood of observing the data {Di, Zi, Xi, Gi: i = 1,...,n} can be represented as .
| (2) |
where ∅(.) is the density function of the standard normal distribution. Let
and
therefore, the log-likelihood . Denote , , , and . It can be shown through a second-order Taylor expansion that the likelihood can be approximated around (β,θ) = (0,0)as
where 0 is a zero vector of length m, and
To derive a variance component test for the null hypothesis H0 : β = θ =0, we further assume that the genetic effects (β,θ) are random effects with Eβ = Eθ = 0,and
where I is the m m identity matrix and the scalars τ ≥ 0, ρ ∈[–1, 1], κ ∈ [0, 1] Here, the variance–covariance matrix is configured by three parameters under the following assumptions. First, genetic effects β on the primary trait are independent and identically distributed (i.i.d.) random variables with variance τκ. Second, genetic effects θ on the secondary phenotype are i.i.d. random variables with variance τ(1-κ) Third, the genetic effects from one marker on the two traits are correlated with correlation coefficient ρ. Fourth, the genetic effects from two markers on either the same or different traits are uncorrelated. One can see that testing the joint genetic effects on the two traits is equivalent to testing H0 : τ = 0. Similar to Lin,12 we can obtain the profile log-likelihood in term of variance component parameters (τ,ρ,κ) by integrating out β and θ
| (3) |
Let be the maximum likelihood estimates of under the null. Define
| (4) |
where (or ) is the score SD (or SZ) evaluated at . The asymptotic null distributions of and are multivariate normal distributions, with means 0 and estimated variance–covariance matrices and , respectively. The score for τ at τ = 0 is
| (5) |
Note that the second term in equation (5) converges to some constant in probability for given (ρ,κ), we thus can conduct a family of variance component tests based on Q,ρ,κ only.
For any given (ρ,κ), denote as the p-value of Qρ,κ evaluated at its observed value .
Since (ρ,κ) are unknown, we propose to define the statistic for testing H0 : τ = 0 as
which measures the strongest evidence of the presence of association with (ρ,κ) turned in proper regions. The final p-value adjusted for multiple comparisons is computed from the null distribution of T.
In the following sections, we will introduce different versions of variance component tests based on T with possible choices of the tuning parameters ρ and κ. Numerical algorithms for computing the final p-value are also discussed.
2.2. The variance component test with ρ = 0 and κ = 1/2
One simple choice of the tuning parameters in T is to set ρ = 0 and κ 1/2, which essentially assumes that the genetic effects distribute equally on either traits for each marker. The statistical significance can be evaluated by checking the distribution of , which follows a mixture of chi-square distributions under the null. Several existing algorithms are available for computing the distribution function of Q0,1/2, thus the p-value can be calculated accurately.13–15 This test is referred as MAPS0,1/2.
2.3. The variance component test with ρ = 0
A more flexible approach is to fix ρ = 0 while allowing n to vary in κ [0, 1]. The test statistic becomes . In practice, we can choose κ at the grids {k/20 : k = 0, ... , 20}. For any given κ
Notice that SD is asymptotically independent with Sz since , the final p-value of T defined can be computed explicitly by an one-dimensional numerical integration algorithm. The details are given in Appendix 1. This test is referred as MAPS0.
2.4. The variance component test with variable ρ and κ
In real application, we usually do not have any prior knowledge on the values of ρ and κ. A robust approach is to maximize the association evidence over , we define the statistic as . In practice, we can choose (ρ,κ) the grids {j/10; j = - 10,…, 10} * {k/20 : k = 0,…,20}. To assess the significance of T, we can generate the scores of and under the null via the direct simulation approach.16 The final p-value of T is then estimated through the computationally efficient minP algorithm.17 We refer this optimal test as MAPSopt. As a special case, the MAPSopt test with κ fixed to be 1/2, and ρ tuned in [—1, 1], is referred as MAPScor. The p-value of MAPScor can be computed similarly as MAPSopt.
2.5. Existing approaches
There are several alternative approaches that are applicable to the setting considered in this paper. The SKAT has been successfully applied in identifying genetic regions associated with complex diseases.7 In the following discussion, the SKAT tests applied to either the dichotomous or quantitative trait are referred as SKATD and SKATZ, respectively. In addition, the standard likelihood ratio test (LRT), which compares the additive model consisting of all the genetic markers with the null model, can be applied to each trait separately, leading to two tests LRTD and LRTZ, respectively. These two tests may loss power due to large degree-of-freedoms. Finally, we generalize the single-marker test in Wu et al.1 to a multi-locus score test. This generalized score test follows distribution under the null.
2.6. A random effect model for a retrospective case–control study
In a case–control study, we assume the quantitative trait Z is only observed in cases. Then the likelihood of observed data {Di, Zi, Xi, Gi : i = 1,·· ,n} can be written as
| (6) |
According to Qin and Zhang,18 the joint distribution of Xi and Gi satisfies
if the risk model is assumed as the logistic regression model in equation (1). Ignoring a constant, the profile likelihood of equation (6) is equivalent to the likelihood equation (2) in a cohort study.18 Therefore, all the tests discussed in previous subsections can be applied to case–control studies.
2.7. Missing data in multi-locus test
In the above, we have described the method assuming no missing genotypes at any considered genetic markers. In real application, we might have a substantial proportion of individuals who have at least one missing genotype in the considered region, especially when the region consists of a large number of markers. Removing those subjects can result in substantial loss of power. To make full use of observed genotypes, we propose to use following modified score statistics defined on observed genotypes.
Without loss of generality, we consider the generalized linear model and assume that the covariates X are observed in full dataset S with sample size n. Other nuisance parameters (e.g., variance parameter σ2 for quantitative trait Z), if any, are denoted as ѱ. The nj individuals without missing genotypes on the jth marker are indexed as Sj, j = 1, 2,..., m, where m is the number of markers within the considered region or gene. Denote the log-likelihood as and let ,, , , and . A superscriptj on these defined term means only individuals in Sj are used. For example, is the score of βj defined on Sj. In contrast, the score of α can be defined on either S (i.e.ℓ,α) or Sj (i.e.,) Similarly, superscript jk means individuals in Sj ⋂ Sk are used. Let be the maximum likelihood estimates of (α,ѱ) using Sj under the null. Statistics denoted with accent is assessed at (e.g., ) We show in the Appendix 1 that, under the assumpation of missing at complete randomness the modified score asymptotically follows multivariate normal distribution with means 0. The covariance between and can be consistently estimated by , where njk is the sample size of Sj ⋂ Sk. Replacing and in equation (4) by the modified score allows the proposed method to handle data with missing genotypes, where is used as the modified variance–covariance matrix in the direct simulation for the evaluation of the p-value.
In this procedure, the score uses all the observed genotypes at the jth SNP. The covariance between the score and is estimated using information on njk subjects, which is very close to the original total sample size if the proportion of missing genotype at each marker is low. Thus, this strategy is much more efficient than the one that requires the removal of subjects who have at least one missing genotypes on the set of considered markers.
This procedure is very general for handling missing genotypes in various multi-locus tests, as long as the test is based on score statistics derived from the generalized linear model. We therefore integrated this procedure into SKAT, LRT, and Wu’s method, so that they can be applied to the real data application described below.
3. Simulation studies
We evaluated performance of the proposed variance component tests through simulation studies with genes generated under various of linkage disequilibrium (LD) structures. Similar to Wang and Elston,19 we considered a gene consisting of 20 SNPs and a study with 500 cases and 500 controls. To generate genotypes on the 20 SNPs, we first simulated continuous random variables R = (R1,...,R20) from a multivariate normal distribution with mean zero and a variance–covariance matrix , where . By properly choosing cut-points, we then discretized Ri into a three-level genotype with levels 0,1, and 2, so that the corresponding SNP had a MAF of 0.4. The LD within this gene was controlled by the parameter r, which was chosen as either 0 or 0.6 in our simulation. We used this algorithm to generate genotypes in controls. We assumed the risk model for the primary trait (case–control status) has the following form
with the 10th and 11th SNPs conferring the risk of disease. In order to simplify the simulation, we assumed the equality of the two odds ratios. Under the given risk model, we used the weighted sampling procedure18 to generate genotypes in cases from the following distribution
Within the stratum of D = 1, the secondary quantitative trait was simulated from
with the same risk SNPs as in the risk model of the dichotomous trait.
We also investigated the robustness of our method by additional simulations, in which the risk SNPs of the primary trait and the secondary trait were different. More specifically, logit , and .
3.1. Type I error
We evaluated the type I errors of all tests proposed in this paper, as well as the existing approaches discussed in Section 2.5. Only the results of scenario with r = 0.6 are presented here. 100,000 datasets were generated under the null by setting β = θ =0, each with 500 cases and 500 controls. p-values of MAPSopt and MAPScor were first estimated with 100,000 resampling steps. Then for those datasets with initial estimates of p-values 5 10−4, more accurate estimates were obtained with 1,000,000 resampling steps. Table 1 shows that all tests can properly control the type I errors at nominal levels of 0.01, 0.001, and 0.0001.
Table 1.
Type I error based on 100,000 datasets generated from the null H0 : β = θ =0.
| Level | MAPSopt | MAPScor | MAPS0 | MAPS0,1/2 | Wu | SKATD | SKATZ | LRTD | LRTZ |
|---|---|---|---|---|---|---|---|---|---|
| 0.01 | 0.00960 | 0.00933 | 0.00902 | 0.01039 | 0.00944 | 0.00999 | 0.00967 | 0.01169 | 0.01231 |
| 0.001 | 0.00099 | 0.00094 | 0.00093 | 0.00099 | 0.00083 | 0.00093 | 0.00098 | 0.00126 | 0.00134 |
| 0.0001 | 0.00010 | 0.00008 | 0.00012 | 0.00005 | 0.00009 | 0.00010 | 0.00008 | 0.00010 | 0.00024 |
3.2. Empirical power
To compare the empirical powers of various tests under different LD structure configured by r, we chose β and θ so that the empirical powers of LRTD and LRTZ were close to specified powers ( pD, pZ) with type I error controlled at the level of 0.01. We set ( pD, pZ) as (0.2, 0.4), (0.3, 0.3), (0.4, 0.2), (0.6, 0.01), or (0.01, 0.6) to represent different scenarios. For example, when ( pD, pZ) = (0.6, 0:01) or (0.01,0 6), the gene has moderate effects on both traits. When ( pD, pZ) = (0.6, 0.01) or (0.01, 0.6), the gene influenced only one trait. We used 10,000 resampling steps to evaluate p-values for MAPSopt and MAPScor.
In Table 2, we summarize the empirical powers, each of which is based on 1000 simulated datasets. We can see from the table that, when the same SNPs in a causal gene influence both traits, MAPSopt and MAPScor have the best performance among all considered tests. When the gene is associated with only one trait, MAPS0 and MAPSopt are the most robust tests among others variance component tests, with MAPS0 is slightly more powerful than MAPSopt due to less model selection penalty. MAPScor is very sensitive to the underlying risk model. For example, its power is less than 1/3 of that of MAPSopt when the gene is only associated with primary trait. When two traits are influenced by different causal SNPs, the method extended from Wu et al.1 is more powerful than other methods if all SNPs are in linkage equilibrium (r = 0). When SNPs are moderately correlated (r = 0.6), MAPSopt is the most robust test in discovering composite gene association.
Table 2.
Empirical power comparison when causal SNPs are observed.
| (r,β,θ) | MAPSopt | MAPScor | MAPS0 | MAPS0,1/2 | Wu | SKATD | SKATZ | LRTD | LRTZ |
|---|---|---|---|---|---|---|---|---|---|
| Two risk models share the same causal SNPs | |||||||||
| (0.0, 0.21, 0.18) | 0.686 | 0.702 | 0.531 | 0.584 | 0.561 | 0.232 | 0.421 | 0.209 | 0.404 |
| (0.6, 0.17, 0.15) | 0.853 | 0.897 | 0.767 | 0.809 | 0.542 | 0.404 | 0.646 | 0.193 | 0.397 |
| (0.0, 0.23, 0.17) | 0.672 | 0.688 | 0.524 | 0.527 | 0.559 | 0.318 | 0.320 | 0.304 | 0.298 |
| (0.6, 0.19, 0.14) | 0.857 | 0.889 | 0.770 | 0.789 | 0.582 | 0.523 | 0.552 | 0.302 | 0.315 |
| (0.0, 0.26, 0.15) | 0.656 | 0.626 | 0.515 | 0.447 | 0.537 | 0.433 | 0.200 | 0.409 | 0.194 |
| (0.6, 0.21, 0.12) | 0.836 | 0.853 | 0.776 | 0.747 | 0.584 | 0.651 | 0.407 | 0.409 | 0.215 |
| Gene associates with one trait | |||||||||
| (0.0, 0.30, 0.00) | 0.478 | 0.145 | 0.512 | 0.151 | 0.392 | 0.619 | 0.010 | 0.593 | 0.008 |
| (0.6, 0.24, 0.00) | 0.726 | 0.350 | 0.746 | 0.376 | 0.418 | 0.820 | 0.007 | 0.607 | 0.011 |
| (0.0, 0.00, 0.22) | 0.458 | 0.475 | 0.496 | 0.539 | 0.391 | 0.004 | 0.601 | 0.004 | 0.600 |
| (0.6, 0.00, 0.18) | 0.726 | 0.760 | 0.752 | 0.794 | 0.391 | 0.001 | 0.821 | 0.007 | 0.599 |
| Two risk models contain different causal SNPs | |||||||||
| (0.0, 0.21, 0.18) | 0.492 | 0.522 | 0.533 | 0.576 | 0.580 | 0.246 | 0.409 | 0.227 | 0.395 |
| (0.6, 0.17, 0.15) | 0.774 | 0.804 | 0.758 | 0.803 | 0.555 | 0.384 | 0.651 | 0.187 | 0.404 |
| (0.0, 0.23, 0.17) | 0.479 | 0.464 | 0.517 | 0.515 | 0.570 | 0.317 | 0.307 | 0.301 | 0.302 |
| (0.6, 0.19, 0.14) | 0.779 | 0.778 | 0.758 | 0.762 | 0.569 | 0.524 | 0.547 | 0.293 | 0.307 |
| (0.0, 0.26, 0.15) | 0.480 | 0.398 | 0.515 | 0.453 | 0.561 | 0.445 | 0.190 | 0.409 | 0.191 |
| (0.6, 0.21, 0.12) | 0.766 | 0.738 | 0.760 | 0.727 | 0.564 | 0.638 | 0.381 | 0.411 | 0.217 |
In Figure 1, we show the optimal (ρ,κ) corresponding to minρ,κ pρ,κ for each simulated dataset, in which SNPs in a gene are moderately correlated (r = 0.6), and genotypes at causal SNPs shared by the two risk models are directly observed. When MAPSopt detects a significant association in practice (e.g., p < 0.01), a selected κ that is very close to 0 or 1 suggests that the gene under study is likely associated with only one trait.
Figure 1.
ρ and κ selected by MAPSopt in simulation studies, in which two risk models share the causal SNPs that are directly observed. The SNPs in a gene are moderately correlated (r = 0.6). Solid points: p-values of MAPSopt ≤ 0:01. Circle: p-values of MAPSopt > 0.01.
We also compared the empirical powers among tests when the causal SNPs are not directly observed. The parameter r controlling the LD structure was set at 0.6. All other settings were similar to those used in previous simulation with full observations. The results are summarized in Table 3. MAPSopt again appears to have the most robust performance among all considered tests, especially when the gene is associated with both traits.
Table 3.
Empirical power comparison when causal SNPs are not directly observed (r = 0.6).
| (β,θ) | MAPSopt | MAPScor | MAPS0 | MAPS0,1/2 | Wu | SKATD | SKATZ | LRTD | LRTZ |
|---|---|---|---|---|---|---|---|---|---|
| Two risk models share the same causal SNPs | |||||||||
| (0.27, 0.25) | 0.780 | 0.803 | 0.674 | 0.724 | 0.558 | 0.295 | 0.566 | 0.205 | 0.401 |
| (0.31, 0.23) | 0.781 | 0.781 | 0.675 | 0.689 | 0.568 | 0.440 | 0.453 | 0.300 | 0.293 |
| (0.34, 0.20) | 0.791 | 0.765 | 0.679 | 0.639 | 0.551 | 0.572 | 0.325 | 0.402 | 0.201 |
| Gene associates with one trait | |||||||||
| (0.39, 0.00) | 0.604 | 0.243 | 0.650 | 0.270 | 0.372 | 0.755 | 0.009 | 0.599 | 0.014 |
| (0.00, 0.30) | 0.613 | 0.626 | 0.635 | 0.686 | 0.397 | 0.003 | 0.738 | 0.011 | 0.598 |
| Two risk models contain different causal SNPs | |||||||||
| (0.34, 0.32) | 0.593 | 0.630 | 0.573 | 0.639 | 0.542 | 0.252 | 0.464 | 0.199 | 0.389 |
| (0.38, 0.30) | 0.604 | 0.616 | 0.581 | 0.617 | 0.560 | 0.360 | 0.373 | 0.291 | 0.309 |
| (0.42, 0.26) | 0.620 | 0.588 | 0.594 | 0.578 | 0.566 | 0.473 | 0.273 | 0.398 | 0.213 |
4. Application to a genome-wide association study of prostate cancer
We demonstrated the application of MAPS as multi-locus tests by applying them on a genome-wide association study (GWAS) of prostate cancer. We focused on 2841 controls and 4544 cases of European ancestry.20 For each prostate cancer case, we used the Gleason score (2–10), which indicates how likely it is that a tumor will spread, as a quantitative trait. We hypothesized that there are genes influencing the mechanism underlying the development of prostate cancer, as well as how fast the tumor cells spread. By looking at the two traits jointly (i.e., prostate cancer status and Gleason score), we intend to increase our chance for detecting that type of genes.
Of the SNPs genotyped using the Illumina HumanOmni2.5 BeadChip, 1,531,807 passed standard quality control criteria.20 We extracted SNPs within 20 kb upstream and 20 kb downstream of a gene or an annotated region. The SNPs with missing rate > 2 % or MAFs < 2 % were excluded from the analyses. For two SNPs with LD coefficient r2 > 0:95, the one with a smaller MAF was discarded. Both traits were adjusted for center, age, and two eigenvectors.
We will provide more detailed report on the analysis of over 20,000 genes/regions elsewhere. Here, we are interested in the 69 genes with both p-values of SKATZ and SKATD less than 0.05, as using tests analyzing two traits jointly are most likely to be beneficial on those genes. In Table 4, we showed results of nine genes on which there were at least one gene-level p-value less than 0.001 by all considered two-trait joint tests. Among those, KLK3 and CLDN11 are known risk genes associated with the prostate cancer in population with European ancestry.21,22 IRX4 has only been identified to be associated with the risk of prostate cancer in Japanese population.23,24 Although the other six genes have not been reported in GWAS as genes susceptible to prostate cancer risk, overexpression of PIAS3 was known to induce apoptosis in prostate cancer cells.25 The forest plots in Figures 2 and 3 illustrate the marginal effects from each SNP in genes LOC643201 and PIAS3 on the prostate cancer risk and the Gleason score. It shows several SNPs in either gene are associated with both traits. This is the main reason why the joint test approach appears to be more advantageous than the single-trait test approaches. We can consider those genes in Table 4 as promising candidates underlying the development of prostate cancer, although further replications are needed.
Table 4.
The suggestive genes with at least one gene-level p-value < 10−3, and the p-values of SKATD and SKATZ are both < 0.05.
| Gene | MAPSopt | MAPScor | MAPS0 | MAPS0,1/2 | Wu | SKATD | SKATZ | LRTD | LRTZ |
|---|---|---|---|---|---|---|---|---|---|
| SENP6 | 3.0E–5 | 2.3E–4 | 3.7E–5 | 1.3E–3 | 2.9E–2 | 3.7E–5 | 4.8E–2 | 1.8E–2 | 2.8E–1 |
| LOC643201 | 8.1E–5 | 5.0E–5 | 1.8E–4 | 2.6E–4 | 1.5E–3 | 5.4E–3 | 1.0E–3 | 2.1E–3 | 8.7E–2 |
| KLK3 | 8.6E–5 | 3.8E–5 | 8.1E–5 | 4.5E–5 | 1.6E–4 | 3.7E–2 | 1.0E–4 | 2.0E–2 | 1.0E–3 |
| PIAS3 | 9.6E–5 | 7.7E–5 | 2.8E–4 | 4.8E–4 | 6.3E–3 | 3.8E–3 | 2.2E–3 | 2.4E–2 | 4.8E–2 |
| IRX4 | 3.9E–4 | 4.9E–4 | 9.9E–4 | 2.7E–3 | 2.0E–1 | 1.9E–3 | 2.1E–2 | 5.0E–2 | 7.0E–1 |
| MRPS31 | 4.1E–4 | 2.1E–4 | 7.1E–4 | 4.4E–4 | 4.3E–2 | 4.0E–2 | 8.4E–4 | 4.0E–1 | 1.8E–2 |
| CLDN11 | 5.3E–4 | 1.3E–3 | 5.6E–4 | 3.1E–3 | 2.7E–4 | 7.4E–4 | 3.5E–2 | 4.8E–4 | 5.8E–2 |
| ZNF526 | 2.0E–3 | 1.7E–3 | 8.8E–4 | 7.3E–4 | 6.9E–3 | 1.5E–2 | 2.3E–3 | 3.4E–2 | 3.3E–2 |
| AGXT2L1 | 6.0E–3 | 4.4E–3 | 8.4E–3 | 6.9E–3 | 1.7E–4 | 1.7E–2 | 2.0E–2 | 1.1E–3 | 1.9E–2 |
The values marked in bold are known susceptibility genes of the risk of prostate cancer.
Figure 2.
The forest plot of 16 SNPs within gene LOC643201.
Figure 3.
The forest plot of 12 SNPs within gene PIAS3.
5. Discussion
Although genetic association studies are typically designed to study one primary trait, valuable information on other secondary phenotypes is often collected. There is a growing interest to study secondary phenotypes using already measured genotypes. Several approaches have been developed to identify genetic markers associated with secondary phenotypes, taking account for the design of the original study.26–30 The proposed method has a different goal. It analyzes the primary trait and a secondary phenotype jointly, aiming at detecting genes influencing both traits. The family of proposed tests is derived from a random effect model with two variance components, with each presenting the genetic effect on one trait. Among its various versions, we found the one that uses observed data to adaptively model the variance–covariance matrix of genetic effects has the most robust performance. We demonstrated the application of the new method by applying it to analyze a GWAS of prostate cancer and identified several promising novel regions that appeared to influence the risk and progression of prostate cancer. An R package of the proposed test is available at https://github.com/zhangh12/MAPS
It has been shown that multi-locus association tests can be a valuable alternative to the commonly used single-marker test. There are many multi-locus approaches for genetic association studies,6,7,31 most of which assume that there is no missing genotype. As a result, genotype imputation is usually needed before using the multi-locus test. However, it is not a trivial task to impute the missing genotypes, and the imputation accuracy depends on the reference genome.11 The strategy for dealing with missing genotypes proposed with our test is more flexible and easy to use, as it does not need imputing missing genotypes. Also, it uses all observed genotypes, and thus is more efficient than the strategy that requires the removal of subjects with at least one missing genotypes on the set of considered markers. This strategy is especially helpful for GWAS, where genotype missing rate at a given SNP is very low. With some modifications, this strategy can be adapted to other multi-locus tests defined by score statistics.7,16,31
In our method, we use three parameters to model the variance–covariance matrix for the genetic effects on a primary trait and a secondary phenotype. To extend the method to study more than two secondary phenotypes, it is important to find an appropriate model for the variance–covariance matrix. Using too many parameters would increase the penalty for model selection and reduce efficiency of the multi-locus test. On the other hand, an over-simplified model can introduce bias into the testing procedure. Further investigations are needed to extend the proposed method to study more secondary phenotypes.
Acknowledgements
The authors would like to thank Professor Hua Liang at The George Washington University for his helpful comments. This study utilized the high-performance computational capabilities of compute cluster at the Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland.
Appendix 1
p-value of variance component test with ρ = 0
We derive formula of the p-values of MAPS0. Let qκ be the upper 100 x p0,κ % quantile of Q0,κ, where Q0,κ follows a null distribution as a central mixture chi-square with non-zero weights. Under the null, and follow central mixture chi-square distributions with corresponding weights.31 Therefore
where FD(·) is the cumulative distribution function of and fz(·) is the probability density function of .
Asymptotic distribution of modified score allowing missing data
We derive modified score to deal with missing genotypes. By Taylor’s Theorem and law of large numbers
where I is the Fisher information matrix. Thus
The information , , Iαα and can be consistently estimated by , , and , respectively. Therefore, we can estimate the covariance between scores of two genetic effects by . Note that we here use the most informative estimates of (ᶛ,β,ᶲ)by using as much samples as possible.
Footnotes
Declaration of conflicting interests
The author(s) received no financial support for the research, authorship, and/or publication of this article.
References
- 1.Wu CO, Zheng G and Kwak M. A joint regression analysis for genetic association studies with outcome stratified samples. Biometrics 2013; 69: 417–426. [DOI] [PubMed] [Google Scholar]
- 2.Yang J, Ferreira T, Morris AP, et al. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet 2012; 44: 369–375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Liu JZ, Mcrae AF, Nyholt DR, et al. A versatile gene-based test for genome-wide association studies. Am J Hum Genet 2010; 87: 139–145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Huang H, Chanda P, Alonso A, et al. Gene-based tests of association. PLoS Genet 2011; 7: e1002177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Li MX, Gui HS, Kwan JS, et al. GATES: a rapid and powerful gene-based association test using extended sines procedure. Am J Hum Genet 2011; 88: 283–293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Zhang H, Wheeler W, Wang Z, et al. A fast and powerful tree-based association test for detecting complex joint effects in case-control studies. Bioinformatics 2014; 30: 2171–2178. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Wu MC, Kraft P, Epstein MP, et al. Powerful SNP-set analysis for case-control genome-wide association studies. Am J Hum Genet 2010; 86: 929–942. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Browning BL and Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet 2010; 84: 210–223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Li Y, Willer CJ, Ding J, et al. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol 2010; 34: 816–834. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Howie B, Fuchsberger C, Stephens M, et al. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet 2012; 44: 955–959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Wang Z, Jacobs KB, Yeager M, et al. Improved imputation of common and uncommon SNPs with a new reference set. Nat Genet 2012; 44: 6–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Lin X Variance component testing in generalised linear models with random effects. Biometrika 1997; 84: 309–326. [Google Scholar]
- 13.Davies RB. The distribution of a linear combination of x2 random variables. J R Stat Soc C 1980; 29: 323–333. [Google Scholar]
- 14.Kuonen D Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika 1999; 86: 929–935. [Google Scholar]
- 15.Liu H, Tang Y and Zhang HH. A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Computat Stat Data Analy 2009; 53: 853–856. [Google Scholar]
- 16.Zhang H, Shi J, Liang F, et al. A fast multilocus test with adaptive SNP selection for large-scale genetic-association studies. Eur J Hum Genet 2014; 22: 696–702. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Ge Y, Dudoit S and Speed TP. Resampling-based multiple testing for microarray data analysis. Test 2003; 12: 1–77. [Google Scholar]
- 18.Qin J and Zhang B. A goodness-of-fit test for logistic regression models based on case-control data. Biometrika 1997; 84: 609–618. [Google Scholar]
- 19.Wang T and Elston RC. Improved power by use of a weighted score test for linkage disequilibrium mapping. Am J Hum Genet 2007; 80: 353–360. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Berndt SI, Wang Z, Yeager M, et al. Two susceptibility loci identified for prostate cancer aggressiveness. Nat Commun 2015; 6: 6889. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Eeles RA, Kote-Jarai Z, Giles GG, et al. Multiple newly identified loci associated with prostate cancer susceptibility. Nat Genet 2008; 40: 316–321. [DOI] [PubMed] [Google Scholar]
- 22.Kote-Jarai Z, Olama AAA, Giles GG, et al. Seven novel prostate cancer susceptibility loci identified by a multi-stage genome-wide association study. Nat Genet 2011; 43: 785–791. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Takata R, Akamatsu S, Kubo M, et al. Genome-wide association study identifies five new susceptibility loci for prostate cancer in the Japanese population. Nat Genet 2010; 42: 751–754. [DOI] [PubMed] [Google Scholar]
- 24.Nakagawa H, Akamatsu S, Takata R, et al. Prostate cancer genomics, biology, and risk assessment through genome-wide association studies. Cancer Sci 2012; 103: 607–613. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Wible BA, Wang L, Kuryshev YA, et al. Increased K+ efflux and apoptosis induced by the potassium channel modulatory protein KChAP/PIAS3 þ in prostate cancer cells. J Biol Chem 2002; 277: 17852–17862. [DOI] [PubMed] [Google Scholar]
- 26.Lin D and Zeng D. Proper analysis of secondary phenotype data in case-control association studies. Genet Epidemiol 2009; 33: 256–265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Monsees GM, Tamimi RM and Kraft P. Genome-wide association scans for secondary traits using case-control samples. Genet Epidemiol 2009; 33: 717–728. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.He J, Li H, Edmondson AC, et al. A Gaussian copula approach for the analysis of secondary phenotypes in case-control genetic association studies. Biostatistics 2012; 13: 497–508. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Li H and Gail MH. Efficient adaptively weighted analysis of secondary phenotypes in case-control genome-wide association studies. Hum Hered 2012; 73: 159–173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Schifano ED, Li L, Christiani DC, et al. Genome-wide association analysis for multiple continuous secondary phenotypes. Am J Hum Genet 2013; 92: 744–759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Lee S, Wu MC and Lin X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics 2012; 13: 762–775. [DOI] [PMC free article] [PubMed] [Google Scholar]



