Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Apr 19.
Published in final edited form as: Stat Methods Med Res. 2016 Aug 8;27(5):1464–1475. doi: 10.1177/0962280216662071

A multi-locus genetic association test for a dichotomous trait and its secondary phenotype

Han Zhang 1, Colin O Wu 2, Yifan Yang 3, Sonja I Berndt 1, Stephen J Chanock 1, Kai Yu 1
PMCID: PMC6474783  NIHMSID: NIHMS1014776  PMID: 27507288

Abstract

Genetic association studies often collect information on secondary phenotypes related to the primary disease status. In many situations, the secondary phenotypes are only measured in subjects with the disease condition. It would be advantageous to model the primary trait and the secondary phenotype together if they share certain level of genetic heritability. We propose a family of multi-locus testing procedures to detect the composite association between a set of genetic markers and two traits (the primary trait and a secondary phenotype), in order to identify genes influencing both traits. The proposed test is derived from a random effect model with two variance components, with each presenting the genetic effect on one trait, and incorporates a model selection procedure for seeking the optimal model to represent the two sources of genetic effects. We conduct simulation studies to evaluate performance of the proposed procedure and apply the method to a genome-wide association study of prostate cancer with the Gleason score as the secondary phenotype.

Keywords: Secondary phenotype, multi-locus test, variance component, genome-wide association study, multiple testing, prostate cancer

1. Introduction

Population-based genetic association studies have been widely used for uncovering the genetic basis underlying complex diseases. Although they are typically designed to study one primary trait, information on other secondary phenotypes is often collected and is potentially valuable for the study of the primary trait. For example, besides knowing the disease status of each subject in a genetic association study of breast cancer, we might also have additional information measured on breast cancer tumor tissues, which provides more details on pathologic and molecular characteristics of the disease. Those secondary phenotypes could be helpful in identifying the disease susceptibility loci if they share certain level of genetic heritability with the primary trait.

Recently, Wu et al.1 proposed a single-marker testing framework to assess the association between a genetic marker and two traits simultaneously in situations where the secondary phenotype is quantitative and is only measured on subjects in a particular primary trait-dependent stratum. For example, the secondary phenotype might be only available on the subjects with disease condition. The data can be collected prospectively or retrospectively with the primary trait being the disease status. Their method aims at detecting genetic markers associated with both traits and maintaining robust power even if the marker is associated with only one of traits.

Although the single-marker test has been the most commonly used approach in detecting genetic susceptibility loci, increasing evidence has suggested that multiple correlated markers within a gene could jointly influence complex diseases.2 A multi-locus test that aggregates association evidence across multiple genetic markers in a considered gene or a genomic region may be more powerful than a single-marker test.36 Here, we focus on the similar setting considered by Wu et al.1 and derive a class of multi-locus tests for the association between a set of genetic markers within a considered gene and two traits. The proposed test extends the sequence kernel association test (SKAT)7 to a random effect model with two variance components, with each presenting a genetic effect on one trait, and incorporates a model selection procedure for seeking the optimal model to represent the two sources of genetic effects.

Many existing multi-locus tests require complete observations on the set of genetic markers. Under the current single-nucleotide polymorphism (SNP) genotyping technology, the genotype missing rate at a given SNP is very low. However, the proportion of subjects with at least one missing genotypes within a considered genomic region could still be high if the number of SNPs in the region is relatively large. Excluding those subjects with at least one missing genotype can reduce the sample size substantially and thus diminish the power of the multi-locus test. Using statistical imputation algorithms810 to impute the missing genotype is a commonly used strategy to retain the sample, but it has several limitations. For example, it requires the knowledge of haplotype distribution on the studying population. Furthermore, it is known that the imputed genotype on a SNP with a relatively low minor allele frequency (MAF) is not very accurate.11 To make the proposed test more flexible in practice, we generalize the test so that it can handle missing genotypes without resorting to imputation or removing samples.

2. Model

For a study with n samples, let {Di, Zi, Xi, Gi} be the observed data on the ith sample, with Di being the primary dichotomous trait (e.g., disease status), Zi being the secondary quantitative trait, Xi being the set of covariates to be adjusted, and Gi being the vector of genotypes on m genetic markers in a given gene or region. We assume that the genotype is coded as 0, 1, and 2, representing the number of minor alleles at a given marker. Other coding schemes can be dealt with similarly. We will describe our method for data sampled from a prospective cohort study, where the secondary phenotype Zi is only available on subjects with Di = 1. We then extend the application of our method to a retrospective case–control study, wherein Zi is collected in cases. We refer to our method as MAPS, i.e., a Multimulti-locus Association test for a dichotomous Primary trait and a quantitative Secondary phenotype.

2.1. A random effect model for a prospective cohort study

In a prospective cohort study, we assume that the dichotomous trait Di can be modeled by the following logistic regression model given the covariates and genotypes of multiple markers within a gene

logitPrDi=1|Xi,Gi=XiTα+GiTβ,i=1,,n (1)

with the intercept term absorbed in covariates by adding a column of 1s in Xi. We also assume that Zi on a subject with Di = 1 follows the normal distribution NXiTγ+GiTθ,σ2. The likelihood of observing the data {Di, Zi, Xi, Gi: i = 1,...,n} can be represented as .

Lβ,θ,α,γ,σ2=i=1nexpXiTα+GiTβ1+expXiTα+GiTβϕZiXiTγGiTθσDi11+expXiTα+GiTβ1Di (2)

where (.) is the density function of the standard normal distribution. Let

Dβ,α=i=1nDiXiTα+GiTβlog1+expXiTα+GiTβ

and

Zθ,γ,σ2=i=1nDilogϕZiXiTγGiTθσ

therefore, the log-likelihood β,θ,α,γ,σ2=Dβ,α+Zθ,γ,σ2. Denote SD=Dβ, SZ=Zθ, SDD=2DββT, and SZZ=2ZθθT. It can be shown through a second-order Taylor expansion that the likelihood can be approximated around (β,θ) = (0,0)as

exp{(β,θ,α,γ,σ2)}exp{D(0,α)+Z(0,γ,σ2)}{1+SDTβ+SZTθ+12(βT,θT)W(βθ)}

where 0 is a zero vector of length m, and

W=SDSDTSDSZTSZSDTSZSZT+SDD00SZZ

To derive a variance component test for the null hypothesis H0 : β = θ =0, we further assume that the genetic effects (β,θ) are random effects with Eβ = Eθ = 0,and

Covβθ=τ×κIρκ(1κ)Iρκ(1κ)I(1κ)I

where I is the m m identity matrix and the scalars τ ≥ 0, ρ ∈[–1, 1], κ ∈ [0, 1] Here, the variance–covariance matrix is configured by three parameters under the following assumptions. First, genetic effects β on the primary trait are independent and identically distributed (i.i.d.) random variables with variance τκ. Second, genetic effects θ on the secondary phenotype are i.i.d. random variables with variance τ(1) Third, the genetic effects from one marker on the two traits are correlated with correlation coefficient ρ. Fourth, the genetic effects from two markers on either the same or different traits are uncorrelated. One can see that testing the joint genetic effects on the two traits is equivalent to testing H0 : τ = 0. Similar to Lin,12 we can obtain the profile log-likelihood in term of variance component parameters (τ,ρ,κ) by integrating out β and θ

˜(τ,ρ,κ,α,γ,σ2)=logEβ,θexp(β,θ,α,γ,σ2)D(0,α)+Z(0,γ,σ2)+12trWCovθβ (3)

Let (α^,γ^,σ^2) be the maximum likelihood estimates of ˜ under the null. Define

Qρ,κ=κS^DTS^D+(1κ)S^DTS^Z+2ρκ(1κ)S^DTS^Z (4)

where S^D (or S^Z) is the score SD (or SZ) evaluated at (α^,γ^,σ^2). The asymptotic null distributions of S^D and S^Z are multivariate normal distributions, with means 0 and estimated variance–covariance matrices S^DD and S^ZZ, respectively. The score for τ at τ = 0 is

d˜dττ=0=12Qρ,κ+12trκS^DD+(1κ)S^ZZ (5)

Note that the second term in equation (5) converges to some constant in probability for given (ρ,κ), we thus can conduct a family of variance component tests based on Q,ρ,κ only.

For any given (ρ,κ), denote pρ,κ=Pr(Qρ,κQ˜ρ,κ) as the p-value of Qρ,κ evaluated at its observed value Q˜ρ,κ.

Since (ρ,κ) are unknown, we propose to define the statistic for testing H0 : τ = 0 as

T=minρ,κpρ,κ,

which measures the strongest evidence of the presence of association with (ρ,κ) turned in proper regions. The final p-value adjusted for multiple comparisons is computed from the null distribution of T.

In the following sections, we will introduce different versions of variance component tests based on T with possible choices of the tuning parameters ρ and κ. Numerical algorithms for computing the final p-value are also discussed.

2.2. The variance component test with ρ = 0 and κ = 1/2

One simple choice of the tuning parameters in T is to set ρ = 0 and κ 1/2, which essentially assumes that the genetic effects distribute equally on either traits for each marker. The statistical significance can be evaluated by checking the distribution of Q0,1/2S^DTS^D+S^DTS^Z, which follows a mixture of chi-square distributions under the null. Several existing algorithms are available for computing the distribution function of Q0,1/2, thus the p-value can be calculated accurately.1315 This test is referred as MAPS0,1/2.

2.3. The variance component test with ρ = 0

A more flexible approach is to fix ρ = 0 while allowing n to vary in κ [0, 1]. The test statistic becomes T=minκ[0,1]p0,κ. In practice, we can choose κ at the grids {k/20 : k = 0, ... , 20}. For any given κ

Q0,κ=κS^DTS^D+(1κ)S^ZTS^Z.

Notice that SD is asymptotically independent with Sz since 2βθT=0, the final p-value of T defined can be computed explicitly by an one-dimensional numerical integration algorithm. The details are given in Appendix 1. This test is referred as MAPS0.

2.4. The variance component test with variable ρ and κ

In real application, we usually do not have any prior knowledge on the values of ρ and κ. A robust approach is to maximize the association evidence over (ρ,κ)[1,1]×[0,1], we define the statistic as T=min(ρ,κ)[1,1]×[0,1]pρ,κ. In practice, we can choose (ρ,κ) the grids {j/10; j = - 10,…, 10} * {k/20 : k = 0,…,20}. To assess the significance of T, we can generate the scores of S^D and S^Z under the null via the direct simulation approach.16 The final p-value of T is then estimated through the computationally efficient minP algorithm.17 We refer this optimal test as MAPSopt. As a special case, the MAPSopt test with κ fixed to be 1/2, and ρ tuned in [—1, 1], is referred as MAPScor. The p-value of MAPScor can be computed similarly as MAPSopt.

2.5. Existing approaches

There are several alternative approaches that are applicable to the setting considered in this paper. The SKAT has been successfully applied in identifying genetic regions associated with complex diseases.7 In the following discussion, the SKAT tests applied to either the dichotomous or quantitative trait are referred as SKATD and SKATZ, respectively. In addition, the standard likelihood ratio test (LRT), which compares the additive model consisting of all the genetic markers with the null model, can be applied to each trait separately, leading to two tests LRTD and LRTZ, respectively. These two tests may loss power due to large degree-of-freedoms. Finally, we generalize the single-marker test in Wu et al.1 to a multi-locus score test. This generalized score test follows χ2m2 distribution under the null.

2.6. A random effect model for a retrospective case–control study

In a case–control study, we assume the quantitative trait Z is only observed in cases. Then the likelihood of observed data {Di, Zi, Xi, Gi : i = 1,·· ,n} can be written as

L(β,θ,α,γ,σ2)=i=1nPr(Zi|Xi,Gi,Di=1)DiPr(Xi,Gi|Di=1)DiPr(Xi,Gi|Di=0)1Di (6)

According to Qin and Zhang,18 the joint distribution of Xi and Gi satisfies

Pr(Xi,Gi|Di=1)expXiTα+GiTβPr(Xi,Gi|Di=0)

if the risk model is assumed as the logistic regression model in equation (1). Ignoring a constant, the profile likelihood of equation (6) is equivalent to the likelihood equation (2) in a cohort study.18 Therefore, all the tests discussed in previous subsections can be applied to case–control studies.

2.7. Missing data in multi-locus test

In the above, we have described the method assuming no missing genotypes at any considered genetic markers. In real application, we might have a substantial proportion of individuals who have at least one missing genotype in the considered region, especially when the region consists of a large number of markers. Removing those subjects can result in substantial loss of power. To make full use of observed genotypes, we propose to use following modified score statistics defined on observed genotypes.

Without loss of generality, we consider the generalized linear model EY=g1(XTα+GTβ) and assume that the covariates X are observed in full dataset S with sample size n. Other nuisance parameters (e.g., variance parameter σ2 for quantitative trait Z), if any, are denoted as ѱ. The nj individuals without missing genotypes on the jth marker are indexed as Sj, j = 1, 2,..., m, where m is the number of markers within the considered region or gene. Denote the log-likelihood as =(α,β,ψ) and let α=α,β=β, αα=2ααT, ββ=2ββT, and βα=2βαT. A superscriptj on these defined term means only individuals in Sj are used. For example, βjj is the score of βj defined on Sj. In contrast, the score of α can be defined on either S (i.e.ℓ,α) or Sj (i.e.,αj) Similarly, superscript jk means individuals in SjSk are used. Let (α^j,ψ^j) be the maximum likelihood estimates of (α,ѱ) using Sj under the null. Statistics denoted with accent ^ is assessed at (α^,ψ^)(e.g., ^βjj=βjj(α^,0,ψ^j)) We show in the Appendix 1 that, under the assumpation of missing at complete randomness the modified score ^β=^βjj:j=1,2,,p=^βjj(α^j,0,ψ^j):j=1,2,,p asymptotically follows multivariate normal distribution with means 0. The covariance between ^βjj and ^βkk can be consistently estimated by νjk=^βjβkjk+nnjknjnk^βjαj^αα1^αβkk, where njk is the sample size of Sj ⋂ Sk. Replacing S^D and S^Z in equation (4) by the modified score ^β allows the proposed method to handle data with missing genotypes, where (νjk)m×m is used as the modified variance–covariance matrix in the direct simulation for the evaluation of the p-value.

In this procedure, the score ^βjj uses all the observed genotypes at the jth SNP. The covariance between the score ^βjj and ^βkk is estimated using information on njk subjects, which is very close to the original total sample size if the proportion of missing genotype at each marker is low. Thus, this strategy is much more efficient than the one that requires the removal of subjects who have at least one missing genotypes on the set of considered markers.

This procedure is very general for handling missing genotypes in various multi-locus tests, as long as the test is based on score statistics derived from the generalized linear model. We therefore integrated this procedure into SKAT, LRT, and Wu’s method, so that they can be applied to the real data application described below.

3. Simulation studies

We evaluated performance of the proposed variance component tests through simulation studies with genes generated under various of linkage disequilibrium (LD) structures. Similar to Wang and Elston,19 we considered a gene consisting of 20 SNPs and a study with 500 cases and 500 controls. To generate genotypes on the 20 SNPs, we first simulated continuous random variables R = (R1,...,R20) from a multivariate normal distribution with mean zero and a variance–covariance matrix Σ=(σij)20×20, where σij=rij. By properly choosing cut-points, we then discretized Ri into a three-level genotype with levels 0,1, and 2, so that the corresponding SNP had a MAF of 0.4. The LD within this gene was controlled by the parameter r, which was chosen as either 0 or 0.6 in our simulation. We used this algorithm to generate genotypes in controls. We assumed the risk model for the primary trait (case–control status) has the following form

logitPr(D=1|G)=βG10+βG11

with the 10th and 11th SNPs conferring the risk of disease. In order to simplify the simulation, we assumed the equality of the two odds ratios. Under the given risk model, we used the weighted sampling procedure18 to generate genotypes in cases from the following distribution

Pr(G1,,G20|D=1)exp{βG10+βG11}Pr(G1,,G20|D=0)

Within the stratum of D = 1, the secondary quantitative trait was simulated from

Z=θG10+θG11+N(0,1),

with the same risk SNPs as in the risk model of the dichotomous trait.

We also investigated the robustness of our method by additional simulations, in which the risk SNPs of the primary trait and the secondary trait were different. More specifically, logit logitPr(D=1|G)=βG9+βG10, and Z=θG11+θG12+N(0,1).

3.1. Type I error

We evaluated the type I errors of all tests proposed in this paper, as well as the existing approaches discussed in Section 2.5. Only the results of scenario with r = 0.6 are presented here. 100,000 datasets were generated under the null by setting β = θ =0, each with 500 cases and 500 controls. p-values of MAPSopt and MAPScor were first estimated with 100,000 resampling steps. Then for those datasets with initial estimates of p-values 5 10−4, more accurate estimates were obtained with 1,000,000 resampling steps. Table 1 shows that all tests can properly control the type I errors at nominal levels of 0.01, 0.001, and 0.0001.

Table 1.

Type I error based on 100,000 datasets generated from the null H0 : β = θ =0.

Level MAPSopt MAPScor MAPS0 MAPS0,1/2 Wu SKATD SKATZ LRTD LRTZ
0.01 0.00960 0.00933 0.00902 0.01039 0.00944 0.00999 0.00967 0.01169 0.01231
0.001 0.00099 0.00094 0.00093 0.00099 0.00083 0.00093 0.00098 0.00126 0.00134
0.0001 0.00010 0.00008 0.00012 0.00005 0.00009 0.00010 0.00008 0.00010 0.00024

3.2. Empirical power

To compare the empirical powers of various tests under different LD structure configured by r, we chose β and θ so that the empirical powers of LRTD and LRTZ were close to specified powers ( pD, pZ) with type I error controlled at the level of 0.01. We set ( pD, pZ) as (0.2, 0.4), (0.3, 0.3), (0.4, 0.2), (0.6, 0.01), or (0.01, 0.6) to represent different scenarios. For example, when ( pD, pZ) = (0.6, 0:01) or (0.01,0 6), the gene has moderate effects on both traits. When ( pD, pZ) = (0.6, 0.01) or (0.01, 0.6), the gene influenced only one trait. We used 10,000 resampling steps to evaluate p-values for MAPSopt and MAPScor.

In Table 2, we summarize the empirical powers, each of which is based on 1000 simulated datasets. We can see from the table that, when the same SNPs in a causal gene influence both traits, MAPSopt and MAPScor have the best performance among all considered tests. When the gene is associated with only one trait, MAPS0 and MAPSopt are the most robust tests among others variance component tests, with MAPS0 is slightly more powerful than MAPSopt due to less model selection penalty. MAPScor is very sensitive to the underlying risk model. For example, its power is less than 1/3 of that of MAPSopt when the gene is only associated with primary trait. When two traits are influenced by different causal SNPs, the method extended from Wu et al.1 is more powerful than other methods if all SNPs are in linkage equilibrium (r = 0). When SNPs are moderately correlated (r = 0.6), MAPSopt is the most robust test in discovering composite gene association.

Table 2.

Empirical power comparison when causal SNPs are observed.

(r,β,θ) MAPSopt MAPScor MAPS0 MAPS0,1/2 Wu SKATD SKATZ LRTD LRTZ
Two risk models share the same causal SNPs
(0.0, 0.21, 0.18) 0.686 0.702 0.531 0.584 0.561 0.232 0.421 0.209 0.404
(0.6, 0.17, 0.15) 0.853 0.897 0.767 0.809 0.542 0.404 0.646 0.193 0.397
(0.0, 0.23, 0.17) 0.672 0.688 0.524 0.527 0.559 0.318 0.320 0.304 0.298
(0.6, 0.19, 0.14) 0.857 0.889 0.770 0.789 0.582 0.523 0.552 0.302 0.315
(0.0, 0.26, 0.15) 0.656 0.626 0.515 0.447 0.537 0.433 0.200 0.409 0.194
(0.6, 0.21, 0.12) 0.836 0.853 0.776 0.747 0.584 0.651 0.407 0.409 0.215
Gene associates with one trait
(0.0, 0.30, 0.00) 0.478 0.145 0.512 0.151 0.392 0.619 0.010 0.593 0.008
(0.6, 0.24, 0.00) 0.726 0.350 0.746 0.376 0.418 0.820 0.007 0.607 0.011
(0.0, 0.00, 0.22) 0.458 0.475 0.496 0.539 0.391 0.004 0.601 0.004 0.600
(0.6, 0.00, 0.18) 0.726 0.760 0.752 0.794 0.391 0.001 0.821 0.007 0.599
Two risk models contain different causal SNPs
(0.0, 0.21, 0.18) 0.492 0.522 0.533 0.576 0.580 0.246 0.409 0.227 0.395
(0.6, 0.17, 0.15) 0.774 0.804 0.758 0.803 0.555 0.384 0.651 0.187 0.404
(0.0, 0.23, 0.17) 0.479 0.464 0.517 0.515 0.570 0.317 0.307 0.301 0.302
(0.6, 0.19, 0.14) 0.779 0.778 0.758 0.762 0.569 0.524 0.547 0.293 0.307
(0.0, 0.26, 0.15) 0.480 0.398 0.515 0.453 0.561 0.445 0.190 0.409 0.191
(0.6, 0.21, 0.12) 0.766 0.738 0.760 0.727 0.564 0.638 0.381 0.411 0.217

In Figure 1, we show the optimal (ρ,κ) corresponding to minρ,κ pρ,κ for each simulated dataset, in which SNPs in a gene are moderately correlated (r = 0.6), and genotypes at causal SNPs shared by the two risk models are directly observed. When MAPSopt detects a significant association in practice (e.g., p < 0.01), a selected κ that is very close to 0 or 1 suggests that the gene under study is likely associated with only one trait.

Figure 1.

Figure 1.

ρ and κ selected by MAPSopt in simulation studies, in which two risk models share the causal SNPs that are directly observed. The SNPs in a gene are moderately correlated (r = 0.6). Solid points: p-values of MAPSopt ≤ 0:01. Circle: p-values of MAPSopt > 0.01.

We also compared the empirical powers among tests when the causal SNPs are not directly observed. The parameter r controlling the LD structure was set at 0.6. All other settings were similar to those used in previous simulation with full observations. The results are summarized in Table 3. MAPSopt again appears to have the most robust performance among all considered tests, especially when the gene is associated with both traits.

Table 3.

Empirical power comparison when causal SNPs are not directly observed (r = 0.6).

(β,θ) MAPSopt MAPScor MAPS0 MAPS0,1/2 Wu SKATD SKATZ LRTD LRTZ
Two risk models share the same causal SNPs
(0.27, 0.25) 0.780 0.803 0.674 0.724 0.558 0.295 0.566 0.205 0.401
(0.31, 0.23) 0.781 0.781 0.675 0.689 0.568 0.440 0.453 0.300 0.293
(0.34, 0.20) 0.791 0.765 0.679 0.639 0.551 0.572 0.325 0.402 0.201
Gene associates with one trait
(0.39, 0.00) 0.604 0.243 0.650 0.270 0.372 0.755 0.009 0.599 0.014
(0.00, 0.30) 0.613 0.626 0.635 0.686 0.397 0.003 0.738 0.011 0.598
Two risk models contain different causal SNPs
(0.34, 0.32) 0.593 0.630 0.573 0.639 0.542 0.252 0.464 0.199 0.389
(0.38, 0.30) 0.604 0.616 0.581 0.617 0.560 0.360 0.373 0.291 0.309
(0.42, 0.26) 0.620 0.588 0.594 0.578 0.566 0.473 0.273 0.398 0.213

4. Application to a genome-wide association study of prostate cancer

We demonstrated the application of MAPS as multi-locus tests by applying them on a genome-wide association study (GWAS) of prostate cancer. We focused on 2841 controls and 4544 cases of European ancestry.20 For each prostate cancer case, we used the Gleason score (2–10), which indicates how likely it is that a tumor will spread, as a quantitative trait. We hypothesized that there are genes influencing the mechanism underlying the development of prostate cancer, as well as how fast the tumor cells spread. By looking at the two traits jointly (i.e., prostate cancer status and Gleason score), we intend to increase our chance for detecting that type of genes.

Of the SNPs genotyped using the Illumina HumanOmni2.5 BeadChip, 1,531,807 passed standard quality control criteria.20 We extracted SNPs within 20 kb upstream and 20 kb downstream of a gene or an annotated region. The SNPs with missing rate > 2 % or MAFs < 2 % were excluded from the analyses. For two SNPs with LD coefficient r2 > 0:95, the one with a smaller MAF was discarded. Both traits were adjusted for center, age, and two eigenvectors.

We will provide more detailed report on the analysis of over 20,000 genes/regions elsewhere. Here, we are interested in the 69 genes with both p-values of SKATZ and SKATD less than 0.05, as using tests analyzing two traits jointly are most likely to be beneficial on those genes. In Table 4, we showed results of nine genes on which there were at least one gene-level p-value less than 0.001 by all considered two-trait joint tests. Among those, KLK3 and CLDN11 are known risk genes associated with the prostate cancer in population with European ancestry.21,22 IRX4 has only been identified to be associated with the risk of prostate cancer in Japanese population.23,24 Although the other six genes have not been reported in GWAS as genes susceptible to prostate cancer risk, overexpression of PIAS3 was known to induce apoptosis in prostate cancer cells.25 The forest plots in Figures 2 and 3 illustrate the marginal effects from each SNP in genes LOC643201 and PIAS3 on the prostate cancer risk and the Gleason score. It shows several SNPs in either gene are associated with both traits. This is the main reason why the joint test approach appears to be more advantageous than the single-trait test approaches. We can consider those genes in Table 4 as promising candidates underlying the development of prostate cancer, although further replications are needed.

Table 4.

The suggestive genes with at least one gene-level p-value < 10−3, and the p-values of SKATD and SKATZ are both < 0.05.

Gene MAPSopt MAPScor MAPS0 MAPS0,1/2 Wu SKATD SKATZ LRTD LRTZ
SENP6 3.0E–5 2.3E–4 3.7E–5 1.3E–3 2.9E–2 3.7E–5 4.8E–2 1.8E–2 2.8E–1
LOC643201 8.1E–5 5.0E–5 1.8E–4 2.6E–4 1.5E–3 5.4E–3 1.0E–3 2.1E–3 8.7E–2
KLK3 8.6E–5 3.8E–5 8.1E–5 4.5E–5 1.6E–4 3.7E–2 1.0E–4 2.0E–2 1.0E–3
PIAS3 9.6E–5 7.7E–5 2.8E–4 4.8E–4 6.3E–3 3.8E–3 2.2E–3 2.4E–2 4.8E–2
IRX4 3.9E–4 4.9E–4 9.9E–4 2.7E–3 2.0E–1 1.9E–3 2.1E–2 5.0E–2 7.0E–1
MRPS31 4.1E–4 2.1E–4 7.1E–4 4.4E–4 4.3E–2 4.0E–2 8.4E–4 4.0E–1 1.8E–2
CLDN11 5.3E–4 1.3E–3 5.6E–4 3.1E–3 2.7E–4 7.4E–4 3.5E–2 4.8E–4 5.8E–2
ZNF526 2.0E–3 1.7E–3 8.8E–4 7.3E–4 6.9E–3 1.5E–2 2.3E–3 3.4E–2 3.3E–2
AGXT2L1 6.0E–3 4.4E–3 8.4E–3 6.9E–3 1.7E–4 1.7E–2 2.0E–2 1.1E–3 1.9E–2

The values marked in bold are known susceptibility genes of the risk of prostate cancer.

Figure 2.

Figure 2.

The forest plot of 16 SNPs within gene LOC643201.

Figure 3.

Figure 3.

The forest plot of 12 SNPs within gene PIAS3.

5. Discussion

Although genetic association studies are typically designed to study one primary trait, valuable information on other secondary phenotypes is often collected. There is a growing interest to study secondary phenotypes using already measured genotypes. Several approaches have been developed to identify genetic markers associated with secondary phenotypes, taking account for the design of the original study.2630 The proposed method has a different goal. It analyzes the primary trait and a secondary phenotype jointly, aiming at detecting genes influencing both traits. The family of proposed tests is derived from a random effect model with two variance components, with each presenting the genetic effect on one trait. Among its various versions, we found the one that uses observed data to adaptively model the variance–covariance matrix of genetic effects has the most robust performance. We demonstrated the application of the new method by applying it to analyze a GWAS of prostate cancer and identified several promising novel regions that appeared to influence the risk and progression of prostate cancer. An R package of the proposed test is available at https://github.com/zhangh12/MAPS

It has been shown that multi-locus association tests can be a valuable alternative to the commonly used single-marker test. There are many multi-locus approaches for genetic association studies,6,7,31 most of which assume that there is no missing genotype. As a result, genotype imputation is usually needed before using the multi-locus test. However, it is not a trivial task to impute the missing genotypes, and the imputation accuracy depends on the reference genome.11 The strategy for dealing with missing genotypes proposed with our test is more flexible and easy to use, as it does not need imputing missing genotypes. Also, it uses all observed genotypes, and thus is more efficient than the strategy that requires the removal of subjects with at least one missing genotypes on the set of considered markers. This strategy is especially helpful for GWAS, where genotype missing rate at a given SNP is very low. With some modifications, this strategy can be adapted to other multi-locus tests defined by score statistics.7,16,31

In our method, we use three parameters to model the variance–covariance matrix for the genetic effects on a primary trait and a secondary phenotype. To extend the method to study more than two secondary phenotypes, it is important to find an appropriate model for the variance–covariance matrix. Using too many parameters would increase the penalty for model selection and reduce efficiency of the multi-locus test. On the other hand, an over-simplified model can introduce bias into the testing procedure. Further investigations are needed to extend the proposed method to study more secondary phenotypes.

Acknowledgements

The authors would like to thank Professor Hua Liang at The George Washington University for his helpful comments. This study utilized the high-performance computational capabilities of compute cluster at the Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland.

Appendix 1

p-value of variance component test with ρ = 0

We derive formula of the p-values of MAPS0. Let qκ be the upper 100 x p0,κ % quantile of Q0,κ, where Q0,κ follows a null distribution as a central mixture chi-square with non-zero weights. Under the null, S^DTS^D and S^ZTS^Z follow central mixture chi-square distributions with corresponding weights.31 Therefore

Pr(TT˜)=1Pr(T>T˜)=1Pr(Q0,κ<qk,κ)=1Pr(S^DTS^D<minκ>0qk(1κ)S^ZTS^ZκandS^ZTS^Z<q0)=10q0minκ>0qk(1κ)tκfZ(t)dt

where FD(·) is the cumulative distribution function of S^DTS^D and fz(·) is the probability density function of S^ZTS^Z.

Asymptotic distribution of modified score allowing missing data

We derive modified score to deal with missing genotypes. By Taylor’s Theorem and law of large numbers

^βjjβjjIβjαIαα1,j=1,2,,p

where I is the Fisher information matrix. Thus

Cov^βjj,^βkk=Covβjj,βkk+IβjαIαα1Covαj,αkIαα1IαβkIβjαIαα1Covαj,αkCovβjj,βkkIαα1Iαβk=njkIβjβk+IβjαIαα1(njkIαα)Iαα1IαβkIβjαIαα1(njkIαβk)(njkIβjα)Iαα1Iαβk=njkIβjβkIβjαIαα1Iαβk

The information Iβjβk, Iβjα, Iαα and Iαβk can be consistently estimated by njk1βjβk(α^jk,0,ϕ^jk), nj1βjα(α^j,0,ϕ^j), n1αα(α^,0,ϕ^) and nk1αβk(α^k,0,ϕ^k), respectively. Therefore, we can estimate the covariance between scores of two genetic effects by ^βjβkjk+nnjknjnk^βjαj^αα1^αβkk. Note that we here use the most informative estimates of (ᶛ,β,ᶲ)by using as much samples as possible.

Footnotes

Declaration of conflicting interests

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

  • 1.Wu CO, Zheng G and Kwak M. A joint regression analysis for genetic association studies with outcome stratified samples. Biometrics 2013; 69: 417–426. [DOI] [PubMed] [Google Scholar]
  • 2.Yang J, Ferreira T, Morris AP, et al. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet 2012; 44: 369–375. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Liu JZ, Mcrae AF, Nyholt DR, et al. A versatile gene-based test for genome-wide association studies. Am J Hum Genet 2010; 87: 139–145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Huang H, Chanda P, Alonso A, et al. Gene-based tests of association. PLoS Genet 2011; 7: e1002177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Li MX, Gui HS, Kwan JS, et al. GATES: a rapid and powerful gene-based association test using extended sines procedure. Am J Hum Genet 2011; 88: 283–293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Zhang H, Wheeler W, Wang Z, et al. A fast and powerful tree-based association test for detecting complex joint effects in case-control studies. Bioinformatics 2014; 30: 2171–2178. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Wu MC, Kraft P, Epstein MP, et al. Powerful SNP-set analysis for case-control genome-wide association studies. Am J Hum Genet 2010; 86: 929–942. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Browning BL and Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet 2010; 84: 210–223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Li Y, Willer CJ, Ding J, et al. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol 2010; 34: 816–834. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Howie B, Fuchsberger C, Stephens M, et al. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet 2012; 44: 955–959. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Wang Z, Jacobs KB, Yeager M, et al. Improved imputation of common and uncommon SNPs with a new reference set. Nat Genet 2012; 44: 6–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Lin X Variance component testing in generalised linear models with random effects. Biometrika 1997; 84: 309–326. [Google Scholar]
  • 13.Davies RB. The distribution of a linear combination of x2 random variables. J R Stat Soc C 1980; 29: 323–333. [Google Scholar]
  • 14.Kuonen D Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika 1999; 86: 929–935. [Google Scholar]
  • 15.Liu H, Tang Y and Zhang HH. A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Computat Stat Data Analy 2009; 53: 853–856. [Google Scholar]
  • 16.Zhang H, Shi J, Liang F, et al. A fast multilocus test with adaptive SNP selection for large-scale genetic-association studies. Eur J Hum Genet 2014; 22: 696–702. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Ge Y, Dudoit S and Speed TP. Resampling-based multiple testing for microarray data analysis. Test 2003; 12: 1–77. [Google Scholar]
  • 18.Qin J and Zhang B. A goodness-of-fit test for logistic regression models based on case-control data. Biometrika 1997; 84: 609–618. [Google Scholar]
  • 19.Wang T and Elston RC. Improved power by use of a weighted score test for linkage disequilibrium mapping. Am J Hum Genet 2007; 80: 353–360. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Berndt SI, Wang Z, Yeager M, et al. Two susceptibility loci identified for prostate cancer aggressiveness. Nat Commun 2015; 6: 6889. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Eeles RA, Kote-Jarai Z, Giles GG, et al. Multiple newly identified loci associated with prostate cancer susceptibility. Nat Genet 2008; 40: 316–321. [DOI] [PubMed] [Google Scholar]
  • 22.Kote-Jarai Z, Olama AAA, Giles GG, et al. Seven novel prostate cancer susceptibility loci identified by a multi-stage genome-wide association study. Nat Genet 2011; 43: 785–791. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Takata R, Akamatsu S, Kubo M, et al. Genome-wide association study identifies five new susceptibility loci for prostate cancer in the Japanese population. Nat Genet 2010; 42: 751–754. [DOI] [PubMed] [Google Scholar]
  • 24.Nakagawa H, Akamatsu S, Takata R, et al. Prostate cancer genomics, biology, and risk assessment through genome-wide association studies. Cancer Sci 2012; 103: 607–613. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Wible BA, Wang L, Kuryshev YA, et al. Increased K+ efflux and apoptosis induced by the potassium channel modulatory protein KChAP/PIAS3 þ in prostate cancer cells. J Biol Chem 2002; 277: 17852–17862. [DOI] [PubMed] [Google Scholar]
  • 26.Lin D and Zeng D. Proper analysis of secondary phenotype data in case-control association studies. Genet Epidemiol 2009; 33: 256–265. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Monsees GM, Tamimi RM and Kraft P. Genome-wide association scans for secondary traits using case-control samples. Genet Epidemiol 2009; 33: 717–728. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.He J, Li H, Edmondson AC, et al. A Gaussian copula approach for the analysis of secondary phenotypes in case-control genetic association studies. Biostatistics 2012; 13: 497–508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Li H and Gail MH. Efficient adaptively weighted analysis of secondary phenotypes in case-control genome-wide association studies. Hum Hered 2012; 73: 159–173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Schifano ED, Li L, Christiani DC, et al. Genome-wide association analysis for multiple continuous secondary phenotypes. Am J Hum Genet 2013; 92: 744–759. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Lee S, Wu MC and Lin X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics 2012; 13: 762–775. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES