Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 May 1.
Published in final edited form as: Stat Med. 2012 Sep 17;32(9):1494–1508. doi: 10.1002/sim.5613

Bias correction to secondary trait analysis with case–control design

Hua Yun Chen a,*,, Rick Kittles b, Wei Zhang c
PMCID: PMC4006579  NIHMSID: NIHMS554472  PMID: 22987618

Abstract

In genetic association studies with densely typed genetic markers, it is often of substantial interest to examine not only the primary phenotype but also the secondary traits for their association with the genetic markers. For more efficient sample ascertainment of the primary phenotype, a case–control design or its variants, such as the extreme-value sampling design for a quantitative trait, are often adopted. The secondary trait analysis without correcting for the sample ascertainment may yield a biased association estimator. We propose a new method aiming at correcting the potential bias due to the inadequate adjustment of the sample ascertainment. The method yields explicit correction formulas that can be used to both screen the genetic markers and rapidly evaluate the sensitivity of the results to the assumed baseline case-prevalence rate in the population. Simulation studies demonstrate good performance of the proposed approach in comparison with the more computationally intensive approaches, such as the compensator approaches and the maximum prospective likelihood approach. We illustrate the application of the approach by analysis of the genetic association of prostate specific antigen in a case–control study of prostate cancer in the African American population.

Keywords: extreme-value sampling design, semi-parametric likelihood, odds ratio model, sensitivity analysis

1. Introduction

Biased sample ascertainment such as the case–control design for a qualitative trait or the extreme-value sampling design for a quantitative trait is frequently used for its sampling efficiency in genome-wide association studies with densely typed genetic markers. Data on secondary traits are often collected in such studies, and it is of substantial interest to explore the association of genetic markers with a secondary trait. The secondary trait sample thus ascertained may be neither a random sample nor a case–control sample from the study population because the induced sampling design for the secondary trait potentially depends on the genotype and the environmental factor in addition to the secondary trait itself. As a result, analysis of the secondary trait by the traditional prospective logistic regression [14] or its extensions [58] may yield a biased estimator of the true genetic association. This problem had been previously studied in the context of reusing case–control data [912]. It was further considered in the secondary trait analysis in [1318].

Methods proposed in the literature for the analysis of secondary traits are subject to problems. The secondary trait analysis using controls only requires a low case-prevalence rate of the primary phenotype in the general population and is inefficient. Analyses that ignore or condition on the primary phenotype [1113], although simple, may yield a biased estimator of the genetic association. The method viewing the case–control sample of the primary phenotype as the second-stage sample in a two-stage sampling design [1013, 16] is applicable only when additional data on the cohort sample wherein the case–control design is nested are available. The retrospective likelihood approach based on a joint model of the first and secondary traits [15, 19] performs well when supplementary information in terms of the population prevalence rate of the primary phenotype is known. However, when the population prevalence rate of the primary phenotype is unknown or known inaccurately, the retrospective likelihood approach can be computationally difficult to implement [20]. In addition, the use of the known population prevalence rate can be questionable when pre-selection of subjects through eligible criteria for subject inclusion is involved in the case–control sample ascertainment. The bias correction approach in [18] requires us to know the population prevalence rates for both the primary phenotype and the secondary trait.

In this article, we propose a new approach to bias correction in the secondary trait analysis. Our approach yields closed-form formulas for correcting bias in the marginal or conditional analysis of secondary traits. The proposed approach first performs a supplemental analysis of the primary phenotype given the secondary trait, the genetic marker, and possible environmental factors. The correction can then be carried out on the basis of the estimates from the supplemental analysis using the derived explicit formulas. The simple correction formulas enable us to rapidly perform the sensitivity analysis to the baseline case-prevalence rate for tens of thousands of genetic markers. We also propose more accurate but computationally slower compensator approaches and the maximum prospective likelihood approach with fixed baseline case-prevalence rate. We can use these approaches to validate the fast approximation approach proposed.

2. Correction to conditional analysis

2.1. Distribution of sample ascertained by the case–control design

Consider the secondary trait as a binary phenotype Y. Assume the association of secondary trait Y with genetic marker G, environment factor E, and possibly the interaction of G and E in the study population by the logistic regression,

p(Y=1G,E)=exp(θ0+θ1G+θ2E+θ12GE)1+exp(θ0+θ1G+θ2E+θ12GE). (1)

Define the odds ratio function between Y and (G, E) as

η2{Y;(G,E)}=exp(θ1YG+θ2YE+θ12YGE). (2)

We can rewrite the logistic regression model as

p(YG,E)=η2{Y;(G,E)}exp(Yθ0)/{1+exp(θ0)}Y=01η2{Y;(G,E)}exp(Yθ0)/{1+exp(θ0)}, (3)

where exp(0)/{1 + exp(θ0)} = P (Y|G = 0, E = 0). (θ1, θ12) characterizes the association between the secondary trait and the genetic marker. We can carry out testing for no genetic association by testing (θ1, θ12) = (0, 0).

In the analysis of a secondary trait, the sample on the secondary trait is not a random sample from the study population when the sample is ascertained by the case–control status of the primary phenotype. It also differs from a case–control sample ascertained on the status of the secondary trait. We need to carefully consider the sample ascertainment mechanism to avoid bias in the analysis. This entails us to take the primary phenotype D into consideration in the analysis of the secondary trait. Assume that the primary phenotype given the secondary trait, the genotype, and the environment factor follows the logistic regression model,

P(D=1Y,G,E)=exp(γ0+γ01Y+γ1G+γ2E+γ12GE)1+exp(γ0+γ01Y+γ1G+γ2E+γ12GE). (4)

Similarly, define the odds ratio function between D and (Y, G, E) as

η1{D;(Y,G,E)}=exp(γ01DY+γ1DG+γ2DE+γ12DGE). (5)

We can rewrite the logistic regression model as

p(DY,G,E)=η1{D;(Y,G,E)}exp(Dγ0)/{1+exp(γ0)}D=01η1{D;(Y,G,E)}exp(Dγ0)/{1+exp(γ0)}, (6)

where exp(0)/{1 + exp(γ0)} = P (D|Y = 0, G = 0, E = 0).

Let S = 1 denote a subject being included in the case–control sample. We model the sampling probability as

P(S=1D,Y,G,E)=π1(D)π2(Y)π3(G,E), (7)

which is a little more general than the conventional case–control design. For example, it accommodates pre-selection of subjects on the basis of (G, E), which frequently happens in practice. However, using the case-prevalence rate in the general population as supplemental information in the retrospective likelihood approach under this general design [15] may lead to bias in the parameter estimation. The sampling design reduces to case–control ascertainment when π2(Y) = π3(G, E) = 1. Assume that πk, k = 1, 2, 3 are unknown. The sample ascertained by such a scheme has (D, Y) jointly distributed as

p(D,YG,E,S=1)=π1(D)p(DY,G,E)π2(Y)P(YG,E)DYπ1(D)p(DY,G,E)π2(Y)P(YG,E) (8)

Under the sampling design (7), Chen [20] showed that maximizing this prospective joint likelihood yields an association parameter estimator identical to that from maximizing the retrospective likelihood [15] on the basis of p(Y, G, E|D). Plugging (3) and (6) into the right-hand side of expression (8), we obtain

p(D,YG,E,S=1)=η1{D;(Y,G,E)}T(Y,G,E)η2{Y;(G,E)}q1(D)q2(Y)DYη1{D;(Y,G,E)T(Y,G,E)η2{Y;(G,E)}q1(D)q2(Y) (9)

where q1(D) = p{D|Y = 0, G = 0, E = 0, S = 1), q2(Y) = p(Y|D = 0, G = 0, E = 0, S = 1), and

T(Y,G,E)={1+exp(γ0+γ01Y+γ1G+γ2E+γ12GE)}-1, (10)

Directly maximizing the likelihood on the basis of (9) is computationally difficult because it involves the intercept γ0, which is weakly identifiable [20]. In Appendix A, we show that T (Y, G, E) in (9) can be replaced by the odds ratio function

η3{Y;(G,E)}=exp{-τ2γ01(γ1YG+γ2YE+γ12YGE)}, (11)

where (γ01, γ1, γ2, γ12) are the odds ratio parameters in (5) and τ2 = r0(1 − r0) with r0 = {1 + exp(−γ0)}−1 being the baseline prevalence rate of the primary phenotype. We give more general formulas in Appendix A for non-zero reference point for (D, Y, G, E) and for a quantitative primary and/or secondary trait.

2.2. Correction formula for the conditional analysis

When we analyze the secondary trait conditional on the primary phenotype, the distribution of the secondary trait from (9) approximately has the form

p(YD,G,E,S=1)=η1(D;YG,E)}η3{Y;(G,E)}η2{Y;(G,E)}q2(Y)Yη1(D;YG,E)}η3{Y;(G,E)}η2{Y;(G,E)}q2(Y), (12)

which we can rewrite as

p(YD,G,E,S=1)=exp{Y(θ0+γ01D+ψ1G+ψ2E+ψ12GE)}1+exp(θ0+γ01D+ψ1G+ψ2E+ψ12GE), (13)

where θ0=θ0+logπ2(1)/π2(0), and

(ψ1ψ2ψ12)(θ1θ2θ12)-τ2γ01(γ1γ2γ12). (14)

From (14), we can obtain a correction formula for the conditional analysis as

(θ1θ2θ12)(ψ1ψ2ψ12)+τ2γ01(γ1γ2γ12). (15)

This approximation suggests that the magnitude of bias in the conditional analysis is influenced by the baseline variance of the primary phenotype (determined by the baseline prevalence rate), the strength of association between the primary and the secondary traits, and the strength of association between the primary phenotype and the genetic marker and the environmental factor. When the primary phenotype is rare in the study population (τ2 ≈ 0) or when no association exists between the primary phenotype and the secondary trait (γ01 = 0) given the genetic marker and the environmental factor, or when no association exists between the primary phenotype and the genetic marker ((γ1, γ12) = (0, 0)) controlled for the secondary trait and the environment factor, the conditional analysis yields approximately correct estimates of association between the secondary trait and the genetic marker.

We can carry out the correction in the following way. First, we run a logistic regression analysis of Y given D, G, and E to obtain estimates of (ψ1, ψ2, ψ12). We then run a supplemental logistic regression of D given Y, G, and E to obtain estimates of (γ01, γ1, γ2, γ12). We can easily compute the corrected estimates of (θ1, θ2, θ12) using (15) when τ2 is known. When τ2 is unknown or is known inaccurately, a plausible range of τ2 is first chosen. We can then compute the estimates of (θ1, θ2, θ12) at different τ2 values in the plausible range. We can obtain a sensitivity plot of (θ1, θ2, θ12) estimates against the plausible range of τ2. We show in Appendix A that the correction formula (15) is also applicable when either or both the primary phenotype and the secondary trait are quantitative.

Although the correction to the conditional analysis is easy to obtain, obtaining the variance estimate can be more involved. We can combine the two logistic regressions into a single joint analysis based on (9) with T (Y, G, E) replaced by (11) so that we can obtain the variance by routine calculation in a maximum likelihood approach. The joint analysis yields estimates of (ψ1, ψ2, ψ12) and (γ01, γ1, γ2, γ12) simultaneously, and we can then obtain the association parameter (θ1, θ2, θ12) by the correction formula (15). We give the detailed variance estimate for the parameter estimator in Appendix A.

2.3. Testing no association between the secondary trait and a genetic marker

Let θ(τ) denote θ for a given τ through the formula (15). Testing no association between the secondary trait and a genetic marker corresponds to test H0: θ1(τ0) = θ12(τ0) = 0, where τ0 is fixed at the true value. For the case with known τ0, we can perform a χ2 test statistic with two degrees of freedom:

Q(τ0)=n{θ^(τ0)-θ(τ0)}TcT{cV^(τ0)cT}-1c{θ^(τ0)-θ(τ0)}~χ22

where (τ0) is the variance estimate for θ̂(τ0) and

c=(100001)

For the case with unknown or inaccurately known τ0, we can perform a sensitivity analysis on the basis of the χ2 statistic. Alternatively, we can perform a test based on the minimal statistic in the following way. Assume that τ0 is in [τa, τb]. For example, if the primary phenotype is binary, then r0 ∈ (0, 1/2] and [τa, τb] ∈ (0, 1/4]. We set the nominal p-value as

1-Minτ[τa,τ̵b]χ22{Q(τ)},

where χdf2 denotes the chi-square distribution with degrees of freedom df. Because the chi-square distribution is a monotone function, we can calculate the p-value as 1-χ22{Minτ[τa,τ̵b]Q(τ)}. The actual type I error α for a nominal type I error p for individual tests is

α=P{Minτ[τa,τ̵b]Q(τ)>(χ22)-1(1-p)H0}P{Q(τ0)>(χ22)-1(1-p)H0}=p.

This approach to controlling type I error can be too conservative when the range of [τa, τb] is large. In this case, we may choose a prior distribution for τ0 and apply an alternative approach as follows. Suppose that τ has a normal distribution with mean τ0 and variance ω2. It then follows that

n{θ^(τ0)-θ(τ0)}N{0,V(τ0)+v1},

where v1=nω2γ012. We can carry out a test for no association between the genetic marker and the secondary trait by using

n{θ^(τ0)-θ(τ0)}TcT[c{V^(τ0)+v^1}cT]-1c{θ^(τ0)-θ(τ0)}~χ22

3. Other correction approaches

3.1. Correction to the marginal analysis

When the marginal analysis for the secondary trait is used, it follows from (9) that the correct model is

p(YG,E,S=1)=W(Y,G,E)exp(θ0+θ1YG+θ2YE+θ12YGE)YW(Y,G,E)exp(θ0+θ1YG+θ2YE+θ12YGE), (16)

where

W(Y,G,E)=1+exp(γ0+γ01Y+γ1G+γ2E+γ12GE)1+exp(γ0+γ01Y+γ1G+γ2E+γ12GE), (17)

and γ0=γ0+log{π1(1)/π1(0)}. By a similar method that was used in obtaining the replacement η3{Y; (G, E)} for T (Y, G, E), we can obtain a replacement for W (Y, G, E) in (20) as

η3{Y;(G,E)}=exp{(τ2-τ2)γ01(γ1YG+γ2YE+γ12YGE)}, (18)

where τ*2 = r*(1 − r*), the baseline variance of the sample, and r={1+exp(-γ0)}-1. We can estimate parameters ( γ0, γ1, γ2, γ12) by fitting a logistic regression model to the case–control sample with γ0 being the intercept of the regression model. An explicit correction formula for the association of the genetic marker and the secondary trait appears as

(θ1θ2θ12)(ψ1ψ2ψ12)+(τ2-τ2)γ01(γ1γ2γ12). (19)

We give the variance estimate of the corrected estimator in Appendix A.

3.2. Correction approaches using compensators

We can rewrite expression (16) as

p(YG,E,S=1)=exp{Y(θ0(G,E)+θ1G+θ2E+θ12GE)}1+exp(θ0(G,E)+θ1G+θ2E+θ12GE) (20)

where

θ0(G,E)=θ0+logW(1,G,E)W(0,G,E) (21)

If we know W (Y, G, E), (20) and (21) suggest an alternative approach to obtain an estimator of θ through a logistic regression with a known compensator log {W(1, G, E)/W (0, G, E)}. We can estimate the parameters ( γ0, γ1, γ2, γ12) in the compensator by fitting a prospective logistic regression model to the case–control sample. When parameter γ0 is known, we can straightforwardly compute the compensator. We can then fit a logistic regression model based on (20) with a known compensator to the data to obtain the estimator of θ directly. When γ0 is unknown or known inaccurately, we can perform a sensitivity analysis of the association parameter of the secondary trait with the genetic marker to the assumed γ0 values based on (20) and (21).

We can carry out a similar compensator approach by the conditional analysis based on

p(YD,G,E,S=1)=exp{Y(θ0(G,E)+γ01D+θ1G+θ2E+θ12GE)}1+exp(θ0(G,E)+γ01D+θ1G+θ2E+θ12GE), (22)

where

θ0(G,E)=θ0+logT(1,G,E)T(0,G,E) (23)

We can estimate parameters (γ01, γ1, γ2, γ12) in the compensator log {T (1, G, E)/T (0, G, E)} by fitting a logistic regression model to the case–control sample. When the other parameter γ0 is known, we can straightforwardly compute the compensator. When γ0 is unknown or known inaccurately, we can perform a sensitivity analysis of the association parameters to the assumed γ0 values. Likewise, we can similarly perform a joint analysis with compensator.

Methods based on compensators along with the maximum joint prospective likelihood approach based on (9) with γ0 fixed can yield more accurate estimators. In terms of sensitivity analysis, however, they require much more computation, and the inference can be more difficult to carry out because of lack of explicit functional relationship between the association parameters and the baseline prevalence rate for the primary phenotype. As a result, we may use the compensator approaches and/or the maximum likelihood approach jointly with the approximation approach as follows. First, we may use the approximation approach to screen a large number of genetic markers in a genome-wide association study. We may then further study a relatively small number of identified markers by the more accurate compensator approaches or the maximum likelihood approach.

4. Simulation studies

In the first simulation study, we evaluate the impact of different choices of the reference point (Appendix A) in the approximation approaches. We simulated a binary primary phenotype and a binary secondary trait. The model for generating the primary phenotype is (4), and the model for generating the secondary trait is (1). Neither includes the environment factor E. We simulate genetic markers as single-nucleotide polymorphisms (SNPs) following the Hardy–Weinberg equilibrium with minor allele frequencies (MAF) 20%. However, we do not make this assumption in the analysis of the simulated data. In model (4), we assume the effect of the secondary phenotype on the first phenotype to be relatively large (γ01 = 1) because it enables us to assess the performance of the explicit correction formulas. The genetic effect is multiplicative, and the effect size is set to γ1 = 0.5 and θ1 = 0.5. We simulated the baseline prevalence rate ranging from r0 = 0.001 to 0.5 with an increment of 0.01. For each r value, we simulated 5000 repetitions of a sample size 1200 with 600 cases and 600 controls.

We computed three sets of corrections. In the first set, (Y0, G0) = (0, 0). In the second set, (Y0, G0) is the sample mean. In the third set, (Y0, G0) = (0.5, 1), which is the middle point of the ranges of the variables. We display the simulation results in Figure 1. We see from Figure 1a that the conditional analysis has little bias when the baseline prevalence rate r is very small. But the bias increases very rapidly when r = 1 ~ 20%. The bias remains at a similar magnitude for r = 20 ~ 50%. For the marginal analysis, the bias is very large when r is very small. But the bias decreases to nearly zero as r decreases from 1% to 20% and mostly stays at the same level for r = 20 ~ 50%. Figure 1b–d respectively shows the corrections with (Y0, G0) = (0, 0), (Y0, G0) being the sample mean and (Y0, G0) = (0.5, 1) the middle point of the ranges of the variables. We can see that bias is reduced in all three choices. However, the best choice of the reference point appears to be the middle point of the ranges of the variables. We repeated the simulation with MAF being 30% and 10% and with different genetic effects; the relative merit (results not shown) remains the same.

Figure 1.

Figure 1

Conditional and marginal analyses and the correction estimators.

In the second simulation study, we used the same data-generation models with the inclusion to both models of an environment variable and the gene–environment interaction. The simulated environment factor is continuous and uniformly distributed on (−2, 2). We fixed the log-odds ratios for genetic, environmental, and gene–environment interaction in the primary phenotype model at (0.5, 0.5, 0.3). We fixed the log-odds ratio for genetic, environmental, and gene–environment interaction in the secondary trait model at (0.5, −0.5, −0.3). We fixed the baseline prevalence rate of the secondary trait at approximately 10%. The simulated baseline prevalence rates for the primary phenotype are at 1%, 5%, 10%, and 15%, which are in the range of rates mostly encountered in practice. We use several methods to analyze the simulated data. They include the analysis using controls only; the marginal and conditional analyses ignoring the biased sampling design; the corrected marginal, conditional, and joint analysis using the explicit correction formulas; and the marginal, the conditional, and the joint correction estimators using the compensators. The joint correction estimator refers to the maximization of likelihood based on (9) with γ0 in T fixed at the truth. We take the reference point used in the correction analysis at the average value of each variable. Note that the analysis under the rare case-prevalence assumption is the same, as the conditional analysis ignores the biased sampling design.

We list the simulation results based on 5000 repetitions of a sample size 1200 with 600 cases and 600 controls in Table I. Because the conditional analyses yield results very close to those of the joint analysis, we do not include results on the conditional analyses in Table I. The analyses presented in Table I include the analysis based on controls only (Controls), marginal analysis (Marginal), conditional analysis based on the odds ratio model without correction (Condit.), corrected joint analysis (Correct.), the compensator approach (Comp.) based on the conditional analysis, and the analysis using the maximum prospective joint likelihood (MLE). The compensator approach based on the marginal analysis yielded very close results as the Comp. and is thus suppressed. We see from the results in Table I that the controls-only approach can be substantially biased when the cases are not very rare in the population, and the odds ratio estimator has large variations in general due to the reduced sample size. MLE with γ0 fixed at the true value has the best performance. Comp. with γ0 fixed at the true value is almost indistinguishable from the MLE. Correct. has a little larger bias than either Comp. or MLE. The variance of Comp. or MLE is the same as or slightly smaller than the variance of the Correct. The other estimators listed in the table have clearly worse performance in at least some of the simulated cases. When computation complexity is not an issue, Comp. or MLE appears to be the favorite. When we consider computation complexity in conjunction with performance, Correct. is favored.

Table I.

Parameter estimates for simulated data with a binary primary phenotype and a binary secondary trait.

Methods θ1 = 0.5
θ2 = −0.5
θ12 = −0.3
Bias A.std E.std Bias A.std E.std Bias A.std E.std
r0 approximates 1% (or β0 = −4.6)
Controls −0.039 0.242 0.243 −0.007 0.151 0.152 −0.034 0.212 0.211
Marginal 0.073 0.122 0.124 0.116 0.091 0.094 0.018 0.105 0.109
Condit. −0.025 0.125 0.127 −0.006 0.097 0.099 −0.029 0.109 0.111
Correct. −0.008 0.124 0.126 0.011 0.095 0.097 −0.019 0.108 0.111
Comp 0.002 0.124 0.127 0.001 0.096 0.098 −0.002 0.109 0.112
MLE 0.002 0.124 0.126 0.001 0.096 0.098 −0.002 0.109 0.112
r0 approximates 5% (or β0 = −2.95)
Controls −0.081 0.267 0.265 −0.033 0.159 0.157 −0.076 0.216 0.227
Marginal 0.033 0.129 0.129 0.085 0.096 0.095 −0.018 0.112 0.117
Condit. −0.066 0.132 0.132 −0.035 0.100 0.100 −0.067 0.116 0.119
Correct. −0.004 0.129 0.130 0.028 0.096 0.096 −0.029 0.113 0.117
Comp. 0.008 0.130 0.133 0.003 0.097 0.100 −0.006 0.115 0.119
MLE 0.008 0.130 0.131 0.003 0.097 0.097 −0.006 0.115 0.120
r0 approximates 10% (or β0 = −2.2)
Controls −0.104 0.295 0.286 −0.071 0.157 0.162 −0.072 0.243 0.241
Marginal 0.001 0.127 0.135 0.050 0.096 0.097 −0.022 0.121 0.123
Condit. −0.096 0.130 0.138 −0.070 0.101 0.101 −0.071 0.125 0.125
Correct. −0.003 0.128 0.136 0.026 0.096 0.097 −0.014 0.123 0.123
Comp. 0.000 0.128 0.138 −0.004 0.098 0.101 −0.016 0.125 0.126
MLE 0.000 0.129 0.136 0.004 0.098 0.098 −0.015 0.125 0.124
r0 approximates 15% (or β0 = −1.73)
Controls −0.144 0.314 0.305 −0.093 0.165 0.167 −0.086 0.257 0.253
Marginal −0.003 0.135 0.141 0.030 0.099 0.099 −0.027 0.127 0.127
Condit. −0.100 0.137 0.144 −0.090 0.103 0.103 −0.076 0.129 0.130
Correct. 0.011 0.136 0.141 −0.023 0.099 0.099 −0.009 0.129 0.128
Comp. 0.003 0.136 0.144 −0.007 0.101 0.103 −0.008 0.130 0.130
MLE 0.003 0.136 0.142 −0.008 0.101 0.099 −0.008 0.130 0.129

Baseline case-prevalence rate r0 is assumed known in Correct., Comp., and MLE.

Bias, estimate-truth; A.std, square root of the averaged estimated variance; E.std, empirical standard deviation from the repeated simulations; θ1, genetic effect on the secondary phenotype; θ2, environment effect on the secondary trait; θ12, gene–environment interaction on the secondary trait; Controls, analysis based on controls only; Marginal, marginal analysis; Condit., conditional analysis; Correct., corrected joint analysis based on the approximate odds ratio model; Comp., joint analysis by the compensator approach; MLE, the maximum likelihood approach.

We also evaluate the size and the power of the proposed tests on the basis of the corrected joint analysis of the approximate likelihood. We calculate the empirical sizes on the basis of the analysis of the simulated data with (γ1, γ2, γ12) = (0, −0.5, 0), that is, no genetic effect but with environmental effect. We list results on tests in Table II. In each scenario, we perform five tests, and we list the estimated size/power for the five tests for the nominal type I error rates: 5% and 1%. The estimated type I errors are close to the nominal type I error. The estimated size and power for different tests suggested that for a large range of the baseline prevalence rate, the size and power of the different tests do not differ much. This may suggest that the tests that control the type I error for a range of baseline prevalence rate may have reasonably good performance. We also varied MAF and the genetic and gene–environment interaction effects in the simulation; similar results (not shown) were obtained.

Table II.

Size and power of the tests under different assumptions on the baseline prevalence rate for the primary phenotype.

r0 (θ1, θ2, θ12) = (0, −0.5, 0), Size with type I error 5%
(θ1, θ2, θ12) = (0.5, −0.5, −0.3), Power with type I error 5%
Test 1 Test 2 Test 3 Test 4 Test 5 Test 1 Test 2 Test 3 Test 4 Test 5
1% 5.80 6.14 8.38 10.44 11.80 99.1 99.6 99.7 99.8 99.8
5.96 5.10 5.50 6.10 6.64 86.6 78.5 71.7 67.0 64.7
5% 8.40 4.68 4.40 5.20 5.56 99.2 99.7 99.7 99.7 99.7
7.46 5.16 4.98 4.86 4.74 88.1 81.7 76.6 73.0 70.9
10% 8.36 5.30 4.60 4.60 4.76 99.1 99.5 99.5 99.6 99.6
8.24 5.68 4.94 4.86 4.92 86.2 79.6 73.7 70.1 68.0
15% 8.70 5.52 4.80 4.78 4.90 98.9 99.3 99.5 99.6 99.7
7.64 5.22 4.58 4.42 4.54 84.1 77.5 72.1 68.5 66.4
Size with type I error 1%
Power with type I error 1%
Test 1 Test 2 Test 3 Test 4 Test 5 Test 1 Test 2 Test 3 Test 4 Test 5
1% 1.16 1.12 1.86 2.92 3.68 95.9 97.8 98.5 98.9 99.0
1.36 0.94 1.24 1.36 1.54 65.8 54.4 46.0 41.6 38.9
5% 1.44 0.68 0.82 1.12 1.24 96.1 98.0 98.3 98.6 98.8
1.56 0.84 0.86 0.82 0.84 70.7 60.6 52.9 48.3 46.1
10% 1.90 0.90 0.92 1.12 1.20 95.8 97.5 98.2 98.3 98.4
1.94 1.16 1.00 1.00 1.02 67.3 56.6 48.9 44.7 42.6
15% 2.18 0.98 0.84 0.88 0.88 95.6 97.1 97.7 98.0 98.2
1.86 1.12 0.82 0.86 0.88 64.2 54.4 47.0 42.6 40.2

The first row under each baseline prevalence rate r0 is the test of no genetic effects, and the second row is the test of no gene–environment interaction effect. The entries under each test are the rejection rate times 100. Entries with bold face are tests with correct baseline prevalence rate assumed. The values in italics are the minimum p-values or power achieved among a range of r0.

Test 1, Wald test assuming τ2 = 0.01 × 0.99; Test 2, Wald test assuming τ2 = 0.05 × 0.95; Test 3, Wald test assuming τ2 = 0.1 × 0.9; Test 4, Wald test assuming τ2 = 0.15 × 0.85; Test 5, τ2 = 0.2 × 0.8.

5. Genetic association of the prostate specific antigen

Prostate specific antigen (PSA) level in blood is a reasonably good predictor for prostate cancer, although it is not a precise marker for prostate cancer [21]. In genetic association studies of prostate cancer, PSA is often measured. It is of substantial interest to investigate the association of PSA level with genetic variants. For example, in performing the genome-wide association studies of prostate cancer through combining case–control studies that included thousands of prostate cancer cases and health controls, Eeles et al. [22, 23] performed secondary analyses on PSA level in blood on the basis of the control sample in an attempt to identify SNPs that may be associated with PSA. The control-only analysis can have substantially lower power in detecting possible association than the analyses that include both cases and controls. In addition, such an analysis that requires sparse disease in the population can be biased because the prevalence rate of prostate cancer in the elder male population can be high [21].

The proposed approach in this article can be more appropriate for the analysis of the PSA data in [22, 23]. Here, we use a similar case–control study of prostate cancer in the African American population to illustrate the use of our approach. The sample in this study was collected and genotyped in one of the coauthors’ (Dr. Kittles) lab. There are 260 cases and 345 controls whose genomes were typed on over one million SNPs using Illumina array. PSA level was measured on 218 cases and 325 controls. In this analysis, we concentrate on the 543 subjects with PSA measured. In a preliminary analysis of the prostate cancer status, we fit a logistic regression model without any SNPs in the model. We found that the logarithm transformed PSA, that is, log(1+ PSA), age of the subject, and the interaction between the two of a subject are all strongly associated with the case–control status of the subject. In a preliminary analysis of the PSA, we fit a linear regression model for the logarithm transformed PSA. We found that the case–control status and the age of the subject are strongly associated with the logarithm transformed PSA, but there was no age and status interaction. These models form the basis of our genome-wide association analysis of the PSA as a secondary trait. Our model for the primary phenotype, the prostate cancer status, is

log[η1{D;(lspa,G,age)}]=D(γ01lpsa+γ02lpsaage+γ1G+γ2age),

where D is the prostate cancer status and lpsa = log(1 + PSA). Our model for the secondary trait is

log[η2{lpsa;(G,age)}]=lpsa(θ1G+θ2age).

The correction formula for the genetic effect remains the same. See Appendix B for more details.

The number of the typed SNPs is over one million. The sample size is relatively small for detecting possible small genetic effects hidden in the large number of SNPs. We randomly sampled 10% of the typed SNPs in the 22 autosomes for the purpose of illustration. After excluding those SNPs with a MAF less than 5% and/or with a p-value less than 0.0001 in Hardy–Weinberg equilibrium test, a total of 90,232 SNPs were selected. We performed analyses examining baseline case-prevalence rates ranges from 0% to 10% with an evaluation interval 0.2%. We chose the fixed point as (age, G, lpsa) = (65, 1, 4), the baseline case-prevalence rate should be interpreted as for this subpopulation. In each case, we applied the genomic control approach in [24] to check for possible population stratification. The estimated variance inflation factors range from λ = 0.964 to λ = 0.970. The adjustment for population stratification appears to be not imperative. We displayed the top SNPs in the p-value spectrum to show the impact of different baseline case-prevalence rates on the test results.

Table III displays the test results of genetic association whose smallest p-value in the baseline rate range of 0% to 10% is smaller than 0.0001 for the secondary trait PSA. Test 1 shows the results under the sparse disease assumption, an approach routinely used for the analysis of secondary traits. Test 2 and Test 3 are computed under the 5% and 10% baseline case-prevalence rates, respectively. We can see that the impact of a non-zero case-prevalence rate can range from relatively small (for example, rs10197545) to very large (for example, rs9316673). The impact is unevenly distributed over different case-prevalence rates because the correction not only affects the odds ratio estimation but also the variance estimates of the odds ratio estimator. The test (MinT) controlling the type I error for the baseline rate ranging from 0% to 10% is a more robust test for the secondary trait analysis when the sparse disease assumption is suspected. MinT can yield the same results as Test 1 in some cases or different results from Test 1 in other cases. Even when MinT yields the same results as Test 1, the analysis is still useful in that it provides information on the reliability of the results. For example, the sensitivity analysis indicates that the significant SNP tested under sparse disease assumption, rs2255684, is only moderately affected by the non-zero baseline case-prevalence rates. This analysis ensures us the reliability of the test result even if the sparse disease assumption does not hold. On the other hand, the test based on sparse disease assumption for rs9316673 can be much less reliable. We can easily perform such a sensitivity analysis using our approach, whereas it can be very time-consuming to perform using other approaches. We also performed a prospective maximum likelihood analysis of the selected SNPs; the results confirmed the effectiveness of the correction approach.

Table III.

Analysis of the secondary trait prostate specific antigen in the case–control study of prostate cancer: test of no genetic effect.

CHR SNP γ01γ1 Test 1 (×10−4) Test 2 (×10−4) Test 3 (×10−4) MinT (×10−4)
2 rs10197545 0.704 0.953
0.953
0.552
0.400
0.729
0.460
0.953
3 rs17034929 −0.037 2.20
2.20
0.467
0.674
0.150
0.446
2.20
6 rs9448387 0.687 1.40
1.40
0.93
0.449
1.33
0.480
1.40
10 rs2255684 0.550 0.421
0.421
0.479
0.478
1.00
0.590
1.00
12 rs7135149 −0.150 1.46
1.46
0.675
1.50
0.465
1.44
1.46
12 rs11068651 0.414 0.600
0.600
0.246
0.250
0.199
0.240
0.600
13 rs9594159 −0.022 7.41
7.41
1.98
3.91
0.776
2.40
7.41
13 rs9316673 0.462 0.215
0.215
0.048
0.054
0.025
0.040
0.215
19 rs350134 0.095 4.06
4.06
1.66
2.20
0.990
1.90
4.06

CHR, Chromosome. SNP, single nucleotide polymorphism; Test 1, p-value of the Wald test assuming τ2 = 0; Test 2, p-value of the Wald test assuming τ2 = 0.05 × 0.95; Test 3, p-value of the Wald test assuming τ2 = 0.1 × 0.9. MinT, p-value of the minimum test for the baseline prevalence rate ranging from 0% to 10% evaluated at an interval of 0.2%. The second entry is the maximum likelihood approach.

6. Discussion

For the analysis of secondary traits, the retrospective likelihood approach proposed in Lin and Zeng [15] included three scenarios. The first assumes a known population case-prevalence rate. The second assumes rare disease that corresponds to a disease prevalence rate in the general population close to zero. The third scenario does not assume any known supplemental information on the disease prevalence rate in the general population. The third scenario is the most important one, whereas the first two may be regarded as special cases of the third one. In the first two scenarios, the retrospective likelihood approach has good performance when the population level case-prevalence rate is accurate and usable. In the third scenario, the retrospective likelihood approach can perform poorly. This is primarily due to the presence of weakly identifiable parameters [20] that cannot be estimated well with practical attainable sample sizes.

The approach we proposed mainly tackles the problem in scenario 3. We consider an approximate model that does not involve estimating the weakly identifiable parameter. In carrying out the task, we work with the simpler joint prospective likelihood, which was shown [20] to be equivalent to the retrospective likelihood in scenario 3. Following the estimation, we recover the parameters of interest by a sensitivity analysis to the baseline case-prevalence rate. Lin and Zeng [15] also suggested a sensitivity analysis to the population case-prevalence rate in scenario 3. However, the validity of their approach under scenarios 1 and 3 rests on the assumption that the covariate distribution in the general population can be identified from the case–control sample, which is violated when pre-selection of subjects through eligible criteria, routinely performed in biomedical studies, is involved. Our approach is not subject to this limitation. Furthermore, the sensitivity analysis using explicit correction formulas we proposed is computationally much faster, which can be critical for performing genome-wide analysis. The explicit correction formulas also suggest that the magnitude of bias in the conditional or marginal analysis is determined by three factors: the baseline prevalence rate of the primary phenotype, the strength of association between the primary phenotype and the secondary trait, and the strength of association between the primary phenotype and the genetic marker. Such dependence has largely been qualified but not quantified previously in the literature. Of note is that correction of the bias does not require us to know the prevalence rate of the secondary trait in the general population. Our approach can also be applied to scenarios 1 and 2 when the baseline case-prevalence rate is used instead.

Our proposed approaches are directly applicable to the secondary trait analysis even if either or both of the primary and secondary traits are quantitative and the sample is ascertained through an extreme-value sampling design rather than a case–control design. Both the environment factor and the gene–environment interaction can be readily included in the model for analysis. In particular, when the primary phenotype is quantitative and a normal model is used to link the phenotype to the genotype and the environmental factor, the correction formula for the joint analysis becomes exact. Our analysis also suggests that when the case-prevalence rate for the primary phenotype is low, the uncorrected conditional analysis is much better than the uncorrected marginal analysis.

In deriving the correction formulas in Appendix A, we have assumed that there is no interaction between the secondary trait and the genotype or the environmental factors. If such interactions exist, the idea we used to obtain the correction formulas can still be applied. However, the correction formulas need to be modified accordingly. See Appendix B. We chose to mainly deal with the simple situation in this article to keep the presentation easy to follow.

Acknowledgments

This research was supported by NSF grant DMS 1007726 (to H. Y. C.). We thank two anonymous reviewers and an associate editor for very helpful comments on the previous versions of this paper.

Appendix A. Derivation of the formulas

We derive general correction formulas for the marginal, conditional, and joint analyses for both discrete and continuous (D, Y) with an arbitrary reference point (D0, Y0, G0, E0). If D is a binary phenotype, we assume it follows model (4). If D is a quantitative trait, we model it by

D=β0+β01Y+β1G+β2G+β2E+β12GE+ε, (A.1)

where εN(0, τ2). Note that the following derivation can also be modified to include interactions between Y and other variables. To keep the presentation simple, we only deal with the case without Y interaction. Define an odds ratio function for D versus (Y, G, E) relative to the reference point as

η1{D;(Y,G,E)}=p(DY,G,E)p(D0Y0,G0,E0)p(D0Y,G,E)p(DY0,G0,E0). (A.2)

The odds ratio function for model (A.1) is

logη1{D;(Y,G,E)}=γ01(D-D0)(Y-Y0)+γ1(D-D0)(G-G0)+γ2(D-D0)(E-E0)+γ12(D-D0)(GE-G0E0),

where (γ01, γ1, γ2, γ12) = (β01, β1, β2, β12)/τ2. The density of D under model (A.1) Can be rewritten as

p(DY,G,E)=η1{D;(Y,G,E)}p{DY0,G0,E0}η1{D;(Y,G,E)}p{DY0,G0,E0}dD. (A.3)

This odds ratio representation also works for binary phenotype D [25, 26]. Expression (6) is the special case with the reference point (D0, Y0, G0, E0) = (0, 0, 0, 0). If Y is a quantitative trait and is modeled by

Y=α0+α1G+α2E+α12GE+e, (A.4)

where eN(0, σ2), the following odds ratio representation

p(YG,E)=η2{Y;(G,E)}p{YG0,E0}Yη2{Y;(G,E)}p{YG0,E0) (A.5)

holds, where

η2{Y;(G,E)}=p(YG,E)p(Y0G0,E0)p(Y0G,E)p(YG0,E0). (A.6)

For the quantitative trait modeled by (A.4),

logη2{Y;(G,E)}=θ1(Y-Y0)(G-G0)+θ2(Y-Y0)(E-E0)+θ12(Y-Y0)(GE-G0E0),

where (θ1, θ2, θ12) = (α1, α2, α12)/σ2. The representation (A.5) also works for binary phenotype Y. Expression (3) is the special case with the reference point (Y0, G0, E0) = (0, 0, 0).

When the sampling design is (7),

p(D,YG,E,S=1)=P(S=1D,Y,G,E)p(DY,G,E)p(YG,E)DYP(S=1D,Y,G,E)p(DY,G,E)p(YG,E). (A.7)

By replacing P(S = 1|D, Y, G, E) with (7), p(D|Y, G, E) with (A.3), and p(Y|G, E) with (A.5), we can rewrite (A.7) as

p(D,YG,E,S=1)=η1{D;(Y,G,E)}T(Y,G,E)η2{Y;(G,E)}p(YD0,G0,E0)p{DY0,G0,E0,S=1)DYη1{D;(Y,G,E)}T(Y,G,E)η2{Y;(G,E)}p(YD0,G0,E0)p{DY0,G0,E0,S=1) (A.8)

Let

ηT{Y;(G,E)}=T(Y,G,E)T(Y0,G0,E0)T(Y0,G,E)T(Y,G0,E0).

When D is a quantitative trait and is modeled by (A.1),

logηT{Y;(G,E)}=-τ2γ01(Y-Y0){γ1(G-G0)+γ2(E-E0)+γ12(GE-G0E0)}

by direct calculation. When D is binary phenotype and is modeled by (4),

ηT{Y;(G,E)}=[1+exp{γ0+m(Y0,G,E)}][1+exp{γ0+m(Y,G0,E0)}][1+exp{γ0+m(Y,G,E)}][1+exp{γ0+m(Y0,G0,E0)}],

where m(y, G, E) = γ01 y + γ1G + γ2E + γ12GE. From the second order Taylor expansion,

log(1+et)log(1+et0)+et01+et0(t-t0)+et02(1+et0)2(t-t0)2

it follows by expanding log[1 + exp{γ0 + m(y, G, E)}] around t0 = γ0 + m(y0, G0, E0) that

logηT{Y;(G,E)}-τ2γ01(Y-Y0){γ1(G-G0)+γ2(E-E0)+γ12(GE-G0E0)},

where τ2 = r0(1 − r0), which is the conditional variance of D given Y = Y0, G = G0, and E = E0, and r0 = et0 /(1 + et0), which is the baseline case-prevalence rate for the primary phenotype. Note that τ2, which was used to denote the residual variance in the regression model for the primary trait, can also be interpreted as the conditional variance of D given Y = Y0, G = G0, and E = E0. If the third order Taylor expansion is used, that is,

log(1+et)log(1+et0)+et01+et0(t-t0)+et02(1+et0)2(t-t02)+et0(1-2et0)6(1+et0)3(t-t0)3,

the approximation yields

logηT{Y;(G,E)}-τ2γ01(Y-Y0){γ1(G-G0)+γ2(E-E0)+γ12(GE-G0E0)}-c3γ01γ1γ2(Y-Y0)(G-G0)(E-E0)

when we exclude higher order terms from the approximation, where c3 = r0(1 − r0)(1 − 2r0). We can obtain the approximation with interactions between Y and (G, E) by changing m(y, G, E) in the derivation accordingly.

When we replace T(Y, G, E) by approximation to ηT {Y; (G, E)}, we can rewrite (A.8) as

p(D,YG,E,S=1)=ξ(D,Y,G,E)q1(D)q2(Y)Dξ(D,Y,G,E)q1(D)q2(Y) (A.9)

where q1(D) = p(D|Y0, G0, E0, S = 1),

q2(Y)=T(Y,G0,E0)p(YG0,E0)YT(Y,G0,E0)p(YG0,E0),
logξ(D,Y,G,E)=γ01(D-D0)(Y-Y0)+γ1(Y-Y0)(G-G0)+γ2(Y-Y0)(E-E0)+γ12(D-D0)(GE-G0E0)+ψ1(Y-Y0)(G-G0)+ψ2(Y-Y0)(E-E0)+ψ12(Y-Y0)(GE-G0E0),

and

(ψ1ψ2ψ12)(θ1θ2θ12)-γ01(τ2γ1-c3γ1γ2E0τ2γ2-c3γ1γ2G0τ2γ12+c3γ1γ2) (A.10)

(A.10) becomes (14) when c3 = 0. From (A.9), we can straightforwardly obtain the correction formula (15) for the conditional analysis, which is the same as (A.10). To obtain the correction formula (19) for the marginal analysis, expand

U(Y,G,E)=Dη1{D;(Y,G,E)}p(DY0,G0,E0,S=1)

in the same way as for T(Y, G, E). The general formula appears as

(ψ1ψ2ψ12)(θ1θ2θ12)-γ01((τ2-τ2)γ1-(c3-c3)γ1γ2E0(τ2-τ2)γ2-(c3-c3)γ1γ2G0(τ2-τ2)γ12-(c3-c3)γ1γ2), (A.11)

which leads to (18) when c3=0=c3.

We can estimate parameters in the joint correction formula by maximizing the likelihood on the basis of (A.9). We can approximate the variance of the parameter estimator by the inverse of the observed information matrix. We use the δ method to find the approximate variance estimate for the corrected odds ratio estimator for the association of the secondary trait and the genetic marker. Specifically, let

n{(ψ^,γ^)-(ψ,γ)}N{(0,0),V},

where (ψ̂, γ̂)is the estimate of (ψ, γ), and ψ = (ψ1, ψ2, ψ12) and γ = (γ01, γ1, γ2, γ12). We can write correction formulas based on (A.10) or (A.11) in the form of θ =ψ + g(γ). The variance for θ̂ has the form

Vψψ+Vψγgγ+(gγ)TVγψ+(gγ)TVγγgγ,

where

V=(VψψVψγVγψVγγ).

For the marginal analysis, we can use the same steps except that we now have γ=(γ0,γ01,γ1,γ2,γ12), where γ0 is the intercept of the logistic regression model from the case–control sample. For different approximations, g is different, and its derivatives with respect to γ can be found accordingly.

Appendix B. More general correction formulas

When there are interactions between the secondary trait and the gene and/or environment factors, that is η1{D; (Y, G, E)} = exp{(DD0)m(Y.G, E)}, where

m(Y,G,E)=γ01(Y-Y0)+γ02(YG-Y0G0)+γ03(YE-Y0E0)+γ023(YGE-Y0G0E0)+γ1(G-G0)+γ2(E-E0)+γ12(GE-G0E0).

We can modify the correction formulas for the conditional or the joint analysis to

θ1=ψ1+(μ-D0)γ02+τ2γ01γ1,θ2=ψ2+(μ-D0)γ03+τ2γ01γ2,θ12=ψ12+(μ-D0)γ023+τ2(γ01γ12+γ1γ03+γ2γ02)+c3γ01γ1γ2,

where θ1=θ1+θ12G0,θ2=θ2+θ12E0,γ12=γ12+γ023Y0,γ02=γ02+γ023E0,γ03=γ03+γ023G0 and

γ1=γ1+γ02Y0+γ12E0+γ023Y0E0γ2=γ2+γ03Y0+γ12G0+γ023Y0G0γ01=γ2+γ02G0+γ03E0+γ023G0E0

We can obtain the variance similarly for the simpler case by the δ method.

References

  • 1.Anderson JA. Separate sample logistic discrimination. Biometrika. 1972;59:19–35. [Google Scholar]
  • 2.Prentice RL, Pyke R. Logistic disease incidence models and case–control studies. Biometrics. 1976;66:403–411. [Google Scholar]
  • 3.Breslow NE, Day N. Statistical methods in cancer research. Volume I: The analysis of case–control studies. IARC Scientific Publications; Lyon: IARC; 1980. [PubMed] [Google Scholar]
  • 4.Breslow NE. Statistics in epidemiology: the case–control study. Journal of the American Statistical Association. 1996;91:14–28. doi: 10.1080/01621459.1996.10476660. [DOI] [PubMed] [Google Scholar]
  • 5.Rabinowitz D. A note on efficient estimation from case–control data. Biometrika. 1997;84:486–488. [Google Scholar]
  • 6.Roeder K, Carroll RJ, Linsay BC. A nonparametric mixture approach to case–control studies with errors in variables. Journal of the American Statistical Association. 1996;91:722–732. [Google Scholar]
  • 7.Scott AJ, Wild CJ. Fitting regression models to case–control data by maximum likelihood. Biometrika. 1997;84:57–71. [Google Scholar]
  • 8.Chen HY. A note on prospective analysis of outcome-dependent samples. Journal of Royal Statistical Society, Ser B. 2003;65:575–584. [Google Scholar]
  • 9.Nagelkerke NJD, Moses S, Plummer FA, Brunham RC, Fish D. Logistic regression in case–control studies: the effect of using independent as dependent variables. Statistics in Medicine. 1995;14:769–755. doi: 10.1002/sim.4780140806. [DOI] [PubMed] [Google Scholar]
  • 10.Lee AJ, McMurchy L, Scott AJ. Re-using data from case–control studies. Statistics in Medicine. 1997;16:1377–1389. doi: 10.1002/(sici)1097-0258(19970630)16:12<1377::aid-sim557>3.0.co;2-k. [DOI] [PubMed] [Google Scholar]
  • 11.Jiang Y, Scott AJ, Wild CJ. Secondary analysis of case–control data. Statistics in Medicine. 2006;25:1323–1339. doi: 10.1002/sim.2283. [DOI] [PubMed] [Google Scholar]
  • 12.Reilly M, Torrang A, Klint A. Reuse of case–control data for analysis of new outcome variables. Statistics in Medicine. 2005;24:4009–4019. doi: 10.1002/sim.2398. [DOI] [PubMed] [Google Scholar]
  • 13.Richardson DB, Rzehak P, Klenk J, Weiland SK. Analysis of case–control data for additional outcomes. Epidemiology. 2007;18:441–445. doi: 10.1097/EDE.0b013e318060d25c. [DOI] [PubMed] [Google Scholar]
  • 14.Kraft P. Analysis of genome-wide association scans for additional outcomes. Epidemiology. 2007;18:838. doi: 10.1097/EDE.0b013e318154c7e2. [DOI] [PubMed] [Google Scholar]
  • 15.Lin DY, Zeng D. Proper analysis of secondary phenotype data in case–control association studies. Genetic Epidemiology. 2009;33:256–265. doi: 10.1002/gepi.20377. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Monsees GM, Tamimi RM, Kraft P. Genome-wide association scans for secondary traits using case–control samples. Genetic Epidemiology. 2009;33:718–728. doi: 10.1002/gepi.20424. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Li H, Gail M, Berndt S, Chatterjee N. Using cases to strengthen inference on the association between single nucleotide polymorphisms and a secondary phenotype in genome-wide association studies. Genetic Epidemiology. 2010;34:427–433. doi: 10.1002/gepi.20495. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Wang J, Shete S. Estimation of odds ratios of genetic variants for the secondary phenotypes associated with primary diseases. Genetic Epidemiology. 2011;35:190–200. doi: 10.1002/gepi.20568. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Lin DY, Zeng D. Likelihood-based inference on haplotype effects in genetic association studies (with discussion) Journal of the American Statistical Association. 2006;101:89–104. [Google Scholar]
  • 20.Chen HY. A unified framework for parameter identifiability and estimation in biased sampling designs. Biometrika. 2011;98:163–175. [Google Scholar]
  • 21.Thompson IM, Pauler DK, Goodman PJ, Tangen CM, Lucia MS, Parnes HL, Minasian LM, Ford LG, Lippman SM, Crawford ED, Crowley JJ, Coltman CA. Prevalence of prostate cancer among men with a prostate-specific antigen level < = 4:0 ng per milliliter. The New England Journal of Medicine. 2004;350:2239–2246. doi: 10.1056/NEJMoa031918. [DOI] [PubMed] [Google Scholar]
  • 22.Eeles RA, Kote-Jarai Z, Giles GG, Olama AAA, Guy M, Jugurnauth SK, Mulholland S, Leongamornlert DA, Edwards SM, Morrison J, Field HI, Southey MC, Severi G, Donovan JL, Hamdy FC, Dearnaley DP, Muir KR, Smith C, Bagnato M, Ardern-Jones AT, Hall AL, O’Brien LT, Gehr-Swain BN, Wilkinson RA, Cox A, Lewis S, Brown PM, Jhavar SG, Tymrakiewicz M, Lophatananon A, Bryant SL, Horwich A, Huddart RA, Khoo VS, Parker CC, Woodhouse CJ, Thompson A, Christmas T, Ogden C, Fisher C, Jamieson C, Cooper CS, English DR, Hopper JL, Neal DE, Easton DF. Multiple newly identified loci associated with prostate cancer susceptibility. Nature Genetics. 2008;40:316–321. doi: 10.1038/ng.90. [DOI] [PubMed] [Google Scholar]
  • 23.Eeles RA, Kote-Jarai Z, Olama AAA, Giles GG, Guy M, Severi G, Muir K, Hopper JL, Henderson BE, Haiman CA, Schleutker J, Hamdy FC, Neal DE, Donovan JL, Stanford JL, Ostrander EA, Ingles SA, John EM, Thibodeau SN, Schaid D, Park JY, Spurdle A, Clements J, Dickinson JL, Maier C, Vogel W, Dork T, Rebbeck TR, Cooney KA, Cannon-Albright L, Chappuis PO, Hutter P, Zeegers M, Kaneva R, Zhang HW, Lu YJ, Foulkes WD, English DR, Leongamornlert DA, Tymrakiewicz M, Morrison J, Ardern-Jones AT, Hall AL, O’Brien LT, Wilkinson RA, Saunders EJ, Page EC, Sawyer EJ, Edwards SM, Dearnaley DP, Horwich A, Huddart RA, Khoo VS, Parker CC, Van As N, Woodhouse CJ, Thompson A, Christmas T, Ogden C, Cooper CS, Southey MC, Lophatananon A, Liu JF, Kolonel LN, Le Marchand L, Wahlfors T, Tammela TL, Auvinen A, Lewis SJ, Cox A, FitzGerald LM, Koopmeiners JS, Karyadi DM, Kwon EM, Stern MC, Corral R, Joshi AD, Shahabi A, McDonnell SK, Sellers TA, Pow-Sang J, Chambers S, Aitken J, Gardiner RA, Batra J, Kedda MA, Lose F, Polanowski A, Patterson B, Serth J, Meyer A, Luedeke M, Stefflova K, Ray AM, Lange EM, Farnham J, Khan H, Slavov C, Mitkova A, Cao GW, Easton DF. Identification of seven new prostate cancer susceptibility loci through a genome-wide association study. Nature Genetics. 2009;41:1116–1121. doi: 10.1038/ng.450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. doi: 10.1111/j.0006-341x.1999.00997.x. [DOI] [PubMed] [Google Scholar]
  • 25.Chen HY. A semiparametric odds ratio model for measuring association. Biometrics. 2007;63:413–421. doi: 10.1111/j.1541-0420.2006.00701.x. [DOI] [PubMed] [Google Scholar]
  • 26.Chen HY. Compatibility of conditionally specified models. Statistics & Probability Letters. 2010;80:670–677. doi: 10.1016/j.spl.2009.12.025. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES