Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2010 Apr 1.
Published in final edited form as: Genet Epidemiol. 2009 Apr;33(3):256–265. doi: 10.1002/gepi.20377

Proper Analysis of Secondary Phenotype Data in Case-Control Association Studies

D Y Lin 1, D Zeng 1
PMCID: PMC2684820  NIHMSID: NIHMS94335  PMID: 19051285

Abstract

Case-control association studies often collect extensive information on secondary phenotypes, which are quantitative or qualitative traits other than the case-control status. Exploring secondary phenotypes can yield valuable insights into biological pathways and identify genetic variants influencing phenotypes of direct interest. All publications on secondary phenotypes have used standard statistical methods, such as least-squares regression for quantitative traits. Because of unequal selection probabilities between cases and controls, the case-control sample is not a random sample from the general population. As a result, standard statistical analysis of secondary phenotype data can be extremely misleading. Although one may avoid the sampling bias by analyzing cases and controls separately or by including the case-control status as a covariate in the model, the associations between a secondary phenotype and a genetic variant in the case and control groups can be quite different from the association in the general population. In this article, we present novel statistical methods that properly reflect the case-control sampling in the analysis of secondary phenotype data. The new methods provide unbiased estimation of genetic effects and accurate control of false-positive rates while maximizing statistical power. We demonstrate the pitfalls of the standard methods and the advantages of the new methods both analytically and numerically. The relevant software is available at our website.

Keywords: case-control sampling, complex diseases, genomewide association studies, linear regression, maximum likelihood, meta-analysis, quantitative traits, secondary traits, SNPs

INTRODUCTION

There is a proliferation of genomewide association (GWA) studies worldwide. These studies usually employ the case-control design, which consists of a sample of cases (i.e. diseased individuals) and a sample of controls (i.e., disease-free individuals). Most GWA studies measure a variety of quantitative or qualitative traits other than the disease trait that defines the case-control status. Exploring these secondary phenotypes can discover genetic variants influencing previously unstudied phenotypes and provide important clues about causal pathways. Although the main interest of case-control association studies lies in the comparison of cases and controls, analysis of secondary phenotype data may supplement the case-control comparison in the initial reports or become the primary focus of subsequent publications.

Indeed, recent months have seen an explosion of publications on genetic variants influencing human quantitative traits, such as height [Weedon et al., 2007; Sanna et al., 2008; Weedon et al., 2008; Lettre et al., 2008; Gudbjartsson et al., 2008], body mass index (BMI) [Frayling et al., 2007; Loos et al., 2008], and lipid levels [Saxena et al., 2007; Willer et al., 2008; Kathiresan et al., 2008]. The data for those publications came mostly from case-control association studies of complex diseases (e.g., diabetes, cancer and hypertension). The most recent publications were based on meta-analysis of multiple GWA studies involving thousands or tens of thousands of individuals.

All publications on quantitative traits have relied on standard linear regression analysis (i.e., classical least-squares estimation under the linear model). Five types of analysis have been conducted to assess the effects of SNPs on quantitative traits using data from case-control association studies: (1) controls only; (2) cases only; (3) combined sample of cases and controls; (4) meta-analysis of cases and controls; (5) joint analysis of cases and controls adjusted for the disease status. Methods (1) and (2) are restricted to controls and cases, respectively. Method (3) analyzes cases and controls together and ignores the disease status. In method (4), cases and controls are analyzed separately and the results are combined by a meta-analytic procedure. Method (5) analyzes cases and controls together and includes the disease status as a covariate in the model.

None of the aforementioned analysis methods is statistically correct. Because cases and controls are selected at different rates from their respective subpopulations, the case-control sample does not constitute a random sample of the general population. As a result, the population association between a SNP and a secondary trait can be distorted in the case-control sample. Thus, method (3) can be grossly misleading, especially when the secondary trait is strongly correlated with the disease and the sampling rates are very different between cases and controls. The other four methods are conditional on the disease status and are thus unaffected by the biased case-control sampling. However, the associations between a SNP and a secondary trait in the case and control groups can be quite different from the association in the general population if the secondary trait is correlated with the disease.

To illustrate the above points, we consider a dichotomous secondary trait taking the values 0 and 1 with equal probability and a SNP with minor allele frequency (MAF) of 0.2. Assume that the disease rate is 10%, the odds ratio of disease with the SNP is 1.5 under the dominant mode of inheritance, and the odds ratio of disease with the secondary trait is 2. If the SNP is unrelated to the secondary trait in the general population (i.e., the odds ratio is 1), then the odds ratios of the secondary trait with the SNP will be approximately 0.975 in the case and control groups, and the odds ratio in the combined sample of cases and controls with an equal number of cases and controls will be approximately 1.045; see Table I for details. In other words, there exist (spurious) associations between the SNP and the secondary trait in the case and control groups, as well as in the combined sample, despite the absence of association at the population level. This phenomenon implies that methods (1), (2) and (3) are all biased; methods (4) and (5) are also biased because they combine the biased results from the case and control groups. Because GWA studies typically have large sample sizes, even modest levels of bias can lead to grossly inflated false-positive rates, especially in meta-analysis.

TABLE I.

Probability distributions of disease status, genotype score and secondary trait in a case-control setting

General population
Genotype score Secondary trait
0 1

0 0.32 0.32
1 0.18 0.18

Odds ratio = 1
Cases
Genotype score Secondary trait
0 1

0 0.0192 0.0362
1 0.0157 0.0289

Odds ratio = 0.975
Controls
Genotype score Secondary trait
0 1

0 0.3008 0.2838
1 0.1643 0.1511

Odds ratio = 0.975
Case-control sample
Genotype score Secondary trait
0 1

0 0.2631 0.3387
1 0.1698 0.2285

Odds ratio = 1.045

The fact that method (3) is biased seems to contradict with the universally accepted practice of performing standard logistic regression on case-control data. Standard logistic regression analysis of case-control data indeed yields correct maximum likelihood estimates of odds ratios, although the estimate of the disease rate is generally biased [Prentice and Pyke, 1979]. This remarkable result, however, applies only to the primary disease outcome that defines the case-control status, and not to a secondary outcome that is correlated with the disease.

We have developed valid and efficient statistical methods to analyze secondary phenotype data in case-control association studies. Our methods are based on likelihood functions that properly reflect the case-control sampling. The corresponding maximum likelihood estimates are approximately unbiased and normally distributed. Furthermore, the estimates are statistically efficient in that they have the smallest variances among all valid estimates, and the corresponding association tests are the most powerful among all valid tests.

In the next section, we describe in more detail the ingredients of the new methods and the pitfalls of standard methods, particularly methods (1)-(5). In the subsequent Results section, we use Monte Carlo simulation to quantify the bias of standard methods in common situations and to evaluate the operating characteristics of the new methods. We relegate the technical details about the new and standard methods to Appendices A and B, respectively.

METHODS

NEW METHODS

Let D denote the case-control status (1 = disease; 0 = no disease) and Y denote the secondary phenotype. Also, let X denote the genotype score for a SNP of interest. Under the additive mode of inheritance, X is the number of minor alleles; under the dominant (recessive) model, X indicates, by the values 1 versus 0, whether or not the individual carries at least one minor allele (two minor alleles). We use a generalized linear model to formulate the effects of X on Y, and write the conditional density of Y given X as P(Y|X). If Y is a quantitative trait, we use the linear regression model, which specifies that the conditional distribution of Y given X is normal with mean β0 + β1X and variance σ2. If Y is a dichotomous trait, we use the logistic regression model, under which

P(Y=1|X)=eβ0+β1X1+eβ0+β1X.

We relate Y and X to D through the logistic regression model

P(D=1|X,Y)=eγ0+γ1X+γ2Y1+eγ0+γ1X+γ2Y.

We are mainly interested in β1.

For a case-control study with a total of n subjects, the data consist of (Di, Yi, Xi) (i = 1,…, n). Because the sampling is conditional on the case-control status, the likelihood function takes the retrospective form, Πi=1nP(Yi,Xi|Di), which is

Πi=1n{P(Di=1|Xi,Yi)P(Yi|Xi)P(Xi)P(Di=1)}Di{P(Di=0|Xi,Yi)P(Yi|Xi)P(Xi)P(Di=0)}1Di,

where P(Di = 1) = ΣyΣx P(Di = 1|x, y)P (y|x)P(x), P(Di = 0) = 1 − P(Di = 1), and P(Di = 0|Xi, Yi) = 1 − P(Di = 1|Xi, Yi). We maximize this function by the Newton-Raphson algorithm. Likelihood-based statistics (i.e., Wald, score and likelihood-ratio statistics) can be used to make inference about the parameter of main interest β1.

The estimation of γ0 may be unstable, especially for dichotomous Y. When the disease is rare such that P(D = 1|X, Y) ≈ eγ0+γ1X+γ2Y, the parameter γ0 cancels in the numerator and denominator of the likelihood function. When the disease is not rare but the disease rate is known approximately, we can incorporate the information about the disease rate into the estimation. We can include environmental covariates in the models for Y and for D, but then the probability distribution of continuous environmental covariates will enter into the likelihood function as a high-dimensional nuisance parameter. We eliminate such nuisance parameters through the profile-likelihood approach. The interested readers are referred to Appendix A for the theoretical and computational details of the new methods.

STANDARD METHODS

As described in the Introduction section, standard statistical methods have been applied to the secondary phenotype data from case-control studies in five different ways, which we will refer to as methods (1)-(5). Methods (1)-(3) are based on the prospective likelihood function Πi P(Yi|Xi), where the product is taken over the controls, the cases or all study subjects. The maximization of this function yields the classical least-squares estimation for quantitative traits and the standard logistic regression for dichotomous traits. Method (4) combines the least-squares or odds ratio estimates of the case and control samples through the inverse-variance meta-analysis procedure. Method (5) is based on the prospective likelihood function Πi=1nP(Yi|Xi,Di), which is a parametric way of combining the results of the case and control samples.

We have conducted a thorough investigation into the properties of the five standard methods. We state here the main conclusions while relegating the details to Appendix B. If the secondary phenotype is not related to the case-control status, or more precisely, D is independent of Y given X (i.e., γ2 = 0), then all five methods are valid. If the SNP is not associated with the case-control status, or more precisely, D is independent of X given Y (i.e., γ1 = 0), then all five methods yield correct estimates of odds ratios for dichotomous traits, but the least-squares estimates for quantitative traits produced by the five methods are biased unless β1 = 0 or γ2 = 0. When the disease is rare in that the probability of disease is virtually 0, all standard methods except (3) are approximately valid.

The fact that standard methods are generally invalid unless γ1 = 0 or γ2 = 0 is disconcerting. Most secondary phenotypes are strongly correlated with the case-control status, so that γ2 ≠ 0. Thus, any SNPs that are associated with the case-control status will tend to be detected as being associated with secondary phenotypes by standard methods even when the latter associations do not exist. When the associations truly exist, all five methods may produce estimates that are biased toward the null and thus reduce statistical power.

RESULTS

We conducted extensive simulation studies to assess the performance of the standard and new methods in the analysis of secondary quantitative traits. We considered a SNP with MAF of 0.3 and additive mode of inheritance. For the model of the secondary quantitative trait, we set β0 = σ2 = 1, and let β1 = 0 and −0.12 under the null and alternative hypotheses, respectively; β1 = −0.12 means that each copy of the minor allele decreases the trait value by 12% of its standard deviation. For the disease-risk model, we set γ2 = log 2, varied the value of γ1 from 0 to log 2, and chose the value of γ0 to yield a disease rate of 1% or 5%. Note that γ1 pertains to the change in the log odds ratio of disease for each copy of the minor allele of the SNP and γ2 to the change in the log odds ratio of disease for one standard-derivation increase of the quantitative trait. The choice of γ2 = log 2 represents a strong, but not uncommon, association between the secondary phenotype and the disease. For each combination of simulation parameters, we generated 1,000,000 data sets with 1,000 cases and 1,000 controls. We compared the new method to standard methods (1)-(3), i.e., the least-squares methods based on the data from the control group, the case group, and the combined sample of cases and controls. We focused on methods (1)-(3) because methods (4) and (5) are closely related (1) and (2).

We summarize the key results in Figures 1 and 2. Figure 1 shows the biases of effect estimates and the coverage probabilities of 99% confidence intervals for β1 as a function of the odds ratio of disease with the SNP (i.e., eγ1) under the alternative hypothesis (i.e., β1 = −0.12). Figure 2 displays the type I error and power for testing the null hypothesis of β1 = 0 at the nominal significance level of 1%. We restricted the horizontal axis to 1.5 because the odds ratios that have been observed in GWA studies thus far are mostly less than 1.5 although the odds ratios may be higher in candidate genes studies.

Figure 1.

Figure 1

Relative biases of effect estimates and coverage probabilities of 99% confidence intervals for four analysis methods: least-squares methods based on controls only, cases only and combined sample of cases and controls versus the new method. The relative bias is the bias divided by the effect size. The odds ratio of disease with the SNP (i.e., eγ1) was varied from 1 to 1.5 with 0.1 increment. Each result is based on 1,000,000 simulated data sets. For disease rate of 1%, least-squares methods based on controls only and cases only and the new method all yield coverage probabilities of virtually 99% at all values of the odds ratio of disease with the SNP, so the three curves are indistinguishable.

Figure 2.

Figure 2

Type I error rates and power of association tests at the 1% nominal significance level for four analysis methods: least-squares methods based on controls only, cases only and combined sample of cases and controls versus the new method. The odds ratio of disease with the SNP (i.e., eγ1) was varied from 1 to 1.5 with 0.1 increment. Each result is based on 1,000,000 simulated data sets. For disease rate of 1%, least-squares methods based on controls only and cases only and the new method all yield type I error rates of virtually 1% at all values of the odds ratio of disease with the SNP, so the three curves are indistinguishable.

The new method performs very well. The effect estimates are virtually unbiased, and the variance estimates accurately reflect the true variations (latter data not shown). A a result, the confidence intervals have proper coverage probabilities, and the association tests have correct type I error rates. The power of the new method is always above 80%.

The least-squares method based on the combined sample of cases and controls, i.e., method (3), can be very wrong, especially when the SNP is strongly related to the disease. The effect estimates are biased, the confidence intervals have poor coverage probabilities, and the type I error is inflated. The problems associated with this method are more severe for rarer diseases. The type I error rates are 8 times and 5 times the nominal significance level under the disease rates of 1% and 5%, respectively, when the odds ratio of disease with the SNP is 1.3. This strategy may be more powerful or less powerful than the new method, dependent on how the SNP affects the disease status and the secondary phenotype. Figure 2 shows that this strategy can be substantially less powerful than the new method.

When the disease is rare, say less than 1%, the least-squares methods based on controls only and cases only, i.e., methods (1) and (2), are appropriate in that the effect estimates are approximately unbiased, the confidence intervals have adequate coverage probabilities, and the association tests have reasonable type I error rates. These two methods, however, use half of the study subjects and are thus much less powerful than the new method. For relatively common diseases, methods (1) and (2) yield biased effect estimates, improper confidence intervals and inflated type I error, especially when the SNP is strongly related to the disease. When the disease rate is 5% and the odds ratio of disease with the SNP is 1.5, the type I error rates for methods (1) and (2) are about 1.3% and 1.8%, respectively.

All the aforementioned results pertain to MAF of 0.3, γ2 of log 2, 1,000 cases and 1,000 controls, and nominal significance level of 1%. We also considered other combinations of simulation parameters. The new methods continued to perform well. The performance of standard methods became worse as MAF, γ2 or sample size was increased and as the nominal significance level was lowered. In addition, the performance of methods (1) and (2) became worse as the disease rate was increased.

Figure 3 displays the type I error rates of the five standard methods for MAF of 0.2, disease rate of 7%, 2,000 cases and 2,000 controls, and nominal significance level of 10−6. The type I error rates for all five methods increase rapidly with increasing values of γ1 and γ2 and are seriously inflated even with moderate values of γ1 and γ2. Under γ1 = log 1.3 and γ2 = 0.5, the type I error rates of methods (1)–(5) are, respectively, 1.7, 2.4, 20, 3.2 and 3.2 times the nominal significance level; under γ1 = log 1.2 and γ2 = 1, the five type I error rates are 1.9, 6.2, 25, 6.7 and 7 times the nominal significance level. Controls-only analysis has the least inflation of type I error, followed by cases-only analysis. Meta-analysis of cases and controls has higher type I error than cases-only and controls-only analyses, mainly because it has twice the sample size. The joint analysis of cases and controls with the disease status as a covariate in the linear model has slightly higher inflation of type I error than the meta-analysis. The analysis of the combined sample without adjusting for the disease status, i.e., method (3), performs much worse than the other four methods.

Figure 3.

Figure 3

Type I error rates (×106) at the nominal significance level of 10−6 for five linear regression methods: (1) controls only; (2) cases only; (3) combined sample of cases and controls; (4) meta-analysis of cases and controls; (5) joint analysis of cases and controls with adjustment for the disease status. The odds ratio of disease for the SNP (i.e., eγ1) was varied from 1 to 1.5 with 0.1 increment. Trait effect on disease (γ2) pertains to the change in the log odds ratio of disease for one-standard derivation increase in the quantitative trait. Each result is based on 100,000,000 simulated data sets. For γ2 of 0.25, the type I error rates are virtually identical between methods (3) and (5), so the two curves are indistinguishable. The nominal significance level of 10−6 is indicated by the black line in each panel.

DISCUSSION

The purpose of this article is two-fold: to evaluate the standard methods and to develop better methods for the analysis of secondary phenotype data in case-control association studies. We have demonstrated both analytically and numerically that all the standard methods can have severely inflated type I error and reduced power in practical situations. The new methods provide accurate control of the type I error and have the highest efficiency/power among all valid methods. We have developed efficient numerical algorithms to implement the new methods and posted the software online. With our software, analysis of a GWA study (with 500K-1,000K SNPs) can be completed in a few hours.

As mentioned in the Introduction section, the recent publications on human height, BMI and lipid levels relied on standard linear regression analysis of quantitative trait data from case-control studies of complex diseases. We are not challenging the conclusions of those publications because some of the initial results were validated in cross-sectional or cohort data, but the effect estimates and P-values reported in the papers might be questionable. Using valid and efficient statistical methods in the initial scans would reduce the number of false positives and increase the number of true positives among the initial results and thus enhance the success rates of validation efforts. It would be even more important to use proper statistical methods in any validation analysis.

When the disease is rare, all standard methods except (3) are approximately valid. Our simulation studies revealed that this assumption is problematic when the disease rate is appreciably higher than 1%. As shown in Figures 1 and 2, methods (1) and (2) have serious problems when the disease rate is 5%. We recommend to use the rare disease assumption only when the disease rate is less than 2%.

Many of the complex diseases currently under study are relatively common (5-10%), so the results of Figure 3 are particularly relevant. If the disease is type 2 diabetes, then γ2 is close to 1 for BMI and triglyceride levels and close to 0 for height. Thus, standard linear regression analysis of BMI and triglyceride levels in case-control studies of type 2 diabetes would be more biased than that of height. The most recent publications on quantitative traits were based on meta-analyses of tens of thousands of subjects, so the inflation of type I error would be much more profound than what is seen in Figure 3.

Although only a small fraction of SNPs are truly associated with a complex disease, any SNPs that are associated with the disease in the observed data (mostly by chance) will tend to be spuriously associated with secondary traits in the case and control groups as well as in the combined sample; therefore, all five standard methods can cause large-scale increases of false positive results in GWA studies. Indeed, the quantile-quantile plots in several publications [Weedon et al., 2007; Weedon et al., 2008; Gudbjartsson et al., 2008] showed substantial deviations of observed statistics from the expected (after correcting for population stratification), even at the 0.01 level of significance.

Acknowledgments

This research was supported by the National Institutes of Health. The authors thank Chad He for his assistance in preparing the figures.

APPENDIX A: THEORETICAL AND COMPUTATIONAL ASPECTS OF NEW STATISTICAL METHODS

We provide theoretical and computational details for the new statistical methods. In the main text of this article, X pertains to a single SNP. In this appendix, we expand X to contain all genetic and environmental factors of interest (including gene-environment interactions). We use Pθ(y|x) to denote the conditional density of Y given X = x, which is formulated through a parametric model with a set of parameters θ. We assume that

P(D=1|X,Y)=eγ0+γ1TX+γ2Y1+eγ0+γ1TX+γ2Y,

where γ1 is now a vector.

The data consist of (Di, Yi, Xi) (i = 1,…, n). The likelihood function is

i=1n{Pθ(Yi|Xi)P(Xi)eγ0+γ1TXi+γ2Yi/(1+eγ0+γ1TXi+γ2Yi)P(Di=1)}×Πi=1n{Pθ(Yi|Xi)P(Xi)/(1+eγ0+γ1TXi+γ2Yi)P(Di=0)}1Di,Di

where P(x) is the density of X, and

P(D=0)=y,xPθ(y|x)P(x)/(1+eγ0+γ1Tx+γ2y)dydx.

For discrete components, the integration is replaced by the summation.

Before deriving estimation methods, we need to determine which parameters are estimable or identifiable. We wish to show that two sets of parameters {γ0, γ1, γ2, θ, P(x)} and {γ0*, γ1*, γ2*, θ*, P*(x)} are identical if they yield the same likelihood. It is natural to assume that Pθ(Y|X) = Pθ*(Y|X) if and only if θ = θ* and that the data matrix for (1, X, Y) is of full rank.

We first consider the case in which the disease is rare such that P(D = 1|X, Y) ≈ eγ0+γ1TX+γ2Y and P(D = 0|X, Y) ≈ 1. Then the likelihood function becomes

i=1n{Pθ(Yi|Xi)P(Xi)eγ1TXi+γ2Yiy,xPθ(y|x)P(x)eγ1Tx+γ2ydydx}Di×i=1n{Pθ(Yi|Xi)P(Xi)}1Di.

We let D = 0 to obtain Pθ(Y|X)P(X) = Pθ*(Y|X)P*(X), which implies P(X) = P*(X) and θ = θ*. We let D = 1 to obtain γ1TX + γ2Y = γ1*TX + γ2*Y + c, where c is a constant. Because the data matrix for (1, X, Y) is of full rank, this equation yields γ1 = γ1* and γ2 = γ2* (implying c = 0). Thus, all parameters are identifiable.

Next, we consider the case in which the disease rate is known. Since P(D = 1) is known,

Pθ(Y|X)P(X)eD(γ0+γ1TX+γ2Y)1+eγ0+γ1TX+γ2Y=Pθ(Y|X)P(X)eD(γ0+γ1TX+γ2Y)1+eγ0+γ1TX+γ2Y.

Summing over D = 0 and 1 yields Pθ(Y|X)P(X) = Pθ*(Y|X)P*(X), which implies θ = θ* and P(X) = P*(X). This further implies γ0 + γ1TX + γ2Y = γ0* + γ1*TX + γ2*Y. Hence, all parameters are identifiable.

The remaining case is when the disease is not rare and the disease rate is unknown. It can be shown that γ1 = γ1*, γ2= γ2*, and

Pθ(y|x)P(x)=c0Pθ(y|x)P(x)1+eγ0+γ1Tx+γ2y1+eγ0+γ1Tx+γ2y,

where c0 is a constant [Roeder et al., 1996]. In most cases, the above equation implies θ = θ*. (In particular, we can show that θ = θ* for continuous traits satisfying the linear regression model; this is also true for dichotomous traits satisfying the logistic regression model if γ2 = 0 or if γ2 ≠ 0 and there is a continuous component of X that is related to D.) Then

P(x)=c0P(x)1+eγ0+γ1Tx+γ2y1+eγ0+γ1Tx+γ2y.

If γ2 ≠ 0, then the above equation clearly implies γ0= γ0* and P(x) = P*(x), so all parameters are identifiable. If γ2 = 0, we have

P(x)=c0P(x)1+eγ0+γ1Tx1+eγ0+γ1Tx,

so γ1, γ2 and θ are identifiable while P(x)/(1 + eγ0+γ1Tx) is identifiable up to some constant.

In all cases, we estimate the parameters by maximizing the likelihood functions. Because P(x) is potentially high-dimensional, we use the profile likelihood approach. We provide estimation procedures separately for the three cases discussed above.

We first consider the rare-disease case. Write pi = P (Xi). By differentiating the log-likelihood function with respect to pi, we obtain

1pin1yPθ(y|Xi)eγ1TXi+γ2ydyy,xPθ(y|x)P(x)eγ1Tx+γ2ydydxλ=0,

where λ is the Lagrange multiplier for the constraint Σi pi = 1, and n1 is the number of cases. Multiplying the above equation by pi and summing over i, we see λ = nn1. Thus,

pi=1nn1+n1ξyPθ(y|Xi)eγ1TXi+γ2ydy

where

ξ={y,xPθ(y|x)P(x)eγ1Tx+γ2ydydx}1.

By the arguments of Lin and Zeng [2006], maximizing the likelihood function is equivalent to maximizing

i=1n{logPθ(Yi|Xi)+Di(γ1TXi+γ2Yi)}+n1logξi=1nlog{1n1n+n1nξyPθ(y|Xi)eγ1TXi+γ2ydy},

which is called the profile log-likelihood function for γ1, γ2, θ and ξ.

When the disease rate is known, we maximize

i=1n{Pθ(Yi|Xi)pieDi(γ0+γ1TXi+γ2Yi)1+eγ0+γ1TXi+γ2Yi}

subject to the constraints that Σi pi = 1 and

i=1npiyPθ(y|Xi)eγ0+γ1TXi+γ2y1+eγ0+γ1TXi+γ2ydy=ξ0,

where ξ0 is the known value of P(D = 1). By using the Lagrange multipliers, we see that the estimate for pi satisfies

1piλyPθ(y|Xi)eγ0+γ1TXi+γ2y1+eγ0+γ1TXi+γ2ydyλ=0.

The two constraints imply λξ0 + λ̃ = n; therefore,

pi={λyPθ(y|Xi)eγ0+γ1TXi+γ2y1+eγ0+γ1TXi+γ2ydy+(nλξ0)}1,

where λ satisfies Σi pi = 1. Thus, the profile log-likelihood function for θ, γ0, γ1 and γ2 is

i=1n{logPθ(Yi|Xi)+Di(γ0+γ1TXi+γ2Yi)log(1+eγ0+γ1TXi+γ2Yi)}i=1nlog{λyPθ(y|Xi)eγ0+γ1TXi+γ2y1+eγ0+γ1TXi+γ2ydy+(nλξ0)},

where λ is determined by the equation

i=1n{λyPθ(y|Xi)eγ0+γ1TXi+γ2y1+eγ0+γ1TXi+γ2ydy+(nλξ0)}1=1

under the constraint that each term in the summation is positive.

Finally, we deal with the case in which the disease is not rare and the disease rate is unknown. If γ2 = 0, we can use standard statistical methods; see Appendix B. Suppose that γ2 ≠ 0. By introducing the Lagrange multiplier λ, we differentiate the log-likelihood function to obtain

1pin1P(D=1)yPθ(y|Xi)P(D=1|Xi,y)dyn0P(D=0)yPθ(y|Xi)P(D=0|Xi,y)dyλ=0,

where n1 and n0 are the numbers of cases and controls. It is easy to see that λ = 0, so

pi={n1ξyPθ(y|Xi)P(D=1|Xi,y)dy+n0(1ξ)yPθ(y|Xi)P(D=0|Xi,y)dy}1,

where ξ = P(D = 1). Plugging this expression into the log-likelihood function yields the profile log-likelihood function for γ0, γ1, γ2, θ and ξ

i=1n{Di(γ0+γ1TXi+γ2Yi)log(1+eγ0+γ1TXi+γ2Yi)+logPθ(Yi|Xi)}i=1nlog{n1ξyPθ(y|Xi)eγ0+γ1TXi+γ2y1+eγ0+γ1TXi+γ2ydy+n0(1ξ)yPθ(y|Xi)11+eγ0+γ1TXi+γ2ydy}n1logξn0log(1ξ).

In all three cases, we maximize the profile log-likelihood functions via the Newton-Raphson algorithm or optimization algorithms. By the arguments used in the proofs of Theorems 2 and 3 of Lin and Zeng [2006], we can show that the maximum likelihood estimators are consistent and asymptotically normal. In addition, the limiting covariance matrix attains the efficiency bound and can be consistently estimated by the inverse of the negative Hessian matrix of the profile log-likelihood function.

APPENDIX B: THEORETICAL PROPERTIES OF STANDARD STATISTICAL METHODS

We investigate the validity of standard statistical methods. Let P denote the true probability, and let Pobs denote the observed probability under the case-control design. Since

Pobs(Y|X)=Pobs(Y,X|D=1)Pobs(D=1)+Pobs(Y,X|D=0)Pobs(D=0)Pobs(X)

and Pobs(Y, X |D) = P(Y, X |D), we have

Pobs(Y|X)={Pobs(D=1)P(D=1)eγ0+γ1TX+γ2Y1+eγ0+γ1TX+γ2Y+Pobs(D=0)P(D=0)11+eγ0+γ1TX+γ2Y}P(Y|X)P(X)Pobs(X). (1)

Likewise,

Pobs(Y|X,D=0)=1P(D=0)11+eγ0+γ1TX+γ2YP(Y|X)P(X)Pobs(X|D=0), (2)

and

Pobs(Y|X,D=1)=1P(D=1)eγ0+γ1TX+γ2Y1+eγ0+γ1TX+γ2YP(Y|X)P(X)Pobs(X|D=1). (3)

If Pobs(Y |X) = P(Y |X), then the usual (prospective) likelihood based on the combined sample of cases and controls will yield correct estimates of θ as well as correct variance estimates. The sufficient and necessary condition for this equality is that Y is independent D conditional on X, i.e., γ2 = 0. To show the sufficiency, we note that equation (1) under γ2 = 0 entails

Pobs(X)=P(X|D=0){eγ0+γ1TXPobs(D=1)/P(D=1)+Pobs(D=0)/P(D=0)}P(D=0).

Similarly, P(X) = P(X |D = 0)(eγ0+γ1TX + 1)P(D = 0). Plugging these two expressions into equation (1), we see Pobs(Y |X) = P (Y |X). We show that γ2 = 0 is the necessary condition by contradiction. Suppose that Pobs(Y|X) = P(Y|X) but γ2 ≠ 0. Then equation (1) implies

Pobs(D=1)P(D=1)=Pobs(D=0)P(D=0).

This is false since Pobs(D = 1) is arbitrary under the case-control design. Using equations (2) and (3), we can show that γ2 = 0 is also the necessary and sufficient condition for the standard analysis based on controls only or cases only to be valid. In most case-control studies, secondary phenotypes are correlated with the disease status, so that γ2 ≠ 0.

What happens if D is independent of X conditional on Y, i.e., γ1 = 0? We first consider dichotomous traits. By switching the roles of X and Y in the above derivation, we see Pobs(X|Y) = P(X|Y) under γ1 = 0; similar calculations show that Pobs(X |Y, D) is proportional to P(X|Y). Therefore, standard logistic regression analysis based on cases only, controls only, or the combined sample yields correct estimates of the odds ratios in the general population. For quantitative traits, the linear model specifies that Y = β0 + β1TX + , where is zero-mean normal with variance σ2. Write = Yβ1TX. Clearly, is independent of X, so standard linear regression will estimate β1 correctly if and only if Pobs( |X) has mean zero. In light of equation (1), this means

yy1+eγ0+(γ1+γ2β1)TX+γ2yPobs(y)dy=0

for all X, implying γ1 = −γ2β1. Thus, standard regression analysis is biased under γ1 = 0 unless β1 = 0 or γ2 = 0. This explains the patterns of bias seen in Figure 2.

We now investigate the bias of including the case-control status as a predictor in a regression model. It follows from equations (2) and (3) that for dichotomous traits,

logPobs(Y=1|X,D)Pobs(Y=0|X,D)=β0+β1TX+γ2Dlog1+eγ0+γ1TX+γ21+eγ0+γ1TX,

and for continuous traits,

Eobs[Y|X,D=0]=yyP(y|X)(1+eγ0+γ1TX+γ2y)1dy/c0(X),
Eobs[Y|X,D=1]=β0+β1TXyyP(y|X)(1+eγ0+γ1TX+γ2y)1dy1c0(X),

where c0(X) = y P(y|X)(1 + eγ0+γ1TX+γ2y)−1dy. Thus, the logistic regression of Y on X and D is not a valid analysis of the effects of X on Y in the general population unless γ1 = 0 or γ2 = 0, whereas the linear regression of Y on X and D generally yields biased estimates of the effects of X on Y in the general population unless γ1 = 0 and γ2 = 0.

Finally, we consider rare disease. When the disease is rare, equation (1) becomes

Pobs(Y|X){Pobs(D=1)eγ1Tx+γ2yP(y|x)P(x)dydxeγ1TX+γ2Y+Pobs(D=0)P(D=0)}P(Y|X)P(X)Pobs(X).

Clearly, Pobs(Y|X) is not equivalent to P(Y|X) unless γ2 = 0. That is, standard analysis based on the combined sample is generally invalid. For rare disease, equation (2) becomes

Pobs(Y|X,D=0)1P(D=0)P(Y|X)P(X)Pobs(X|D=0)P(Y|X).

Thus, standard analysis based on controls only is approximately valid. On the other hand, equation (3) becomes

Pobs(Y|X,D=1)1eγ1Tx+γ2yP(y|x)P(x)dydxeγ1Tx+γ2yP(Y|X)P(X)Pobs(X|D=1).

Thus, Pobs(Y|X, D = 1) is not equivalent to P(Y |X) unless γ2 = 0. Interestingly, when Y is dichotomous,

Pobs(Y=1|X,D=1)Pobs(Y=0|X,D=1)eγ2P(Y=1|X)P(Y=0|X),

so standard logistic regression based on cases only yields approximately correct estimates of odds ratios. If Y is continuous satisfying the linear model, then Pobs (Y |X, D = 1) is approximately proportional to exp{−(Yβ1TXβ0σ2γ2)2/2σ2}; therefore, standard linear regression based on cases only is approximately correct for rare disease. Whether the trait is dichotomous or continuous, the regression means between cases and controls differ only by a constant, so (for rare disease) the regression of Y on X and D yields approximately correct estimation of the effects of X on Y in the general population.

References

  1. Frayling TM, Timpson NJ, Weedon MN, Zeggini E, Freathy RM, Lindgren CM, Perry JR, Elliott KS, Lango H, Rayner NW, et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science. 2007;316:889–894. doi: 10.1126/science.1141634. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Gudbjartsson DF, Walters GB, Thorleifsson G, Stefansson H, Halldorsson BV, Zusmanovich P, Sulem P, Thorlacius S, Gylfason A, Steinberg S, et al. Many sequence variants affecting diversity of adult human height. Nat Genet. 2008;40:609–615. doi: 10.1038/ng.122. [DOI] [PubMed] [Google Scholar]
  3. Kathiresan S, Melander O, Guiducci C, Surti A, Burtt NP, Rieder MJ, Cooper GM, Roos C, Voight BF, Havulinna AS, et al. Six new loci associated with blood low-density lipoprotein cholesterol, high-density lipoprotein cholesterol or triglycerides in humans. Nat Genet. 2008;40:189–197. doi: 10.1038/ng.75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Lettre G, Jackson AU, Gieger C, Schumacher FR, Berndt SI, Sanna S, Eyheramendy S, Voight BF, Butler JL, Guiducci C, et al. Identification of ten loci associated with height highlights new biological pathways in human growth. Nat Genet. 2008;40:584–591. doi: 10.1038/ng.125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Lin DY, Zeng D. Likelihood-based inference on haplotype effects in genetic association studies. J Am Stat Ass. 2006;101:89–118. with discussion. [Google Scholar]
  6. Loos RJF, Lindgren CM, Li S, Wheeler E, Zhao JH, Prokopenko I, Inouye M, Freathy RM, Attwood AP, Beckmann JS, et al. Common variants near MC4R are associated with fat mass, weight and risk of obesity. Nat Genet. 2008;40:768–775. doi: 10.1038/ng.140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–411. [Google Scholar]
  8. Roeder K, Carroll RJ, Lindsay BG. A semiparametric mixture approach to case-control studies with errors in covariables. J Am Stat Ass. 1996;91:722–732. [Google Scholar]
  9. Sanna S, Jackson AU, Nagaraja R, Willer CJ, Chen WM, Bonnycastle LL, Shen H, Timpson N, Lettre G, Usala G. Common variants in the GDF5-UQCC region are associated with variation in human height. Nat Genet. 2008;40:198–203. doi: 10.1038/ng.74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Saxena R, Voight BF, Lyssenko V, Burtt NP, de Bakker PIW, Chen H, Roix JJ, Kathiresan S, Hirschhorn JN, Daly MJ, et al. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science. 2007;316:1331–1336. doi: 10.1126/science.1142358. [DOI] [PubMed] [Google Scholar]
  11. Weedon MN, Lango H, Lindgren CM, Wallace C, Evans DM, Mangino M, Freathy RM, Perry JRB, Stevens S, Hall AS, et al. Genome-wide association analysis identifies 20 loci that influence adult height. Nat Genet. 2008;40:575–583. doi: 10.1038/ng.121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Weedon MN, Lettre G, Freathy RM, Lindgren CM, Voight BF, Perry JR, Elliott KS, Hackett R, Guiducci C, Shields B, et al. A common variant of HMGA2 is associated with adult and childhood height in the general population. Nat Genet. 2007;39:1245–1250. doi: 10.1038/ng2121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Willer CJ, Sanna S, Jackson AU, Scuteri A, Bonnycastle LL, Clarke R, Heath SC, Timpson NJ, Najjar SS, Stringham HM, et al. Newly identified loci that influence lipid concentrations and risk of coronary artery disease. Nat Genet. 2008;40:161–169. doi: 10.1038/ng.76. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES