Abstract
The simplest form of retrospective study allows the reconstruction of the dependence between a binary outcome, representing the contrast between cases and controls, and one or more explanatory variables. A different objective for such situations is considered, in which there are distinct explanatory variables, say determining . Reconstruction of the originating distribution of from the case-control data is considered for both continuous and binary variables. Emphasis is on the linear regression coefficient of on . That coefficient, but not the relevant intercept, shows considerable stability, as shown by theory and simulations. An approximation to the value of the coefficient not conditioning on is given.1
Keywords: Bias removal, Case-control study, Indirect sampling
1. Introduction
One rather general formulation of the challenge of interpreting observational studies is to suppose data available on a sample of individuals with three broad types of observation, outcomes, explanatory variables, and background or intrinsic variables, . The ultimate objective of study is usually the dependence of on allowing for the presence of . An experiment or intervention will study this directly, including often an element of randomization of . Observational studies are constrained in various ways, implying that the distributions generating the data may be only indirectly related to the distribution of interest.
In particular, in a case-control design inclusion in the data depends strongly on an outcome variable. In the present paper we suppose, somewhat unusually, emphasis lies on the dependence among the explanatory variables, in particular that of on in the underlying population, marginalizing over . Reconstruction of the underlying dependence of interest is not direct and it has been pointed out (Nagelkerke, Moses, Plummer, Brunham, Fish, 1995, Lee, Mc Murphy, Scott, 1997, Jiang, Scott, Wild, 2006, Lin, Zeng, 2009, Wei, Carroll, Müller, Van Keilegom, Chatterjee, 2013, Xing, Mc Carthy, Dupuis, Cupples, Meigs, Lin, Allen, 2016) that some methods in the literature may be misleading.
An area where this issue often arises is genetic epidemiology. Many genome-wide association studies (GWAS) have a case-control design because their main aim is to discover associations of genetic variants with relatively rare outcomes, but often data on other variables are collected and the data are re-used to assess the associations with other variables in the case-control sample. Lin and Zeng (2009) demonstrated that some approaches commonly employed in applications give biased estimates of the association of interest and proposed methods to address this issue. Monsees et al. (2009) discussed the issues with a focus on GWAS and testing. Dai and Zhang (2014) studied the Mendelian randomisation estimator for the relationship of a continuous exposure with an outcome in a case-control study.
Because the population proportion of cases is in general unknown, corrections for weighted sampling based on unequal but known probabilities of selection (Horvitz and Thompson, 1952) are not available, except possibly as the basis for a sensitivity analysis.
The emphasis in the present note is on the formal relations involved, not on explicit details of estimation procedures.
In section 2 we give the formulation of the problem and theoretical relations involved for linear regression when is continuous and in particular a formula for the reconstructed regression coefficient. In section 3 we present some results from a simulation study illustrating the theoretical findings. Section 4 introduces the corresponding relationships when all variables are binary, and in section 5 the conclusions are discussed.
2. Theory
To study these issues in their simplest form we consider two random variables whose population distribution is of interest. The case/control binary outcome depends on and defines two random samples conditionally respectively on cases, and controls (Figure 1). From these we wish to reconstruct the population distribution of . Our arguments are general but we focus on the linear regression coefficient, of on . We treat both variables as one-dimensional; the results extend directly to vector .
Fig. 1.

Path diagram.
An instructive but extreme special case arises when is conditionally independent of given . Then also is conditionally independent of given so that the form of the regression relation of on is the same within cases and within controls and within the population. This is concordant with the general notion that in fitting regression relations the explanatory variables are typically regarded as fixed at their observed values. The joint distribution of is, however, in general different in cases from that in controls. To estimate the linear least squares regression coefficient of on we may, however, in this situation find the regression coefficients and their standard errors separately within cases and within controls and, preferably subject to an informal check of consistency, calculate a weighted mean.
More generally we suppose without loss of generality that and also that
where is an increasing function with values in (0, 1). The regression coefficients, such as are defined for a given function so that if different such functions are involved in a specific study an extended notation would be required. Natural choices for are the standardized normal integral and the logistic function. Another important possibility, normally useful, however, only over a restricted range, is the linear in probability model, for . It is known that if the data are concentrated in the range of probabilities say in empirical choice between different ‘dose-response’ relations such as logistic, integrated normal and linear is feasible only with very large amounts of data (Chambers and Cox, 1967). Then marginally in the population, . For the cases, we have
on using the relation that It follows that the conditional distribution of given within the cases, is
| (1) |
To obtain results for the controls, we replace by and reverse the sign of the regression coefficients .
Thus the conditional distribution of given is the same in cases and controls and in the population if and only if consistently with the more general result noted previously. However the conditional mean of in (1) is, on writing for the population and simplifying,
where is the conditional variance of around its least squares regression on .
Thus when all regression coefficients are positive the regression line of on among the cases is somewhat lower than its population form but has the same slope. The replacement of by for the controls implies that because is typically small the distortion among the controls is much smaller.
In many applications, however, especially where some probabilities are quite small, the linear in probability model will not be reasonable. Indeed the most common reason for use of a case-control design is that cases are rare in the population, indeed possibly very rare. In such situations the relation between and in the population will be close to that in the controls and the linearity of the assumed dependence of suspect. We give a more realistic formulation later.
A more detailed analysis of the linear in probability model shows that if the regression of on were studied directly ignoring case/control status then to a first approximation the slope would be unchanged but the position of the line displaced.
For a more refined analysis abandoning the linearity assumption, we assume to have a bivariate normal distribution, taken without loss of generality to have zero means. The regression coefficient of on is again denoted by . We assume further that instead of (1)
where is the standard normal cumulative distribution function. This leads, after integrating over the conditional distribution of given and then over the distribution of to
Here It follows that
with a complementary expression given . If we assume that the population regression of on is linear with normal errors we may replace the first factor on the right-hand side by . In line with earlier results if and only if implying that case/control status is independent of given . In general, if we standardize and to have zero means and unit variances the regression coefficients involving are likely to be numerically small in realistic situations and expansion leads to
where related to Mills ratio. For controls, change the sign of and change to .
That is, to this order the impact of case-control sampling on the regression of on is to leave the slope unchanged but induce translations of the regression line in opposite directions for cases and controls.
In particular it follows that to this order for the cases
where in the last term refers to all regression coefficients with as outcome variable. For the controls, again reverse the appropriate signs and change to .
Inclusion of quadratic terms in the shows that relatively complicated nonlinearities are involved. Typically cases are rare, so that and is small. There is an upward displacement of the regression line but to this order no change in slope. The downwards shift in the line for controls is by contrast much smaller because now the denominator of is close to one whereas the numerator is usually small.
The conditional expectation of given among the cases now follows on multiplying by and integrating.
| (2) |
where is the standardized normal density. For the controls, reverse the signs of . The integral is best approximated by the delta method, that is local linearization around the expected value of namely to give the approximations
| (3) |
Here is the standardized normal density. For the controls, change the sign of the arguments of .
The second term in (2) specifies a nonlinear dependence on . It is most simply summarized by the slope at thus changing the linear regression coefficient to
| (4) |
Now for negative a convenient first approximation is that in effect the leading term of an asymptotic expansion of for large negative underestimating the ratio. This leads to a simple approximation to the regression coefficient among the cases of
| (5) |
For controls, however, a quite different approximation has to be used because being the population proportion of cases, is no longer small. The second term in (4) is thus typically small and the regression coefficient in the controls thus only slightly different from that in the population.
3. Some simulations
To illustrate the problem and confirm the results of the previous section we carried out some simulations. We generated a ‘population’ sample with a given relationship between and and then selected a case-control sample within that, with case/control status depending on and/or . We selected a few plausible values for the relationship between the three variables involved to examine the resulting estimates of the regression of on by carrying out various types of analysis: the analysis in the full population and within the case-control sample (incorrectly) conditioning on case/control status, using inverse probability weighting assuming the proportion of cases is known, and applying the proposed correction.
We generated values of and where . Case/control status was generated to be equal to 1, denoting a case, with probability where is the logistic function. Then 2000 cases and 2000 controls were selected at random. For each parameter configuration 250 replicates were generated. We fitted a linear regression of on in the full sample, in controls only, in cases only, in both cases and controls adjusting for case/control status, in both cases and controls ignoring case/control status, in both cases and controls with weighting by the inverse of the probability of being selected in the case-control sample and in the case-control sample by using (5). Table 1 is a short summary of a much more extensive study, the summary concentrating on the region where the proportion of cases is relatively small. The most surprising result is the relative stability of the point estimate defined by pooling the data regardless of case/control status. This would not be expected to hold if the relation between and was systematically different in cases and controls.
Table 1.
Simulation results; Continuous and ; is the estimated coefficient from the linear regression of on in the population sample, is the effect of on given is the effect of on given and are the estimated coefficients of the regression of on in controls and cases only, respectively, is the estimated coefficient from the sample adjusting for is the estimated coefficient from the sample ignoring is the estimated coefficient from inverse probability weighted regression, is the reconstructed estimate using (5) and is the proportion of cases in the population. Estimates are averages over 250 simulation runs.
| 0.5 | 0.5 | 0.5 | 0.02 | 0.50 | 0.49 | 0.49 | 0.49 | 0.55 | 0.50 | 0.50 |
| 0.1 | 0.50 | 0.47 | 0.46 | 0.47 | 0.53 | 0.49 | 0.51 | |||
| 1 | 0.02 | 0.50 | 0.48 | 0.46 | 0.47 | 0.56 | 0.50 | 0.50 | ||
| 0.1 | 0.50 | 0.46 | 0.42 | 0.44 | 0.53 | 0.47 | 0.53 | |||
| 1 | 0.5 | 0.02 | 0.50 | 0.48 | 0.43 | 0.45 | 0.60 | 0.50 | 0.51 | |
| 0.1 | 0.50 | 0.44 | 0.39 | 0.41 | 0.54 | 0.46 | 0.54 | |||
| 1 | 0.02 | 0.50 | 0.46 | 0.37 | 0.42 | 0.59 | 0.50 | 0.52 | ||
| 0.1 | 0.50 | 0.41 | 0.32 | 0.37 | 0.54 | 0.44 | 0.60 | |||
| 0.8 | 0.5 | 0.5 | 0.02 | 0.80 | 0.80 | 0.79 | 0.79 | 0.83 | 0.80 | 0.80 |
| 0.1 | 0.80 | 0.78 | 0.78 | 0.78 | 0.81 | 0.79 | 0.81 | |||
| 1 | 0.02 | 0.80 | 0.79 | 0.77 | 0.78 | 0.83 | 0.80 | 0.80 | ||
| 0.1 | 0.80 | 0.78 | 0.76 | 0.77 | 0.81 | 0.79 | 0.82 | |||
| 1 | 0.5 | 0.02 | 0.80 | 0.78 | 0.75 | 0.77 | 0.85 | 0.80 | 0.81 | |
| 0.1 | 0.80 | 0.76 | 0.72 | 0.74 | 0.82 | 0.77 | 0.83 | |||
| 1 | 0.02 | 0.80 | 0.78 | 0.71 | 0.75 | 0.84 | 0.79 | 0.82 | ||
| 0.1 | 0.80 | 0.75 | 0.70 | 0.73 | 0.82 | 0.77 | 0.87 |
When either or is zero and is zero, all regressions give the same estimate of . As or increases, the estimates of from the regression in cases only, controls only or on both conditioning on case/control status have a downward bias, the bias increasing with increasing and . The bias is larger when the proportion of cases in the population is larger. The estimates from the weighted regression have a similar pattern but smaller bias. The estimate from the case-control sample obtained without adjusting for case/control status, has a small upward bias when the cases are rare in the population but becomes unbiased when the proportion of cases in the sample becomes closer to the proportion of cases in the population. The reconstructed estimate has a small upward bias.
As a sensitivity analysis the simulation was repeated with and having a distribution (Supplementary table). The resulting estimates differed slightly but not substantially from those in Table 1.
4. Binary variables
We now describe an argument broadly parallel to that in the previous section for the case where both variables are binary. Consider two binary variables and studied at two levels of the variable indicating, again, case/control status, . Then in the population for we have probabilities
where is the joint probability distribution of and is the population probability of an individual being a case. Thus any population property, such as, for example, the log odds ratio
can be found in terms of quantities that can be estimated and the population parameter . In particular when is small, we have that with error
where and is the log odds ratio in the controls, and is the conditional probability of in controls. Note that because probabilities sum to one,
Thus the adjustment term, the coefficient of is particularly important when both and have the same sign.
An alternative approach more closely linked to the results of Section 2 is to regard the binary variables as dichotomized versions of unobserved normally distributed random variables. The earlier results may then be adapted.
5. Discussion
We have discussed the relationships involved when fitting a regression on a pair of variables in a non-random sample, selected on the basis of a third variable. We have found that the regression coefficient is relatively stable under different types of analysis and have proposed a correction to reconstruct the coefficient that would have been obtained under no selection. The intercepts from the different analyses vary substantially, as expected.
The simulations show that the regression coefficient estimated conditioning on case/control status has substantial downward bias, whereas the coefficient estimated by ignoring case/control status or using inverse probability weighting are closer to the value estimated from the population from which the case-control sample has been drawn.
The account given here of sampling based on case/control status can be generalized in various ways, the most immediate of which is to replace the scalar explanatory variable by a vector. When are both binary the discussion can be extended to include adjustment for other covariates, including continuous ones. Yet another possibility is to consider the use of instrumental variables to assess the ‘causal’ effect of on from case-control sampling. Dai and Zhang (2015) reported that an uncorrected estimate is unbiased in the null case but otherwise biased away from the null.
In the very special case when case/control status, is independent of given there are three sources of information about within controls, within cases, and comparison of group means. First approximations to the first two, before adjustment for the special sampling procedure, are given, with their estimated variances, by standard linear regression formulae.
The third estimate, based on the comparison of means is asymptotically independent of the other two and is . Its variance is best found conditionally on the values of as proportional to the variance of the difference of two independent means. The final estimate is a weighted mean with weights the reciprocals of the estimated variances. A more refined estimate of precision, allowing for errors in the estimated weights, has not been investigated.
The investigation outlined here is one facet of the broad challenge of the study of dependences that can be investigated only indirectly.
Acknowlgedgments
We are grateful for helpful referees’ comments.
Footnotes
Supplementary appendix with simulation code and results.
Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.ecosta.2020.10.003.
Appendix A. Supplementary materials
Supplementary Raw Research Data. This is open data under the CC BY license http://creativecommons.org/licenses/by/4.0/
References
- Chambers E.A., Cox D.R. Discrimination between alternative binary response models. Biometrika. 1967;54:573–578. [PubMed] [Google Scholar]
- Dai J.Y., Zhang X.C. Mendelian randomization studies for a continuous exposure under case-control sampling. American Journal of Epidemiology. 2015;181:440–449. doi: 10.1093/aje/kwu291. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Horvitz D.G., Thompson D.J. A generalization of sampling without replacement from a finite universe. J. American Statist. Assoc. 1952;47:663–685. [Google Scholar]
- Jiang Y., Scott A.J., Wild C.J. Secondary analysis of case-control data. Statist. Med. 2006;25:1323–1339. doi: 10.1002/sim.2283. [DOI] [PubMed] [Google Scholar]
- Lee A.J., Mc Murphy L., Scott A.J. Re-using data from case-control studies. Statist. Med. 1997;16:1377–1389. doi: 10.1002/(sici)1097-0258(19970630)16:12<1377::aid-sim557>3.0.co;2-k. [DOI] [PubMed] [Google Scholar]
- Lin D.Y., Zeng D. Proper analysis of secondary phenotype data in case-control association studies. Genetic Epidemiology. 2009;33:256–265. doi: 10.1002/gepi.20377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Monsees G.M., Tamimi R.M., Kraft P. Genome-wide association scans for secondary traits using case-control samples. Genetic Epidemiology. 2009;33:717–728. doi: 10.1002/gepi.20424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nagelkerke N.J.D., Moses S., Plummer F.A., Brunham R.C., Fish D. Logistic regression in case-control studies: the effect of using independent as dependent variables. Statist. Med. 1995;14:769–775. doi: 10.1002/sim.4780140806. [DOI] [PubMed] [Google Scholar]
- Wei J., Carroll R.J., Müller U.U., Van Keilegom I., Chatterjee N. Robust estimation for homoscedastic regression in the secondary analysis of case-control data. J. R. Statist. Soc. B. 2013;75:185–206. doi: 10.1111/j.1467-9868.2012.01052.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xing C., Mc Carthy J.M., Dupuis J., Cupples L.A., Meigs J.B., Lin X., Allen A.S. Robust analysis of secondary phenotypes in case-control genetic association studies. Statistics in Medicine. 2016;35:4226–4237. doi: 10.1002/sim.6976. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Raw Research Data. This is open data under the CC BY license http://creativecommons.org/licenses/by/4.0/
