Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Aug 4.
Published in final edited form as: Genet Epidemiol. 2008 Nov;32(7):638–646. doi: 10.1002/gepi.20338

On combining family and case-control studies

Ruth M Pfeiffer 1, David Pee 2, Maria T Landi 1
PMCID: PMC8336588  NIHMSID: NIHMS1726666  PMID: 18454494

Summary

Studies to detect genetic association with disease can be family-based, often using families with multiple affected members, or population based, as in population based case-control studies. If data on both study types are available from the same population, it is useful to combine them to improve power to detect genetic associations. Two aspects of the data need to be accommodated, the sampling scheme and potential residual correlations among family members. We propose two approaches for combining data from a case-control study and a family study that collected families with multiple cases. In the first approach, we view a family as the sampling unit and specify the joint likelihood for the family members using a two-level mixed effects model to account for random familial effects and for residual genetic correlations among family members. The ascertainment of the families is accommodated by conditioning on the ascertainment event. The individuals in the case-control study are treated as families of size one, and their unconditional likelihood is combined with the conditional likelihood for the families. This approach yields subject specific maximum likelihood estimates of covariate effects. In the second approach, we view an individual as the sampling unit. The sampling scheme is accommodated using two-phase sampling techniques, marginal covariate effects are estimated, and correlations among family members are accounted for in the variance calculations. The models are compared in simulations. Data from a case-control and a family study from North-Eastern Italy on melanoma and a low-risk melanoma-susceptibility gene, MC1R, are used to illustrate the approaches.

Keywords: Ascertainment, Nested random effects model, Marginal model, Family history, Stratified sampling

1. Introduction

Associations of disease with candidate genes or markers in linkage disequilibrium with candidate genes can be tested with either family-based association studies, often using families with multiple affected members, or population based case-control studies. The choice of a specific design depends on whether the investigators are looking for a rare mutation with a large relative risk, or a common genetic variant with moderate relative risk.

Sometimes both types of studies are available from the same population, as was the case in our motivating example of cutaneous melanoma, the major cause of skin cancer mortality world wide. To investigate the role of two rare high-risk melanoma susceptibility genes, CDKN2A and CDK4, investigators collected 55 melanoma-prone families in North-Eastern Italy. Given the relatively low incidence rate of melanoma in Mediterranean countries, melanoma prone families were defined as having at least two members with melanoma. To study the impact of DNA repair capacity, hypothesized to confer low risk of melanoma, the same investigators also conducted a case-control study, including 179 healthy controls and 183 sporadic cases recruited at the same hospital as the family study cases. To be more specific, a melanoma patient seen at the Dermatology Department of the Maurizio Bufalini Hospital of Cesena, Italy, who reported a family history of melanoma was invited to participate in the family study, and subsequently all family members were also enrolled. A melanoma patient who did not have a family history of melanoma was invited to participate in the case-control study. Controls for the case-control study were healthy individuals without a family history of melanoma. Thus subjects in the case-control study had no family history of melanoma. Clinical data, phenotypic characteristics and epidemiologic variables for both studies were collected by a single dermatologist.

After answering the initial study questions, the investigators were also interested to assess the impact of a low-risk melanoma-susceptibility gene on melanoma risk, the human melanocortin-1 receptor gene (MC1R), that has been shown to regulate pigment formation. Specific variants of MC1R have been associated with red hair, low tanning ability, freckling, and thus increased susceptibility to melanoma in Northern European populations [Kennedy et al, 2001; Duffy et al, 2004].

As the subjects from the Italian case-control and family studies came from the same region, and study protocol and data collection methods were very similar, combining both studies to investigate the impact of MC1R variants and their interactions with other risk factors on melanoma risk seemed a sensible approach to improve power to detect associations. Two features of the data needed to be accommodated in a combined approach to provide valid estimates of risk, the sampling of the subjects and potential residual correlations among included family members. In section 2, we propose two approaches to jointly analyze case-control and family data to estimate effects of genetic variants on disease risk, a conditional approach, and a marginal approach. In section 3 we study both methods using simulated data. We apply the methods to the Italian melanoma data in section 4, and close with a discussion in section 5.

2. Data and Models

2.1. Data

Let Y denote the binary disease status of an individual in the study (Y = 1 if the person has disease and 0 otherwise), and X the corresponding vector of measured covariates including genetic and other information, for example, gender and age.

For subjects from the family study we use Yij for the jth member of the ith family and denote by mi the size of the ith family, and by m the total number of families in the study. The total number of individuals in the family study is given by nF=immi, and the total number of individuals in the case-control study by nC, comprised of n0 controls and n1 cases. For simplicity, subjects sampled into the case-control study are indexed by a single subscript, e.g. Yk.

2.2. A conditional approach

In separate analyses, a standard approach would be to use logistic regression models to estimate odds ratios for the case-control study, and conditional logistic regression analysis, conditioning on the number of cases in a family to account for the ascertainment, for the family study. Such conditional logistic regression is strictly valid if the family members have exchangeable genotypes, such as siblings. For all other types of families, the usual conditional logistic model assumes that there is no residual correlation among family members, that is, individuals from the same family are independent given family and their covariates. To obtain joint risk estimates, it would be sensible to maximize the joint likelihood obtained by multiplying the conditional logistic model likelihood times the unconditional logistic model likelihood as the two studies are independent.

We extend this basic approach using a random effects model that yields family specific estimates of risk, introduced by Pfeiffer et al [2001], to allow for varying residual correlations among family members. To summarize the model, the probability pij = P(Yij = 1) is a function of the covariate Xij, the random familial effect ai, which affects all family members equally, and an individual level random genetic effect gij for the jth individual in the ith family:

logit(pij)=logitP(Yij=1|ai,gij,Xij)=μ+σaai+σggij+βXij. (1)

The ai are assumed to be independent and identically distributed from a standard normal distribution with E(ai) = 0 and var(ai) = 1, and are assumed to be independent of the gij’s in the general population. For the gij’s, we assume a polygenic model in which the gij’s are normally distributed, with mean zero, variance one, and an additive component of variance [Fisher, 1918]. The additive covariance matrix Σi for the ith family is a function of the degree of kinship k(j, l) between members j and l in the family:

cov(gij,gil)=(i)j,l=2k(j,l). (2)

For example, k(j, j) = 0, k(j, l) = 1 if j and l are first-degree relatives, e.g. siblings, and k(j, l) = 2 if j and l represent second-degree relatives. Thus for each extra generation that separates two family members, the correlation is multiplied by a factor of 1/2. For unrelated family members, such as spouses, k(j, l) = ∞ and (Σi)j,l = 0.

Under the logistic model (1), the marginal probability of the response in the ith family requires multidimensional integration over the random effects distribution, which cannot be carried out in closed form, and is written as

P(Yi1,,Yimi|Xi1,,Ximi)=j=1mipijyijqij1yijdF(a,g), (3)

where qij = 1 − pij. To evaluate the integrals in the random effects models we used Monte Carlo integration [Pfeiffer et al, 2001].

Likelihood for the family study

To account for the fact that the families selected into the family study are not a random sample of families in the population, but have at least two cases, the likelihood function of the data should be conditioned on the ascertainment event. Letting Yi.=j=1miYij denote the number of cases in the ith family, conditioning on Yi. ≥ 2 leads to the following likelihood

Lf(Y1,,YnF,θ)=i=1nFP(Yi1,Yi2,,Yimi|Xi1,..,Ximi,Yi.2)=i=1nFj=1miexp(βYijXij)λijY(θ)dF(a,g)1l=1miλil0(θ)dF(a,g)lexp(βXil)λil1(θ)jlλij0(θ)dF(a,g), (4)

where

λijY(θ)=exp{Yij(μ+σaai+σggij)}1+exp(μ+σaai+σggij+βXij),forY=0,1, (5)

and θ = (μ, β, σa, σg).

Alternatively, one could condition on a a slightly stronger event, the exact number Yi. of cases in a family, resulting in the conditional likelihood function

Lf(Y1,,Yk,θ)=i=1nfP(Yi1,Yi2,,Yimi|Xi1,..,Ximi,Yi.)=i=1nfj=1miexp(βYijXij)λijY(θ)dF(a,g)l=1miexp(βYilXij)λilY(θ)dF(a,g), (6)

where the summation is over all possible choices of Yi. cases out of mi family members. While (4) as well as (6) yield asymptotically unbiased estimates of β, the likelihood (4) results in more efficient estimates of β than the likelihood (6) [Pfeiffer et al, 2003].

Missing covariate data often occur in family studies. Indeed, in the Italian melanoma family study several cases were deceased at the time of the interview, and thus genetic and other covariate information could not be obtained. To avoid modeling of missing data, these cases were excluded, and for families with missing case information we used the likelihood (6), with Yi. denoting the number of the cases with complete covariate information.

Likelihood for the case-control study

As the random effects model (1) was specified in a form appropriate for prospectively sampled family data, we now show its validity for retrospectively sampled case-control data.

For the individuals sampled into the case-control study, information on family members or on family size is not available, and we therefore treat them as families of size one. Let the sampling indicator be R = 1 if a person is sampled into the case-control study and R = 0 otherwise, with P(R = 1|Y = i) = πi i = 0, 1. For individuals sampled into the case-control study, we get under model (1) that

P(Y=1|R=1,X)=P(Y=1|R=1,a,g,X)dF(a,g)=P(R=1|Y=1,a,g,X)P(Y=1|a,g,X)dF(a,g)P(R=1|Y=0,a,g,X)P(Y=0|a,g,X)+P(R=1|Y=1,a,g,X)P(Y=1|a,g,X). (7)

As R depends only on case-control status, P(R = 1|Y = i, a, g, X) = P(R = 1|Y = i) = πi, the above expression reduces to

P(Y=1|R=1,X)=π1exp(μ+σaa+σgg+βX)π0+π1exp(μ+σaa+σgg+βX)dF(a,g)=exp(μ+σaa+σgg+βX)1+exp(μ+σaa+σgg+βX)dF(a,g) (8)

where μ* = μ + ln(π1/π0). The integral is two-dimensional.

The likelihood contribution for the case-control data is thus

Lc(Y1,,Ync,θ*)=i=1ncexp{Yi(μ+σaai+σggi+βXi)}1+exp(μ*+σaai+σggi+βXi)dF(a,g)=i=1ncexp(βYiXi)λiY(θ)dF(a,g), (9)

with λiY(θ) given by (5), and θ* = (μ*, β, σa, σg). While the above likelihood contains some information on the random effects parameters, they cannot be estimated solely using case-control data.

If the disease is rare, i.e. if σa, σg, βX << |μ|, and μ is small, using the moment generating function of the normal random effects a and g, we obtain

P(Y=1|R=1,X)=eμ+βXeσaa+σggdF(a,g)=exp(μ+βX+σa2/2+σg2/2). (10)

The random effects parameters could thus also be absorbed into a case-control study specific intercept at little cost to efficiency of the parameters estimates, leading to the usual unconditional likelihood

Lc(Y1,,Ync,β,μ˜)=i=1ncexp{Yij(μ˜+βXi)}.1+exp(μ˜+βXi) (11)

Combined likelihood for the family and case-control studies

Multiplying the likelihood contributions for the family study, given by (4) or (6), and the case-control study, given by (9) yields the combined data likelihood:

L(μ,μ,β,σa,σg)=Lf(μ,β,σa,σg)Lc(μ,β,σa,σg). (12)

Note that if σg = σa = 0, and the likelihood (4) is used for the families, then (12) reduces to the standard logistic likelihood for the case-control data, and conditional logistic regression for the family data. However, if residual correlations are present among family members, ignoring them leads to estimates of covariate effects that are attenuated towards the null [Pfeiffer et al, 2001].

2.3. A marginal approach

While the random effects model likelihood allows one to accommodate ascertainment of the families as well as residual familial correlation, the interpretation of the parameters is conditional on the random effects. In some applications, it may be desirable to have a marginal or population averaged interpretation of the parameters. In a marginal model, an individual rather than a family, is viewed as the sampling unit, and the relationship between covariates and outcome is described by the logistic regression model

logitP(Y=1|X)=logitfβ(Y=1|X)=β0+β1X. (13)

When estimating parameters, however, the sampling process needs to be accounted for. To do so, we adapt a weighted likelihood approach, also known as the Horvitz-Thompson approach, that is a popular method for analyzing data from multiphase stratified designs.

The sampling in our setting depends not only on disease status, Y, of a subject, but also on the number of his or her diseased relatives. We therefore define a measured covariate, “family history”, FH, as follows: FH = 0 if a person has no relatives with disease, FH = 1 if the a person has one relative with disease, and FH = 2 if the person has two or more diseased relatives. The source population can then be partitioned into six disjoint strata based on values of Y and FH: (Y, FH) ∈ {(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2)}, as shown in Table I. Cases from families with three or more affected family members belong to the stratum (Y = 1, FH = 2), cases arising from families with two affected members fall into the stratum (Y = 1, FH = 1), and sporadic cases are from the stratum (Y = 1, FH = 0). Unaffected individuals who are members of a multiplex family fall into the stratum (Y = 0, FH = 2), and and all other unaffected persons belong to either (Y = 0, FH = 1) or (Y = 0, FH = 0).

Table I:

Strata defined by cases-control status, Y, and family history, FH

FH
0 1 2
Y = 0 N00 N01 N02
Y = 1 N10 N11 N12

Let R = 1 if a person is sampled into the combined study, and R = 0 otherwise. The sampling probabilities are now defined as functions of family history and disease status of each person, P(R = 1|FH, Y) = π(Y, FH). Let Sβ(Yij, Xij = /β log fβ(Yij|Xij) where fβ is given by model (13), and the subindices i and j denote the family and the subject within family, respectively. Individuals from the case-control study again are viewed as families of size one. Let N denote the total population size, and mi the size of the ith family in the population. Solving the estimating function that sums over the scores for each sampled unit weighted by the inverse of the sampling probability,

S(π,β)=i=1Nj=1miRijπ(Yij,FHij)Sβ(Yij,Xij)=i=1nF+ncj=1miwijSβ(Yij,Xij)=0, (14)

where wij = 1/π(Yij, FHij), yields consistent estimates of β under model (13). This is easy to see, as using ER|FH,Y,XRij = π(Yij, FHij),

EY,X,FH,RU(π,β)=EY,X,FHi=1Nj=1miER|FH,Y,XRijπ(Yij,FHij)Sβ(Yij,Xij)=0

For multistage sampling this approach was termed Horvitz-Thompson estimating function [Whittemore, 1997], and referred to as pseudolikelihood estimating function [Pfeffermann, 1993].

We accommodate correlations among family members in the variance calculations, by using a robust or sandwich estimate of the variance [Binder, 1983]. As n = nF + nc → ∞, using standard Taylor expansion, and assuming that the sampling weights are known and fixed,

n1/2(β^β)=n[i=1nj=1miwijβSβ(Yij,Xij)]1n1/2[i=1nj=1miwijSβ(Yij,Xij)]+op(1)N(0,A1B(A1)T).

The pieces of the asymptotic variance are A=E[βSβ(Y,X)], estimated by its weighted sampling equivalent

A^=1n[i=1(n)j=1miwijβSβ(Yij,Xij)]

and B=lcov[Sβ(Y,X)|unit in stratuml], estimated by

B^=1ni=1n[j=1miwijSβ(Yij,Xij)][j=1miwijSβ(Yij,Xij)]

evaluated at β=β^.

While the above inverse probability weighted approach is appealing as it is computationally simple, it has two drawbacks. First, it can be inefficient [Whittemore and Halpern, 1997], especially if the weights are very different, and second, it requires that π(Y = i, FH = j) > 0 for all i, j. However, empty cells, and thus π(Y = i, FH = j) = 0, can occur in the combined family and case-control study setting when all the cases in a population are sampled into the study, i.e. π(1, FH) = P(R = 1|Y = 1, FH) = 1 for FH = 0, 1, 2. In this situation all healthy individuals in the FH = 2 stratum are also sampled with probability one, P(R = 1|Y = 0, FH = 2) = 1, as they are included in the family part of the study. Because persons sampled into the case-control part of the study are required to be unrelated, it follows that P(R = 1|Y = 0, FH = 1) = 0, while P(R = 1|Y = 0, FH = 0) = π, for some 0 < π < 1.

A marginal approach when cells are empty

Approaches to deal with the empty cell problem are available, if one is willing to assume that the probability of disease occurrence in the source population is described by the logistic regression model

logitP(Y=1|FH,X)=logitfβ(Y=1|FH,X)=β0+β1X+β2FH. (15)

The above model, like model (13), assumes that there is no residual familial correlation, given family history and the measured covariates X.

We want to stress that the estimate of covariate effects based on model (15) will be smaller than the marginal estimate obtained from model (13), as X and FH are positively correlated. An approximation for the bias is given in the Appendix. However, if Y is independent of X, then FH and X are also uncorrelated, and both models (13) and (15) yield unbiased estimates of β1 = 0. The test of the hypothesis H0 : β1 = 0 derived from model (15) thus has the correct size.

Under model (15), a pseudo-score approach proposed by Chatterjee et al [2003] can be used even with empty cells and is briefly summarized here for our setting. Assuming that the scores and integrals to follow exist and letting βlogfβ=Sβ, where fβ is defined by (15), a score function for all individuals is given by

SF(β)=j{RjSβ(Yj|FHj,x)+(1Rj)Sβ(Yj|FHj,x)dF(x|FHj)fβ(Yj|FHj,x)dF(x|FHj)}, (16)

for a fixed choice of F. An estimate of F(.|FH) can be substituted into (16), leading to an estimated score, SF^. Instead of the empirical distribution function which cannot be used when there are empty cells, the following smoother consistent estimate of F is proposed:

dF(x|FH)=dFn(x|FH,R=1)P(R=1|FH)P(R=1|x,FH) (17)

where using n = nF + nC,

Fn(x|FH,R=1)=i=1nI(Xix,FHi=fh,Ri=1)i=1nI(FHi=fh,Ri=1)

denotes the empirical distribution function conditional on FH of the covariate x among the genotyped subjects. The denominator of (17) is

P(R=1|x,FH)=Y=01P(R=1|Y,FH)fβ(Y|x,FH) (18)

as the sampling only depends on Y and FH. Plugging (17) into the score equation (16) results in the pseudoscore

Sβ;F^=j{RjSβ(Yj|FHj,x)+(1Rj)Sβ(Yj|FHj,x)hβ(Yj|FHj,x)dF(x|FHj)hβ(Yj|FHj,x)dF(x|FHj)} (19)

where hβ(Yj|FHj, x) = fβ(Yj|FHj, x)/P(R = 1|FH, Y) as the term P(R = 1|FH) cancels out of the denominator and the numerator. An iterative procedure [Chatterjee et al, 2003] can be used to find estimates of the parameters in (19).

To accommodate dependency of family members in the variance calculations, we propose a bootstrap estimate for the variance, based on resampling families.

3. Simulation study

We used simulations to compare the estimates of β for the random effects model (1) and for the fixed effects model (15) that estimates parameters in the presence of a family history variable in the model, using the pseudo-score approach, as we were most interested in the combined family and case-control study situation with complete case ascertainment.

Phenotypes were generated according to the random effects model (1), for various choices of the parameters. For each data set in the simulations, we created a population of families and then sampled all individuals from families with two or more cases, and in addition, a fixed number of cases from single case families, and controls from families with no case. Due to the varying number of multiplex families, the number of cases varied for each of the simulations.

For simplicity we assumed that all families had the same size, ni = 5, and the same family structure. We generated the individual level random effects gij from a multivariate normal distribution with a correlation structure corresponding to a family comprised of a mother, father, and three offspring, i.e. (Σi)j,l = 0 for j = 1 and l = 2, and (Σi)j,l = 1/2 for all other jl. The random familial intercept ai was generated from a standard normal distribution.

The univariate covariate X was simulated from a single biallelic gene as follows. Let d denote the wild type allele, and D the disease associated allele with allele frequency p. Let Dij = 0, 1, 2 denote the number of alleles D that individual ij is carrying. To simulate the data, we used a dominant score functions Xij = X(Dij) in model (1) defined as Xij = X(Dij) = 0 for Dij = 0, and Xij = X(Dij) = 1 for Dij = 1 and Dij = 2. We focused on a dominant model, as this was the model we used in the data example. To generate genotypes for a random family, we first selected the parental genotypes at random from the general population assuming Hardy-Weinberg equilibrium and then generated the genotypes for the offspring assuming Mendelian transmission. The values of μ, σa, σg and p were chosen to reflect plausible levels of genetic risk for rare diseases. The true value of β was 1.0.

Table II gives the means of the parameter estimates over 100 simulations for the pseudolikelihood and the random effects estimates with the corresponding empirical variances of the estimates. To obtain the true value of the parameter corresponding to the marginal model (13), we simulated 1000 datasets consisting of 10000 families each, and computed the mean value of β obtained by fitting (13) to all the data. We also provide results obtained from fitting models (13) and (15) to the data using a generalized estimating equations (GEE) approach [Liang and Zeger, 1986], ignoring that sampling probabilities for cases and controls depend on family history, and treating cases and controls as random samples from all cases and controls in the population.

Table II:

Mean estimates over 100 simulations and the corresponding empirical variances for 1000 families of size 5 for a dominant genetic model. Standard deviations given below the estimates.

Simulation parameters Marginal models Random effects model
(13)1 (15)2
GEE pseudo-likelih. GEE

μ, σa2, σg2, p β1 β1 β1 β2 β1 β2 β σa2 σg2
−3, 0.0, 0.0, 0.1 1.00 0.99 0.99 −0.01 0.99 −0.05 1.10 0.48 0.28
0.07 0.13 0.10 0.11 0.14 0.08 0.16 0.73 0.42
−3, 0.5, 0.5, 0.25 0.96 0.90 0.90 0.31 0.90 −0.15 1.01 0.33 0.29
0.06 0.11 0.06 0.05 0.12 0.07 0.06 0.37 0.31
−3, 1.0, 0.5, 0.1 0.88 0.79 0.80 0.47 0.77 0.11 1.01 0.45 0.89
0.07 0.12 0.10 0.06 0.12 0.07 0.13 0.63 0.66
−3, 0.5, 1.0, 0.1 0.88 0.81 0.80 0.47 0.82 −0.02 1.01 0.45 0.90
0.03 0.13 0.10 0.06 0.13 0.07 0.13 0.68 0.68
−3, 1.0, 1.0, 0.1 0.81 0.74 0.71 0.79 0.71 0.10 0.98 0.86 0.93
0.06 0.14 0.09 0.04 0.14 0.07 0.13 0.79 0.56
1

model (13) : logit P(Y|X) = β0 + β1X.

2

model (15) : logit P(Y|X, FH) = β0 + β1X + β2FH

The estimates β^ for the random effects model were unbiased for β = 1 for all settings in Table II. The random effects variance, however were estimated with low precision, and were not statistically significantly different from zero.

When there was no residual correlation within a family given the measured gene, i.e., σa = 0 and σg = 0, the estimates based on the two-phase pseudolikelihood were virtually unbiased, 0.99(0.10) for β1 corresponding to X and β2, corresponding to FH, was estimated to be −0.01, with standard error (0.11). When there was residual correlation, however, the estimates of β1 underestimated the true marginal β1. For example, even for relatively small random effects variances of σa = 0.5 and σg = 0.5, the true fixed effects β1 was 0.96, while the estimate from the model that adjusted for family history was β^(X)=0.90, corresponding to a 6% bias. The bias increased with the increasing magnitude of σa and σg. For σa = 1 and σg = 1, the true fixed effects β was 0.81, while the estimate from the model that adjusted for family history was β^(X)=0.71, corresponding to a 12% bias, clearly showing the impact of the over-adjustment.

As the naive GEE estimates did not take into account the sampling design, the resulting covariates effects were biased for model (13). For example, for μ = −3, σ1 = 1, σg = 1, β^1 estimated using the GEE approach was 0.74, compared to the true marginal value of β1 = 0.81. Model (15), that includes FH, accommodated the sampling design by absorbing the sampling probabilities into the logistic regression coefficient associated with FH, β2. However, while β1 was estimated without bias, β^2 was biased and lost its interpretability. In addition, there was noticeable loss in efficiency (17% to 50%) in the estimates of β1 from the GEE compared to the pseudo-likelihood approach.

We repeated one of the settings in Table II, μ = −3, σa = 1 and σg = 1 when 200 controls from the FH = 1 cell were sampled into the case-control part of the study to assess the impact of empty cells on the pseudolikelihood estimates of β. The estimate of β1 was 0.71, with a slightly smaller standard deviation 0.08, compared to the empty cell setting, and the estimate for β2 was 0.77(0.05). Thus adding 200 controls did not noticeably effect the estimates.

4. Data Example

We illustrate the proposed methods on data from the motivating example, a case-control study and a family study on melanoma conducted in Northeastern Italy. Subjects from both studies were recruited at the Dermatology Department of the Maurizio Bufalini Hospital of Cesena, Italy; very similar questionnaires were used and pigmentation characteristics were assessed by a single dermatologist. The study hospital is the reference point for the regions of Southern Emilia-Romagna and Northern Marche, with a population of around 1,000,000. From 1994 to 1999, 589 incident melanoma patients were diagnosed the M. Bufalini hospital [Calista et al, 2000], and invited to enroll in the respective studies. Eighty-five percent of all melanoma patients diagnosed in the region were seen in the study hospital, as verified with a regional cancer registry, and with records of melanoma diagnosis from the main hospitals in the study area. Ascertainment of melanoma prone families continued until 2004. Further study details are given elsewhere [Landi et al, 2001; Landi et al, 2005].

In the case-control study DNA was available for 169 melanoma patients and 171 control subjects. Study subjects had no family history of melanoma. The family study included 55 melanoma-prone families, defined as having two or more individuals affected with melanoma per family, ranging in size from two to twelve family members. DNA was obtained from 84 affected and 203 unaffected family members. MC1R is highly polymorphic in the general populations, and most variants are very rare. We therefore assessed genetic risk of individuals carrying any MC1R variant compared to those carrying the consensus sequence. Table III summarizes the numbers of cases and controls in both studies by their MC1R carrier status.

Table III:

Distribution of melanocortin-1 receptor (MC1R) variants in Italian case-control and family study

MC1R status Case-control study Family study
Cases
n=169
Controls
n=171
Cases
n=84
Controls
n=203
No variant 35 71 19 61
20.7% 41.5% 22.6% 30.0%
Any variant 134 100 65 142
79.3% 58.5% 77.4% 70.0%

We first analyzed the case-control study and family study separately, using unconditional logistic regression for the case-control part, and standard conditional logistic regression, conditioning on family, and the random effects regression models using (6) or (4) for the family study (Table IV). The log-odds ratios associated with carrying any MC1R variant were 1.00, 95% CI: (0.52, 1.48) for the case-control study, and, statistically not significant, 0.67, 95%CI: (−0.14, 1.48) for the family study based on standard conditional logistic regression, and 0.70, 95%CI: (−0.31, 1.79) based on the random effects model. The estimate of the individual random effects variance was σ^g2=1.61, 95%CI: (0.00, 4.31). When we combined the two studies and fitted the random effects logistic regression models using (6) or (4) for the family part of the study, the combined log-odds ratio was 1.22, 95%CI: (0.57, 1.86). The estimate of the individual random effects variance was now σ^g2=1.55, 95%CI: (0.37, 2.73), and was statistically significantly different from zero, indicating that substantial residual correlation was present in the data. This is not surprising, as it is known that MC1R variants also influence pigmentation characteristics, which are independently associated with melanoma risk, and thus might induce correlations not captured by gene carrier status alone.

Table IV:

Association between melanocortin-1 receptor (MC1R) gene and melanoma risk in Italian case-control study, family study and both studies combined

Log-odds ratio for carrying any MC1R variant 95% Confidence interval
Logistic regression
Case-control study 1.00 (0.52, 1.48)
Conditional logistic regression
Family study 0.67 (−0.14, 1.48)
Random effects logistic regression
Family study 0.70 (−0.31, 1.79)
Random effects logistic regression
Combined studies 1.22 (0.57, 1.86)
Fixed effects (marginal) model 0.81 (0.44, 1.17)

We then fitted a fixed effects model to the combined data. As subjects sampled into the case-control study did not have a family history of melanoma, π(Y = 0, FH = 1) = 0, the Horvitz-Thompson estimating function approach was not applicable. We therefore fitted the logistic model (15) that includes a family history variable, using the pseudoscore approach. To apply the two-phase sampling ideas, we reconstructed the population that gave rise to our study sample over the ten years during which individuals for the family study were ascertained. Calista et al [2000] reported 589 incident cases during five years, corresponding to 85% of all cases in the region. Assuming that case ascertainment did not differ between sporadic and familial cases, we estimated that the total number of cases over ten years was 2 · 589/0.85 = 1385.

Of the 589 cases, 30 reported having a relative with melanoma, leading to an estimated total of 70 incident cases with a family history of melanoma. Assuming that the distribution of having one or more relatives among all cases was the same as in the family study, we estimated that 47 cases had one, and 23 cases had two or more relatives with melanoma.

As two healthy individuals with a family history of melanoma were seen by study staff, but not enrolled into the case-control study, we estimated that (2/179) · 10, 000, 000 healthy individuals had a family history of melanoma. Again, assuming that the population distribution of FH = 1 and FH = 2 followed the distribution in our sample, 74, 487 individuals were estimated to have had FH = 1 and 37,243 FH = 2. Given those population counts, the log-odds ratio estimate for the fixed effects model (15) for carrying any MC1R variant was 0.81 with 95%CI (0.44, 1.17), and the estimated log-odds ratios for FH was 0.41 95%CI: (0.28, 0.55). The estimates for the fixed effects model for carrying any MC1R variant, and for FH agree well with the results for the simulations for σa2=0.5 and σg2=1.0 (Table II).

5. Discussion

We propose two approaches to combine data from case-control and family studies that were sampled from the same source population. The first approach views a family as the sampling unit, specifies a full likelihood for the families, and uses a two-level random effects model to account for residual correlations among family members. The likelihood for a family is conditioned on the ascertainment event, for example having at least two diseased members. When there is missing covariate information on cases, conditioning on the exact number of cases with full covariate information circumvents the problem of having to deal with missing exposure data. Individuals sampled into the case-control study are viewed as families of size one, and the retrospective sampling is accommodated in the intercept of the random effects logistic model without changing the interpretation of β. The likelihood contributions from the case-control and family studies are then combined, and maximum likelihood estimates of the effects are found.

When the random effects variances are zero and exact conditioning is used for the family study, this approach simplifies to multiplying an unconditional logistic regression likelihood for the case-control study by the conditional logistic likelihoods for the family study. However, if residual correlation is present in the data, this simple approach will lead to estimates of covariate effects that are biased towards the null [Pfeiffer et al, 2001]. While a criticism could be that no straightforward diagnostics are available to evaluate the validity of the random effects model assumptions, the estimates of the fixed effects are robust to moderate misspecifications of the underlying random effects distribution [Pfeiffer et al., 2003]. If the disease of interest is rare and the unmeasured residual component is small, then treating the family members as independent within a given family and relying on conditional logistic regression will result in nearly unbiased estimates of the covariate effects.

To avoid modeling of missing covariate information in the likelihood based approach, we excluded individuals with missing covariates, and for those families used the likelihood (6), with conditioning on the number of the cases with complete covariate information. If the rate of missingness of a covariate, e.g. the gene of interest, is the same for all cases, there will not be a bias in the estimates. However, if cases who carry the gene have more missing information than cases who do not carry the gene, the estimates of genetic effects will be biased towards the null. As the relative survival rate survival following a melanoma diagnosis is excellent, we do not expect a strong difference of missing MC1R information differentially by genotype, and thus the impact of excluding cases with missing data on our results should be small.

The second approach views a study subject as the sampling unit, and estimates marginal effects of covariates. To accommodate the sampling scheme in the combined case-control and family study, we introduce sampling probabilities for each individual in the population that depend on case-control status of the person and on a family history variable, that is coded as having none, one or two or more relatives with disease. Using a weighted estimating equations approach, where the weights are the inverse of the sampling probabilities, one can then obtain unbiased estimates of marginal covariate effects. To apply this approach, however, the sampling probabilities need to be strictly greater than zero. This assumption is sometimes violated, as in our motivating example, where individuals sampled into the case-control part of the study were required to have no family history of melanoma. To deal with this setting, we apply a pseudoscore approach [Chatterjee et al, 2003], and estimate covariate effects in the presence of family history. In risk prediction models, family history is often used as a surrogate for genetic components of disease. In our example, family history was introduced to accommodate the sampling scheme in the combined case-control and family study, and can also be viewed as capturing the effects of residual genetic or environmental factors, similar to a random effect. However, the interpretation of the parameter corresponding to the genetic coefficient is conditional on family history, and does not have a marginal interpretation in its own right.

One could also analyze the data simply by using a generalized estimation equations (GEE) approach [Liang and Zeger, 1986]. However, as the cases and controls are not a random sample of the cases and controls in the population, and the GEE estimates do not take into account the sampling design, the resulting covariate effect estimates will be biased for model (13). Model (15) however, can accommodate the sampling design by absorbing the sampling probabilities into the logistic regression coefficient associated with FH. While the parameter associated with the key exposure, X, can now be estimated without bias, the estimate of the logistic regression coefficient associated with FH is biased and can no longer be interpreted as a true population parameter. In addition, we found that the GEE based estimates of the logistic regression coefficient for X can be substantially less efficient then the estimates based on the pseudo-likelihood approach.

Ideally, the restricted design should be avoided, as the lack of validation subjects in certain strata of family history limits the ability to check the model based on study samples alone. However, in special situations, and under the proper assumptions, the pseudoscore method is flexible enough to accommodate those limitations.

We want to stress again the difference in interpretation of the parameters from the random effects model (1) and the two marginal models, (13) and (15). The random effects model estimate β(X|a, g) for a covariate X captures the effect of X for a person with given familial and polygenic effects. This estimate is useful for etiologic inference. The estimated effect of X β(X) in (13) expresses the average effect of X in the population. It could be useful to predict risk in a randomly selected individual or to gauge risk in the population. The logistic parameter β(X|FH) in (15) gives the average effect of X in the population in the presence of family history. Equation (15) may provide better prediction of risk than using X alone in equation (13). For positive association, the relationship between the parameters from the three models is β(X|a, g) ≥ β(X) ≥ β(X|FH). Parameter estimates from random effects model and marginal model (13) are equal only when the random effects variances equal zero. In view of these differences, the choice of model should be based on the intended application.

Our family based approach relates to work by Neuhaus et al [2006], who propose methods to fit random effects models to case-control family data. In the case-control family design, cases and controls are sampled and then the data are augmented with response and covariate information from family members. To accommodate the sampling, Neuhaus et al divide the population into strata defined by the joint phenotypes in a family. If information on the sampling rates is available, it can be used to model stratum membership of a family. The authors propose a semi-parametric likelihood method for estimation and also also investigate a survey sampling based approach, similar to our marginal model approach. However, they estimate parameters in a random effects model and do not fit marginal models.

Our study design is different from the case-control family design in Neuhaus et al. We do not have phenotype and covariate information for relatives of persons sampled into the case-control study. Indeed, if such information is gathered at all, depends on the phenotypes in a family. We also fit marginal models, where the parameter interpretation is the same, regardless of family size.

In other related work, Siegmund et al [1999] propose a weighted analysis for family data. In the first stage of a two stage design, cases are sampled and classified into strata based on their family history. In the second stage, cases are sampled randomly from the family history strata and information on family members is obtained. If controls are siblings of cases, the weights for cases and controls are constant. In this setting matching on family leads to a cancellation of the sampling weights, and results in the same standard conditional logistic regression likelihood that is obtained for the family part of study under the random effects model when the random effects variances are zero.

Several authors have proposed likelihood based methods for combining data from triads, sampled through an affected offspring, and independent cases and controls [ Epstein et al, 2005; Nagelkerke et al, 2004]. However, in all those approaches it is assumed that the independent cases and controls are random samples from the cases and controls in the population, while in our setting sampling of cases and controls depends on family history. In addition, only offspring phenotypes are used in the triad data, while we would use phenotypes from all family members.

Another point worth mentioning is that family history as well as having two or more cases in a family is a function of family size. We do not explicitly account for family size in our approaches.

This work was motivated by our desire to combine data from a family and case-control study to increase power to detect associations between variants in the MC1R gene and melanoma risk in Italian subjects. However, our work may have broader applications, and could be adapted to other designs where data from family studies are enriched with other information. Such an example was studied in the Genetic Analysis Workshop (GAW) 15, where one of the problems consisted of jointly analyzing 1500 nuclear families of size 4 with two affected offspring and 2000 unrelated controls to detect genetic associations. Both of our approaches would be applicable in that setting.

ACKNOWLEDGMENTS

We thank Nilanjan Chatterjee and Mitchell Gail for helpful discussions and the referee for insightful comments that led to improvements of the paper. We also thank Yi-Hau Chen for help with the programming for the pseudo score approach.

Appendix

Assume that for the joint distribution F of X, FH, Y, F(Y, X, FH) = PF(Y|β, X)F~(X,FH), the relationship between Y,X and FH is defined through the marginal model (13), and for G, G(Y, X, FH) = PG(Y|X, FH) G~(X,FH), through the marginal model (15). Assume that F is the true probability model, i.e. it gave rise to the observations Y, X, FH. The maximum likelihood estimate under the false model G converges to the value β=(β0,β1,β2) which minimizes the Kullback-Leibler divergence between the true model F and the misspecified model G [Akaike, 1973; White, 1982]:

β=argminγEX,FHEY|X,FHlogPF(Y|β,X)PG(Y|γ,X,FH), (20)

where the expectation is taken with respect to the true model F. In what follows, we assume without loss of generality that E(X) = 0 and E(FH) = 0. After differentiating the above expression with respect to γi and some simplification, we see that β* for each component βi has to satisfy

EX,FH{yPF(y|β,X)γiPG(y|γ,X,FH)prG(y|γ,X,FH)}=0. (21)

To be more specific, β1 solves

EX,FH{yXPF(y|β,X)(1PG(y|β,X,FH))}=0,

and similar equations are obtained for β0 and β2. After putting a Taylor expansion of PG(y|β*,X,FH) around β*=(β0,β1,0) in the above equation, we see that

(β1β1)=EX,FH{[X(β0β0*)+FHXβ2]PG(y=0|β*,X,FH)PG(y=1|β,X,FH)}EX,FH{X2PG(y=0|β,X,FH)PG(y=1|β,X,FH)} (22)

For β1 ≈ 0, we obtain

(β1β1*)=EX,FH(FHX)β2*/E(X2). (23)

From the above equation, we that when X and FH are uncorrelated, which happens when X does not influence Y, there is no bias in the estimate of the effect of X using the model that also includes FH, i.e. β1=β1. In addition, if the correlation between FH and X is positive and β2>0, then β1 is biased is towards the null.

References

  1. Akaike H 1973. Information theory and extension of the maximum likelihood principle. In Second International Symposium on Information Theory, Ed. Petrov BN and Czaki F, pp267–91. Budapest: Akademiai Kiado. [Google Scholar]
  2. Binder DA. 1983. On the variances of asymptotically normal estimators from complex surveys. International Statistical Review 51:279–292. [Google Scholar]
  3. Calista D, Goldstein AM, Landi MT. 2000. Familial melanoma aggregation in north-eastern Italy, J Invest Dermatol 115:764–765. [DOI] [PubMed] [Google Scholar]
  4. Chatterjee N, Chen YH. 2007. Maximum likelihood inference on a mixed conditionally and marginally specified regression model for genetic epidemiologic studies with two-phase sampling. J Roy Statist Soc Ser B 69:123–142. [Google Scholar]
  5. Chatterjee N, Chen YH, Breslow NE. 2003. A pseudoscore estimator for regression problems with two-phase sampling. J Amer Statist Assoc 98:158–168. [Google Scholar]
  6. Duffy DL, Box NF, Chen W, Palmer JS, Montgomery GW, James MR, Hayward NK, Martin NG, Sturm RA. 2004. Interactive effects of MC1R and OCA2 on melanoma risk phenotypes. Hum Mol Genet 13:447–461. [DOI] [PubMed] [Google Scholar]
  7. Epstein MP, Veal CD, Trembath RC, Barker JNWN, Li C, Satten GA 2005. Genetic association analysis using data from triads and unrelated subjects Am J Hum Genet 76 (4): 592–608. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Fisher RA. 1918. The correlation between relatives on the supposition of Mendelian inheritance. Trans Roy Soc Edinb 52:399–433. [Google Scholar]
  9. Kennedy C, ter Huurne J, Berkhout M, Gruis N, Bastiaens M, Bergman W, Willemze R, Bavinck JN. 2001. Melanocortin 1 receptor (MC1R) gene variants are associated with an increased risk for cutaneous melanoma which is largely independent of skin type and hair color. J Invest Dermatol 117:294–300. [DOI] [PubMed] [Google Scholar]
  10. Landi MT, Baccarelli A, Calista D, Pesatori A, Fears T, Tucker MA, Landi G. 2001. Combined risk factors for melanoma in a Mediterranean population. Br J Cancer 85:1304–1310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Landi MT, Kanetsky PA, Gold B, Tsang S, Munroe D, Rebbeck T, Swoyer J, Ter-Minassian M, Goldstein AM, Calista D, Pfeiffer RM. 2005. MC1R and ASIP variants, and DNA repair activity in sporadic and familial melanoma in a Mediterranean population, J Natl Cancer Inst 97:998–1007. [DOI] [PubMed] [Google Scholar]
  12. Liang KY, Zeger SL. 1986. Longitudinal data analysis using generalized linear models. Biometrika, 73:13–22. [Google Scholar]
  13. Nagelkerke NJD, Hoebee B, Teunis P, Kimman TG. 2004. Combining the transmission disequilibrium test and case-control methodology using generalized logistic regression Eur J Hum Genet 12 (11): 964–970. [DOI] [PubMed] [Google Scholar]
  14. Neuhaus JM, Scott AJ, Wild CJ. 2006. Family-specific approaches to the analysis of case-control family data Biometrics 62:488–494. [DOI] [PubMed] [Google Scholar]
  15. Pfeffermann D 1993. The role of sampling weights when modeling survey data.International Statistical Review, 61:317–337. [Google Scholar]
  16. Pfeiffer RM, Gail MH, Pee D. 2001. Inference for environmental effects based on family data that accounts for ascertainment. Biometrika 88:933–948. [Google Scholar]
  17. Pfeiffer R, Hildesheim A, Gail MH, Pee D, Cheng YJ, Goldstein A, Diehl S. 2003. Robustness of inference on measured covariates to misspecification of genetic random effects in family studies, Genetic Epidem 24:14–23. [DOI] [PubMed] [Google Scholar]
  18. Siegmund KD, Whittemore AS, Thomas DC. 1999. Multistage sampling for disease family registries. J Natl Cancer Inst Monograph, 26:43–49. [DOI] [PubMed] [Google Scholar]
  19. White H 1982. Maximum likelihood estimation of misspecified models. Econometrica 50:1–25. [Google Scholar]
  20. Whittemore AS. 1997. Multistage sampling designs and estimating equations. J Roy Statist Soc Ser B 59:589–602. [Google Scholar]
  21. Whittemore AS, Halpern J. 1997. Multi-stage sampling in genetic epidemiology Stat in Med 16): 153–167. [DOI] [PubMed] [Google Scholar]

RESOURCES