Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2011 Sep 16.
Published in final edited form as: Lifetime Data Anal. 2010 Dec 12;17(2):175–194. doi: 10.1007/s10985-010-9178-5

Missing Genetic Information in Case-Control Family Data with General Semi-Parametric Shared Frailty Model

Anna Graber-Naidich 1, Malka Gorfine 2,, Kathleen E Malone 3, Li Hsu 4
PMCID: PMC3174530  NIHMSID: NIHMS323227  PMID: 21153764

Abstract

Case-control family data are now widely used to examine the role of gene-environment interactions in the etiology of complex diseases. In these types of studies, exposure levels are obtained retrospectively and, frequently, information on most risk factors of interest is available on the probands but not on their relatives. In this work we consider correlated failure time data arising from population-based case-control family studies with missing genotypes of relatives. We present a new method for estimating the age-dependent marginalized hazard function. The proposed technique has two major advantages: (1) it is based on the pseudo full likelihood function rather than a pseudo composite likelihood function, which usually suffers from substantial efficiency loss; (2) the cumulative baseline hazard function is estimated using a two-stage estimator instead of an iterative process. We assess the performance of the proposed methodology with simulation studies, and illustrate its utility on a real data example.

Keywords: case-control family study, missing genotypes, multivariate survival analysis, frailty model

1 Introduction

Case-control family studies are frequently used in epidemiologic and medical research to dissect the genetic and environmental etiology of complex traits such as cancer and coronary heart diseases (see, for example, Becher et al., 2003; Hopper, 2003; Chatterjee et al., 2006; Malone et al., 2000; Malone et al., 2006). A typical case-control family study includes a sample of independent diseased individuals (case probands) and non-diseased individuals (control probands). Data on age at onset and an array of risk factors is collected on the probands and their relatives. Genetic and environmental background shared by family members leads to correlation of outcomes within a family. An objective of this type of study is to estimate the hazard function for developing disease given these covariates.

Several recent publications focus on multivariate survival models which take into account within-cluster dependencies and sampling design while assuming that all risk factors are observed for the probands and their relatives (see Shih and Chatterjee, 2002; Hsu et al., 2004; Hsu and Gorfine, 2006; Gorfine et al., 2009; and others). However, cost and logistics prohibit collection of important risk factors on all family members, for example, candidate genes that may be related to the disease of interest, are collected only for probands but not for proband relatives. In this paper we consider hazard function estimation when the genetic information of proband relatives is missing. Our work is motivated by a multicenter breast cancer study of women aged 35–64 (Malone et al., 2000; Malone et al., 2006). In this study cases were incident breast cancer cases ascertained from a set of geographically defined populations in the United States (four centers using population-based cancer registries, one center using a hospital-based sampling plan), and controls were selected by random-digit dialing and then frequency matched with case probands on study center, race and age. Data on potential breast cancer risk factors were obtained via structured in-person interviews. Each subject was asked to enumerate all first-degree female blood relatives (mother, sisters, daughters) including birth year, vital status, death year, history and type of cancer, and laterality (if breast cancer). Blood samples were collected on the case and control probands for genetic analysis. One study objective was to estimate the hazard function of age at breast cancer diagnosis for subjects who are or are not carriers of BRCA1/2 mutations. However, no blood was collected on relatives, so the mutation status of relatives could not be determined through genotyping.

In general, two modeling approaches are used to account for the correlation within clusters: the marginal model and the conditional model. In the marginal model, (see, for example, Shih and Louis, 1995; Shih and Chatterjee, 2002; Genest and MacKay, 1986; Marshal and Olkin, 1988; and Zhao et al., 1998) the correlation within a cluster is modeled using a copula function. The main idea behind the copula model approach is to decompose the multivariate survival distribution into two components: the marginal survival distributions and the copula function that links these marginal functions. In the conditional model also referred to as the frailty model, a latent family-specific random factor, i.e. frailty, is included and represents the unobserved common risk shared by family members. Family member event times are assumed to be independent conditional on this unobserved frailty term. The frailty is usually assumed to act multiplicatively on the hazard function. One of the most extensively used frailty distributions is gamma (Gill, 1985, 1989; Nielsen et al., 1992; Klein, 1992; among others), as it is convenient mathematically. However, the use of any distribution with positive support is possible and has been used in various applications, such as positive stable (Hougaard, 1986, Fine et al., 2003), inverse Gaussian, compound Poisson (Henderson and Oman, 1999) and log-normal (McGilchrist, 1993; Ripatti and Palmgren, 2000; Vaida and Xu, 2000). A comparison of the properties of the various frailty distributions can be found in Hougaard (2000) and Duchateau and Janssen (2008).

For a case-control family study with missing genotypes of relatives, Chatterjee et al. (2006) proposed a marginal piecewise proportional hazards model with pre-specified knots and a copula model for the joint survival function of the family members. In the case of more than one relative to a proband, they proposed two Estimation techniques. One is based on the full likelihood and the other consists of a “composite likelihood” that has both computational and robustness advantages. In the composite likelihood approach, families with M > 1 relatives are broken into M relative-proband doublets, and each doublet is assumed to follow a bivariate distribution and independent of other doublets. By simulation, the authors showed that their estimators performed well in terms of bias. Chen et al. (2009) also proposed an estimation technique to accommodate missing genotypes of relatives. The approach is based on an expectation-conditional-maximization algorithm while using the composite likelihood. It can accommodate the two-phase case-control study design, and also performs well in terms of bias in finite sample sizes. Chen et al. then examined the coverage rates of bootstrap confidence and found that they performed well a range of situations except in the case of rare mutations. In both works of Chatterjee et al. (2006) and Chen et al. (2009), the carrier status of each relative is inferred individually from the observed genotype of the proband under Mendel’s law of inheritance.

In our survival analysis context, under the marginal model approach of Chatterjee et al. (2006), the regression coefficients describe the effect of risk factors on the population average and are salient to practical public health administration. In contrast, under the conditional modeling of Chen et al. (2009), the regression coefficients characterize the effect of risk factors on a subject’s disease risk due to exposure relative to the risk of an unexposed family member. Therefore, the conditional model is useful when the objective is to make inference about individual families, as in genetic counseling. A comprehensive comparison of conditional modeling versus the marginal population-average modeling can be found in Zeger et al. (1988).

In this work we focus on conditional modeling and provide a new estimation technique, which is a modification of the method proposed in Gorfine et al. (2009) to accommodate missing genotypes of relatives. We use here the conditional model approach because not only does it account for correlation among relatives, but also the marginalized hazard function accounts for the possibility of non-proportionality in marginalized hazards without using the piecewise model of Chatterjee et al. (2006). In addition, due to the conditional independence commonly assumed for this, the full likelihood can now be easily implemented, although we would still need to sum over the unknown genotypes of all the relatives. We show by simulation and the data example that the efficiency gain under the full likelihood can be substantial. In addition, our approach avoids the use of iterative procedure for estimating the unspecified cumulative baseline hazard function. By this we reduce the computing time compare to the iterative estimators used in Chatterjee et al. (2006) and Chen et al.(2009). In Section 2 we formulate model assumptions and sampling design. In Section 3 we construct the likelihood function. The proposed estimation procedure is described in Section 4, and its finite sample properties are examined via a comprehensive simulation study in Section 5. In Section 6 we analyze a breast cancer dataset using our method of estimation. Discussion is given in Section 7.

2 The sampling design and model assumptions

Let To and C be the failure time and censoring time, respectively, and let Z be a p–vector of time-independent covariates. Let ω be a random variable with known density function f(ω) = f(ω; θ) where θ is an unknown parameter. For simplicity θ is assumed to be a scalar. Suppose that the conditional distribution of To given Z and ω follows an extended Cox (1972) proportional hazard model of the form

λ(tZ,ω)=ωλ0(t)exp(βTZ) (1)

where β is a p–vector of unknown regression coefficients, and λ0 is an unspecified conditional baseline hazard function. Define T = min(To, C), δ = I(ToC), N(t) = I(Tt, δ = 1), Y (t) = I(Tt), and Λ0(t)=0tλ0(u)du.

Consider a case-control family study where n0 control-probands (δ = 0) and n1 case-probands (δ = 1) are randomly sampled from a well-defined population. Relatives of each proband are ascertained as well, resulting in n = n0 + n1 independent families. Let βT = (αT, γ) and ZT = (XT, G) where X denotes the observed covariates for all study subjects (e.g. gender, ethnicity) and G denotes the genotype related to the disease under study and is being observed only for the probands. For simplicity of presentation we do not include here interaction terms between X and G, although the case of a model with such interaction terms could be developed in a similar manner. The observed data consists of n independent vectors (Ti0, Xi0T, Gi0, δi0, TiRT,XiRT,δiRT) where i0 corresponds to the proband of the ith family with mi relatives, TiRT=(Ti1,,Timi),XiRT=(Xi1T,,XimiT), and δiRT=(δi1,,δimi)i=1,,n. The unobserved data are the relatives’ genotypes {GiR}i=1n,GiRT=(Gi1,,Gimi), and the family-level frailty variates {ωi}i=1n.

For simplicity, we work with the case where G is binary, equaling 1 if the subject carries the high-risk genotype and 0 otherwise. Also, let π denote the unknown high risk allele frequency, and assume a dominant allelic effect. Hence, Pr(G = 1) = π2 + 2π(1 − π). Note that this assumption is not required. If the mode of inheritance is recessive, then Pr(G = 1) = π2. We assume a Mendelian inheritance, namely each individual receives independently one allele from his/her mother and another from his/her father. Under this assumption, we can write the joint probabilities of the relatives’ genotypes given their proband’s genotype, Pr(GiR|Gi0), as a function of the unknown high-risk allele frequency π.

3 The likelihood function

For constructing the likelihood function we assume further that: (1) conditional on {Xij,Gij}j=0mi and ωi, the censoring times are independent of the failure times and noninformative for ωi and (β, π, Λ0); (2) the frailty ωi is independent of {Xij,Gij}j=0mi; (3) the covariate effect is subject specific, namely Pr(Tij, δij|Xi0, Gi0, XiR, GiR, ωi) = Pr(Tij, δij|Xij, Gij, ωi). Then, in the spirit of Hsu et al. (2004) and Chatterjee et al. (2006), we write

L=i=1nf(TiR,δiR,XiR,Xi0,Gi0Ti0,δi0)=i=1nf(TiR,δiRXiR,Xi0,Gi0,Ti0,δi0)×f(XiRXi0,Gi0)×f(Xi0,Gi0Ti0,δi0). (2)

Since f(XiR|Xi0, Gi0) does not depend on the parameters of interest (β, θ, π, Λ0), this term will be ignored and only the other two terms in (2) will be used in the following proposed estimation technique.

Denote the proband likelihood by L(1)=i=1nf(Xi0,Gi0Ti0,δi0). The likelihood consists of the relatives’ data by L(2)=i=1nf(TiR,δiRXiR,Xi0,Gi0,Ti0,δi0). As in Chen et al. (2009), assume that X and G are independent in the target population so, similarly to Chatterjee et al. (2006), we use the Bayes’ theorem and get

L(1)=i=1nf(Ti0,δi0Xi0,Gi0)Pr(Gi0)f(Xi0)xgf(Ti0,δi0x,g)Pr(Gi0=g)f(x)dx

where the Σ and the ∫ are over all possible values of G and X, respectively. Alternatively, Chen et al. (2009) proposed considering L(1)=i=1nf(Gi0Ti0,δi0,Xi0) so that

L(1)=i=1nf(Ti0,δi0Xi0,Gi0)Pr(Gi0)gf(Ti0,δi0Xi0,g)Pr(Gi0=g)

is independent of the unknown marginal distribution of X. For the conditional density of (T, δ) given (X, G), we consider the conditional hazard function (1) and get

f(Ti0,δi0Xi0,Gi0)={λ(Ti0Xi0,Gi0,ω)}δi0S(Ti0Xi0,Gi0,ω)f(ω)dω={λ0(Ti0)exp(αTXi0+γGi0)}δi0ωδi0exp{ωHi0(Ti0)}f(ω)dω

where Hi0(Ti0) = Λ0(Ti0) exp(αT Xi0 + γGi0). The probability of the carrier status in L(1) or (1) is being written in terms of the unknown high-risk allele frequency π.

Next, we consider the computation of L(2). Write

f(TiR,δiRXiR,Ti0,δi0,Xi0,Gi0)=f(TiR,δiRXiR,Ti0,δi0,Xi0,Gi0,ω)f(ωTi0,δi0,Xi0,Gi0)dω. (3)

Given assumptions (2) and (3) above, and since X and G are assumed to be independent in the target population, we get

f(TiR,δiRXiR,Ti0,δi0,Xi0,Gi0,ωi)=gR{0,1}mij=1mif(Tij,δijXij,Gij=gijR,ωi)Pr(GiR=gRGi0), (4)

where ΣgR∈{0,1}mi represents a sum over all possible configurations of the genotypes of the mi relatives. From model (1) we get

f(Tij,δijXij,Gij=g,ωi)={λ0(Tij)exp(αTXij+γg)}δijωiδijexp{ωiHij(τ)}, (5)

for i = 1,, n j = 1,, mi with Hij(t) = Λ0(min{t, Tij}) exp(αT Xij + γGij) and τ is the maximal follow-up time. The joint probability of the carrier status of the relatives given the proband’s status, P(GiR|Gi0), is expressed in terms of π by using Mendelian inheritance laws and can be easily written for each family structure, as which we will demonstrate in Section 6. Combining equations (3), (4) and (5) we get the likelihood function L(2).

The resulting likelihood L(1)L(2) is a function of the unknown parameters (θ, β, π, Λ0) and the unknown density function of the observed covariates X. The modified likelihood (1)L(2) is a function of the unknown parameters (θ, β, π, Λ0).

4 Estimation

In what follows, we adopt the pseudo full likelihood estimation approach of Gorfine et al. (2009) with the required modification for the case of missing genotypes of relatives. We begin with an estimator of the cumulative baseline hazard function Λ0.

Denote by Inline graphic the σ-algebra generated by the known history up to time t,

Ft=σ{Ti0,δi0,Xi0,Gi0,Nij(u),Yij(u),Xij;i=1,,n;j=1,,mi;0ut}

and let Ft+ consist of Inline graphic, plus the unobserved genotypes of the relatives, that is

Ft+=σ{Ti0,δi0,Xi0,Gi0,Nij(u),Yij(u),Xij,Gij;i=1,,n;j=1,,mi;0ut}.

For i = 1, …, n, j = 1, …, mi write

E{ωiexp(αTXij+γGij)Ft}=EE{ωiexp(αTXij+γGij)Ft,GiR}=E{exp(αTXij+γGij)E(ωiFt+)}=gR{0,1}miexp(αTXij+γgijR)E(ωiFt,gR)Pr(GiR=gRGi0)=ψij(β,θ,π,Λ0,t) (6)

where

E(ωiFt+)=ωNi·(t)+1+δi0exp[ω{Hi·(t)+Hi0(Ti0)}]f(ω)dωωNi·(t)+δi0exp[ω{Hi·(t)+Hi0(Ti0)}]f(ω)dω, (7)

Ni·(t)=j=1miNij(t) and Hi·(t)=j=1miHij(t). Then, the stochastic intensity process of Nij(t), i = 1, …, n, j = 1,, mi, with respect to Inline graphic can be written as

λ0(t)Yij(t)ψij(β,θ,π,Λ0,t). (8)

Let τg, g = 1,, G, denote the gth ordered failure time among the relatives and assume that dg failures were observed at time τg. We express the following estimators in a form that allows for a modest level of ties. In light of the stochastic intensity process (8), a Breslow-type estimator of the cumulative baseline hazard function, with a jump at each observed failure time among the relatives, is given by

ΔΛ0(τg)=dgi=1nj=1miYij(τg)ψij(β;θ,π,Λ0,tg1). (9)

However, as evident by Equations (6)(7), the conditional expectation in the denominator of (9) is a function of Λ0(Ti0), and Ti0 could be greater than the jump time that its jump size currently estimates. Consequently, the above Breslow-type formula for the jump in the baseline hazard estimator at time t will often involve values of Λ0 for times beyond time t, thus an iterative procedure is required to compute the estimator.

Alternatively, we present here a two-stage non-iterative estimator of the baseline hazard function in the spirit of Gorfine et al. (2009) with the required modification for missing genotype of relatives. The first-stage estimator is defined as a step function whose g-th jump is given by

ΔΛ0(τg)=dgi=1nj=1miI(Ti0<τg)Yij(τg)ψij(β,θ,π,Λ0,τg1), (10)

with

dg=i=1nI(Ti0<τg)j=1midNij(τg).

The formula (10) is of the same form as (9) but in computing the jump at each failure time τg, we include only relatives whose proband observation time is less than τg. Hence, we avoid the aforementioned problem with (9), and consequently we avoid the need for an iterative optimization process. To indemnify for the efficiency loss caused by the exclusion of some of the available data when using (10), we follow up with a second stage.

The second-stage estimator is defined as a step function whose g-th jump is given by

ΔΛ^0(τg)=dgi=1nj=1miYij(τg)ψij(β,θ,π,τg1), (11)

where ψ̃ij(β, θ, π, t) is defined analogously to ψij(β, θ, π, Λ0, t), with Λ0(Ti0) replaced by Λ̃0(Ti0) if Ti0t and by Λ̂0(Ti0) otherwise. In Section 5 we show, by simulation, that the efficiency loss associated with Λ̂0 in comparison to Λ0 is negligible.

Given an estimator for the cumulative baseline hazard function, we propose to estimate (β, θ, π) by using a pseudo full likelihood estimation approach, where in the score functions based on (1)L(2) or L(1)L(2) we replace Λ0 by its estimate, Λ0 or Λ̂0. It should be noted that while using L(1)L(2) an estimator of the density of X is required. To sum up, our proposed estimation procedure is as follows: (1) provide initial value for (β, θ, π); (2) given the value of (β, θ, π), estimate Λ0 by using either Λ0 or Λ̂0; (3) given the value of Λ0, estimate (β; θ; π) by maximizing (1)L(2); (4) repeat Steps 2 and 3 until convergence is reached with respect to the estimates of β, θ, π and Λ0. If L(1) instead of (1) is being used in Step (3), an estimator of the marginal density of X should be included in the procedure and Step (3) should be replaced by the following Step: (3*) Given the value of Λ0, estimate (β, θ, π) and f(X) by maximizing L(1)L(2) and L(1), respectively. Note that either a parametric or non-parametric model can be considered for f(X).

5 Simulation studies

The objectives of the following simulation study are: (1) to investigate the finite sample properties of our proposed estimators; (2) to compare various popular bootstrap methods for the construction of confidence intervals; (3) to evaluate the efficiency loss associated with the two-stage estimator Λ̂0 relative to the iterative Breslow-type estimator Λ0, (4) to assess the efficiency under the composite likelihood approach of Chatterjee et al. (2006) and Chen et al. (2009) in comparison to our full likelihood approach.

It is popular to use the gamma distribution for the frailty variate. The gamma frailty model with expectation 1 and variance θ has the following interesting properties: for two given failure times Ti1 and Ti2 from the same family i, let λi1(s|Ti2 > t) denote the conditional hazard of Ti1 given {Ti2 > t}, and let λi1(s|Ti2 = t) denote the conditional hazard of Ti1 given Ti2 evaluated at Ti2 = t. Define χ(s, t) = [λi1(s|Ti2 = t)]/[λi1(s|Ti2 > t)], which is the cross-ratio, as a local measure of association between survival times(Clayton, 1978). Under the gamma frailty model, χ(s, t) is constant over s and t and equals 1 + θ. Thus, under the gamma frailty model, the parameter θ has a nice interpretation. In addition, the gamma frailty model can be re-expressed in terms of the Clayton-Oakes copula-type model (Clayton, 1978; Oakes, 1989). The gamma frailty model is also convenient mathematically, because it admits a closed-form representation of the marginal survival distributions. Hence we conducted our simulation study under the gamma frailty model, using, as is customary, the gamma distribution with expectation 1 and variance θ.

As a risk factor, we used a binary covariate that mimics the carrier status of a dominant gene with high-risk allele frequency π. We considered two popular simulation scenarios (Chatterjee et al., 2006; Chen et al., 2009): (1) involving a rare genetic variant, π=0.01, with large genetic effect β = ln(5); and (2) involving a more common variant, π=0.1, with a modest genetic effect β = ln(2). Genotypes of family members followed Mendelian inheritance. We assumed a Weibull conditional baseline hazard function, ()(p−1), with p = 4.6 and μ = 0.01. Potential censoring time was assumed to be normally distributed with mean 60 and standard deviation 15, leading to approximately 83% censoring among the relatives. Simulation results are based on 2000 simulated data sets. Results with one relative for each proband are based on 500 case-families and 500 control-families. For three relatives for each proband, 300 case-families and 300 control-families were used.

Table 1 presents the means and the empirical standard errors, based on one sister for each proband. It is evident that our proposed estimators perform very well in term of bias in both scenarios.

Table 1.

Simulation results: mean and standard error

Rare & High Penetrant
Common & Low Penetrant
True value Mean Empirical SE True value Mean Empirical SE
β 1.6095 1.7263 0.4308 0.6932 0.6806 0.2052
θ 2 2.0246 0.5325 2 2.0782 0.5204
π 0.01 0.0089 0.0021 0.1 0.1015 0.0099
Λ0(45) 0.0254 0.0253 0.0054 0.0254 0.0250 0.0052
Λ0(60) 0.0954 0.0948 0.0153 0.0954 0.0939 0.0155
Λ0(75) 0.2663 0.2633 0.0409 0.2663 0.2608 0.0426
Λ0(80) 0.3583 0.3537 0.0592 0.3583 0.3514 0.0612

5.1 Exploring various bootstrap techniques

To construct confidence intervals, we used the bootstrap approach. In the standard nonparametric bootstrap we resampled with replacement from the observed data, where the sampling unit consists of a family. Therefore, in the setting of censored survival data, it leads to a substantial proportion of tied survival times. To investigate, numerically, the effect of ties in the bootstrap sample on the coverage rate of the confidence intervals, we compared three bootstrap techniques: (1) the standard nonparametric bootstrap; (2) the nonparametric bootstrap but with a naive correction for ties (that is, for each observed time in the bootstrap sample, a very small value, sampled from a uniform distribution, was added. This assures there will be no ties); (3) the weighted bootstrap approach of Kosorok et al. (2004, Section 4). In each bootstrap sample we generated n independent weights from the unit exponential distribution. Let ξ1,, ξn be the standardized weights. Then, in the estimating functions, for any given function h the empirical mean n1i=1nh(Ti,δi,Zi) is replaced by its corresponding weighted empirical mean n1i=1nξih(Ti,δi,Zi). Under the setting of right-censored univariate failure times, Kosorok et al. (2004) proved that the weighted bootstrap procedure is valid. In Table 2 the coverage rates of 95% Wald confidence intervals associated with each of the above three techniques are presented. The results are based on one sister for each proband. It is evident that the proposed estimators perform reasonably well in terms of coverage rates, and the three bootstrap methods provide similar results. Under the scenario where π = 0.01, the coverage rate of π is underestimated. A similar phenomenon was also observed in Chen et al. (2009). Similar results were observed with larger number of sisters of each proband, such as 3, hence those results are omitted.

Table 2.

Simulation results: coverage rates under three Bootstrap methods

Rare & High Penetrant
Common & Low Penetrant
Standard bootstrap Corrected bootstrap Weighted bootstrap Standard bootstrap Corrected bootstrap Weighted bootstrap
β 0.9140 0.9225 0.9185 0.9330 0.9330 0.9365
θ 0.9610 0.9525 0.9465 0.9635 0.9565 0.9630
π 0.8370 0.8630 0.8625 0.9450 0.9490 0.9415
Λ0(45) 0.9225 0.9255 0.9345 0.9205 0.9275 0.9230
Λ0(60) 0.9215 0.9285 0.9310 0.9270 0.9295 0.9260
Λ0(75) 0.9275 0.9355 0.9280 0.9305 0.9345 0.9350
Λ0(80) 0.9225 0.9455 0.9345 0.9265 0.9275 0.9280

5.2 Efficiency loss under Λ̂0 versus Λ0

In Table 3 we contrast the proposed estimation procedure based on the iterative estimator, Λ0(t)=τgtΔΛ0(τg), to the one based on the two-stage estimator, Λ̂0(t) = Στgt Δλ̂0(τg).

Table 3.

Simulation results: comparing Λ̂0 versus Λ0 under Common & Low Penetrant

True value Λ̂0
Λ0

Mean Empirical SE Mean Empirical SE
β 0.6932 0.6806 0.2052 0.6707 0.2031
θ 2 2.0782 0.5204 2.0572 0.5162
π 0.1 0.1015 0.0099 0.1021 0.0100
Λ0(45) 0.0254 0.0250 0.0052 0.0253 0.0050
Λ0(60) 0.0954 0.0939 0.0155 0.0950 0.0148
Λ0(75) 0.2663 0.2608 0.0426 0.2637 0.0416
Λ0(80) 0.3583 0.3514 0.0612 0.3535 0.0594

Given (θ, β, π), the following steps are required for the iterative estimator Λ0:

  • Step 1. Use a standard survival procedure to obtain initial estimate of Λ0 (e.g. Breslow estimate) while ignoring dependency among relatives and the effect of the partially observed genotype.

  • Step 2. Based on (9), compute Λ0 sequentially at the observed failure times of the relatives. At each observed failure time of the relatives τg, g = 1,, G, use the current values of Λ0 for each Λ0(Ti0) with Ti0τg.

  • Step 3. Repeat Step 2 until convergence with respect to Λ0.

In contrast, the following are the two non-iterative steps required for the two-stage estimator Λ̂0 given (θ, β, π): (1) based on (10), compute Λ̃0 sequentially at each observed failure time of a relative whose proband’s observation time is smaller than the failure time of the relative; (2) based on (11), compute Λ̂0 sequentially at the observed failure times of all the relatives. At each observed failure time τg, g = 1,, G, use Λ̃0 for each Λ0(Ti0) with Ti0τg.

Results in Table 3 are based on one sister for each proband and under the common genetic variant with modest genetic effect. It is evident that the efficiency loss is negligible with respect to all the parameters. Similar results were observed under the case of rare genetic variant and large genetic effect. Hence, these results are omitted.

5.3 The composite likelihood versus the proposed full likelihood

In the composite likelihood (CL) of Chatterjee et al. (2006) and Chen et al. (2009) each family with mi > 2 relatives is being broken into mi relative-proband doublets. Each doublet is assumed to be independent of the others, ignoring dependency among relatives. Chatterjee et al. (2006) applied it to the marginal setting where the joint survival distribution of a family is determined by a copula model. Hence, the CL approach has a computational advantage from two aspects: (1) simplifying the multivariate survival distribution of dimension mi + 1 to a two-dimensional distribution; (2) simplifying the joint probability of the relatives’ genotype given their proband’s genotype P(GiR|Gi0) to each relative’s genotype given the proband’s genotype P (Gij|Gi0) j = 1,, mi. It is clear that both approaches, the CL and the full likelihood (FL), provide unbiased estimators, but the above computational advantages may come with a price, that is efficiency loss. In our random effects model, due to the conditional independence assumption of the survival times given the frailty variate, the only relevant advantage is of type (2) above.

Table 4 provides simulation results for the CL versus the FL under the gamma-frailty model where each proband has three sisters, and is based on 300 case-families and 300 control-families. The table includes the relative efficiency (RE) which is the ratio of the empirical variances of the FL and the CL. As expected, the greatest efficiency loss under the CL relative to the FL is in estimating the dependency parameter θ and the regression coefficient β. It is evident that the efficiency loss is substantial. As θ and the number of relatives for each proband decreases, the efficiency loss of the CL relative to the FL decreases as well. For example, Chatterjee et al. (2006) observed negligible efficiency loss when θ = 1 and each proband has only two relatives. For this specific case, under the random effects model, we observed similar results (not shown). However, since the efficiency loss is substantial for a larger value of θ or larger number of relatives, we recommend using the FL.

Table 4.

Simulation results: composite likelihood versus the full likelihood

True value Composite Likeliood
Full Likelihood
RE
Mean Empirical SE Mean Empirical SE
Rare & High Penetrant
β 1.6095 1.5961 0.5467 1.4997 0.4840 0.78
θ 2 2.1034 0.5268 1.8937 0.3748 0.51
π 0.01 0.0102 0.0030 0.0105 0.0030 1.00
Λ0(45) 0.0254 0.0245 0.0045 0.0274 0.0047 1.09
Λ0(60) 0.0954 0.0920 0.0136 0.1027 0.0138 1.03
Λ0(75) 0.2663 0.2562 0.0345 0.2620 0.0328 0.90
Λ0(80) 0.3583 0.3454 0.0501 0.3404 0.0451 0.81
Common & Low Penetrant
β 0.6932 0.7056 0.2239 0.6770 0.2016 0.81
θ 2 2.0940 0.4789 1.8919 0.3628 0.57
π 0.1 0.1002 0.0105 0.0982 0.0095 0.82
Λ0(45) 0.0254 0.0248 0.0046 0.0275 0.0050 1.18
Λ0(60) 0.0954 0.0926 0.0146 0.1029 0.0157 1.16
Λ0(75) 0.2663 0.2574 0.0375 0.2611 0.0367 0.96
Λ0(80) 0.3583 0.3453 0.0512 0.3371 0.0480 0.88

6 Example

To illustrate the use of our proposed estimators in practice, we analyze data collected in a population-based case-control study, the National Institute of Child Health and Human Development’s Woman’s Contraceptive and Reproductive Experiences (CARE) study, (Marchbanks et al., 2002; Malone et al., 2006). Briefly, in the first phase white or black women with invasive breast cancer diagnosed at ages 35–64 during 1994–1998 were included as case probands. Control probands were ascertained through random-digit dialing and were frequency matched with case probands on study center, race and age group. Each subject was asked to enumerate all first-degree female blood relatives (mother, sisters, daughters) and relative birth years, vital status, death years, types of cancer, and lateralities (if breast cancer). Due to resource limitations, it was possible to collect blood from only 33% of the probands. Therefore, probands were stratified based on case-control status, study center, race, and age, and a random sample from each stratum was selected for genotyping of BRCA1/2 mutations. The sampling fraction (the probability of a proband to be selected for genotyping) of each proband was assigned and all the strata have representative samples. No data or blood specimens were collected from relatives, and thus no information was available on their mutation status. Chen et al. (2009) used a subset of the data focusing on the BRCA1 gene and whites to illustrate their method on the estimation of the population-based penetrance function. To facilitate a direct comparison with their method, we use the same subset for illustration.

In this subset there are 1603 white probands, 1144 cases and 459 controls. Among the case probands, there are 42 BRCA1 carriers and only one among the control probands. A total of 4568 first-degree relatives (mothers, sisters and daughters) aged 18 or older with known breast cancer status were included. Each proband has at least one relative and the highest number of relatives is 11. Among the relatives, 634 are observed with breast cancer. Conditional carrier status probabilities of the relatives given the proband’s carrier status, for each observed family structure, is given in Appendix 1.

The required modification of our estimation procedure, to accommodate the two-phase design, was done by using the weighted likelihood approach of Chen et al. (2009) where each selected family is weighted by its inverse selection probability. Details are given in Appendix 2. In analyzing the data, the popular gamma frailty model was used with expectation 1 and variance θ.

As noted in Gorfine et al. (2009), a key assumption in our procedure for estimating Λ0 by the two-stage estimator is that the range of the probands’ observational times must be the same as that of the relatives’ observational times. However, in the CARE data the probands’ observed times are fixed to be between ages 35–64 while the relatives’ observed failure times are unrestricted and observed to be between ages 19–92. Hence, we expect the two-stage estimator, Λ̂0, to underestimate the true Λ0. But, this can be easily corrected by first estimating Λ0(35) as described in Gorfine et al. (2009), and one gets a three-stage estimator. Alternatively, one may use the iterative Breslow-type estimator Λ0, defined in (9).

Analysis of the CARE data results are presented in Table 5. We provide here the point estimates and the 95% confidence intervals based on the weighted bootstrap approach of Kosorok et al. (2004), using 200 bootstrap samples, for β, θ, π and the marginal cumulative probability of developing breast cancer at age t given the carrier status G, FM(t|G = g), g = 0, 1, at ages t = 50, 60, 70, 80. Confidence intervals based on the other two bootstrap approaches gave similar results and are not presented here. From both two- and three-stage estimators, it is evident that the risk of developing breast cancer for carriers is approximately 12 (≈ exp(β̂)) fold greater than that of non-carriers within the same family. In addition, there remains a substantial residual dependency even after adjusting for BRCA1 with estimated cross ratio equals 2.043 (95% CI: 1.617–2.470) under the two-stage estimator. Using the iterative Breslow-type estimator, Λ0, the regression coefficient estimate is slightly larger and the estimate of θ is slightly smaller. As a comparison, we also included the results of Chen et al. (2009), where β was estimated based on the CL rather the FL as in our method. Note that Chen et al. did not include the proband data for estimating β due to numerical difficulty in obtaining the estimates of β and π. By using the FL, we avoid this numerical problem since the allele frequency parameter π is included in the FL based on the complete familial relationship (Appendix 1), and not just based on the relative-proband pair relationship, as in the CL approach. Hence, while the point estimates are about the same, our confidence intervals are much tighter than those of Chen et al., indicating that the efficiency gained by using the FL is substantial compared with the CL. It is worth noting that the efficiency gain here seems even more substantial than the simulation results indicated. This is partly due to the fact that Chen et al. couldn’t include the proband data in the estimation of β. Using the FL not only does it allow us to use the correlation information among the relatives but also does it allow us, in this example, to use the information from the proband data by disentangling the estimation of π and β.

Table 5.

Analysis of the CARE data

The Proposed Methods
Chen et al.
two-stage
three-stage
Breslow-type
point estimate 95% CI
point estimate 95% CI point estimate 95% CI point estimate 95% CI
β 2.543 (1.985,3.101) 2.487 (2.131,2.844) 2.719 (2.339,3.099) 2.354 (1.374,3.534)
θ 1.043 (0.617,1.470) 0.978 (0.647,1.308) 0.947 (0.550,1.442) 0.948 (0.565,1.444)
π 0.165% (0.068%,0.262%) 0.169% (0.098%,0.241%) 0.152% (0.096%,0.202%) 0.182% (0.080%,0.401%)
 Non-carrier’s cumulative marginal disease probability
FM(50|G = 0) 0.021 (0.017,0.024) 0.021 (0.018,0.037) 0.021 (0.017,0.035) 0.021 (0.017,0.027)
FM(60|G = 0) 0.040 (0.034,0.045) 0.040 (0.034,0.067) 0.040 (0.033,0.067) 0.041 (0.034,0.050)
FM(70|G = 0) 0.069 (0.059,0.078) 0.070 (0.059,0.115) 0.069 (0.058,0.114) 0.072 (0.060,0.090)
FM(80|G = 0) 0.096 (0.082,0.111) 0.098 (0.085,0.160) 0.097 (0.082,0.156) 0.102 (0.084,0.125)
 Carrier’s cumulative marginal disease probability
FM(50|G = 1) 0.211 (0.182,0.239) 0.207 (0.183,0.315) 0.247 (0.208,0.354) 0.188 (0.082,0.423)
FM(60|G = 1) 0.342 (0.306,0.374) 0.338 (0.296,0.466) 0.388 (0.342,0.519) 0.313 (0.143,0.612)
FM(70|G = 1) 0.481 (0.443,0.514) 0.478 (0.433,0.613) 0.531 (0.485,0.660) 0.454 (0.227,0.743)
FM(80|G = 1) 0.571 (0.527,0.608) 0.570 (0.530,0.700) 0.620 (0.575,0.737) 0.549 (0.304,0.814)

7 Discussion

In this work, we modified the method of Gorfine et al. (2009) for case-control family data with a general semi-parametric shared frailty model designed to handle the common and important situation of missing genetic information of relatives. The finite sample size properties of the new estimators were investigated by simulation under the popular gamma frailty model. The results indicate that our estimators have minimal bias and provide proper coverage rates except in estimating the allele frequency under rare gene and low disease incidence. The main difference between our approach and the approaches of Chatterjee et al. (2006) and Chen et al. (2009) is in using the FL instead of the CL. It is shown by simulation and data analysis that the gain in efficiency by using the FL approach can be substantial. We also compared, by simulation, three common bootstrap methods for the construction of confidence intervals in censored data. We concluded that the three methods are comparable in terms of confidence interval coverage rates.

The proposed approach was presented here for the case of dominant allelic effect so that G is binary. However, the extension for G with higher number of states is straightforward. For example, consider the case of a biallelic locus as in SNP data were G consists of 3 possible states AA, Aa and aa. Then, under Hardy-Weinberg equilibrium (HWE) we have Pr(G = AA) = π2, Pr(G = Aa) = 2π(1 − π) and Pr(G = aa) = (1 − π)2 where π is the allele frequency of A. Denote the genotypes AA, Aa and aa by 1, 2 and 3, respectively. It is clear that the proposed P estimation procedure can be applied directly where in (4) and (6) the summation would be ΣgR∈{1;2;3}mi. In the case that the HWE assumption fails, we can still use similar procedure though we need to estimate the frequency of each genotype individually.

We showed by simulation that the FL estimates could be substantially more efficient than the CL estimates, under the popular gamma frailty model. Although the proposed method can be applied to any positive frailty distribution, often the true frailty distribution is unknown and one may choose the gamma frailty due to mathematical convenience. Hence, it is of practical importance to compare the robustness of the two likelihoods, FL versus CL, under misspecification of the frailty distribution. Simulation studies indicate that in some cases the FL is more robust where in other cases the CL is more robust. For example, consider the case with log-normal true frailty distribution such that ω = exp(W), W ~ N(−1.5, 2.5) so that E(ω) = 1, Kendall’s τ =0.4, and the analysis was performed under the gamma frailty model. Simulation results show that the CL tends to be more robust under the case where β = 0.6932 and π = 0.1, a common but relatively moderate risk allele situation. The FL tends to be more robust under β = 1.6095 and π = 0.01, a rare allele with high risk situation (results not shown). In both situations, the biases are modest. As we come to the selection of the frailty distribution, we note that any positive distribution can be used in applying our proposed approach, where each puts a different dependence structure on the data. Ideally, the use of a particular frailty model should depend on the actual setting. However, selection of the proper distribution can be a challenging task, since the frailty variates are unobservable and the sampling scheme is retrospective. There are works (Glidden, 1999, and references therein) dealing with graphical and statistical procedures for choosing the frailty distribution under the simple prospective sample but for the more complex retrospective setting, as we have here, these procedures need to be modified. However, the effect of frailty distribution mis-specification on the marginal regression estimates and marginal hazard functions under the assumed gamma distribution, was investigated by Hsu et al. (2007). The simulation results show that the biases are generally 10% and lower, even when the true frailty distribution deviates substantially from the assumed gamma distribution. This suggests that the gamma frailty model can be a practical choice in real data analysis if the regression parameters and the marginal hazard functions are of primary interest and individual cluster members are exchangeable with respect to their dependencies.

In this work we use the shared frailty model, which assumes that all the unobserved risk factors (genetic as well as environmental factors) are similar for all family members. Frequently, this is an unrealistic assumption in practice. We may want to extend the proposed estimation technique to accommodate varying degrees of dependence strength among different types of relatives. this can be done in the spirit of Hsu and Gorfine (2006), and should be a topic of a future work.

As a result of the two major complications that our setup presents, namely the sampling method and the missing covariates, one might suspect that direct pseudo maximum likelihood estimation of (β, θ, π) might be computationally difficult or unstable. However, no such problems were encountered in our simulation study and the real data analysis.

There are additional practical issues to be considered in analyzing population-based case-control family data, including misreported family history, false negative carriership, and multiple phenotypes and tumor pathology. We will address these issues in a separate communication.

As to the asymptotic properties of the proposed estimators, simulation studies reported in Section 5 show that the proposed semiparametric estimation methods are consistent. Based on Gorfine et al. (2009) we believe that our estimators are consistent and asymptotically normal (under similar assumptions as listed in Gorfine et al., 2009). However, for a rigorous proof, one should use the technique presented in Gorfine et al. (2009) with the required modifications. This is a potential theme for future work.

The C++ code used in this work are available on the internet at http://ie.technion.ac.il/~gorfinm/tables.html.

Acknowledgments

This research is supported in part by a grant from the United States-Israel Binational Science Foundation and grants from the National Institutes of Health (R01 CA98858 and R01 AG14358). The authors thank all of the CARE Study Investigators and Staff for their leadership in designing and conducting the CARE Study which is supported by the National Institute of Child health and Human Development, with additional support from the National Cancer Institute, through contracts with Emory University (N01 HD3-3168), Fred Hutchinson Cancer Research Center (N01 HD2-3166), Karmanos Cancer Institute at Wayne State University (N01 HD 3-3174), University of Pennsylvania (N01 HD3-3176), University of Southern California (N01 HD 3-3175), and through an intro-agency agreement with the Centers for Disease Control and Prevention (Y01 HD7022).

Appendix 1

In what follows we present the probabilities of the relatives’ carrier status given the proband carrier status according to each of the 7 types of family structure observed in the CARE data.

The following are the probabilities of a daughter’s carrier status, GD, given her parents’ carrier status, GF and GM, based on Mendelian inheritance law: Pr(GD=0GF=1,GM=1)=π2(1π)21(1π)4,Pr(GD=0GF=1,GM=0)=Pr(GD=0GF=0,GM=1)=1π2π, and Pr(GD = 0|GF = 0, GM = 0) = 1. For simplicity of presentation, let Pr(gD|gF, gM) = Pr(GD = gD|GF = gF, GM = gM) and Pr(Gj = gj) = Pr(gj) for j = 0, F, M, S. Based on the above conditional probabilities one can easily calculate the following probabilities.

  1. The relatives consist of the mother:
    Pr(GM=gMG0=g0)=Pr(gM)Pr(g0)gF{0,1}Pr(g0gM,gF)Pr(gF).
  2. The relatives consist of m sisters:
    Pr(GS1=gS1,,GSm=gSmG0=g0)=1Pr(g0)gF{0,1}gM{0,1}{Pr(1gF,gM)}cs1{Pr(0gF,gM)}ncs1Pr(gF)Pr(gM)

    where cs1=k=1mgSk+g0 and ncs1 = m + 1 − cs1.

  3. The relatives consist of n daughters:
    Pr(GD1=gD1,,GDn=gDnG0=g0)=gF{0,1}{Pr(1gF,g0)}cs2{P(0gF,g0)}ncs2Pr(gF)

    where cs2=k=1mgDk and ncs2 = ncs2.

  4. The relatives consist of m sisters and the mother:
    Pr(GS1=gS1,,GSm=gSm,GM=gmG0=g0)=Pr(gm)Pr(g0)gF{0,1}{P(1gF,gM)}cs1{P(0gF,gM)}ncs1Pr(gF).
  5. The relatives consist of n daughters and the mother:
    Pr(GD1=gD1,,GDn=gDn,GM=gMG0=g0)=Pr(gM)Pr(g0)[gF{0,1}Pr(0gF,gM)P(gF)][gF{0,1}{Pr(1gF,g0)}cs2{Pr(0gF,g0)}ncs2Pr(gF)].
  6. The relatives consist of m sisters and n daughters:
    Pr(GD1=gD1,,GDn=gDn,GS1=gS1,,GSn=gSnG0=g0)=[gF{0,1}{Pr(1gF,g0)}cs2Pr(0gF,g0)}ncs2Pr(gF)]×1Pr(g0)[gF{0,1}gM{0,1}{Pr(1gF,gM)}cs1{Pr(0gF,gM)}ncs1Pr(gF)Pr(gM)].
  7. The relatives consist of m sisters, n daughters and the mother:
    Pr(GD1=gD1,,GDn=gDn,GS1=gS1,,GSn=gSn,GM=gMG0=g0)=[gF{0,1}{Pr(1gF,g0)}cs2Pr(0gF,g0)}ncs2Pr(gF)]×Pr(gM)Pr(g0)[gF{0,1}{Pr(1gF,gM)}cs1{Pr(0gF,gM)}ncs1Pr(gF)].

Appendix 2

Let pi denote the selection probability of the ith proband, i = 1,, n. Also, let Li(1)=f(Gi0Ti0,δi0,Xi0) and Li(2)=f(TiR,δiRXiR,Xi0,Gi0,Ti0,δi0). Then, (β; θ; π) is estimated by maximizing

i=1n{Li(1)}1/pi{Li(2)}1/pi

after replacing the unknown baseline hazard function by its estimate. The following is the modified two-stage estimator of the cumulative baseline hazard function. The g-th jump of the first-stage estimator is given by

i=1npi1I(Ti0<τg)j=1midNij(τg)i=1npi1j=1miI(Ti0<τg)Yij(τg)ψij(β,θ,π,Λ0,τg1),

and the g-th jump of the second-stage estimator is given by

i=1npi1j=1midNij(τg)i=1npi1j=1miYij(τg)ψij(β,θ,π,τg1).

Contributor Information

Anna Graber-Naidich, Email: agraber@tx.technion.ac.il, Faculty of Industrial Engineering and Management, Technion City, Haifa 32000, Israel.

Malka Gorfine, Email: gorfinm@ie.technion.ac.il, Faculty of Industrial Engineering and Management, Technion City, Haifa 32000, Israel.

Kathleen E. Malone, Email: kmalone@fhcrc.org, Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA 98109-1024, USA

Li Hsu, Email: lih@fhcrc.org, Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA 98109-1024, USA.

Bibliography

  1. Becher H, Schmidt S, Chang-Claude J. Reproductive factors and familial predisposition for breast cancer by age 50 years. A case-control-family study for assessing main effects and possible geneenvironment interaction. International Journal of Epidemiology. 2003;32:38–48. doi: 10.1093/ije/dyg003. [DOI] [PubMed] [Google Scholar]
  2. Chatterjee N, Kalaylioglu Z, Shih JH, Gail MH. Case-control and case-only designs with genotype and family history data: Estimating relative risk, residual familial aggregation, and cumulative risk. Biometrics. 2006;62:36–48. doi: 10.1111/j.1541-0420.2005.00442.x. [DOI] [PubMed] [Google Scholar]
  3. Chen L, Hsu L, Malone K. A frailty-model based approach to estimating the age dependent penetrance function of candidate genes using population based case-control study designs: An application to data on BRCA1 gene. Biometrics. 2009;65:1105–1114. doi: 10.1111/j.1541-0420.2008.01184.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Clayton DG. A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika. 1978;65:141–151. [Google Scholar]
  5. Cox DR. Regression models and life tables (with discussion) Journal of the Royal Statistical Society B. 1972;34:187–220. [Google Scholar]
  6. Duchateau L, Janssen P. The Frailty Model. New York: Springer; 2008. [Google Scholar]
  7. Fine JP, Glidden DV, Lee KE. A simple estimator for a shared frailty regression model. Journal of the Royal Statistical Society. 2003;65:317–329. [Google Scholar]
  8. Genest C, MacKay J. The joy of copulas: Bivariate distributions with given marginals. American Statistician. 1986;40:280–283. [Google Scholar]
  9. Gill RD. Discussion of the paper by D. Clayton and J. Cuzick. Journal of the Royal Statistical Society A. 1985;148:108–109. [Google Scholar]
  10. Gill RD. Non- and semi-parametric maximum likelihood estimators and the von Mises method (Part 1) Scandinavian journal of statistics. 1989;16:97–128. [Google Scholar]
  11. Glidden DV. Checking the adequacy of the gamma frailty model for multivariate failure times. Biometrika. 1999;86:381–393. [Google Scholar]
  12. Gorfine M, Zucker DM, Hsu L. Case-control survival analysis with a general semiparametric shared frailty model - a pseudo full likelihood approach. Annals of Statistics. 2009;37:1489–1517. doi: 10.1901/jaba.2009.37-1489. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Henderson R, Oman P. Effect of frailty on marginal regression estimates in survival analysis. Journal of the Royal Statistical Society B. 1999;61:367–379. [Google Scholar]
  14. Hopper JL. Comentary: Case-control-family design: a paradigm for future epidemiology reserach? International Journal of Epidemiology. 2003;32:48–50. doi: 10.1093/ije/dyg114. [DOI] [PubMed] [Google Scholar]
  15. Hougaard P. Survival models for heterogeneous populations derived from stable distributions. Biometrika. 1986;73:387–396. [Google Scholar]
  16. Hougaard P. Analysis of Multivariate Survival Data. New York: Springer; 2000. [Google Scholar]
  17. Hsu L, Chen L, Gorfine M, Malone K. Semiparametric estimation of marginal hazard function from casecontrol family studies. Biometrics. 2004;60:936–944. doi: 10.1111/j.0006-341X.2004.00249.x. [DOI] [PubMed] [Google Scholar]
  18. Hsu L, Gorfine M. Multivariate survival analysis for case-control family data. Biostatistics. 2006;7:387–398. doi: 10.1093/biostatistics/kxj014. [DOI] [PubMed] [Google Scholar]
  19. Hsu L, Gorfine M, Malone K. On robustness of marginal regression coefficient estimates and hazard functions in multivariate survival analysis of family data when the frailty distribution is misspecified. Statistics in Medicine. 2007;26:4657–4678. doi: 10.1002/sim.2870. [DOI] [PubMed] [Google Scholar]
  20. Klein JP. Semiparametric estimation of random effects using the Cox model based on the EM algorithm. Biometrics. 1992;48:795–806. [PubMed] [Google Scholar]
  21. Kosorok MR, Lee BL, Fine JP. Robust inference for univariate proportional hazards frailty regression models. Annals of Statistics. 2004;32:1448–1491. [Google Scholar]
  22. Malone KE, Daling JR, Neal C, Suter NM, O’Brien C, Cushing-Haugen K, Jonasdottir TJ, Thompson JD, Ostrander EA. Frequency of BRCA1/BRCA2 mutations in a population-based sample of young breast carcinoma cases. Cancer. 2000;88:1393–1402. doi: 10.1002/(sici)1097-0142(20000315)88:6<1393::aid-cncr17>3.0.co;2-p. [DOI] [PubMed] [Google Scholar]
  23. Malone KE, Daling JR, Doody DR, Hsu L, Bernstein L, Coates RJ, Marchbanks PA, Simon MS, McDonald JA, Norman SA, Strom BL, Burkman RT, Ursin G, Deapen D, Weiss LK, Folger S, Madeoy JJ, Friedrichsen DM, Suter NM, Humphrey MC, Spirtas R, Ostrander EA. Prevalence and predictors of BRCA1 and BRCA2 mutations in a population-based study of breast cancer in white and black American women ages 35 to 64 years. Cancer Research. 2006;66:8297–8308. doi: 10.1158/0008-5472.CAN-06-0503. [DOI] [PubMed] [Google Scholar]
  24. Marchbanks PA, et al. The NICHD women’s contracetive and reproductive experiences study: methods and results. Annals of Epidemiology. 2002;26:213–221. doi: 10.1016/s1047-2797(01)00274-5. [DOI] [PubMed] [Google Scholar]
  25. Marshall AW, Olkin I. Families of multivariate distributions. Journal of the American Statistical Association. 1988;83:834–841. [Google Scholar]
  26. McGilchrist CA. REML estimation for survival models with frailty. Biometrics. 1993;49:221–225. [PubMed] [Google Scholar]
  27. Nielsen GG, Gill RD, Andersen PK, Sørensen TIA. A counting process approach to maximum likelihood estimation in frailty models. Scandinavian journal of statistics. 1992;19:25–43. [Google Scholar]
  28. Oakes D. Bivariate survival models induced by frailties. Journal of the American Statistical Association. 1989;84:487–493. [Google Scholar]
  29. Ripatti S, Palmgren J. Estimation of multivariate frailty models using penalized partial likelihood. Biometrics. 2000;56:1016–1022. doi: 10.1111/j.0006-341x.2000.01016.x. [DOI] [PubMed] [Google Scholar]
  30. Shih JH, Chatterjee N. Analysis of survival data from case-control family studies. Biometrics. 2002;58:502–509. doi: 10.1111/j.0006-341x.2002.00502.x. [DOI] [PubMed] [Google Scholar]
  31. Shih JH, Louis TA. Inferences on the association parameter in copula models for bivariate survival data. Biometrics. 1995;51:1384–1399. [PubMed] [Google Scholar]
  32. Vaida F, Xu RH. Proportional hazards model with random effects. Statistics in Medicine. 2000;19:3309–3324. doi: 10.1002/1097-0258(20001230)19:24<3309::aid-sim825>3.0.co;2-9. [DOI] [PubMed] [Google Scholar]
  33. Zeger SL, Liang KY, Albert PS. Models for Longitudinal Data: A Generalized Estimating Equation Approach. Biometrics. 1988;44:1049–1060. [PubMed] [Google Scholar]
  34. Zhao LP, Hsu L, Holte S, Chen Y, Quiaoit F, Prentice RL. Combined association and aggregation analysis of data from case-control family studies. Biometrika. 1998;85:299–315. [Google Scholar]

RESOURCES