Improved Horvitz-Thompson Estimation of Model Parameters from Two-phase Stratified Samples: Applications in Epidemiology

Norman E Breslow; Thomas Lumley; Christie M Ballantyne; Lloyd E Chambless; Michal Kulich

doi:10.1007/s12561-009-9001-6

. Author manuscript; available in PMC: 2010 May 1.

Published in final edited form as: Stat Biosci. 2009 May 1;1(1):32. doi: 10.1007/s12561-009-9001-6

Improved Horvitz-Thompson Estimation of Model Parameters from Two-phase Stratified Samples: Applications in Epidemiology

Norman E Breslow ¹, Thomas Lumley ², Christie M Ballantyne ³, Lloyd E Chambless ⁴, Michal Kulich ⁵

PMCID: PMC2822363 NIHMSID: NIHMS137324 PMID: 20174455

Abstract

The case-cohort study involves two-phase sampling: simple random sampling from an infinite super-population at phase one and stratified random sampling from a finite cohort at phase two. Standard analyses of case-cohort data involve solution of inverse probability weighted (IPW) estimating equations, with weights determined by the known phase two sampling fractions. The variance of parameter estimates in (semi)parametric models, including the Cox model, is the sum of two terms: (i) the model based variance of the usual estimates that would be calculated if full data were available for the entire cohort; and (ii) the design based variance from IPW estimation of the unknown cohort total of the efficient influence function (IF) contributions. This second variance component may be reduced by adjusting the sampling weights, either by calibration to known cohort totals of auxiliary variables correlated with the IF contributions or by their estimation using these same auxiliary variables. Both adjustment methods are implemented in the R survey package. We derive the limit laws of coefficients estimated using adjusted weights. The asymptotic results suggest practical methods for construction of auxiliary variables that are evaluated by simulation of case-cohort samples from the National Wilms Tumor Study and by log-linear modeling of case-cohort data from the Atherosclerosis Risk in Communities Study. Although not semiparametric efficient, estimators based on adjusted weights may come close to achieving full efficiency within the class of augmented IPW estimators.

Keywords: Calibration, Case-cohort, Estimation, Log-linear model, Semiparametric

1 Introduction

Two phase stratified sampling designs were proposed by Neyman (1938) for estimation of the finite population mean of a target variable that was difficult to measure. The average amount of money spent on food by each family residing in a given district was mentioned as a possible target of inference. At the first phase, a large sample is drawn from the population. Information on an auxiliary variable, easier to measure but correlated with the target variable, is collected and used to stratify the sample. At phase two, random sub-samples are drawn without replacement from each stratum for measurement of the target variable. The technique is widely used in survey sampling to reduce costs.

Two phase designs have also been proposed for use in epidemiology. They are particularly valuable when a large cohort (the phase one sample) is under surveillance for a disease event of interest and sampling from the cohort is required to obtain information on additional covariates. White (1982) proposed stratifying the second phase of sampling on both disease status and exposure when more information was needed on confounding factors. She argued that, when both disease and exposure were rare, this was more efficient than the standard case-control design that stratified on disease status alone. Borgan et al. (2000) considered exposure stratified versions of the case-cohort study (Prentice, 1986).

In these epidemiologic applications, the target of inference is not a finite population mean but rather parameters in a probability model – an infinite “superpopulation” from which the cohort is regarded as constituting a phase one random sample. For the stratified case-control study the parameters of interest are odds ratios, exponentiated coefficients in a logistic regression model. For the stratified case-cohort study they are hazard ratios, exponentiated coefficients in the Cox (1972) proportional hazards model, and often also the baseline hazard function.

Current standard practice for estimation of regression coefficients in the Cox model is solution of a Horvitz and Thompson (1952) inverse probability weighted (IPW) version of the Cox (1975) partial likelihood equations (Barlow, 1994; Barlow et al., 1999; Borgan et al., 2000). Survey statisticians advocate this approach on grounds that when the model is misspecified, as is generally the case, it consistently estimates the parameters that would be estimated by fitting the “wrong” model to the cohort were complete data available fot it (Binder, 1992). The Horvitz-Thompson approach is known to be inefficient, however, sometimes seriously so (Robins et al., 1994). One reason is that it often ignores much of the information available for the cohort. The Atherosclerosis Risk in Communities (ARIC) investigators, for example, conduct numerous stratified case-control and case-cohort studies nested in their cohort of 15,972 subjects sampled from four U.S. communities (The ARIC Investigators 1989). Information on standard risk factors for cardiovascular disease is available for nearly the entire cohort from interviews, bioassays and imaging studies conducted at baseline. Additional information on candidate genes or biomarkers is collected for a phase two cohort random sample (CRS), also known as a sub-cohort, that is stratified on demographic factors and sometimes also on carotid wall thickness. Even when enriched by disease cases that occur outside the CRS, the phase two sample typically contains no more than 10–15% of the cohort. ARIC analyses have ignored information on adjustment factors available for the great majority of potential controls.

Recently improved communication between biostatisticians and survey methodologists has led to a better understanding of the two phase sampling designs used in epidemiology and to improved methods of analysis. Statistical efficiency may be enhanced by adjustment of the standard Horvitz-Thompson sampling weights, either by calibrating the weights to cohort totals of auxiliary variables (Deville and Särndal, 1992) or by using these variables to estimate the weights (Robins et al., 1994). We describe some theory and methods for calibration and estimation of weights when fitting semiparametric models that apply a fortiori to the fitting of parametric models. The resulting improvements in precision are illustrated by fitting a log-linear model to data from an ARIC study of coronary heart disease (Ballantyne et al., 2004). A companion paper for epidemiologists (Breslow et al., 2009) investigates the corresponding gains in efficiency of hazard ratios estimated from the ARIC data. By simulation of case-cohort samples from the National Wilms Tumor Study (NWTS) (D’Angio et al., 1989; Green et al., 1998), we also illustrate the problems that may occur when one attempts to use too many auxiliary variables for calibration or estimation of the weights.

A proposal is made for construction of auxiliary variables with the goal of achieving approximate optimality within the class of augmented IPW estimators (Wang and Chen, 2001). We do not consider semiparametric efficient methods of estimation that make greater use of the assumptions of the fitted model, usually at the price of introducing bias if the model does not hold. Such methods are reasonably well developed for logistic regression analysis of stratified case-control data when the covariates available for the entire cohort are all discrete (Scott and Wild, 1997; Breslow and Holubkov, 1997). Semiparametric efficient methods for estimation of hazard ratios from stratified case-cohort samples, which pose greater computational problems, are currently under development; see, for example, Nan (2004); Scheike and Martinussen (2004); Zeng and Lin (2007).

2 Properties of Horvitz-Thompson Estimators

This section reviews results of Breslow and Wellner (2007). Suppose N subjects are sampled at random from an infinite population and indexed by i = 1, …, N. These subjects constitute the (main) cohort following the first phase of sampling. Denote by V ∈ Inline graphic a vector of random variables that is observed for all cohort members. Suppose is partitioned = V₁ ∪···∪ and the cohort is divided correspondingly into J strata, with the i^th subject in stratum j if V_i ∈ . Let N_j denote the number of subjects in the j^th stratum, j = 1, …, J, so N = N₁ + … + N_J. For the epidemiologic designs, one stratum typically consists of the disease cases while the remaining strata contain the controls. At the second phase of sampling n_j ≤ N_j subjects are sampled at random without replacement from the j^th stratum, with sampling for different strata conducted independently. Additional variables are observed for the n = n₁ + … + n_J subjects sampled at phase two. Let W ∈ Inline graphic denote the vector of random variables that are potentially available for the cohort, namely, V plus the additional variables known only for the n subjects sampled at phase two. We denote by Σ_N = σ[W₁, …, W_N] the sigma field of information potentially available for everyone. An important aspect of this formulation for the case-cohort study is that it accomodates random sampling of both cases and controls. Often cases (and controls) are absent from the phase two sample because some data are missing by chance rather than by design, for example, because of uninformative biological samples (Mark and Katki, 2006). The methods described herein still apply provided that such absence is random within defined strata.

Let ξ_i be a binary indicator of whether (ξ_i = 1) or not (ξ_i = 0) the i^th subject is sampled at phase two and let π_i = Pr(ξ_i = 1) be the probability of such sampling. Thus, if j(i) denotes the stratum of the i^th subject, π_i = n_j₍_i₎/N_j₍_i₎. Furthermore, for any pair i ≠ i′,

π_{i, i^{'}} \overset{def}{=} Pr (ξ_{i} = ξ_{i^{'}} = 1) = {\begin{array}{l} π_{i} π_{i^{'}} = \frac{n_{j (i)}}{N_{j (i)}} \cdot \frac{n_{j (i^{'})}}{N_{j (i^{'})}} if j (i) \neq j (i^{'}) \\ \frac{n_{j (i)}}{N_{j (i)}} \cdot \frac{n_{j (i)} - 1}{N_{j (i)} - 1} if j (i) = j (i^{'}) \end{array} .

(1)

The goal of the investigation is to make inferences about parameters in a (superpopulation) model for a random variable X = X(W) ∈ Inline graphic which is completely observed for subjects in the CRS but in general only partially observed for those in the cohort. The model may be parametric or semiparametric. In the latter case it is specified by probability distributions P_θ,η for X that are indexed by a Euclidean parameter θ ∈ Θ ⊂ ℝ^p and an infinite dimensional parameter η ∈ Ξ. The paradigm is the Cox model, where θ denotes the vector of regression coefficients and η the baseline cumulative hazard function. We denote by P₀ = P_θ_₀,_η_₀ the “true” model and use operator notation Pf = E_P f (X) for expectations.

Let ℓ̇_θ,η denote the usual parametric likelihood score for θ and B_θ,η denote the score operator (Begun et al., 1983) that maps one dimensional submodels for η, indexed by directions h ∈ ℋ from which η’s in the submodel approach η₀, into the corresponding score functions. If X was known for all N cohort subjects, parameters could be estimated by solving likelihood equations (van der Vaart, 1998, Sect. 25.12)

P_{N} {\dot{ℓ}}_{θ, η} = 0

(2)

P_{N} B_{θ, η} h = 0 \forall h \in ℋ

(3)

for estimators (θ̃_N, η̃_N), where ℙ_N denotes empirical measure based on X₁, …, X_N. Assumptions are made at least sufficient to guarantee that $\sqrt{N} ({\tilde{θ}}_{N} - θ_{0}, {\tilde{η}}_{N} - η_{0})$ is asymptotically Gaussian. For two phase data, ℙ_N in (2) and (3) is replaced by IPW empirical measure defined by

P_{N}^{π} = \frac{1}{N} \sum_{i = 1}^{N} \frac{ξ_{i}}{π_{i}} δ_{X_{i}},

(4)

where δ_X_{_i} is Dirac measure that puts unit mass on X_i, and the solution is denoted by (θ̂_N, η̂_N). When applied with the Cox model this leads to an IPW version of the so called Breslow (1974) estimator of the cumulative hazard function and to the same IPW version of the partial-likelihood equations that has been proposed by Binder (1992) and Lin (2000), among others.

Denote the efficient information by ${\tilde{I}}_{0} = P_{0} [(I - B_{0} {(B_{0}^{*} B_{0})}^{- 1} B_{0}^{*}) {\dot{ℓ}}_{0} {\dot{ℓ}}_{0}^{T}]$ and the efficient influence function by ${\tilde{ℓ}}_{0} = {\tilde{I}}_{0}^{- 1} (I - B_{0} {(B_{0}^{*} B_{0})}^{- 1} B_{0}^{*}) {\dot{ℓ}}_{0}$ , where $B_{0}^{*}$ is the adjunct of B₀. Invertibility of the information operator $B_{0}^{*} B_{0}$ is implicit in the assumption that η is estimable at a $\sqrt{N}$ rate. The principal result of Breslow and Wellner (2007, p. 94) is

\begin{array}{l} \sqrt{N} ({\hat{θ}}_{N} - θ_{0}) = \sqrt{N} ({\tilde{θ}}_{N} - θ_{0}) + \sqrt{N} ({\hat{θ}}_{N} - {\tilde{θ}}_{N}) \\ = \sqrt{N} P_{N} {\tilde{ℓ}}_{0} + \sqrt{N} (P_{N}^{π} - P_{N}) {\tilde{ℓ}}_{0} + o_{p} (1) . \end{array}

(5)

The first term on the right hand side of (5) represents the usual asymptotic expansion for the unobservable estimator θ̃_N. It converges in distribution to Inline graphic ℓ̃₀, where is the P₀-Brownian bridge (van der Vaart and Wellner, 1996, Sect. 2.1). Conditionally on Σ_N, and hence considered as a random function only of the sampling indicators ξ_i, the second term converges similarly by virtue of the weak convergence

\sqrt{N} (P_{N}^{π} - P_{N}) ⇝ \sum_{j = 1}^{J} \sqrt{ν_{j}} \sqrt{\frac{1 - p_{j}}{p_{j}}} G_{j} in ℓ^{\infty} (F) .

(6)

Here ℱ is a (Donsker) class of functions that contains the likelihood scores for (θ, η) in a neighborhood of (θ₀, η₀); ν_j = lim N_j/N is the fraction of the population in stratum j; p_j = lim n_j/N_j is the limiting samping fraction; and Inline graphic denotes the Brownian bridge process restricted to stratum j, i.e., based on the distribution P_0|_j(·) = E(·|V ∈ ). Furthermore, the processes {, , …, } are mutually independent.

The asymptotic variance of θ̂_N is thus the sum of two components, one corresponding to each phase of sampling. The first component is the usual variability of an estimator based on random sampling from an infinite population (model) assuming no missing data. It is not amenable to modification using methods described here. The second component, which is amenable to improvement, represents the additional variability stemming from the fact that some components of X are observable only at phase two. It is the normalized design based variance of the standard Horvitz-Thompson estimator of an unknown finite population total, namely, the total of the efficient influence function (IF) contributions for all N phase one subjects. Similar results have appeared in the sample survey literature for estimation of Euclidean parameters by solving estimating equations that contain no nuisance functions (Rao et al., 2002; Rubin-Bleuer and Kratina, 2005). Equations (5) and (6) provide the extension to semiparametric models under two-phase stratified sampling, with the semiparametric efficient influence function ℓ̃₀ playing the role of the ordinary parametric IF. For phase two subjects the IF contributions ℓ̃₀(x_i) may be approximated with negligible error from the observed x_i by dfbeta’s (Therneau and Grambsch, 2000, p. 155) defined by

dfbet a_{i} \overset{def}{=} {\tilde{ℓ}}_{({\hat{θ}}_{N}, {\hat{η}}_{N})} (x_{i}) .

(7)

Explicit formulae are available for the Cox model (Cain and Lange, 1984; van der Vaart, 1998, Sect. 25.12.1). Values may be obtained from standard statistical packages as a type of residual following a model fit. In R, for example, they are obtained with the command db<-resid(model.fit, type=’dfbeta’).

3 Calibration and Estimation of the Weights

Survey statisticians are adept at improving estimates of finite population (here, cohort or phase one) totals when auxiliary variables, closely correlated with the target variable, are available for the entire population. We consider ℓ̃₀(x_i) to be the vector of target variables, denote by z_i = z(v_i) a q-vector of auxiliary variables and set $z_{tot} = \sum_{i = 1}^{N} z_{i}$ . Some suggestions for choice of the z_i are given later. The idea behind calibration is to modify the design weights $d_{i} = π_{i}^{- 1}$ to new weights w_i = g_id_i such that the w_i and d_i are as close as possible yet the phase one totals z_tot of the auxiliary variables are exactly estimated. If G(w, d) denotes a distance measure, the problem is thus to minimize $\sum_{i = 1}^{N} ξ_{i} G (w_{i}, d_{i})$ subject to the constraints

\hat{z_{tot}} \overset{def}{=} \sum_{i = 1}^{N} ξ_{i} w_{i} z_{i} = z_{tot}

(8)

known as the calibration equations. Here we consider G(w, d) = (w−d)²/2d and G(w, d) = w log(w/d)−w+d, the Poisson deviance. See Deville and Särndal (1992) for other possibilities.

Let λ be a q-vector of Lagrange multipliers corresponding to the constraints (8). For G(w, d) = (w−d)²/2d, solution of the constrained minimization problem by standard calculus yields g_i = 1 − λ^Tz_i which, when substituted into (8), leads to an explicit solution for λ:

{\hat{λ}}_{N} = {(\sum_{i = 1}^{N} ξ_{i} d_{i} z_{i} z_{i}^{T})}^{- 1} (\sum_{i = 1}^{N} ξ_{i} d_{i} z_{i} - z_{tot}) .

(9)

The estimator obtained with the resulting weights is known as the generalized regression or GREG estimator (Särndal et al., 1989). When used to estimate a finite population total $Y = \sum_{i = 1}^{N} y_{i}$ via

\begin{array}{l} {\hat{Y}}_{GREG} = \sum_{i = 1}^{N} ξ_{i} \frac{g_{i}}{π_{i}} y_{i} = \sum_{i = 1}^{N} z_{i}^{T} \hat{β} + \sum_{i = 1}^{N} \frac{ξ_{i}}{π_{i}} (y_{i} - z_{i}^{T} \hat{β}), where \\ \hat{β} = {(\sum_{i = 1}^{N} \frac{ξ_{i}}{π_{i}} z_{i} z_{i}^{T})}^{- 1} (\sum \frac{ξ_{i}}{π_{i}} z_{i} y_{i}), \end{array}

the GREG estimator is the finite population sum of the fitted values plus the Horvitz-Thompson estimator of the sum of the residuals.

When G(w, d) is the Poisson deviance, solution of the minimization problem gives g_i = exp(−λ^Tz_i) and the calibration equations, now solved iteratively for λ, become

\sum_{i = 1}^{N} ξ_{i} exp (- λ^{T} z_{i}) d_{i} z_{i} = z_{tot} .

Under standard regularity assumptions (Isaki and Fuller, 1982) applicable to design based inference, Deville and Särndal (1992, p. 379) show that the solution generally satisfies

{\hat{λ}}_{N} = {\hat{B}}_{N}^{- 1} (\sum_{i = 1}^{N} ξ_{i} d_{i} z_{i} - z_{tot}) + O_{p} (n^{- 1}), where

(10)

\frac{1}{N} {\hat{B}}_{N} = \frac{1}{N} (\sum_{i = 1}^{N} ξ_{i} d_{i} z_{i} z_{i}^{T}) \overset{p}{\to} P_{0} (Z Z^{T}) .

(11)

When $g_{i} = exp (- {\hat{λ}}_{n}^{T} z_{i})$ the calibrated weights are always positive (with GREG they may not be) and the resulting estimator is known as the (generalized) raking estimator. Since under mild regularity conditions (see Sect. 7, Appendix) ${\hat{λ}}_{N} = O_{p} (\frac{1}{\sqrt{N}})$ , the g_i converge to 1 and hence the estimated w_i converge to the design weights d_i.

The asymptotic properties of the estimator θ̂_N (λ̂_N) based on calibrated weights may be derived from results in Breslow and Wellner (2008) by letting the Lagrange multipliers λ play the role of “nuisance” parameters α. Regardless of which distance function is used for calibration (Deville and Särndal, 1992), one finds

\sqrt{N} ({\hat{θ}}_{N} ({\hat{λ}}_{N}) - θ_{0}) ⇝ G {\tilde{ℓ}}_{o} + \sum_{j = 1}^{J} \sqrt{ν_{j}} \sqrt{\frac{1 - p_{j}}{p_{j}}} G_{j} ({\tilde{ℓ}}_{0} - Q Z) .

(12)

Here $Q Z = P_{0} ({\tilde{ℓ}}_{0} Z^{T}) P_{0}^{- 1} (Z Z^{T}) Z$ is the least squares projection of each component of ℓ̃₀ onto the linear subspace of L₂(P₀) spanned by components of Z. Hence the effect of calibration on the (asymptotic) phase two component of variance is to replace ℓ̃₀ by the residual after its (population) least squares regression on Z. The Appendix contains a brief derivation of (12).

The asymptotic variance of the calibration estimator could be further reduced if we replaced Q in the j^th summand of (12) by $Q_{j} = P_{0 ∣ j} ({\tilde{ℓ}}_{0} Z^{T}) P_{0 ∣ j}^{- 1} (Z Z^{T})$ , the projection with respect to the conditional distribution of the data in stratum. This is because, with ${Var}_{j} f = P_{0 ∣ j} f^{\otimes 2} - P_{0 ∣ j}^{\otimes 2} f$ , Q_jZ minimizes Var_j (ℓ̃₀ − AZ) among all linear functions of Z. Consequently, instead of calibrating to the overall total, one might consider calibrating to the subtotals of z within each stratum. This is accomplished by defining a new q × J vector of calibration variables by ${\tilde{Z}}^{T} = [Z_{1}^{T} Z_{2}^{T} \dots Z_{J}^{T}]$ where Z_j = 1(V ∈ Inline graphic )Z and setting $\tilde{Q} = P_{0} ({\tilde{ℓ}}_{0} {\tilde{Z}}^{T}) P_{0}^{- 1} (\tilde{Z} {\tilde{Z}}^{T})$ . Then $\tilde{Q} \tilde{Z} = \sum_{j = 1}^{J} 1 (V \in V_{j}) Q_{j} Z$ and (ℓ̃ − Q̃Z̃) = (ℓ̃ − Q_jZ) as desired. However, since this may greatly increase the number of calibration variables, the increased variability in the weights may result in an estimator with increased rather than reduced finite sample variance in all but the largest samples. See Sect. 5 below.

Biostatisticians (Robins et al., 1994) have also suggested adjustment of the sampling weights to improve efficiency, namely, by estimating the weights using a correct parametric model π(z_i; α) = Pr(ξ_i = 1|Z_i = z_i). For two-phase stratified sampling the model is rendered correct by including the J binary stratum indicators among the adjustment variables z. Logistic regression is typically used for π(z_i; α), in which case the estimating equations for α become

\sum_{i = 1}^{N} ξ_{i} z_{i} = \sum_{i = 1}^{N} z_{i} / w_{i}

(13)

where w_i = p(z_i; α̂_N)⁻¹ denotes the weight estimated for subject i. Note the parallel with the calibration equations (8): the weights appear on opposite sides. Results of Breslow and Wellner (2008) may again be used to derive the asymptotic distribution of θ̂_N (α̂_N) under two-phase stratified sampling. See the Appendix for a brief derivation of the weak convergence

\begin{array}{l} \sqrt{N} ({\hat{θ}}_{N} ({\hat{α}}_{N}) - θ_{0}) ⇝ G {\tilde{ℓ}}_{0} + \sum_{j = 1} \sqrt{ν_{j}} \sqrt{\frac{1 - p_{j}}{p_{j}}} G_{j} ({\tilde{ℓ}}_{0} - p_{j} R Z) where \\ R = P_{0} (1 - π_{0}) {\tilde{ℓ}}_{0} Z^{T} {(P_{0} π_{0} (1 - π_{0}) Z Z^{T})}^{- 1} . \end{array}

(14)

Note that R is also a projection matrix, now of components of ℓ̃₀ onto the linear span of π₀Z with respect to the distribution P_π defined by $P_{π} f = P_{0} (\frac{1 - π_{0}}{π_{0}} f) / P_{0} (\frac{1 - π_{0}}{π_{0}})$ .

Previous discussions of Horvitz-Thompson methods with estimated weights have derived their properties for Bernoulli sampling, whereby the N phase one subjects are selected independently for membership in the phase two sample with known sampling probabilities π₀(v) (Robins et al., 1994; Mark and Katki, 2006). This facilitates comparison of asymptotic properties in terms of influence functions. Using the asymptotic expansions in the Appendix, which hold for both sampling schemes, the influence function under Bernoulli sampling for the estimator based on calibrated weights is readily shown to equal

\frac{ξ}{π_{0} (v)} {\tilde{ℓ}}_{0} (x) - \frac{ξ - π_{0} (v)}{π_{0} (v)} Q z (v)

(15)

where π₀(v) = p_j for v ∈ Inline graphic . The asymptotic phase two variance under Bernoulli sampling is therefore $\sum_{j = 1}^{J} \sqrt{ν_{j}} \sqrt{(1 - p_{j}) / p_{j}} P_{0 ∣ j} {({\tilde{ℓ}}_{0} - Q Z)}^{\otimes 2}$ . The limit laws under the two sampling schemes, however, are subtly different. The asymptotic phase two variance implied by (12) for two phase stratified sampling, namely, $\sum_{j = 1}^{J} \sqrt{ν_{j}} \sqrt{(1 - p_{j}) / p_{j}} {Var}_{j} ({\tilde{ℓ}}_{0} - Q Z)$ , is less than that for Bernoulli sampling.

Expression (15) shows immediately how to choose z. Selecting z^opt = E(ℓ̃₀|V = v), we find Qz^opt = z^opt and conclude that the class of calibrated IPW estimators contains the optimal member of the class of augmented IPW estimators considered by Robins et al. (1994), Wang and Chen (2001) and Mark and Katki (2006). Influence functions for this class take the form (15) with Qz(v) replaced by φ(v), an arbitrary function of v, and the choice φ = z^opt minimizes the asymptotic variance. Similarly, the influence function when weights are estimated using auxiliary variables z has the form (15) where Qz is replaced by Rπ₀z. To obtain the optimal influence function, therefore, one would include the variables z^opt/π₀(v) together with the stratum indicators as predictors of the weights.

Calculation of z^opt = E(ℓ̃₀|V = v) requires knowledge of the conditional distribution [X|V], which is generally unavailable. One approach to approximating z^opt is to specify parametric regression models for the components of ℓ̃₀ on V (Robins et al., 1994, Sect. 2.7) and to estimate the parameters by IPW regression of the dfbeta on V using the second phase data. Kulich and Lin (2004) proposed an alternative “plug in” method for approximating the conditional expectation that we adopt here. It is likely to be most useful when there are only one or two partially missing variables. The steps are as follows:

Using (linear or logistic) regression models fitted by IPW to the phase two data, develop and fit parametric models to predict each partially missing variable (known only at phase two) given variables v known for all.
Use the models in step 1 to impute values x̂_i for everyone in the phase one sample; variables already known for everyone are used in their original form.
Fit the interest model P_θ,η(X) to the phase one sample using the imputed values x̂_i.
Construct auxiliary variables z as imputed dfbetas (7) from the model fitted in step 3; these are estimates of z^opt.
Estimate θ using weights adjusted to the z_i.

Examples of this approach are given in the companion paper (Breslow et al., 2009) and in Sect. 5 below.

An interesting special case arises when V is discrete with K > J levels, say V ∈ {1, …, K}. Define Z^T = (1(V = 1), …, 1(V = K)). Then both calibration and estimation lead to the same adjusted weights, namely, the inverse sampling fractions within each of the K strata defined by levels of V. This is a finer stratification than that actually used for sampling. Since $Q Z = \sum_{k = 1}^{K} P_{0 ∣ k} ({\tilde{ℓ}}_{0}) 1 (V = k) = E ({\tilde{ℓ}}_{0} ∣ V)$ , where P_0|_k is defined analogously to P_0|_j, it follows that Horwitz-Thompson estimation based on the finer stratification yields the most efficient estimate within the augmented IPW class. Survey statisticians refer to this method as post-stratification.

4 Log-linear Modeling

We first consider log-linear modeling of multinomial outcomes δ corresponding to occupancy of one of L cells (ℓ = 1, …, L) in a multidimensional table, with calibration to marginal totals rather than to IF contributions in a regression model. The outcome for subject i is $δ_{i}^{T} = (δ_{i, 1}, \dots, δ_{i, L})$ where δ_i,_ℓ = 1 if the subject occupies cell ℓ, otherwise δ_i,_ℓ = 0. Redefine X to denote a L × p design matrix (p < L), without the intercept to maintain identifiability, and let θ ∈ Θ ⊂ R^p be a parameter vector such that, with x_ℓ denoting the ℓ^th row of X,

p_{ℓ} (θ) = Pr (occupy cell ℓ ∣ θ) = \frac{exp (x_{ℓ} θ)}{\sum_{m = 1}^{L} exp (x_{m} θ)} .

(16)

The fully parametric model P_θ(δ) thus has likelihood

lik (θ) = \prod_{i = 1}^{N} \prod_{ℓ = 1}^{L} {[p_{i, ℓ} (θ)]}^{δ_{i, ℓ}} .

Our results for semiparametric models and two-phase stratified samples apply to this situation with ℓ̃_θ denoting the ordinary (parametric) influence function rather than the semiparametric efficient one. Details have been worked out by Kovacevic and Rai (2002) for joint model-design based inference for parameters in log-linear models under more general sampling designs. With p(θ) defined by p^T(θ) = (p₁(θ), …, p_L(θ)) denoting the vector of cell occupancy probabilites under the model, the likelihood score is ℓ̇_θ = X^T[δ − p(θ)]. With $O = \sum_{i = 1}^{N} δ_{i}$ denoting the (unobserved) vector of table frequencies for the phase one sample, the MLE θ̃_N would be obtained from (2) as the solution to ℙ_N ℓ̇_θ(δ_i) = X^T[O − p(θ)N] = 0. The equation to be solved for θ̂_N is instead $P_{N}^{π} {\tilde{ℓ}}_{θ} = X^{T} [O^{π} - p (θ) \hat{N}]$ where $O^{π} = \sum_{i = 1}^{N} ξ_{i} π_{i}^{- 1} δ_{i}$ denotes the estimated vector of frequencies and $\hat{N} = \sum_{i = 1}^{N} ξ_{i} π_{i}^{- 1}$ (= N for two-phase stratified sampling) is the estimated phase one total. Thus one simply fits the log-linear model to the estimated table of frequencies.

With $I_{θ} = P_{θ} {\dot{ℓ}}_{θ} {\dot{ℓ}}_{θ}^{T} = X^{T} diag {p_{ℓ} (θ) [1 - p_{ℓ} (θ)]} X$ denoting the parametric information and ${\tilde{ℓ}}_{θ} = I_{θ}^{- 1} X^{T} [δ - p (θ)]$ the parametric influence function, one can use equations (5) and (6) to determine the asymptotic variance of θ̂_N. The phase one component of variance is estimated by (NI_θ̂_{_N})⁻¹ or more robustly

{\hat{Var}}_{PHS - I} = \frac{1}{N (N - 1)} I_{{\hat{θ}}_{N}}^{- 1} X^{T} [\sum_{ℓ = 1}^{L} O_{ℓ}^{π} (e_{ℓ} - \hat{p}) {(e_{ℓ} - \hat{p})}^{T}] X I_{{\hat{θ}}_{N}}^{- 1}

where p̂ = p(θ̂_N) and e_ℓ denotes the L-vector with one in the ℓ^th position and zeroes elsewhere. Similarly, if O_j,_ℓ denotes the number of phase two subjects in stratum j observed to occupy cell ℓ of the table, then the phase two component of variance is estimated by

\begin{array}{l} {\hat{Var}}_{PHS - II} = \frac{1}{N^{2}} I_{{\hat{θ}}_{N}}^{- 1} X^{T} [\sum_{j = 1}^{J} N_{j} \frac{1 - n_{j} / N_{j}}{n_{j} / N_{j}} S_{j}^{2}] X I_{{\hat{θ}}_{N}}^{- 1} where \\ S_{j}^{2} = \frac{1}{n_{j} - 1} \sum_{ℓ = 1}^{L} O_{ℓ, j} (e_{ℓ} - {\hat{p}}^{(j)}) {(e_{ℓ} - {\hat{p}}^{(j)})}^{T} \end{array}

and p̂⁽^j⁾ is the observed vector of cell occupancy fractions in stratum j.

As an illustration, we studied the association between high density lipoprotein-cholesterol (HDL-C) and lipoprotein-associated phospholipase A₂ (Lp-PLA₂) using data from the ARIC case-cohort study of Ballantyne et al. (2004). Measurements on HDL-C were available for the entire cohort of 12,345 participants who had plasma collected at their second follow-up visit and were free of coronary heart disease (CHD) at that time. Levels of Lp-PLA₂ were determined by assay of the stored plasma samples for the 604 who developed CHD during the 6–8 years of additional follow-up and for 732 controls who had been sampled for the cohort random sample after stratification on gender, age and ethnicity (two levels each). The ARIC investigators were interested in whether their data would support the finding by Persson et al. (2007) of a negative association between Lp-PLA₂, a new biomarker of inflammation, and the well known risk factor HDL-C. Sampling weights for the nine sampling strata ranged from 1.2 for the stratum consisting of the CHD cases to 32 for white female controls under age 55. (The CHD cases were effectively sampled at less than 100% since some plasma samples were uninformative.)

All statistical analyses were carried out using the R survey package (Lumley, 2004) available from http://cran.r-project.org/. Table 1A shows the frequencies for the cross-classification of the two risk factors, estimated together with their standard errors using standard sampling weights. They were obtained using IPW estimates of population means that are implemented in the svytotal function. Note that the marginal HDL-C totals of the frequencies estimated using standard weights do not agree with the actual (integral) HDL-C totals for the cohort.

Table 1.

Association between HDL-C and Lp-PLA₂: Standard vs Calibrated Weights

	A. Standard weights				B. Calibrated weights
HDL-C (mg/L)	Lp-PLA₂ (μG/L)				Lp-PLA₂ (μG/L)
	0–0.309	0.31–0.42	0.421–1	Total	0–0.309	0.31–0.42	0.421–1	Total

	Estimated frequencies				Estimated frequencies
<40	701.4	938.6	1,561.3	3,201.3	739.0	988.9	1,645.0	3,373
40–59.0	1,764.4	2,310.2	2,175.3	6,249.9	1,665.0	2,180.1	2,052.9	5,898
≥60.0	1,569.6	909.4	414.8	2,893.8	1,667.4	966.0	440.6	3,074

Total	4,035.4	4,158.2	4,154.4	12,345	4,071.4	4,135.1	4,138.5	12,345
	Standard errors				Standard errors
<40	105.7	111.3	138.9	185.4	99.7	105.0	117.7	49.5
40–59.0	166.9	187.5	170.0	234.1	144.4	155.9	146.7	55.5
≥60.0	164.2	124.1	81.9	197.7	128.0	117.0	83.1	48.0

Total	217.3	222.2	206.1		212.9	220.3	201.8

Open in a new tab

To formally assess the statistical significance of the clearly negative association, we fit a log-linear model with a term for a linear×linear interaction. The design matrix was specified as

X^{T} = (\begin{matrix} 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 \\ 1 & 0 & - 1 & 0 & 0 & 0 & - 1 & 0 & 1 \end{matrix}) .

The first four columns of X (rows of X^T) describe the standard independence model, a log-linear model with main effects for rows (HDL-C) and columns (Lp-PLA₂). The interaction term in the last column corresponds to the single degree of freedom linear by linear association test for dependence in a two-way table. Fitting the model using svyloglin, the interaction coefficient was −0.5323 with standard error 0.0734 yielding a test statistic Z² = 52.6, p = 4 × 10⁻¹³. Expanding the model to saturation and testing for dependence using a Wald statistic yielded χ² = 60.1, DF = 4, p = 3 × 10⁻¹². Most of the association was explained by the linear interaction.

Table 1B shows the tabular frequencies estimated after calibration of the weights to the marginal totals of HDL-C using the generalized raking procedure described in Sect. 3. These are the classical raking estimates that are proportionally adjusted to the known marginal totals (Deming and Stephan, 1940) so that the estimated and actual marginal (HDL-C) totals do now agree. The standard errors of the estimated marginal totals reflect only the binomial sampling variability at phase one; the phase two component is zero. With one minor exception where the estimated frequency also increased, the standard errors of frequencies in the cross-classification decreased markedly. There was no meaningful change in the precision of estimation of the marginal totals of Lp-PLA₂, reflecting the fact that no phase one information about this margin was utilized. Fitting the interaction model with calibrated weights led to a slightly more significant linear by linear association test, Z² = 54.0, p = 2 × 10⁻¹³, but there was virtually no change in the 4 DF test (χ² = 60.2).

Results of estimating the table frequencies using estimated weights (not shown) were similar to those obtainned with calibrated weights in terms of precision. The estimated marginal HDL-C totals were 3412.9, 6014.9 and 2915.2, however, and hence differed from the actual totals shown in the margin of Table 2B. Nonetheless, the phase two contributions to the SEs of the marginal totals were still neglible if not absolutely zero. For the three levels of HDL-C, the SEs reported by svytotal were 49.693, 55.534 and 47.190, whereas the SEs based only on phase one variability were 49.687, 55.527 and 47.184.

Table 2.

Average Total Standard Error (SE) and Empirical Phase Two Standard Error (RMSE). Adjustment to Within Stratum Totals of Estimated Influence Function Contributions

	Cox	Standard weights		Calibrated weights^*		Estimated weights
Model term	SE	SE	RMSE	SE	RMSE	SE	RMSE
UH	.503	.537	.188	.572	.157	.517	.146
Age₀ (yr)	.321	.360	.158	.429	.228	.323	.085
Age₁ (yr)	.015	.026	.021	.024	.030	.016	.016
Stage (III, IV vs I, II)	.259	.346	.231	.365	.265	.271	.116
Diameter (cm)	.015	.021	.015	.021	.018	.015	.008
Stage × Diameter	.020	.029	.020	.028	.020	.021	.010
UH×Age₀	.552	.612	.287	.668	.321	.587	.258
UH×Age₁	.033	.051	.048	.053	.060	.046	.050

Open in a new tab

Based on 998 replications for which coefficients were obtained

See the companion paper for Cox regression analyses of the ARIC data.

5 Simulation Study of Cox Regression Modeling

To study the effect of different methods of adjustment of sampling weights on the performance of Cox regression coefficients estimated with the svycoxph function, we used data on N = 3, 915 NWTS patients that were also considered in the the companion paper. The data and sample R code are available from the website http://faculty.washington.edu/norm/software.html. These data were also used by Kulich and Lin (2004), whom we follow in choice of models, to illustrate their “combined, doubly weighted” estimator. The goal was to estimate hazards associated with prognostic variables in a failure time analysis, with event-free survival (time from diagnosis to disease progression or death) as the endpoint. The most important prognostic factor was histologic subtype, classified as favorable (FH) or unfavorable histology (UH). Information on histology was available both from the central pathology reference laboratory and from the institution where the patient was treated. This made possible the simulation of exposure stratified case-cohort studies (Borgan et al., 2000), treating the central pathology measurement as known only for the phase two sample and the institutional pathology measurement as one of the variables used for stratification of the sample and adjustment of the weights. The Cox model was first fit to the entire cohort, yielding the normally unobservable θ̃_N for comparison with the two-phase estimates θ̂_N. Mean squared differences (θ̃_N − θ̂_N)² averaged over 10,000 phase two samples provided an empirical estimate of the phase two variance component.

Phase two sampling was stratified on the basis of institutional histology, stage of disease (low=I, II vs high=III, IV), age at diagnosis (babies vs 1+ year olds) and whether (cases) or not (controls) the patient had relapsed or died before the end of follow-up. All cases and all but the three largest control strata were included in their entirety in the phase two sample: 120 of 452 FH control babies with low stage disease; 160 of 1,620 1+ year old controls with low stage disease; and 120 of 914 1+ year old controls with high stage disease were sampled at random for phase two. This provided 660 phase two controls for comparison with the 669 cases and a phase two sample size of n = 1, 329. The Cox model contained terms for histology (UH), age as a linear spline with separate slopes for babies (Age₀) and older children (Age₁), a binary indicator of high vs low stage, tumor diameter (cm) and interactions between stage and diameter and between histology and the two age terms.

Auxiliary variables for calibration and estimation of the weights were constructed using the procedure described at the end of Sect. 3. For each of the 10,000 simulated phase two samples we first used the svyglm function with standard sampling weights to predict histology (central pathology) based on a logistic regression model containing terms for institutional (local) histology (the main predictor), stage, age greater or less than 10 years, study (NWTS-3 vs NWTS-4) and an interaction between local UH and stage. The logistic regression coefficients were then used in conjunction with phase one data to yield a predicted probability of having UH for each of the 3,915 phase one subjects. The Cox model was fit using the ordinary coxph procedure with this predicted probability in place of the binary UH covariate. Estimated IF contributions (dfbetas) were extracted, augmented by 1 to ensure that they were positive, and used as the auxiliary variables z for calibration. For estimation the dfbetas were divided by the known sampling probabilities before adding 1, and combined with the stratum indicators in the prediction equation.

Here is a brief summary of results of the initial simulation study as reported in the companion paper (Breslow et al., 2009). Calibration again used the raking procedure. Mean values of the regression coefficients were nearly identical for the three case-cohort estimation methods (standard, calibrated and estimated weights) and close to those estimated for the entire cohort. Both calibration and estimation reduced the phase two variance components, sometimes to negligible levels. For example, the phase two standard error of the Age₀ coefficient was reduced from 0.162 with standard weights to 0.037 with calibrated and to 0.061 with estimated weights. The phase two SE for the UH main effect was reduced by 29% with calibration and to 28% with estimation. There were lesser reductions for the interaction with Age₀.

Additional simulations based on only 1,000 replications were carried out to determine whether the modification of the auxiliary variables suggested in the paragraph following equation (12) in Sect. 3, intended to reduce the asymptotic phase two variance further under two phase stratified sampling, would actually improve estimates in finite samples. This increased the number of adjustment variables from 8, the number of terms in the Cox model, to 32 in view of the four sampling strata. Substantial numerical difficulties were experienced with the larger number of adjustment variables, particularly for calibration. The survey option to force convergence of the iterative solution of the calibration equations, by returning the weights after 200 iterations, was required to prevent otherwise frequent failure of the algorithm but led to outliers in the estimated regression coefficients and the failure of the svycoxph procedure in 2 samples. This increased the phase two variability in comparison with calibration based on a smaller number of variables.

Results are shown in Table 2. The column labeled Cox SE shows the robust (Lin and Wei, 1989) standard errors for the Cox model fit to the entire cohort. These were noticeably larger than the model based standard errors, reflecting a lack of proportionality for several covariates. The phase one variance components estimated by svycoxph also are the robust versions, emphasizing the sample survey view that the goal is to approximate the results of fitting a possibly misspecified model to the entire cohort. The columns labeled SE for the various two-phase case-cohort methods are averages of the total estimated standard errors, incorporating both phase one and phase two variability. The columns labeled RMSE contain the empirical phase two standard errors. For most coefficients, in stark contrast to results obtained with just 8 adjustment variables, the RMSE using calibrated weights were larger even than with standard sampling weights. The svycoxph procedure returned regression coefficients for all 1,000 replicates when weights were determined by estimation, but they too had slightly larger RMSE than those obtained with fewer adjustment variables. On the other hand, they were more accurate than estimates obtained using standard weights. Qualitatively similar results were obtained when the adjustment variables were limited to the 24 corresponding to the three strata sampled at less than 100%.

6 Discussion and Conclusions

This paper has reviewed statistical properties of Horvitz-Thompson estimators of Euclidean parameters in semiparametric models with two-phase stratified samples, derived the corresponding properties for modified estimators where the sampling weights are adjusted by calibration or estimation, and illustrated the methods by log-linear and Cox regression modeling of stratified case-cohort data. A limitation of the results of van der Vaart (1998) for the Cox model, which were used by Breslow and Wellner (2007) and apply therefore to those reported here, is that time-dependent covariates were excluded from consideration. This is currently also a limitation of the svycoxph function in the survey package, at least so far as correct estimation of the variance in concerned. Work is in progress to modify svycoxph so that it will calculate the variances correctly when multiple time slice records are included in the survival analysis for each phase two subject. This will provide the extension needed to deal with time-dependent covariates. We think it likely that the asymptotic results hold for this situation and in fact much more generally. We are hopeful that a general theory for estimation of parameters in semiparametric models can be developed for sample survey designs along the lines of what Lin (2000) has provided for the Cox model.

The asymptotic properties of estimators with weights adjusted using calibration and estimation are very similar. Finite sample performance of the two methods of adjustment was also very similar in the initial simulation results for the NWTS cohort as reported in the companion paper. An advantage of calibration for the survey statistician is that information on the auxiliary variables is not needed for everyone in the cohort (phase one sample). Only the totals over the cohort are required. The more detailed information is required for estimation of the sampling probabilities.

Our asymptotic results suggested the possibility of improvement by using stratum specific versions of the imputed IF contributions for calibration of the weights. Robins et al. (1994) likewise noted that, so far as asymptotic variances are concerned, increasing the number of auxiliary variables used to estimate the weights can only reduce (or at least never increase) the asymptotic variance of the estimators. As the simulations reported here illustrate, however, actual performance in finite samples may be quite different. Adjustment based on calibration was particularly susceptible to numerical problems caused by a large number of auxiliary variables. In practical work, standard or estimated sampling weights should be used when the calibration algorithm fails to converge. Options for calibration that bound the discrepancy between design and calibrated weights have been implemented in the R survey package and further work exploring the finite sample properties of estimators using such weights would be desirable. It would also be of interest to compare our “plug in” method of approximating the optimal auxiliary variables z^opt to other methods of estimating the conditional mean.

Methods for improving the analysis of data from two phase stratified samples have now been implemented in the flexible survey package of the freely available R statistical system. There is no longer any excuse for epidemiologists and statisticians to waste valuable information by inefficient analysis of data from stratified case-control and case-cohort studies.

Acknowledgments

The Atherosclerosis Risk in Communities Study is carried out as a collaborative study supported by the National Heart, Lung and Blood Institute contracts N01-HC-55-15, N01-NC-55016, N01-HC-55019, N01-HC-55020, N01-HC-55021 and N01-HC-55022. The National Wilms Tumor Study and Norman Breslow received financial support from grants R01-CA-054498, R01-CA-40644 and earlier grants from the National Cancer institute. Michal Kulich received financial support from Research Project MSM 0021620839 from Ministry of Education, Czech Republic. The authors thank the staff and participants of the ARIC and NWTS studies for their important contributions.

7 Appendix

Here we provide a brief outline of the derivation of (12) for the calibration estimator. For in a neighborhood of the limiting value λ₀ = 0 denote modified weights by $π_{λ}^{- 1} (v) = π_{0}^{- 1} (v) [1 - λ^{T} z]$ , where $π_{0}^{- 1} (v)$ are known sampling weights depending on v. Define classes of functions

\begin{array}{l} ψ_{1; θ, η, λ} (w, ξ) = \frac{ξ}{π_{λ} (v)} {\dot{ℓ}}_{θ, η} (x) and \\ ψ_{2; θ, η, λ, h} (w, ξ) = \frac{ξ}{π_{λ} (v)} B_{θ, η} h (x) \end{array}

for θ ∈ Θ, η ∈ Ξ, and h ∈ ℋ, and let Ψ_1;_θ,η,λ = P₀ ψ_1; _θ,η,λ (W, ξ) and Ψ_2; _θ,η,λ_,_h = P₀ ψ_2; _θ,η,λ_,_h(W, ξ) denote their expectation. Then

\begin{array}{l} {{\dot{Ψ}}_{1, λ} = \frac{\partial}{\partial λ^{T}} P_{0} ψ_{1; θ, η, λ, h} |}_{λ = 0} = - P_{0} ({\dot{ℓ}}_{0} Z^{T}) and \\ {{\dot{Ψ}}_{2; λ} h = \frac{\partial}{\partial λ^{T}} P_{0} ψ_{2; θ, η, λ, h} |}_{λ = 0} = - P_{0} (B_{0} h Z^{T}), h \in ℋ . \end{array}

Substituting these expressions for the analogous quantities Ψ̇₁_,α and Ψ̇_2;_αh in Breslow and Wellner (2008, Sect. 3) and reworking the calculations there shows that

\sqrt{N} ({\hat{θ}}_{N} ({\hat{λ}}_{N}) - θ_{0}) = \sqrt{N} ({\hat{θ}}_{N} (0) - θ_{0}) - P_{0} ({\tilde{ℓ}}_{0} Z^{T}) \sqrt{N} {\hat{λ}}_{N} + o_{p} (1) .

From (6), (9) and (11), assuming n ↑ ∞ faster than $\sqrt{N}$ as N ↑ ∞, we find

\sqrt{N} {\hat{λ}}_{N} = B^{- 1} \sqrt{N} (P_{N}^{π} - P_{N}) Z + o_{p} (1) ⇝ \sum_{j = 1}^{J} \sqrt{ν_{j}} \sqrt{\frac{1 - p_{j}}{p_{j}}} G_{j} B^{- 1} Z

while from (5) and (6)

\sqrt{N} ({\hat{θ}}_{N} (0) - θ_{0}) ⇝ G {\tilde{ℓ}}_{0} + \sum_{j = 1}^{J} \sqrt{ν_{j}} \sqrt{\frac{1 - p_{j}}{p_{j}}} G_{j} {\tilde{ℓ}}_{0} .

Combining the last three equations yields the desired result (12).

The limit law for θ̂_N based on estimated weights is derived similarly. From Breslow and Wellner (2008, eq. 18) one concludes for either Bernoulli or two-phase stratified sampling

\sqrt{N} ({\hat{θ}}_{N} ({\hat{α}}_{N}) - θ_{0}) = \sqrt{N} ({\hat{θ}}_{N} (α_{0}) - θ_{0}) - P_{0} ({\tilde{ℓ}}_{0} (1 - π_{0}) Z) \sqrt{N} ({\hat{α}}_{N} - α_{0}) + o_{p} (1) .

A Taylor expansion with remainder of α̂_N about α₀ in (13) leads to

\sqrt{N} ({\hat{α}}_{N} - α_{0}) = {[P_{0} π_{0} (1 - π_{0}) Z Z^{T}]}^{- 1} \frac{1}{\sqrt{N}} \sum_{i = 1}^{N} [ξ_{i} - π_{0} (v_{i})] Z + o_{p} (1) .

Combining these two equations with (5) we find

\sqrt{N} (\hat{θ} ({\hat{α}}_{N}) - θ_{0}) = \sqrt{N} P_{N} {\tilde{ℓ}}_{0} + \sqrt{N} (P_{N}^{π} - P_{N}) ({\tilde{ℓ}}_{0} - π_{0} R Z) + o_{p} (1)

and the conclusion (14) follows from the arguments in Breslow and Wellner (2007, Sect. 4).

Contributor Information

Norman E. Breslow, Email: norm@u.washington.edu, Department of Biostatistics, University of Washington, Seattle, WA, USA, Tel.: +1-206-543-2035, Fax: +1-206-616-2724

Thomas Lumley, Email: tlumley@u.washington.edu, Department of Biostatistics, University of Washington, Seattle, WA, USA.

Christie M Ballantyne, Email: cmb@bcm.edu, Department of Medicine, Baylor College of Medicine, Houston, TX, USA.

Lloyd E. Chambless, Email: Lloyd.Chambless@mail.cscc.unc.edu, Department of Biostatistics, University of North Carolina, Chapel Hill, NC, USA

Michal Kulich, Email: kulich@karlin.m3.cuni.cz, Department of Probability and Mathematical Statistics, Charles University, Prague, CZ.

References

Ballantyne CM, Hoogeveen RC, Bang H, et al. Lipoprotein-associated phospholipase A(2), high-sensitivity C-reactive protein, and risk for incident coronary heart disease in middle-aged men and women in the Atherosclerosis Risk in Communities (ARIC) study. Circulation. 2004;109:837–842. doi: 10.1161/01.CIR.0000116763.91992.F1. [DOI] [PubMed] [Google Scholar]
Barlow WE. Robust variance estimation for the case-cohort design. Biometrics. 1994;50:1064–1072. [PubMed] [Google Scholar]
Barlow WE, Ichikawa L, Rosner D, Izumi S. Analysis of case-cohort designs. J Clin Epidemiol. 1999;52:1165–1172. doi: 10.1016/s0895-4356(99)00102-x. [DOI] [PubMed] [Google Scholar]
Begun JM, Hall WJ, Huang W-M, Wellner JA. Information and asymptotic efficiency in parametric-nonparametric models. Ann Stat. 1983;11:432–452. [Google Scholar]
Binder DA. Fitting Cox’s proportional hazards model from survey data. Biometrika. 1992;79:139–147. [Google Scholar]
Borgan O, Langholz B, Samuelsen SO, et al. Exposure stratified case-cohort designs. Lifetime Data Anal. 2000;6:39–58. doi: 10.1023/a:1009661900674. [DOI] [PubMed] [Google Scholar]
Breslow N. Covariance analysis of censored survival data. Biometrics. 1974;30:89–99. [PubMed] [Google Scholar]
Breslow NE, Holubkov R. Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. J Roy Stat Soc B. 1997;59:447–461. [Google Scholar]
Breslow NE, Lumley T, Ballantyne CM, et al. Using the whole cohort in the analysis of case-cohort data. Am J Epidemiol. 2009;192 doi: 10.1093/aje/kwp055. (in press) [DOI] [PMC free article] [PubMed] [Google Scholar]
Breslow NE, Wellner JA. Weighted likelihood for semiparametric models and two-phase stratified samples, with application to Cox regression. Scand J Stat. 2007;34:86–102. doi: 10.1111/j.1467-9469.2007.00574.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Breslow NE, Wellner JA. A Z-theorem with estimated nuisance parameters and correction note for ’Weighted Likelihood for Semiparametric Models and Two-phase Stratified Samples, with Application to Cox Regression’. Scand J Stat. 2008;35:186–192. doi: 10.1111/j.1467-9469.2007.00574.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cain KC, Lange NT. Approximate case influence for the proportional hazards regression model with censored data. Biometrics. 1984;40:493–499. [PubMed] [Google Scholar]
Cox DR. Regression models and life-tables (with discussion) J Roy Stat Soc B. 1972;34:187–220. [Google Scholar]
Cox DR. Partial likelihood. Biometrika. 1975;62:269–276. [Google Scholar]
D’Angio GJ, Breslow N, Beckwith JB, et al. Treatment of Wilms’ tumor: Results of the third National Wilms’ Tumor Study. Cancer. 1989;64:349–360. doi: 10.1002/1097-0142(19890715)64:2<349::aid-cncr2820640202>3.0.co;2-q. [DOI] [PubMed] [Google Scholar]
Deming WE, Stephan FF. On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Ann Math Stat. 1940;11:427–444. [Google Scholar]
Deville JC, Särndal C-E. Calibration estimators in survey sampling. J Am Stat Assoc. 1992;87:376–382. [Google Scholar]
Green DM, Breslow NE, Beckwith JB, et al. Comparison between single-dose and divided-dose administration of dactinomycin and doxorubicin for patients with Wilms’ tumor: A report from the National Wilms’ Tumor Study Group. J Clin Oncol. 1998;16:237–245. doi: 10.1200/JCO.1998.16.1.237. [DOI] [PubMed] [Google Scholar]
Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc. 1952;47:663–685. [Google Scholar]
The ARIC Investigators. The Atherosclerosis Risk in Communities (ARIC) study: design and objectives. Am J Epidemiol. 1989;129:687–702. [PubMed] [Google Scholar]
Isaki CT, Fuller WA. Survey design under the regression superpopulation model. J Am Stat Assoc. 1982;77:89–96. [Google Scholar]
Kovacevic MS, Rai SN. Log-linear modelling of change using longitudinal survey data. Commun Stat Theory Methods. 2002;31:1815–1835. [Google Scholar]
Kulich M, Lin DY. Improving the efficiency of relative-risk estimation in case-cohort studies. J Am Stat Assoc. 2004;99:832–844. [Google Scholar]
Lin DY. On fitting Cox’s proportional hazards models to survey data. Biometrika. 2000;87:37–47. [Google Scholar]
Lin DY, Wei LJ. The robust inference for the Cox proportional hazards model. J Am Stat Assoc. 1989;84:1074–1078. [Google Scholar]
Lumley T. Analysis of complex survey samples. J Stat Software. 2004;9:1–19. [Google Scholar]
Mark SD, Katki HA. Specifying and implementing nonparametric and semiparametric survival estimators in two-stage (nested) cohort studies with missing case data. J Am Stat Assoc. 2006;101:460–471. [Google Scholar]
Nan B. E3cient estimation for case-cohort studies. Can J Stat. 2004;32:403–419. [Google Scholar]
Neyman J. Contribution to the theory of sampling human populations. J Am Stat Assoc. 1938;33:101–116. [Google Scholar]
Persson M, Nilsson JA, Nelson JJ, et al. The epidemiology of Lp-PLA(2): Distribution and correlation with cardiovascular risk factors in a population-based cohort. Atherosclerosis. 2007;190:388–396. doi: 10.1016/j.atherosclerosis.2006.02.016. [DOI] [PubMed] [Google Scholar]
Prentice RL. A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika. 1986;73:1–11. [Google Scholar]
Rao JNK, Yung W, Hidiroglou M. Estimating equations for the analysis of survey data using poststratification information. Sankhya. 2002;64:364–378. [Google Scholar]
Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc. 1994;89:846–866. [Google Scholar]
Rubin-Bleuer S, Kratina IS. On the two-phase framework for joint model and design-based inference. Ann Stat. 2005;33:2789–2810. [Google Scholar]
Särndal C-E, Swensson B, Wretman JH. The weighted residual technique for estimating the variance of the general regression estimator of the finite population total. Biometrika. 1989;76:527–537. [Google Scholar]
Scheike TH, Martinussen T. Maximum likelihood estimation for Cox’s regression model under case-cohort sampling. Scand J Stat. 2004;31:283–293. [Google Scholar]
Scott AJ, Wild CJ. Fitting regression models to case-control data by maximum likelihood. Biometrika. 1997;84:57–71. [Google Scholar]
Therneau TM, Grambsch PM. Modeling Survival Data: Extending the Cox Model. Springer-Verlag; New York: 2000. [Google Scholar]
van der Vaart AW. Asymptotic Statistics. Cambridge University Press; Cambridge: 1998. [Google Scholar]
van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes with Applications in Statistics. Springer; New York: 1996. [Google Scholar]
Wang CY, Chen HY. Augmented inverse probability weighted estimator for Cox missing covariate regression. Biometrics. 2001;57:414–419. doi: 10.1111/j.0006-341x.2001.00414.x. [DOI] [PubMed] [Google Scholar]
White JE. A two stage design for the study of the relationship between a rare exposure and a rare disease. Am J Epidemiol. 1982;115:119–128. doi: 10.1093/oxfordjournals.aje.a113266. [DOI] [PubMed] [Google Scholar]
Zeng D, Lin DY. Maximum likelihood estimation in semiparametric regression models with censored data. J Roy Stat Soc B. 2007;69:507–536. [PMC free article] [PubMed] [Google Scholar]

[R1] Ballantyne CM, Hoogeveen RC, Bang H, et al. Lipoprotein-associated phospholipase A(2), high-sensitivity C-reactive protein, and risk for incident coronary heart disease in middle-aged men and women in the Atherosclerosis Risk in Communities (ARIC) study. Circulation. 2004;109:837–842. doi: 10.1161/01.CIR.0000116763.91992.F1. [DOI] [PubMed] [Google Scholar]

[R2] Barlow WE. Robust variance estimation for the case-cohort design. Biometrics. 1994;50:1064–1072. [PubMed] [Google Scholar]

[R3] Barlow WE, Ichikawa L, Rosner D, Izumi S. Analysis of case-cohort designs. J Clin Epidemiol. 1999;52:1165–1172. doi: 10.1016/s0895-4356(99)00102-x. [DOI] [PubMed] [Google Scholar]

[R4] Begun JM, Hall WJ, Huang W-M, Wellner JA. Information and asymptotic efficiency in parametric-nonparametric models. Ann Stat. 1983;11:432–452. [Google Scholar]

[R5] Binder DA. Fitting Cox’s proportional hazards model from survey data. Biometrika. 1992;79:139–147. [Google Scholar]

[R6] Borgan O, Langholz B, Samuelsen SO, et al. Exposure stratified case-cohort designs. Lifetime Data Anal. 2000;6:39–58. doi: 10.1023/a:1009661900674. [DOI] [PubMed] [Google Scholar]

[R7] Breslow N. Covariance analysis of censored survival data. Biometrics. 1974;30:89–99. [PubMed] [Google Scholar]

[R8] Breslow NE, Holubkov R. Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. J Roy Stat Soc B. 1997;59:447–461. [Google Scholar]

[R9] Breslow NE, Lumley T, Ballantyne CM, et al. Using the whole cohort in the analysis of case-cohort data. Am J Epidemiol. 2009;192 doi: 10.1093/aje/kwp055. (in press) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Breslow NE, Wellner JA. Weighted likelihood for semiparametric models and two-phase stratified samples, with application to Cox regression. Scand J Stat. 2007;34:86–102. doi: 10.1111/j.1467-9469.2007.00574.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Breslow NE, Wellner JA. A Z-theorem with estimated nuisance parameters and correction note for ’Weighted Likelihood for Semiparametric Models and Two-phase Stratified Samples, with Application to Cox Regression’. Scand J Stat. 2008;35:186–192. doi: 10.1111/j.1467-9469.2007.00574.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Cain KC, Lange NT. Approximate case influence for the proportional hazards regression model with censored data. Biometrics. 1984;40:493–499. [PubMed] [Google Scholar]

[R13] Cox DR. Regression models and life-tables (with discussion) J Roy Stat Soc B. 1972;34:187–220. [Google Scholar]

[R14] Cox DR. Partial likelihood. Biometrika. 1975;62:269–276. [Google Scholar]

[R15] D’Angio GJ, Breslow N, Beckwith JB, et al. Treatment of Wilms’ tumor: Results of the third National Wilms’ Tumor Study. Cancer. 1989;64:349–360. doi: 10.1002/1097-0142(19890715)64:2<349::aid-cncr2820640202>3.0.co;2-q. [DOI] [PubMed] [Google Scholar]

[R16] Deming WE, Stephan FF. On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Ann Math Stat. 1940;11:427–444. [Google Scholar]

[R17] Deville JC, Särndal C-E. Calibration estimators in survey sampling. J Am Stat Assoc. 1992;87:376–382. [Google Scholar]

[R18] Green DM, Breslow NE, Beckwith JB, et al. Comparison between single-dose and divided-dose administration of dactinomycin and doxorubicin for patients with Wilms’ tumor: A report from the National Wilms’ Tumor Study Group. J Clin Oncol. 1998;16:237–245. doi: 10.1200/JCO.1998.16.1.237. [DOI] [PubMed] [Google Scholar]

[R19] Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc. 1952;47:663–685. [Google Scholar]

[R20] The ARIC Investigators. The Atherosclerosis Risk in Communities (ARIC) study: design and objectives. Am J Epidemiol. 1989;129:687–702. [PubMed] [Google Scholar]

[R21] Isaki CT, Fuller WA. Survey design under the regression superpopulation model. J Am Stat Assoc. 1982;77:89–96. [Google Scholar]

[R22] Kovacevic MS, Rai SN. Log-linear modelling of change using longitudinal survey data. Commun Stat Theory Methods. 2002;31:1815–1835. [Google Scholar]

[R23] Kulich M, Lin DY. Improving the efficiency of relative-risk estimation in case-cohort studies. J Am Stat Assoc. 2004;99:832–844. [Google Scholar]

[R24] Lin DY. On fitting Cox’s proportional hazards models to survey data. Biometrika. 2000;87:37–47. [Google Scholar]

[R25] Lin DY, Wei LJ. The robust inference for the Cox proportional hazards model. J Am Stat Assoc. 1989;84:1074–1078. [Google Scholar]

[R26] Lumley T. Analysis of complex survey samples. J Stat Software. 2004;9:1–19. [Google Scholar]

[R27] Mark SD, Katki HA. Specifying and implementing nonparametric and semiparametric survival estimators in two-stage (nested) cohort studies with missing case data. J Am Stat Assoc. 2006;101:460–471. [Google Scholar]

[R28] Nan B. E3cient estimation for case-cohort studies. Can J Stat. 2004;32:403–419. [Google Scholar]

[R29] Neyman J. Contribution to the theory of sampling human populations. J Am Stat Assoc. 1938;33:101–116. [Google Scholar]

[R30] Persson M, Nilsson JA, Nelson JJ, et al. The epidemiology of Lp-PLA(2): Distribution and correlation with cardiovascular risk factors in a population-based cohort. Atherosclerosis. 2007;190:388–396. doi: 10.1016/j.atherosclerosis.2006.02.016. [DOI] [PubMed] [Google Scholar]

[R31] Prentice RL. A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika. 1986;73:1–11. [Google Scholar]

[R32] Rao JNK, Yung W, Hidiroglou M. Estimating equations for the analysis of survey data using poststratification information. Sankhya. 2002;64:364–378. [Google Scholar]

[R33] Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc. 1994;89:846–866. [Google Scholar]

[R34] Rubin-Bleuer S, Kratina IS. On the two-phase framework for joint model and design-based inference. Ann Stat. 2005;33:2789–2810. [Google Scholar]

[R35] Särndal C-E, Swensson B, Wretman JH. The weighted residual technique for estimating the variance of the general regression estimator of the finite population total. Biometrika. 1989;76:527–537. [Google Scholar]

[R36] Scheike TH, Martinussen T. Maximum likelihood estimation for Cox’s regression model under case-cohort sampling. Scand J Stat. 2004;31:283–293. [Google Scholar]

[R37] Scott AJ, Wild CJ. Fitting regression models to case-control data by maximum likelihood. Biometrika. 1997;84:57–71. [Google Scholar]

[R38] Therneau TM, Grambsch PM. Modeling Survival Data: Extending the Cox Model. Springer-Verlag; New York: 2000. [Google Scholar]

[R39] van der Vaart AW. Asymptotic Statistics. Cambridge University Press; Cambridge: 1998. [Google Scholar]

[R40] van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes with Applications in Statistics. Springer; New York: 1996. [Google Scholar]

[R41] Wang CY, Chen HY. Augmented inverse probability weighted estimator for Cox missing covariate regression. Biometrics. 2001;57:414–419. doi: 10.1111/j.0006-341x.2001.00414.x. [DOI] [PubMed] [Google Scholar]

[R42] White JE. A two stage design for the study of the relationship between a rare exposure and a rare disease. Am J Epidemiol. 1982;115:119–128. doi: 10.1093/oxfordjournals.aje.a113266. [DOI] [PubMed] [Google Scholar]

[R43] Zeng D, Lin DY. Maximum likelihood estimation in semiparametric regression models with censored data. J Roy Stat Soc B. 2007;69:507–536. [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Improved Horvitz-Thompson Estimation of Model Parameters from Two-phase Stratified Samples: Applications in Epidemiology

Norman E Breslow

Thomas Lumley

Christie M Ballantyne

Lloyd E Chambless

Michal Kulich

Abstract

1 Introduction

2 Properties of Horvitz-Thompson Estimators

3 Calibration and Estimation of the Weights

4 Log-linear Modeling

Table 1.

Table 2.

5 Simulation Study of Cox Regression Modeling

6 Discussion and Conclusions

Acknowledgments

7 Appendix

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Improved Horvitz-Thompson Estimation of Model Parameters from Two-phase Stratified Samples: Applications in Epidemiology

Norman E Breslow

Thomas Lumley

Christie M Ballantyne

Lloyd E Chambless

Michal Kulich

Abstract

1 Introduction

2 Properties of Horvitz-Thompson Estimators

3 Calibration and Estimation of the Weights

4 Log-linear Modeling

Table 1.

Table 2.

5 Simulation Study of Cox Regression Modeling

6 Discussion and Conclusions

Acknowledgments

7 Appendix

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases