Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Feb 19.
Published in final edited form as: Ann Stat. 2013 Feb 1;41(1):269–295. doi: 10.1214/12-AOS1073

WEIGHTED LIKELIHOOD ESTIMATION UNDER TWO-PHASE SAMPLING

Takumi Saegusa 1,2, Jon A Wellner 1,2
PMCID: PMC3929280  NIHMSID: NIHMS496004  PMID: 24563559

Abstract

We develop asymptotic theory for weighted likelihood estimators (WLE) under two-phase stratified sampling without replacement. We also consider several variants of WLEs involving estimated weights and calibration. A set of empirical process tools are developed including a Glivenko–Cantelli theorem, a theorem for rates of convergence of M-estimators, and a Donsker theorem for the inverse probability weighted empirical processes under two-phase sampling and sampling without replacement at the second phase. Using these general results, we derive asymptotic distributions of the WLE of a finite-dimensional parameter in a general semiparametric model where an estimator of a nuisance parameter is estimable either at regular or nonregular rates. We illustrate these results and methods in the Cox model with right censoring and interval censoring. We compare the methods via their asymptotic variances under both sampling without replacement and the more usual (and easier to analyze) assumption of Bernoulli sampling at the second phase.

Key words and phrases: Calibration, estimated weights, weighted likelihood, semiparametric model, regular, nonregular

1. Introduction

Two-phase sampling is a sampling technique that aims at cost reduction and improved efficiency of estimation. At phase I, a large sample is drawn from a population, and information on variables that are easier to measure is collected. These phase I variables may be important variables such as exposure in a regression model, or simply may be auxiliary variables that are correlated with unavailable variables at phase I. The sample space is then stratified based on these phase I variables. At phase II, a subsample is drawn without replacement from each stratum to obtain phase II variables that are costly or difficult to measure. Strata formation seeks either to oversample subjects with important phase I variables, or to effectively sample subjects with targeted phase II variables, or both. This way, two-phase sampling achieves effective access to important variables with less cost.

While two-phase sampling was originally introduced in survey sampling by Neyman [20] for estimation of the “finite population mean” of some variable, it has become increasingly important in a variety of areas of statistics, biostatistics and epidemiology, especially since [22, 33] and [27].

The setting treated here is as follows:

  • We begin with a semiparametric model 𝒫 for a vector of variables X with values in 𝒳. [The prime examples which we treat in detail in Section 4 are the Cox proportional hazards regression model with (a) right censoring, and (b) interval censoring.]

  • Let W = (X, U) ∈ 𝒳 × 𝒰 ≡ 𝒲 where U is a vector of “auxiliary variables,” not involved in the model 𝒫. Suppose that W ~ 0 and X ~ P0. Now suppose that V ≡ (X̃, U) ∈ 𝒱 where X̃(X) is a coarsening of X.

  • At phase I we observe V1, …, VN i.i.d. as V, and then use the phase I data to form strata, that is, disjoint subsets 𝒱1, …, 𝒱J of 𝒱 with j=1J𝒱j=𝒱. We let Nj = #{iN : Vi ∈ 𝒱j}.

  • Next, a phase II sample is drawn by sampling without replacement njNj items from stratum j. For the items selected we observe Xi. Thus for the selection indicators ξi we have 0i = 1|Vi) = (nj/Nj)1𝒱j (Vi) ≡ π0 (Vi).

  • Finally weighted likelihood (or inverse probability weighted) estimation methods based on all the observed data are used to estimate the parameters of the model 𝒫 and to make further inferences about the model.

It is now well known that the classical Horvitz–Thompson estimators [9] use only the phase II data and are inefficient, sometimes quite severely so; see, for example, [2, 3, 14, 23] and [34]. Improvements in efficiency of estimation can be achieved by “adjusting” the weights by use of the phase I data (even though the sampling probabilities are known). Two basic methods of adjustment are:

  1. Estimated weights, a method originating in the missing data literature [23], and with significant further developments since in connection with many models in which the missing-ness mechanism is not known, in contrast to our current setting in which the missing-ness is by design.

  2. Calibration, a method originating in the sample survey literature [8]; see also [13, 14].

One of our goals here is to study existing methods for adjustment of the weights of weighted likelihood methods and to introduce several new methods: modified calibration as suggested by Chan [6] and centered calibration as proposed here in Section 2.

A second goal is to give a systematic treatment of estimators based on sampling without replacement at phase II in the setting of general semiparametric models and to make comparisons with the behavior of estimators based on Bernoulli (or independent) sampling at phase II, thus continuing and strengthening the comparisons made in [4, 5], and [2, 3] for a particular sub-class of semiparametric models and adjustments via estimated weights and ordinary calibration. Many studies of the theoretical properties of procedures based on two-phase design data have been made for the case of Bernoulli sampling; see, for example, [11] and the review of case-cohort sampling given there. On the other hand, while statistical practice continues to involve phase II data sampled without replacement, most available theory in this case (other than [4, 5]) has developed on a model-by-model basis. As has become clear from [4, 5], sampling without replacement results in smaller asymptotic variances, and hence inference based on asymptotic variances derived from Bernoulli sampling will often be conservative. Our treatment here provides theory and tools for dealing directly with the sampling without replacement design. We do this by providing the relevant theory both for semiparametric models in which the infinite-dimensional nuisance parameters can be estimated at a regular rate (n) with complete data, and semiparametric models in which the infinite-dimensional nuisance parameters can only be estimated at slower (nonregular) rates.

The main contributions of our paper are three-fold: First, we establish two Z-theorems giving weak sufficient conditions for asymptotic distributions of the WLEs in general semiparametric models. The first theorem covers the case where the nuisance parameter is estimable at a regular rate; this yields rigorous justification of [2, 3] under weaker conditions. The second theorem covers the case of general semiparametric models with nonregular rates for estimators of the nuisance parameters. The conditions of our theorems, formulated in terms of complete data, are almost identical to those for the MLE with complete data. This formulation allows us to use tools from empirical process theory together with the new tools developed here in a straightforward way. Second, we propose centered calibration, a new calibration method. This new calibration method is the only one guaranteed to yield improved efficiency over the plain WLE under both Bernoulli sampling and sampling without replacement, while other methods are warranted only for Bernoulli sampling. Third, we establish general results for the inverse probability weighted (IPW) empirical process. Some results such as a Glivenko–Cantelli theorem (Theorem 5.1) and a Donsker theorem (Theorem 5.3) are of interest in their own right. These results accounting for dependence due to the sampling design are useful in verifying the conditions of Z-theorems in applications. For instance, Theorems 5.1 and 5.2 easily establish consistency and rates of convergence under our “without replacement” sampling scheme. We illustrate application of the general results with examples in Section 4.

The rest of the paper is organized as follows. In Section 2, we introduce our estimation procedures in the context of a general semiparametric model. The WLE and methods involving adjusted weights are discussed. Two Z-theorems are presented in Section 3; these yield asymptotic distributions of the WLEs of finite-dimensional parameters of the model. All estimators are compared under Bernoulli sampling and sampling without replacement with different methods of adjusting weights. In Section 4 we apply our Z-theorems to the Cox model with both right censoring and interval censoring. Section 5 consists of general results for IPW empirical processes. Several open problems are briefly discussed in Section 6. All proofs, except those in Section 4 and auxiliary results, are collected in [25].

2. Sampling, models and estimators

We use the basic notation introduced in the previous section. After stratified sampling, X is fully observed for nj subjects in the jth stratum at phase II. The observed data is (V, Xξ, ξ) where ξ is the indicator of being sampled at phase II. We use a doubly subscripted notation: for example, Vj,i denotes V for the ith subject in stratum j. We denote the stratum probability for the jth stratum by νj0(V ∈ 𝒱j), and the conditional expectation given membership in the jth stratum by P0|j (·) ≡ 0(·|V ∈ 𝒱j).

The sampling probability is P(ξ = 1|Vi) = π0 (Vi) = nj/Nj for Vi ∈ 𝒱j. These sampling probabilities are assumed to be strictly positive; that is, there is a constant σ > 0 such that 0 < σ ≤ π0(υ) ≤ 1 for υ ∈ 𝒱. We assume that nj/Njpj > 0 for j = 1, …, J as N → ∞. Although dependence is induced among the observations (Vi, ξi Xi, ξi) by the sampling indicators, the vector of sampling indicators (ξj1, …, ξjNj) within strata, are exchangeable for each j = 1, …, J, and the J random vectors (ξj1, …, ξjNj) are independent.

The empirical measure is one of the most useful tools in empirical process theory. Because the Xi’s are observed only for a sub-sample at phase II, we define, instead, the IPW empirical measure Nπ by

Nπ=1Ni=1Nξiπ0(Vi)δXi=1Nj=1Ji=1Njξj,inj/NjδXj,i,

where δXi denotes a Dirac measure placing unit mass on Xi. The identity in the last display is justified by the arguments in Appendix A of [4]. We also define the IPW empirical process by 𝔾Nπ=N(NπP0) and the phase II empirical process for the jth stratum by 𝔾j,NjξNj(j,Njξ(nj/Nj)j,Nj), j = 1, …, J, j,NjξNj1i=1Njξj,iδXj,i is the phase II empirical measure for the jth stratum, and j,NjNj1i=1NjδXj,i is the empirical measure for all the data in the jth stratum; note that the latter empirical measure is not observed. Then, following [4], we decompose 𝔾Nπ as follows:

𝔾Nπ=𝔾N+j=1JNjN(Njnj)𝔾j,Njξ, (2.1)

where N=N1j=1JNjj,Njand𝔾N=N(NP0). Notice that 𝔾j,Njξ correspond to “exchangeably weighted bootstrap” versions of the stratum-wise complete data empirical processes 𝔾j,NjNj(j,NjP0|j). This observation allows application of the “exchangeably weighted bootstrap” theory of [21] and [32], Section 3.6.

2.1. Improving efficiency by adjusting weights

Efficiency of estimators based on IPW empirical processes can be improved by adjusting weights, either by estimated weights [23] or by calibration [8] via use of the phase I data; see also [14]. Besides these, we discuss two variants of calibration, modified calibration [6], and our proposed new method, centered calibration.

Let Zig(Vi) be the auxiliary variables for the ith subject for a known transformation g. For estimated weights with binary regression, Zi contains the membership indicators for the strata I𝒱j (Vi), j = 1, …, J. Observations with π0(V) = 1 are dropped from binary regression, and the original weight 1 is used. For notational simplicity, we write Zi for either method, and assume that sampling probabilities are strictly less than 1 for all strata.

2.1.1. Estimated weights

The method of estimated weights adjusts weights through binary regression on the phase I variables. The sampling probability for the ith subject is modeled by pα(ξi|Zi)=Ge(ZiTα)ξi(1Ge(ZiTα))1ξiπα(Vi)ξi{1πα(Vi)}1ξi, where α ∈ 𝒜e ⊂ ℝJ+k is a regression parameter and Ge : ℝ ↦ [0, 1] is a known function. If Ge(x) = ex/(1+ex), for instance, then the adjustment simply involves logistic regression. Let α̂N be the estimator of α that maximizes the pseudo- (or composite) likelihood

i=1Npα(ξi|Zi)=i=1NGe(ZiTα)ξi(1Ge(ZiTα))1ξi. (2.2)

We define the IPW empirical measure with estimated weights by

Nπ,e=1Ni=1Nξiπα^N(Vi)δXi=1Ni=1Nξiπ0(Vi)π0(Vi)Ge(ZiTα^N)δXi,

and the IPW empirical process with estimated weights by 𝔾Nπ,e=N(Nπ,eP0).

2.1.2. Calibration

Calibration adjusts weights so that the inverse probability weighted average from the phase II sample is equated to the phase I average, whereby the phase I information is taken into account for estimation. Specifically, we find an estimator α̂N that is the solution for α ∈ 𝒜c ⊂ ℝk of the following calibration equation:

1Ni=1NξiGc(Vi;α)π0(Vi)Zi=1Ni=1NZi, (2.3)

where Gc(V ; α) ≡ G(g(V)T α) = G(ZT α) for known G with G(0) = 1 and Ġ(0) > 0. We call πα(V) ≡ π0(V)/Gc(V ; α) the calibrated sampling probability. We define the calibrated IPW empirical measure by

Nπ,c=1Ni=1Nξiπα^N(Vi)δXi=1Ni=1Nξiπ0(Vi)G(ZiTα^N)δXi

and the calibrated IPW empirical process by 𝔾Nπ,c=N(Nπ,cP0).

Examples for G in the definition of Gc are listed in [8] (F in their notation). For G(x)=1+x,Nπ.cX is a well-known regression estimator of the mean of X. Since we assume boundedness of G later, we may want to consider truncated versions of these examples instead. Note that choice of G in (variants of) calibration does not affect asymptotic results on WLEs.

As noted in [13], there are several different approaches to calibration. Here, and in introducing variants of calibration below, we adopt the view that calibration proceeds by making the smallest possible change in weights in order to match an estimated phase II average with the corresponding phase I average. Another approach proceeds via regression modeling of the variable X of interest and the auxiliary variables V, leading to a robustness discussion on effects of the validity of the model on estimation for X. We prefer the former view because we do not assume a model for X and V throughout this paper. In fact, our results are independent of such a modeling assumption.

2.1.3. Modified calibration

Modifying the function Gc in calibration so that individuals with higher sampling probabilities π(Vi) receive less weight was proposed by [6] in a missing response problem where observations are i.i.d. (see, e.g., [28] for recent developments in this area and [14] for their connections with calibration methods). An interpretation of this method within the framework of [8] is discussed in [26]. In modified calibration, we find the estimator α̂N that is the solution for α ∈ 𝒜mc ⊂ ℝk of the following calibration equation:

1Ni=1NξiGmc(Vi;α)π0(Vi)Zi=1Ni=1NZi, (2.4)

where Gmc(V ; α) ≡ G((π0(V)−1 − 1) ZT α) for known G with G(0) = 1 and Ġ(0) > 0. We call πα(V) ≡ π0(V)/Gmc(V ; α) the calibrated sampling probability with modified calibration. We define the IPW empirical measure with modified calibration by

Nπ,mc=1Ni=1Nξiπα^N(Vi)δXi=1Ni=1Nξiπ0(Vi)G(1π0(Vi)π0(Vi)ZiTα^N)δXi

and the corresponding IPW empirical process by 𝔾Nπ,mc=N(Nπ,mcP0).

2.1.4. Centered calibration

We propose a new method, centered calibration, that calibrates on centered auxiliary variables with modified calibration. This method improves the plain WLE under our sampling scheme, while retaining the good properties of modified calibration. See Section 3.4 for a discussion of its advantage and connections to other methods.

In centered calibration, we find the estimator α̂N that is the solution for α ∈ 𝒜cc ⊂ ℝk of the following calibration equation:

1Ni=1NξiGcc(Vi;α)π0(Vi)(ZiZ¯N)=0, (2.5)

where Gcc(V ; α) ≡ G((π0 (V)−1 − 1)(ZN)T α) for known G with G(0) = 1 and Ġ(0) > 0 and Z¯N=N1i=1NZi. We call πα(V) ≡ π0(V)/Gcc(V ; α) the calibrated sampling probability with centered calibration. We define the IPW empirical measure with centered calibration by

Nπ,cc=1Ni=1Nξiπα^N(Vi)δXi=1Ni=1Nξiπ0(Vi)Gcc(Vi;α^N)δXi

and the corresponding IPW empirical process by 𝔾Nπ,cc=N(Nπ,ccP0).

2.2. Estimators for a semiparametric model 𝒫

We study the asymptotic distribution of the weighted likelihood estimator of a finite-dimensional parameter θ in a general semiparametric model 𝒫 = {Pθ,η : θ ∈ Θ, η ∈ H} where Θ ⊂ ℝp and the nuisance parameter space H is a subset of some Banach space ℬ. Let P0 = Pθ00 denote the true distribution.

The MLE for complete data is often obtained as a solution to the infinite-dimensional likelihood equations. In such models, the WLE under two-phase sampling is obtained by solving the corresponding infinite-dimensional inverse probability weighted likelihood equations. Specifically, the WLE (θ̂N, η̂N) is a solution to the following weighted likelihood equations:

ΨN,1π(θ,η)=Nπ˙θ,η=oP*(N1/2),ΨN,2π(θ,η)h=Nπ(Bθ,ηhPθ,ηBθ,ηh)=oP*(N1/2), (2.6)

where ˙θ,η20(Pθ,η)p is the score function for θ, and the score operator Bθ,η:20(Pθ,η) is the bounded linear operator mapping a direction h in some Hilbert space ℋ of one-dimensional submodels for η along which η → η0. The WLE with estimated weights (θ̂N,e, η̂N,e), the calibrated WLE (θ̂N,c, η̂N,c), the WLE with modified calibration (θ̂N,mc, η̂N,mc) and the WLE with centered calibration (θ̂N,cc, η̂N,cc) are obtained by replacing NπbyNπ,# with # ∈ {e, c, mc, cc} in (2.6), respectively. Let ℓ̇0 = ℓ̇θ0, η0, and B0 = Bθ0, η0.

3. Asymptotics for the WLE in general semiparametric models

We consider two cases: in the first case the nuisance parameter η is estimable at a regular (i.e., n) rate, and for ease of exposition, η is assumed to be a measure. In the second case η is only estimable at a nonregular (slower than n) rate. Our theorem (Theorem 3.2) concerning the second case nearly covers the former case, but requires slightly more smoothness and a separate proof of the rate of convergence for an estimator of η. On the other hand, our theorem (Theorem 3.1) concerning the former case includes a proof of the (regular) (n) rate of convergence, and hence is of interest by itself.

3.1. Conditions for adjusting weights

We assume the following conditions for estimators of α for adjusted weights. Throughout this paper, we may assume both Conditions 3.1 and 3.2 at the same time, but it should be understood that the former condition is used exclusively for the estimators regarding estimated weights and the latter condition is imposed only for estimators regarding (variants of) calibration. Also, it should be understood that Conditions 3.2(a)(i) and (d)(i), Conditions 3.2(a)(ii) and (d)(ii) and Conditions 3.2(a)(iii) and (d)(iii) are assumed for estimators defined via calibration, modified calibration and centered calibration, respectively.

Condition 3.1 (Estimated weights). (a) The estimator α̂N is a maximizer of the pseudo-likelihood (2.2).

  • (b)

    Z ∈ ℝJ+k is not concentrated on a (J + k)-dimensional affine space of ℝJ+k and has bounded support.

  • (c)

    Ge : ℝ ↦ [0, 1] is a twice continuously differentiable, monotone function.

  • (d)

    S0P0[{Ġe(ZT α0)}20(V)(1−π0(V))}−1 Z⊗2] is finite and nonsingular, where Ġe is a derivative of Ge.

  • (e)

    The “true” parameter α0 = (α0, 1, …, α0, J+k) is given by α0,j=Ge1(pj) for j = 1, …, J and α0, j = 0, for j = J + 1, …, J + k. The parameter α is identifiable, that is, pα = pα0 almost surely implies α = α0.

  • (f)

    For a fixed pj ∈ (0, 1), nj satisfies nj = [Njpj] for j = 1, …, J.

Condition 3.2 (Calibrations). (a) (i) The estimator α^N=α^Nc is a solution of calibration equation (2.3). (ii) The estimator α^N=α^Nmc is a solution of calibration equation (2.4). (iii) The estimator α^N=α^Ncc is a solution of calibration equation (2.5).

  • (b)

    Z ∈ ℝk is not concentrated at 0 and has bounded support.

  • (c)

    G is a strictly increasing continuously differentiable function on ℝ such that G(0) = 1 and for all x, −∞ < m1G(x)M1 < ∞ and 0 < Ġ(x)M2 < ∞, where Ġ is the derivative of G.

  • (d)

    (i) P0Z⊗2 is finite and positive definite. (ii) P00(V)−1(1 − π0(V))Z⊗2] is finite and positive definite. (iii) P00(V)−1 (1 − π0(V)) (Z − μZ)⊗2] is finite and positive definite where μZ = PZ.

  • (e)

    (e) The “true” parameter α0 = 0.

Condition 3.1(f) may seem unnatural at first, but in practice the phase II sample size nj can be chosen by the investigator so that the sampling probability pj can be understood to be automatically chosen to satisfy nj = [Njpj]. The other parts of Condition 3.1 are standard in binary regression, and Condition 3.2 is similar to Condition 3.1.

Asymptotic properties of α̂N for all methods are proved in [25].

3.2. Regular rate for a nuisance parameter

We assume the following conditions.

Condition 3.3 (Consistency). The estimator (θ̂N, η̂N) is consistent for (θ0, η0) and solves the weighted likelihood equations (2.6), where Nπ may be replaced by Nπ,# with # ∈ {e, c, mc, cc} for the estimators with adjusted weights.

Condition 3.4 (Asymptotic equicontinuity). Let ℱ1(δ) = {ℓ̇θ,η : |θ − θ0| + ∥η − η0∥ < δ} and ℱ2(δ) = {Bθ,ηhPθ,ηBθ,ηh : h ∈ ℋ, |θ − θ0| + ∥η − η0∥ < δ}. There exists a δ0 > 0 such that (1) ℱk0), k = 1, 2, are P0-Donsker and suph∈ℋ P0|fjf0, j|2 → 0, as |θ − θ0| + ∥η − η0∥ → 0, for every fj ∈ ℱj0), j = 1, 2, where f0,1 = ℓ̇θ00 and f0,2 = B0hP0B0h, (2) ℱk0), k = 1, 2, have integrable envelopes.

Condition 3.5. The map Ψ = (Ψ1, Ψ2) : Θ × H ↦ ℝp × ℓ (ℋ) with components

Ψ1(θ,η)P0ΨN,1(θ,η)=P0˙θ,η,Ψ2(θ,η)hP0ΨN,2(θ,η)=P0Bθ,ηhPθ,ηBθ,ηh,h,

has a continuously invertible Fréchet derivative map Ψ̇0 = (Ψ̇11, Ψ̇12, Ψ̇21, Ψ̇22) at (θ0, η0) given by Ψ̇ij0, η0)h = P0(ψ̇i,j, θ0, η0,h), i, j ∈ {1, 2} in terms of L2(P0) derivatives of ψ1,θ,η,h = ℓ̇θ,η and ψ2,θ,η,h = Bθ,ηhPθ,ηBθ,ηh; that is,

suph[P0{ψi,θ,η0,hψi,θ0,η0,hψ˙i1,θ0,η0,h(θθ0)}2]1/2=o(θθ0),suph[P0{ψi,θ0,η,hψi,θ0,η0,hψ˙i2,θ0,η0,h(ηη0)}2]1/2=o(ηη0).

Furthermore, Ψ̇0 admits a partition

(θθ0,ηη)(Ψ˙11Ψ˙12Ψ˙21Ψ˙22)(θθ0ηη0),

where

Ψ˙11(θθ0)=Pθ0,η0˙θ0,η0˙θ0,η0T(θθ0),Ψ˙12(ηη0)=Bθ0,η0*˙θ0,η0d(ηη0),Ψ˙21(θθ0)h=Pθ0,η0Bθ0,η0h˙θ0,η0T(θθ0),Ψ˙22(ηη0)h=Bθ0,η0*Bθ0,η0hd(ηη0)

and Bθ0,η0*Bθ0,η0 is continuously invertible.

Let I˜0=P0[(IB0(B0*B0)1B0*)˙0˙0T] be the efficient information for θ and ˜0=I˜01(IB0(B0*B0)1B0*)˙0 be the efficient influence function for θ for the semiparametric model with complete data.

Theorem 3.1. Under Conditions 3.1–3.5,

N(θ^Nθ0)=NNπ˜0+oP*(1)Z~Np(0,Σ),N(θ^N,#θ0)=NNπ,#˜0+oP*(1)Z#~Np(0,Σ#),

where # ∈ {e, c, mc, cc},

ΣI01j=1Jνj1pjpjVar0|j(˜0), (3.1)
Σ#I01+j=1Jνj1pjpjVar0|j((IQ#)˜0) (3.2)

and (recall Conditions 3.1 and 3.2)

QefP0[π01(V)fG˙e(ZTα0)ZT]S01(1π0(V))1G˙e(ZTα0)Z,QcfP0[fZT]{P0Z2}1Z,QmcfP0[(π01(V)1)fZT]{P0[(π01(V)1)Z2]}1Z,QccfP0[(π01(V)1)f(ZμZ)T]{P0[(π01(V)1)(ZμZ)2]}1×(ZμZ).

Remark 3.1. Our conditions in Theorem 3.1 are the same as those in [5] except the integrability condition. Our Condition 3.4(2) requires existence of integrable envelopes for class of scores while the condition (A1*) in [5] requires square integrable envelopes. Note that this integrability condition is required only for the WLE with adjusted weights, as in [4].

Remark 3.2. As can be seen from the definition of Q#, the choice of G in calibration does not affect the asymptotic variances while Ge in the method of estimated weights does affect the asymptotic variance.

3.3. Nonregular rate for a nuisance parameter

For h = (h1, …, hp)T with hkH, k = 1, …, p, let Bθ,η[h] = (Bθ,ηh1, …, Bθ,ηhp)T. We assume the following conditions.

Condition 3.6 (Consistency and rate of convergence). An estimator (θ̂N,η̂N) of (θ00) satisfies |θ̂N − θ0| = oP (1), and ∥η̂N − η0∥ = OP (N−β) for some β > 0.

Condition 3.7 (Positive information). There is an h¯*=(h1*,,hp*), where hk*H for k = 1, …, p, such that

P0{(˙0B0[h¯*])B0h}=0for allhH.

The efficient information I0P0(ℓ̇0B0[h*])⊗2 for θ for the semiparametric model with complete data is finite and nonsingular. Denote the efficient influence function for the semiparametric model with complete data by ˜0I01(˙0B0[h¯*]).

Condition 3.8 (Asymptotic equicontinuity). (1) For any δN ↓ 0 and C > 0,

sup|θθ0|δN,ηη0CNβ|𝔾N(˙θ,η˙0)|=oP(1),sup|θθ0|δN,ηη0CNβ|𝔾N(Bθ,ηB0)[h¯*]|=oP(1).

(2) There exists a δ > 0 such that the classes {ℓ̇θ,η : |θ − θ0| + ∥η − η0∥ ≤ δ} and {Bθ,η[h*] : |θ − θ0| + ∥η − η0∥ ≤ δ} are P0-Glivenko–Cantelli and have integrable envelopes. Moreover, ℓ̇θ,η and Bθ,η[h*] are continuous with respect to (θ, η) either pointwise or in L1(P0).

Condition 3.9 (Smoothness of the model). For some α > 1 satisfying αβ > 1/2 and for (θ, η) in the neighborhood {(θ, η) : |θ − θ0| ≤ δN, ∥η − η0∥ ≤ CN−β},

|P0{˙θ,η˙0+˙0(˙0T(θθ0)+B0[ηη0])}|=o(|θθ0|)+O(ηη0α),|P0{(Bθ,ηB0)[h¯*]+B0[h¯*](˙0T(θθ0)+B0[ηη0])}|=o(|θθ0|)+O(ηη0α).

In the previous section, we required that the WLE solves the weighted likelihood equations (2.6) for all h ∈ ℋ. Here, we only assume that the WLE (θ̂N, η̂N) satisfies the weighted likelihood equations

ΨN,1π(θ,η,α)=Nπ˙θ,η=oP*(N1/2),ΨN,2π(θ,η,α)[h¯*]=NπBθ,η[h¯*]=oP*(N1/2). (3.3)

The corresponding WLEs with adjusted weights, (θ̂N,#, η̂N,#) with # ∈ {e, c, mc, cc} satisfy (3.3) with Nπ replaced by Nπ,#.

Theorem 3.2. Suppose that the WLE is a solution of (3.3) where Nπ may be replaced by Nπ,# with # ∈ {e, c, mc, cc} for the estimators with adjusted weights. Under Conditions 3.1, 3.2 and 3.6–3.9,

N(θ^Nθ0)=NNπ˜0+oP*(1)Z~Np(0,Σ),N(θ^N,#θ0)=NNπ,#˜0+oP*(1)Z#~Np(0,Σ#),

where Σ and Σ# are as defined in (3.1) and (3.2) of Theorem 3.1, but now I0 and ℓ̃0 are defined in Condition 3.7, and Q# are defined in Theorem 3.1.

Remark 3.3. Our conditions are identical to those of the Z-theorem of [10] except Condition 3.8(2). This additional condition is not stringent for the following reasons. First, the Glivenko–Cantelli condition is usually assumed to prove consistency of estimators before deriving asymptotic distributions. Second, a stronger L2(P0)-continuity condition is standard as is seen in Condition 3.4 (see also Section 25.8 of [31]). Note that the L1(P0)-continuity condition is only required for the WLEs with adjusted weights.

3.4. Comparisons of methods

We compare asymptotic variances of five WLEs in view of improvement by adjusting weights and change of designs. We also include in comparison special cases of adjusting weights involving stratum-wise adjustment.

3.4.1. Stratified Bernoulli sampling

We first give a statement of the result corresponding to Theorem 3.1 for stratified Bernoulli sampling where all subjects are independent with the sampling probability pj if V ∈ 𝒱j and θ^NBernandθ^N,#Bern with # ∈ {e, c, mc, cc} are the corresponding WLE and WLEs with adjusted weights.

Theorem 3.3. Suppose Conditions 3.1 [except Condition 3.1(f)] and 3.2 hold. Let ξi ∈ {0, 1} and ξ be i.i.d. with E[ξ|V]=π0(V)=j=1JpjI(V𝒱j).

  1. Suppose that the WLE is a solution of (3.3) where Nπ may be replaced by Nπ,# with # ∈ {e, c, mc, cc} for the estimators with adjusted weights. Under the same conditions as in Theorem 3.1,
    N(θ^NBernθ0)=NNπ˜0+oP*(1)ZBern~Np(0,ΣBern),N(θ^N,#Bernθ0)=NNπ,#˜0+oP*(1)Z#Bern~Np(0,Σ#Bern),
    where
    ΣBernI01+j=1Jνj1pjpjP0|j(˜0)2, (3.4)
    Σ#BernI01+j=1Jνj1pjpjP0|j((IQ#)˜0)2, (3.5)
    where Q# with # ∈ {e, c, mc, cc} are defined in Theorem 3.1.
  2. Under the same conditions as in Theorem 3.2, the same conclusions in (1) hold with I0 and ℓ̃0 replaced by those defined in Condition 3.7.

Comparing the variance–covariance matrices in Theorem 3.3 to those in Theorems 3.1 and 3.2, we obtain the following corollary comparing designs. All estimators have smaller variances under sampling without replacement.

Corollary 3.1. Under the same conditions as in Theorem 3.3,

Σ=ΣBernj=1Jνj1pjpj{P0|j˜0}2,Σ#=Σ#Bernj=1Jνj1pjpj{P0|j(IQ#)˜0}2,#{e,c,mc,cc}.

Variance formulas (3.5) with # ∈ {e, mc, cc} except for the ordinary calibration have the following alternative representations which show the efficiency gains over the plain WLE under Bernoulli sampling.

Corollary 3.2. Under the same conditions as in Theorem 3.3,

Σ#Bern=ΣBernVar(ξπ0(V)π0(V)Q#˜0),#{e,mc,cc}.

3.4.2. Within-stratum adjustment of weights

Adjusting weights can be carried out in every stratum. This is proposed by Breslow et al. [2, 3] for ordinary calibration. Consider calibration on where ≡ (Z(1), …, Z(J))T with Z(j)I (V ∈ 𝒱j)ZT . The calibration equation (2.3) becomes

1Ni=1NξiGc(Z˜i;α)π0(Vi)ZiI(Vi𝒱j)=1Ni=1NZiI(Vi𝒱j),j=1,,J,

where α ∈ ℝJk. We call this special case within-stratum calibration. We define within-stratum modified and centered calibration analogously.

We also call estimated weights carried out within stratum within-stratum estimated weights. Recall that Z in estimated weights contains the membership indicators for the strata and the rest are other auxiliary variables, say Z[2]. Within-stratum estimated weights uses ≡ (Z(1), …, Z(J))T where Z(j)I (V ∈ 𝒱j)(Z[2])T with 1 included in Z[2]. The “true” parameter α̃0 has zero for all elements except having Ge1(pj) for the element corresponding to I (V ∈ 𝒱j), j = 1, …, J.

The following corollary summarizes within-stratum adjustment of weights under stratified Bernoulli sampling and sampling without replacement. All methods achieve improved efficiency over the plain WLE under Bernoulli sampling while centered calibration is the only method to yield a guaranteed improvement under sampling without replacement. This is because centering yields the L20(P0|j)-projection suitable for the conditional variances in (3.2) while noncentering results in the L2(P0|j)-projection for the conditional expectations in (3.5).

Corollary 3.3. (1) (Bernoulli) Under the same conditions as in Theorem 3.3 with Z replaced by Z̃ and α0 replaced by α̃0 for within-stratum estimated weights,

Σ#Bern=ΣBernj=1Jνj1pjpjP0|j(Q#(j)˜0)2, (3.6)

where # ∈ {e, c, mc, cc} and

Qe(j)fP0|j[fG˙e(Z˜Tα˜0)(Z[2])T]{P0|jG˙e2(Z˜Tα˜0)(Z[2])2}1×G˙e(Z˜Tα˜0)I(V𝒱j)Z[2],Qc(j)fP0|j[fZT]{P0|j[Z2]}1I(V𝒱j)Z,Qmc(j)fQc(j)f,Qcc(j)fP0|j[f(ZμZ,j)T]{P0|j[(ZμZ,j)2]}1I(V𝒱j)(ZμZ,j)

with μZ,jE[I (V ∈ 𝒱j)Z] for j = 1, …, J.

(2) (without replacement) Under the same conditions as in Theorems 3.1 or 3.2 with Z is replaced by Z̃

Σcc=Σj=1Jνj1pjpjVar0|j(Qcc(j)˜0). (3.7)

.

3.4.3. Comparisons

We summarize Corollaries 3.1–3.3. Every method of adjusting weights improves efficiency over the plain WLE in a certain design and with a certain range of adjustment of weights (within-stratum or “across-strata” adjustment). However, particularly notable among all methods is centered calibration. While other methods gain efficiency only under Bernoulli sampling, centered calibration improves efficiency over the plain WLE under both sampling schemes. There is no known method of “across-strata” adjustment that is guaranteed to gain efficiency over the plain WLE under stratified sampling without replacement.

There are close connections among all methods. When the auxiliary variables have mean zero, centered and modified calibrations are essentially the same. The ordinary and modified calibrations give the same asymptotic variance when carried out stratum-wise. For Z and α0 defined for estimated weights, estimated weights and modified calibration based on (1 − π0(V))−1 Ġe(ZT α0)Z performs the same way. Similarly within-stratum estimated weights with and α̃0 is as good as within-stratum calibration based on Ġe(T α̃0).

As seen in the relationship among methods, there is no single method superior to others in each situation. In fact, performance depends on choice and transformation of auxiliary variables, the true distribution P0 and the design. For our “without replacement” sampling scheme, within-stratum centered calibration is the only method guaranteed to gain efficiency while other methods may perform even worse than the plain WLE.

4. Examples

For asymptotic normality of WLEs, consistency and rate of convergence need to be established first to apply our Z-theorems in Section 3. To this end, general results on IPW empirical processes discussed in the next section will be useful. We illustrate this in the Cox models with right censoring and interval censoring under two-phase sampling.

Let T ~ F be a failure time, and X be a vector of covariates with bounded supports in the regression model. The Cox proportional hazards model [7] specifies the relationship

Λ(t|x)=exp(θTx)Λ(t),

where θ ∈ Θ ⊂ ℝp is the regression parameter, Λ ∈ H is the (baseline) cumulative hazard function. Here the space H for the nuisance parameter Λ is the set of nonnegative, nondecreasing cadlag functions defined on the positive line. The true parameters are θ0 and Λ0.

In addition to X, let U be a vector of auxiliary variables collected at phase I which are correlated with the covariate X. For simplicity of notation, we assume that the covariates X are only observed for the subject sampled at phase II. Thus, if some of the coordinates of X are available at phase I, then we include an identical copy of those coordinates of X in the vector of U.

4.1. Cox model with right censored data

Under right censoring, we only observe the minimum of the failure time T and the censoring time C ~ G. Define the observed time Y = TC and the censoring indicator Δ = I (TC). The phase I data is V = (Y, Δ, U), and the observed data is (Y, Δ, ξX, U, ξ) where ξ is the sampling indicator. With right censored data and complete data, the theory for maximum likelihood estimators in the Cox model has received several treatments; the one we follow most closely here is that of [31]. For the Cox model with case-cohort data, see [27] and for treatments with even more general designs [1] and [12]. Here, for both sampling without replacement and Bernoulli sampling, we continue the developments of [4, 5]. We assume the following conditions:

Condition 4.1. The finite-dimensional parameter space Θ is compact and contains the true parameter θ0 as an interior point.

Condition 4.2. The failure time T and the censoring time C are conditionally independent given X, and that there is τ > 0 such that P(T > τ) > 0 and P(C ≥ τ) = P(C = τ) > 0. Both T and C have continuous conditional densities given the covariates X = x.

Condition 4.3. The covariate X has bounded support. For any measurable function h, P(Xh(Y)) > 0.

Let λ(t) = (d/dt)Λ(t) be the baseline hazard function. With complete data, the density of (Y, Δ, X) is

pθ,Λ(y,δ,x)={λ(y)eθTxΛ(y)eθTx(1G)(y|x)}δ{eΛ(y)eθTxg(y|x)}1δpX(x),

where pX is the marginal density of X and g(·|x) is the conditional density of C given X = x. The score for θ is given by ℓ̇θ,Λ (y, δ, x) = x{δ − eθT x Λ (y)}, and the score operator Bθ,Λ : ℋ ↦ L2(Pθ,Λ) is defined on the unit ball ℋ in the space BV[0, τ] such that Bθ,Λ h(y, δ, x) = δh(y) − eθT x[0,y] h d Λ. Because the likelihood based on the density above does not yield the MLE for complete data, we define the log likelihood for one observation for complete data by ℓθ,Λ (y, δ, x) = log{(eθT x Λ {y})δ e−Λ(y)eθT x} where Λ{t} is the (point) mass of Λ at t. Then maximizing the weighted log likelihood Nπθ,Λ reduces to solving the system of equations Nπ˙θ,Λ=0andNπBθ,Λh=0 for every h ∈ ℋ. The efficient score for θ for complete data is given by

θ0,Λ0*(y,δ,x)=δ(x(M1/M0)(y))eθ0Tx[0,y]δ(x(M1/M0)(t))dΛ0(t),

, and the efficient information for θ for complete data is

I˜θ0,Λ0=E[(θ0,Λ0*)2]=Eeθ0TX0τ(XM1M0(y))2(1G)(y|X)dΛ0(y),

where Mk(s)=Pθ0,Λ0[Xkeθ0TXI(Ys)],k=0,1.

Theorem 4.1 (Consistency). Under Conditions 3.1, 3.2, 4.1–4.3, the WLEs are consistent for0, Λ0).

Proof. This proof follows along the lines of the proof given by [29], but with the usual empirical measure replaced by the IPW empirical measure (with adjusted weights), and by use of Theorem 5.1. For details see [25].

Our Z-theorem (Theorem 3.1) yields asymptotic normality of the WLEs.

Theorem 4.2 (Asymptotic normality). Under Conditions 3.1, 3.2, 4.1–4.3,

N(θ^Nθ0)=NNπ˜θ0,Λ0+oP*(1)dN(0,Σ),N(θ^N,#θ0)=NNπ,#˜θ0,Λ0+oP*(1)dN(0,Σ#),

where # ∈ {e, c, mc, cc}, ˜θ0,Λ0=Iθ0,Λ01θ0,Λ0* is the efficient influence function for θ for complete data, and Σ and Σ# are given in Theorem 3.1.

Proof. We verify the conditions of Theorem 3.1. Condition 3.3 holds by Theorem 4.1. Conditions 3.4 and 3.5 hold under the present hypotheses as was shown in [31], Section 25.12.

For variance estimation regarding θ^N,I^NNπ(θ^N,Λ^N*)2 can be used to estimate I0. Letting ˜^0I^N1θ^N,Λ^N*, we can estimate Var0|j˜0 by P^j˜02{P^j˜0}2 where P^j˜0Nπ˜^0I(V𝒱j) and P^j˜02Nπ˜^02I(V𝒱j). The other four cases are similar.

4.2. Cox model with interval censored data

Let Y be a censoring time that is assumed to be conditionally independent of a failure time T given a covariate vector X. Under the case 1 interval censoring, we do not observe T but (Y, Δ) where Δ ≡ I (TY). The phase I data is V = (Y, Δ, U) and the observed data is (Y, Δ, ξ X, U, ξ) where ξ is the sampling indicator. In the case of complete data, maximum likelihood estimates for this model were studied by Huang [10]. For a generalized version of this model and two-phase data with Bernoulli sampling, weighted likelihood estimates with and without estimated weights have recently been studied by Li and Nan [11]. Here we treat two-phase data under sampling without replacement at phase II and with both estimated weights and calibration.

With complete data, the log likelihood for one observation is given by

(θ,F)δlog{1F¯(y)exp(θTx)}+(1δ)logF¯(y)exp(θTx)δlog{1eΛ(y)exp(θTx)}(1δ)eθTxΛ(y)(θ,Λ),

where F̅ ≡ 1 − F = e−Λ. The score for θ and the score operator Bθ,Λ for Λ for complete data are ℓ̇θ,Λ = x exp(θT x)Λ(y)(δr(y, x; θ, Λ) − (1 − δ)) and Bθ,Λ[h] = exp(θT x)h(y){δr(y, x; θ, Λ) − (1 − δ)} where r(y, x; θ, Λ) = exp(−eθT x Λ (y))/{1 − exp(−eθT x Λ (y))}. The efficient score for θ for complete data is given by

θ0,Λ0*=eθ0TxQ(y,δ,x;θ0,Λ0)Λ0(y){xE[Xe2θ0TXO(Y|X)|Y=y]E[e2θ0TXO(Y|X)|Y=y]},

where Q(y, δ, x; θ, Λ) = δr(y, x; θ, Λ) − (1 − δ) and O(y|x) = r(y, x; θ0, Λ0). The efficient information for θ for complete data I˜θ0,Λ0=E[(θ0,Λ0*)2] is given by Ĩθ0, Λ0 = E[R(Y, X){X − E[X R(Y, X)|Y]/E[R(Y, X)|Y]}] where R(Y,X)=Λ02(Y|X)O(Y|X). See [10] for further details.

We impose the same assumptions made for complete data in [10].

Condition 4.4. The finite-dimensional parameter space Θ is compact and contains the true parameter θ0 as its interior point.

Condition 4.5. (a) The covariate X has bounded support; that is, there exists x0 such that |X| ≤ x0 with probability 1. (b) For any θ ≠ θ0, the probability P(θTXθ0TX)>0.

Condition 4.6. F0(0) = 0. Let τF0 = inf{t : F0(t) = 1}. The support of Y is an interval S[Y] = [lY, uY] and 0 < lYuY < τF0.

Condition 4.7. The cumulative hazard function Λ0 has strictly positive derivative on S[Y], and the joint function G(y, x) of (Y, X) has bounded second order (partial) derivative with respect to y.

4.2.1. Consistency

The characterization of WLEs (θ̂N, Λ̂N) and (θ̂N, #, Λ̂N, #) with # ∈ {e, c, mc, cc} maximizing Nπ(θ,Λ)orNπ,#(θ,Λ) is given in [25], Lemma A.5. We prove consistency of the WLEs in the metric given by d((θ1, Λ1), (θ2, Λ2)) ≡ ∥θ1 − θ2∥ + ∥Λ1 − Λ2PY, where ∥ · ∥ is the Euclidean metric and Λ1Λ2PY2=(Λ1(y)Λ2(y))2dPY, and PY is the marginal probability measure of the censoring variable Y.

Theorem 4.3 (Consistency). Under Conditions 3.1, 3.2, 4.4–4.7, the WLEs are consistent in the metric d.

Proof. We only prove consistency for the WLE. Proofs for the other four estimators are similar.

Let be the set of all subdistribution functions defined on [0, ∞]. We denote the WLE of F as N = 1 − e−Λ̂N. Define the set ℱ of functions by

{f(θ,F)=δ(1F¯(y)exp(θTx))+(1δ)F¯(y)exp(θTx):θΘ,FH˜}.

Boundedness of X and compactness of Θ ⊂ ℝp imply that the set {eθT x : θ ∈ Θ} is Glivenko–Cantelli. The set is also Glivenko–Cantelli since it is a subset of the set of bounded monotone functions. Thus, it follows from boundedness of functions in ℱ and the Glivenko–Cantelli preservation theorem [30] that ℱ is Glivenko–Cantelli.

Let 0 < α < 1 be a fixed constant. It follows by concavity of the function u ↦ log u and Jensen’s inequality that

P0[log{1+α(f(θ,F)/f(θ0,F0)1)}]log(P0[1+α(f(θ,F)/f(θ0,F0)1)])=log(1α+αP0[f(θ,F)/f(θ0,F0)])0,

where the first equality holds if and only if 1 + α(f (θ, F)/f0, F0) − 1) is constant on S[Y], in other words, (θ, F) = (θ0, F0) on S[Y] by the identifiability Condition 4.5. Note also that by monotonicity of the logarithm

P0[log{1+α(f(θ,F)/f(θ0,F0)1)}]P0[log{1+α(01)}]=log(1α).

Thus, the set 𝒢 = {log{1 + α(f (θ, F)/f0, F0) − 1)} : f (θ, F) ∈ ℱ} has an integrable envelope. To see this, form a sequence (θn, Fn) such that

gnlog{1+α(f(θn,Fn)/f(θ0,F0)1)}supθΘ,FH˜log{1+α(f(θ,F)/f(θ0,F0)1)}G.

Then {gn − log(1 − α)}n∈ℕ is a monotone increasing sequence of nonnegative functions. By the monotone convergence theorem, P0gn − log(1 − α) → P0G − log(1 − α) ≤ −log(1 − α). Thus we choose G ∨ − log(1 − α) as an integrable envelope. Also, the set 𝒢 is Glivenko–Cantelli by a Glivenko–Cantelli preservation theorem [30].

Now, by the concavity of the map u ↦ log u, and the definition of the WLE, we have

Nπlog{1+α(f(θ^N,F^N)/f(θ0,F0)1)}Nπ{(1α)log(1)+αlog{f(θ^N,F^N)/f(θ0,F0)}}=α{Nπlogf(θ^N,F^N)Nπlogf(θ0,F0)}0.

Since Θ and are compact, there is a subsequence of (θ̂N, N) converging to (θ, F) ∈ Θ × . Along this subsequence, Theorem 5.1 implies that

0Nπlog{1+α(f(θ^N,F^N)/f(θ0,F0)1)}P*Pθ0,F0[log{1+α(f(θ,F)/f(θ0,F0)1)}]0,

so that Pθ0, F0 log{1 + α(f, F)/f0, F0) − 1)} = 0. This is possible when (θ, F) = (θ0, F0) because (θ, F) ↦ P[log{1 + α(f (θ, F)/f0, F0) − 1)}] attains its maximum only at (θ0, F0). Hence we conclude that (θ̂N, N) converges to (θ0, F0) in the sense of Kullback–Leibler divergence. Since the Kullback–Leibler divergence bounds the Hellinger distance, it follows by Lemma A5 of [17] that d((θ̂N, Λ̂N), (θ0, Λ0)) = oP* (1).

4.2.2. Rate of convergence

We prove the rate of convergence of the WLE is N1/3 by applying the rate theorem (Theorem 5.2) in Section 5. Since we proved the consistency of (θ̂N, Λ̂N) to (θ0, Λ0) on S[Y], under Condition 4.6 we can restrict a parameter space of Λ to HM ≡ {Λ ∈ H : M−1 ≤ Λ ≤ M, on S[Y]}, where M is a positive constant such that M−1 ≤ Λ0M on S[Y]. Define ℳ ≡ {ℓ(θ, Λ): θ ∈ Θ, Λ ∈ HM}.

Theorem 4.4 (Rate of convergence). Under Conditions 4.4–4.7,

d((θ^N,Λ^N),(θ0,Λ0))=OP*(N1/3).

This holds if we replace the WLE by the WLEs with adjusted weights assuming Conditions 3.1 and 3.2.

Proof. Since the rate of convergence for the WLE is easier to verify than the other four estimators, we only prove the theorem for the WLE with modified calibration. The cases for the WLEs with adjusted weights.

We proceed by verifying the conditions in Theorem 5.2. Bound (5.4) follows by Lemma 5.2 in Section 5 and Lemma A5 of [17]. For bound (5.5), we follow the proof of (5.3) in [10]. Since α̂N is consistent, we can specify the small neighborhood 𝒜mc, 0 of a zero vector such that Gmc(z; α) is contained in a small interval that contains 1 and consists of strictly positive numbers. Thus, multiplying the log likelihood by a uniformly bounded quantity Gmc(z; α) only requires a slight modification of Huang’s proof of his Lemma 3.1 to obtain supQ log N[·] (ε, Gℳ, L2(Q)) ≲ ε−1 for ε small enough where the supremum is taken over the all discrete probability measures and Gℳ = {Gmc(·; α)ℓ(θ, Λ) : α ∈ 𝒜mc, 0, ℓ(θ, Λ) ∈ ℳ}. Let Gδ = {m(θ, Λ, α) − m0, Λ0, α) : m(θ, Λ, α) ∈ Gℳ, d((θ, Λ), (θ0, Λ0)) ≤ δ}. It follows by Lemma 3.2.2 of [32] that E*𝔾NGδδ1/2{1+(δ1/2/δ2N)M}ϕN(δ). Apply Theorem 5.2 to conclude rN = N1/3.

4.2.3. Asymptotic normality of the estimators

We apply Theorem 3.2 to derive the asymptotic distributions of the WLEs.

Theorem 4.5 (Asymptotic normality). Under Conditions 3.1, 3.2, 4.4–4.7,

N(θ^Nθ0)=NNπ˜θ0,Λ0+oP*(1)N(0,Σ),N(θ^N,#θ0)=NNπ,#˜θ0,Λ0+oP*(1)N(0,Σ#),

where # ∈ {e, c, mc, cc}, ˜θ0,Λ0=Iθ0,Λ01θ0,Λ0* is the efficient influence function for complete data and Σ and Σ# are given in Theorem 3.2.

Proof. We proceed by verifying the conditions of Theorem 3.2 for the WLE with modified calibration. The other four cases are similar.

Condition 3.6 is satisfied with β = 1/3 by Theorems 4.3 and 4.4. Conditions 3.7–3.9 are verified by [10] with

h¯*(y)Λ0(y)E[Xe2θ0TXO(Y|X)|Y=y]/E[e2θ0TXO(Y|X)|Y=y].

Since Nπ,mc˙θ^N,mc,Λ^N,mc=0 by Lemma A.5, it remains to show that

Nπ,mcBθ^N,mc,Λ^N,mc[h¯*]=oP*(N1/2).

Let g0h¯*Λ01 be the composition of h* and the inverse of Λ0. Note that Λ0 is a strictly increasing continuous function by our assumption. Since g0(Λ̂N, mc(y)) is a right continuous function and has exactly the same jump points as Λ̂N, mc(y), by Lemma A.5, Nπ,mcg0(Λ^N,mc(Y))eθ^N,mcTXQ(Y,Δ,X;θ^N,mc,Λ^N,mc)=0. By Conditions 4.5–4.7, h* has bounded derivative. This and the assumption that Λ0 has strictly positive derivative by Condition 4.7 imply that g0 has bounded derivative, too. So, noting that h* = g0 ◦ Λ0, we have

Nπ,mcBθ^N,mc,Λ^N,mc[h¯*]=Nπ,mch¯*(Y)eθ^N,mcTXQ(Y,Δ,X;θ^N,mc,Λ^N,mc)=Nπ,mc{g0Λ0(Y)g0(Λ^N,mc(Y))}eθ^N,mcTXQ(Y,Δ,X;θ^N,mc,Λ^N,mc)=(Nπ,mcPθ0,Λ0){g0Λ0(Y)g0(Λ^N,mc(Y))}×eθ^N,mcTXQ(Y,Δ,X;θ^N,mc,Λ^N,mc)+Pθ0,Λ0{g0Λ0(Y)g0(Λ^N,mc(Y))}×eθ^N,mcTXQ(Y,Δ,X;θ^N,mc,Λ^N,mc).

Huang [10] showed that the second term in the display is oP*(N−1/2). We show that the first term in the display is also oP*(N−1/2). Let C > 0 be an arbitrary constant. Define for a fixed constant η > 0, 𝒟(η) ≡ {ψ(y, x; θ, Λ) : d((θ, Λ), (θ0, Λ0)) ≤ η, Λ ∈ HM}, where ψ(y, δ, x; θ, Λ) ≡ {g0 ◦ Λ0(y) − g0(Λ(y))} × eθT x Q(y, δ, x; θ, Λ). Because Huang [10] showed that 𝒟(η) is Donsker for every η > 0 and that ∥𝔾N𝒟 (CN−1/3) = oP*(1), it follows by Lemma 5.4 with ℱN replaced by 𝒟(CN−1/3) that 𝔾Nπ,mc𝒟(CN1/3)=oP*(1). This completes the proof.

Unlike the previous example, θ,Λ* depends on additional unknown functions, and the method of variance estimation used in the previous example does not apply to the present case. See the discussion in Section 6.

5. General results for IPW empirical processes

The IPW empirical measure and IPW empirical process inherit important properties from the empirical measure and empirical process, respectively. We emphasize the similarity between empirical processes and IPW empirical processes.

5.1. Glivenko–Cantelli theorem

The next theorem states that the Glivenko–Cantelli property for complete data is preserved under two-phase sampling.

Theorem 5.1. Suppose thatis P0-Glivenko–Cantelli. Then

NπP0P*0, (5.1)

where ∥·∥ is the supremum norm. This also holds if we replace NπbyNπ,# with # ∈ {e, c, mc, cc} assuming Conditions 3.1 and 3.2.

5.2. Rate of convergence

The rate of convergence of an M-estimator for complete data is often established via maximal inequalities for the empirical processes. If we follow the same line of reasoning, it is natural to derive maximal inequalities for IPW empirical processes, though this may require some efforts. Fortunately, these maximal inequalities for empirical processes (or slight modifications of them) suffice to establish the same rate of convergence under two-phase sampling.

Theorem 5.2. Let ℳ = {mθ : θ ∈ Θ} be the set of criterion functions and defineδ = {mθmθ0 : d(θ, θ0) < δ} for some fixed δ > 0 where d is a semimetric on the parameter space Θ.

(1) Suppose that for every θ in a neighborhood of θ0,

P0(mθmθ0)d2(θ,θ0); (5.2)

here ab means aK b for some constant K ∈ (0, ∞). Assume that there exists a function ϕN such that δ ↦ ϕN(δ)/δα is decreasing for some α < 2 (not depending on N), and for every N,

E*𝔾NδϕN(δ), (5.3)

where 𝔾N is the empirical process. If an estimator θ̂N satisfying Nπmθ^NNπmθ0OP*(rN2) converges in outer probability to θ0, then rNd(θ̂N, θ0) = OP*(1) for every sequence rN such that rN2ϕN(1/rN)N for every N.

(2) Let # ∈ {e, c, mc, cc} be fixed. Suppose Condition 3.2 holds. Suppose also that for every θ ∈ Θ in a neighborhood of θ0,

P0{G˜#(V;α)(mθmθ0)}d2(θ,θ0)+|αα0|2, (5.4)

where G̃e = π0(V)/Ge or G̃# = G# with # ∈ {c, mc, cc}. Assume that

E*𝔾NG˜#δϕN(δ), (5.5)

where G̃#δ ≡ {#(·; α)f : |α| ≤ δ, α ∈ 𝒜N, f ∈ ℳδ} for some 𝒜N ⊂ 𝒜#. Then an estimator θ̂N,# satisfying Nπ,#mθ^N,#Nπ,#mθ0OP*(rN2) has the same rate of convergence as θ̂N in part (1) if it is consistent.

Remark 5.1. The key to establishing a general theorem for the rate of convergence is to make use of the boundedness of the weights in the IPW empirical process and also deal with the dependence of the weights. In treating independent bootstrap weights in the weighted bootstrap ([15], Lemmas 1–3), require the boundedness of bootstrap weights because the product of an unbounded weight and a bounded function is no longer bounded. Our theorem exploits the boundedness of sampling indicators in the IPW empirical processes by applying a multiplier inequality for the case of bounded weights (Lemma 5.1) to cover more general cases.

The following is a multiplier inequality for bounded exchangeable weights. Note that the sum of stochastic processes in the second term is divided by n1/2 rather than k1/2.

Lemma 5.1. For i.i.d. stochastic processes Z1, …, Zn, every bounded, exchangeable random vector1, …, ξn) with each ξi[l, u] that is independent of Z1, …, Zn, and any 1 ≤ n0n,

E1ni=1nξiZi*2(n01)ni=1nE*ZiEmax1inξin+2(ul)maxn0knE1ni=n0kZi*.

Bound (5.5) is not difficult to verify in the presence of bound (5.3) since G#(· ; α) is a bounded monotone function indexed by a finite-dimensional parameter. Bound (5.4) may be verified through the lemma below for some applications including the Cox model with interval censoring.

Lemma 5.2. Suppose Conditions 3.1 and 3.2 hold. Let mθ be the log likelihood log pθ where pθ is the density with dominating measure μ, and d is the Hellinger distance. Then the bound (5.4) holds.

5.3. Donsker theorem

The next theorem yields weak convergence of the IPW empirical processes under sampling without replacement.

Theorem 5.3. Suppose thatwithP0 < ∞ is P0-Donsker and Conditions 3.1 and 3.2 hold. Then

𝔾Nπ𝔾π𝔾+j=1Jνj1pjpj𝔾j, (5.6)
𝔾Nπ,#𝔾π,#𝔾+j=1Jνj1pjpj𝔾j(·Q#·) (5.7)

in(ℱ) where # ∈ {e, c, mc, cc}, the P0-Brownian bridge process, 𝔾, indexed byand the P0|j-Brownian bridge processes, 𝔾j, indexed byare all independent.

Remark 5.2. The integrability hypothesis ∥P0 < ∞ is only required for the IPW empirical processes with adjusted weights.

For a Donsker set ℱ, it follows by Theorem 5.3 and Lemma 2.3.11 of [32] that asymptotic equicontinuity in probability and in mean follows for the metric that depends on the limit process. In applications, it is of interest to have these results for the original metric ρP0(f, g) = σP0(f − g).

Theorem 5.4. Letbe Donsker and defineδ = {f − g : f, g ∈ ℱ, ρP0(f, g) < δ} for some fixed δ > 0. Then, for every sequence δN ↓ 0,

E*𝔾NπδN0

and consequently, 𝔾NπδN=oP*(1). Moreover, 𝔾Nπ,#δN=oP*(1) for # ∈ {e, c, mc, cc} assuming Conditions 3.1 and 3.2.

We end this section with two important lemmas. The first lemma is an extension of Lemma 3.3.5 of [32] and will be used in our proof of Theorem 3.1 to verify asymptotic equicontinuity.

Lemma 5.3. Suppose ℱ = {ψθ,h − ψθ0,h : ∥θ − θ0∥ < δ, h ∈ ℋ} is P0-Donsker for some δ > 0 and that suph∈ℋ P0θ,h − ψθ0,h)2 → 0, as θ →θ0. If θ̂N converges in outer probability to θ0, then

𝔾Nπ(ψθ^N,hψθ0,h)=oP*(1).

This also holds if we replace 𝔾Nπby𝔾Nπ,# with # ∈ {e, c, mc, cc} assuming Conditions 3.1 and 3.2 hold andP0 < ∞.

The second lemma is used to verify asymptotic equicontinuity in the proof of Theorem 3.2, the first part for the IPW empirical process and the second part for the other four IPW empirical processes with adjusted weights.

Lemma 5.4. LetN be a sequence of decreasing classes of functions such that ∥𝔾NN = oP*(1). Assume that there exists an integrable envelope forN0 for some N0. Then E∥𝔾NN → 0 as N → ∞. As a consequence, 𝔾NπN=oP*(1).

Suppose, moreover, thatN is P0-Glivenko–Cantelli withP0N1 < ∞ for some N1, and that every f = fN ∈ ℱN converges to zero either point-wise or in L1(P0) as N → ∞. Then 𝔾Nπ,eN=oP*(1),𝔾Nπ,cN=oP*(1),𝔾Nπ,mcN=oP*(1)and𝔾Nπ,ccN=oP*(1), assuming Conditions 3.1 and 3.2.

6. Discussion

We developed asymptotic theory for weighted likelihood estimation under two-phase sampling, introduced and studied a new calibration method, centered calibration, and compared several WLE estimation methods involving adjusted weights. The methods of proof and general results for the IPW empirical process are applicable to other estimation procedures. For example, the weighted Kaplan–Meier estimator can be shown to be asymptotically Gaussian via our Donsker theorem (Theorem 5.3) together with the functional delta method. A particularly interesting application is to study asymptotic properties of estimators that are known to be efficient under Bernoulli sampling (e.g., estimator of [19]). Whether or not these estimators are “efficient” under our sampling scheme is an open problem; see [16] for a definition of efficiency with non-i.i.d. data.

There are several other open problems. Variance estimation under two-phase sampling has been restricted to the case where the asymptotic variance is a known function up to parameters as discussed in Section 4, while there are several methods available for complete data in a general case (e.g., [18]). In [24] the first author has proposed and studied nonparametric bootstrap variance estimation methods which remain valid even under model misspecification; these results will appear elsewhere. Another direction of research is to study (local and global) model misspecification under two-phase sampling where missingness is by design. An interesting open problem beyond our sampling scheme is to study other complex survey designs. Stratified sampling without replacement is sufficiently simple for the existing bootstrap empirical process theory to apply. Other complex designs may provide interesting theoretical challenges, perhaps in connection with extensions of bootstrap empirical process theory.

Supplementary Material

ElectronicSupplementaryFile

Acknowledgements

We owe thanks to Kwun Chuen Gary Chan for suggesting the modified calibration method introduced in Section 2.1.3. We also thank Norman Breslow for many helpful conversations concerning two-phase sampling, and two referees for their constructive comments and suggestions.

Footnotes

1

Supported by NIH/NIAID R01 AI089341.

2

Supported in part by NSF Grant DMS-11-04832, NI-AID Grant 2R01 AI291968-04 and the Alexander von Humboldt Foundation.

Supplementary material for “Weighted likelihood estimation under two-phase sampling” (DOI: 10.1214/12-AOS1073SUPP;.pdf). Due to space constraints, the proofs and technical details have been given in the supplementary document [25]. References here beginning with “A.” refer to [25].

REFERENCES

  • 1.Binder DA. Fitting Cox’s proportional hazards models from survey data. Biometrika. 1992;79:139–147. MR1158522. [Google Scholar]
  • 2.Breslow NE, Lumley T, Ballantyne C, Chambless L, Kulich M. Improved Horvitz–Thompson estimation of model parameters from two-phase stratified samples: Applications in epidemiology. Stat. Biosc. 2009;1:32–49. doi: 10.1007/s12561-009-9001-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Breslow NE, Lumley T, Ballantyne C, Chambless L, Kulich M. Using the whole cohort in the analysis of case-cohort data. Am. J. Epidemiol. 2009;169:1398–1405. doi: 10.1093/aje/kwp055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Breslow NE, Wellner JA .Weighted likelihood for semiparametric models and two-phase stratified samples, with application to Cox regression. Scand. J. Stat. 2007;34:86–102. doi: 10.1111/j.1467-9469.2007.00574.x. MR2325244. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Breslow NE, Wellner JA. A Z-theorem with estimated nuisance parameters and correction note for:“Weighted likelihood for semiparametric models and two-phase stratified samples, with application to Cox regression” [ScandJStatist. 34 (2007), no. 1, 86-102; MR2325244] Scand. J. Stat. 2008;35:186–192. doi: 10.1111/j.1467-9469.2007.00574.x. MR2391566. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Chan KCG. Uniform improvement of empirical likelihood for missing response problem. Electron. J. Stat. 2012;6:289–302. [Google Scholar]
  • 7.Cox DR. Regression models and life-tables (with discussion) J. R. Stat. Soc. Ser. B Stat. Methodol. 1972;34:187–220. MR0341758. [Google Scholar]
  • 8.Deville J-C, Särndal C-E. Calibration estimators in survey sampling. J. Amer. Statist. Assoc. 1992;87:376–382. MR1173804. [Google Scholar]
  • 9.Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 1952;47:663–685. MR0053460. [Google Scholar]
  • 10.Huang J. Efficient estimation for the proportional hazards model with interval censoring. Ann. Statist. 1996;24:540–568. MR1394975. [Google Scholar]
  • 11.Li Z, Nan B. Relative risk regression for current status data in case-cohort studies. Canad. J. Statist. 2011;39:557–577. MR2860827. [Google Scholar]
  • 12.Lin DY. On fitting Cox’s proportional hazards models to survey data. Biometrika. 2000;87:37–47. MR1766826. [Google Scholar]
  • 13.Lumley T. Complex Surveys:A Guide to Analysis Using R. Hoboken, NJ: Wiley; 2010. [Google Scholar]
  • 14.Lumley T, Shaw PA, Dai JY. Connections between survey calibration estimators and semiparametric models for incomplete data. Int. Stat. Rev. 2011;79:200–232. doi: 10.1111/j.1751-5823.2011.00138.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Ma S, Kosorok MR. Robust semiparametric M-estimation and the weighted bootstrap. J. Multivariate Anal. 2005;96:190–217. MR2202406. [Google Scholar]
  • 16.McNeney B, Wellner JA. Application of convolution theorems in semiparametric models with non-i.i.d. data. J. Statist. Plann. Inference. 2000;91:441–480. Prague Workshop on Perspectives in Modern Statistical Inference: Parametrics, Semi-parametrics, Non-parametrics (1998) MR1814795. [Google Scholar]
  • 17.Murphy SA, van der Vaart AW. Semiparametric likelihood ratio inference. Ann. Statist. 1997;25:1471–1509. MR1463562. [Google Scholar]
  • 18.Murphy SA, van der Vaart AW. Observed information in semi-parametric models. Bernoulli. 1999;5:381–412. MR1693616. [Google Scholar]
  • 19.Nan B. Efficient estimation for case-cohort studies. Canad. J. Statist. 2004;32:403–419. MR2125853. [Google Scholar]
  • 20.Neyman J. Contribution to the theory of sampling human populations. J. Amer. Statist. Assoc. 1938;33:101–116. [Google Scholar]
  • 21.Præstgaard J, Wellner JA. Exchangeably weighted bootstraps of the general empirical process. Ann. Probab. 1993;21:2053–2086. MR1245301. [Google Scholar]
  • 22.Prentice RL. A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika. 1986;73:1–11. [Google Scholar]
  • 23.Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist. Assoc. 1994;89:846–866. MR1294730. [Google Scholar]
  • 24.Saegusa T. Ph.D. thesis. Seattle, WA: Univ. Washington; 2012. Weighted likelihood estimation under two-phase sampling. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Saegusa T, Wellner JA. Supplement to"Weighted likelihood estimation under two-phase sampling". 2012 doi: 10.1214/12-AOS1073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Saegusa T, Wellner JA. Technical Report 592. Seattle, WA: Dept. Statistics, Univ. Washington; 2012. Weighted likelihood estimation under two-phase sampling. Available at arXiv:1112.4951. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Self SG, Prentice RL. Asymptotic distribution theory and efficiency results for case-cohort studies. Ann. Statist. 1988;16:64–81. MR0924857. [Google Scholar]
  • 28.Tan Z. Efficient restricted estimators for conditional mean models with missing data. Biometrika. 2011;98:663–684. MR2836413. [Google Scholar]
  • 29.van der Vaart A. Lectures on Probability Theory and Statistics (Saint-Flour, 1999). Lecture Notes in Math. Vol. 1781. Berlin: Springer; 2002. Semiparametric statistics; pp. 331–457. MR1915446. [Google Scholar]
  • 30.van der Vaart A, Wellner JA. High Dimensional Probability. II (Seattle. WA, 1999). Progress in Probability. Vol. 47. Boston, MA: Birkhäuser; 2000. Preservation theorems for Glivenko– Cantelli and uniform Glivenko–Cantelli classes; pp. 115–133. MR1857319. [Google Scholar]
  • 31.van der Vaart AW. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Vol. 3. Cambridge: Cambridge Univ. Press; 1998. MR1652247. [Google Scholar]
  • 32.van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes :With Applications to Statistics. New York: Springer; 1996. MR1385671. [Google Scholar]
  • 33.White JE. A two stage design for the study of the relationship between a rare exposure and and a rare disease. Am. J. Epidemiol. 1986;115:119–128. doi: 10.1093/oxfordjournals.aje.a113266. [DOI] [PubMed] [Google Scholar]
  • 34.Zheng H, Little RJA. Penalized spline nonparametric mixed models for inference about a finite population mean from two-stage samples. Survey Methodology. 2004;30:209–218. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ElectronicSupplementaryFile

RESOURCES