Abstract
An objective of randomized placebo-controlled preventive HIV vaccine efficacy trials is to assess the relationship between the vaccine effect to prevent infection and the genetic distance of the exposing HIV to the HIV strain represented in the vaccine construct. Motivated by this objective, recently a mark-specific proportional hazards model with a continuum of competing risks has been studied, where the genetic distance of the transmitting strain is the continuous `mark' defined and observable only in failures. A high percentage of genetic marks of interest may be missing for a variety of reasons, predominantly due to rapid evolution of HIV sequences after transmission before a blood sample is drawn from which HIV sequences are measured. This research investigates the stratified mark-specific proportional hazards model with missing marks where the baseline functions may vary with strata. We develop two consistent estimation approaches, the first based on the inverse probability weighted complete-case (IPW) technique, and the second based on augmenting the IPW estimator by incorporating auxiliary information predictive of the mark. We investigate the asymptotic properties and finite-sample performance of the two estimators, and show that the augmented IPW estimator, which satisfies a double robustness property, is more efficient.
Keywords: Augmented inverse probability weighted complete-case estimator, auxiliary marks, competing risks, double robustness, failure time data, genetic data, HIV vaccine trial, mark-specific vaccine efficacy, missing at random, semiparametric model
1 Introduction
In many medical studies, evaluation of treatment efficacy is based on comparison of survival times between two treatment groups. This is often done through the ratio of two hazard functions of an endpoint. In a preventive HIV vaccine efficacy trial, the usual primary objective is to assess vaccine efficacy (V E) to prevent HIV infection, where typically V E is defined as one minus the hazard ratio (vaccine/placebo) of the failure event. However, it may be quite difficult to achieve an efficacious vaccine due to genetic variation of HIV (Fauci et al., 2008). The study population is exposed to many genetic types of circulating HIVs but the vaccine only contains antigens based on one or a few strains, and the vaccine protection is likely to be lower against strains that are not in the vaccine or that have greater genetic distance from the strain(s) in the vaccine (Gilbert et al., 1999). Assessment of this problem can be formulated with a competing risks failure time model where the endpoint is HIV infection and the mark variable is the genetic type or distance of the transmitting strain(s) from the strain(s) represented in the vaccine (A `mark' by definition is only meaningful/observable in failures). The goal is to evaluate mark-specific vaccine efficacy defined as one minus the mark-specific hazard ratio (vaccine/placebo) of infection.
In HIV vaccine efficacy trials, the mark variable of interest is some measure of genetic distance between an HIV sequence measured from an infected subject and an HIV sequence represented in the vaccine. For example, if the two HIV sequences are aligned, then the genetic distance may be defined as the weighted percent mismatch of amino acids (e.g., Wu et al., 2001), with weights selected to make the distance scientifically/immunologically meaningful. Definitions of genetic distance between unaligned sequences are also possible, for example based on the difference in total counts of amino acids that possess certain physico-chemical properties. For many definitions of genetic distances of interest, the value of the distance may be unique for almost all infected subjects, making it natural to consider it as a continuous mark variable. Gilbert et al. (2004, 2008) developed procedures for estimation and testing of continuous mark-specific hazards rates and relative risks, and Sun et al. (2009) studied the continuous mark-specific proportional hazards model, which allows evaluating mark-specific vaccine efficacy with adjustment for covariate effects.
However, the previous work for a continuous mark did not account for missingness of the mark variable in failures, so that its valid implementation relies on a dubious missing completely at random assumption (MCAR). While these methods will perform satisfactorily for mark variables measured in almost all failures, for which violations of MCAR are inconsequential, they may be invalid and misleading under moderate-to-high rates of missingness. Moreover, arguably the marks of greatest scientific interest in HIV vaccine trials are subject to high rates of missingness. In particular, ideally the mark would be defined based on the HIV strain that is actually transmitted. However, because vaccine trials only periodically test for HIV infection, the earliest observable HIV genotype is measured using a subjects' earliest available blood sample, which is drawn days, weeks, or several months after the actual transmission event. Therefore the ideal mark definition is unachievable, and we must consider knowledge of HIV evolution and the HIV testing algorithm to develop a workable definition.
HIV vaccine trials test volunteers for anti-HIV antibodies at periodic intervals (e.g., every 3 or 6 months); these antibody-based tests have near-perfect sensitivity to detect infections that occurred at least 4 weeks ago but otherwise may miss the infection. For all subjects with an HIV antibody positive (Ab+) test, a “look-back” procedure is applied wherein earlier available blood samples are tested for HIV infection using a more sensitive antigen-based HIV-specific PCR assay, which has near-perfect sensitivity if the infection occurred at least one week ago; thus for example if blood samples are drawn 2 weeks and 6 weeks after HIV acquisition, and the Ab and PCR tests are applied to both samples, then with very high probability the 2-week sample will be Ab−, PCR+ and the 6-week sample will be Ab+, PCR+. Therefore, each infected subject is classified into one of two groups, defined by whether the earliest sample is Ab− and PCR+ (an `early' sample) or is Ab+ and PCR+ (a `later' sample). These groups approximately correspond to the first available sample being drawn less than or greater than 4 weeks after HIV acquisition. Extensive analysis of early viruses suggests that these viruses well-approximate the transmitting strain, whereas later viruses tend to have undergone significant evolution, with many genetic mutations (Keele et al., 2008). If the mark is defined based on the earliest sampled virus from each infected subject, then the fact that it is often significantly mutated relative to the transmitted virus may make the analysis misleading about the relationship between vaccine efficacy and the exposing virus. One remedy to this problem defines the mark as the early virus in the Ab− and PCR+ phase, which will likely be missing from many infected subjects, with missingness rate depending on the frequency of HIV testing. For the recent Step HIV vaccine trial with genetic data on 87 diagnosed HIV infections, with 6-monthly HIV antibody testing, 41 were Ab−/PCR+ viruses, for a missingness rate of 53% (unpublished data). The sequel trial that is ongoing (named HVTN 505), with 3-monthly HIV antibody testing, is expected to give a missingness rate of less than 50%. The `early' mark of interest may be missing for other reasons besides HIV evolution, for example because of a missing blood sample or a technical problem in measuring the HIV sequence; the developed methods accommodate this issue by allowing separate models of the different missingness types.
To analyze data with the mark defined as the genetic distance from the early virus, a `complete-case' application of the previously developed methods, which excludes infected subjects without a measured mark, may be highly biased and inefficient. Accordingly, this article develops valid and more efficient estimation methods for data-sets with missing at random continuous marks. The new methods restrict attention to a univariate mark variable, albeit if the vaccine contains multiple HIV sequences and/or if multiple sequences are transmitted to a study participant (which happens in a minority of cases, Keele et al., 2008), then it would be of interest to study a multivariate mark variable defined as the set of genetic distances to the vaccine sequences. The current methods can accommodate multiple genetic distances by defining the univariate mark as the minimum of the multiple distances.
Early work in nonparametric estimation for failure time data with missingness of a discrete mark (failure type) include Dinse (1986), Racine-Poon & Hoel (1984) and Lagakos & Louis (1988). Goetghebeur & Ryan (1990) and Dewanji (1992) derived modified logrank tests to compare survival experience between two groups. Goetghebeur & Ryan (1995) and Lu & Tsiatis (2001) studied proportional cause-specific hazards models for one particular failure cause. Gao & Tsiatis (2005) studied linear transformation models of cause-specific hazard functions and Lu & Liang (2008) studied semiparametric additive models of cause-specific hazard functions.
We develop two estimation approaches for the stratified mark-specific proportional hazards model. The first uses inverse probability weighting (IPW) of the complete-case estimator, which leverages auxiliary predictors of whether the mark is observed. The second approach, adapting the theory of Robins et al. (1994), augments the IPW complete-case estimator with auxiliary predictors of the missing marks.
The article is organized as follows. Notation, assumptions, and the stratified mark-specific proportional hazards model are introduced in Section 2. The estimation procedures are proposed in Section 3. The asymptotic properties of the proposed estimators are derived in Section 4. Procedures for constructing confidence intervals for mark-specific vaccine efficacies are also given in Section 4. The finite-sample performance and robustness analysis of the estimators are studied in simulations in Section 5. Proofs of the main results are placed in the web Appendices.
2 Stratified mark-specific proportional hazards models and missing marks
Let T be the failure time, V a continuous mark variable, and Z(t) a possibly time-dependent p-dimensional covariate. Under the competing risks model, the mark V is only defined and observable when T is observed, whereas if T is right-censored, the mark is undefined and meaningless. Suppose that the conditional mark-specific hazard function at time t given the covariate history Z(s), for s ≤ t, defined below only depends on the current value Z(t). The conditional mark-specific hazard function given Z(t) = z is defined as
(1) |
with t ranging over a fixed interval [0, τ]. Assume that V is continuous and has a known bounded support; rescaling V if necessary, this support is taken to be [0, 1]. The function given in (1) is the natural analog of its discrete counterpart, with similar interpretation.
Gilbert et al. (2008) defined mark-specific vaccine efficacy as VE(t, v) = 1−λ(t, v|z = 1)/λ(t, v|z = 0), with z the indicator of assignment to the vaccine group; they developed several nonparametric and semiparametric tests concerning VE(t, v) as well as a nonparametric estimation method. Sun et al. (2009) studied the mark-specific proportional hazards (PH) model, λ(t, v|z(t)) = λ0(t, v)exp {β(v)T z(t)}, where λ0(·, v) is the unspecified baseline hazard function and β(v) is the p-dimensional unknown regression coefficient function of v. For the HIV vaccine trial application, the covariate is partitioned as z(t) = (z1, z2(t))T, where z1 is the treatment (vaccine) group indicator and z2(t) is a vector of possibly time-dependent covariates. Then the covariate-adjusted mark-specific vaccine efficacy defined above takes the simpler form VE(v) = 1 − exp(β1(v)), without any dependence on t. Assuming the vaccine and placebo groups have the same distribution of the number of exposures to viruses with genetic distance marks in a neighborhood of v (by randomization and double-blinding), VE(v) approximates the multiplicative effect of the vaccine to reduce the instantaneous rate of infection with this virus type (Gilbert et al., 2008; Sun et al., 2009). The statistical procedures developed by Sun et al. (2009) are based on observations of the random variables (X, Z(·), V) for δ = 1 and (X, Z(·)) for δ = 0, where X = min{T, C}, δ = I(T ≤ C), and C is a censoring random variable. Whereas the earlier work assumed V was observed whenever T is observed, here we allow V to be missing, and incorporate in the estimation procedures auxiliary covariates and/or auxiliary mark variables that inform about the probability V is observed and about the distribution of V. For our motivating example discussed above, genetic data for the `later' virus measured in almost all infected subjects is a natural auxiliary mark to help predict the genetic distance V for an `early' virus that is frequently missing.
In practice, different key subgroups (e.g., men and women; subjects living in different geographic regions) typically have different baseline mark-specific hazards of HIV infection. The stratified mark-specific PH model postulates that the conditional mark-specific hazard function given covariate z(t) for an individual in the kth stratum equals
(2) |
where β(v) is the p-dimensional unknown regression coefficient function of v, λ0k(·, v) is the unspecified baseline hazard function for the kth stratum and K is the number of strata. Model (2) allows different baseline functions for different strata. Similar generalizations of the Cox model was studied by Dabrowska (1997).
Next we introduce some notation and assumptions that are used throughout the article. Let nk be the number of subjects in the kth stratum; the total sample size is . As above, the right-censored mark-specific failure time is represented by (X, δ, V) and Z(·) is the covariate process. Let R be the indicator of whether all possible data are observed for a subject; R = 1 if either δ = 0 (right-censored) or if δ = 1 and V is observed; and R = 0 otherwise. Auxiliary variables A may be helpful for predicting missing marks. Since the mark can only be missing for failures, supplemental information is potentially useful only for failures, for predicting missingness and for informing about the distribution of missing marks. For example, the `later' HIV virus V* can be considered a subset of A. In general, A could include multiple viral sequences per infected subject at multiple time-points along with the time points at which the viruses are sequenced, giving a picture of the intra-subject evolution of a subject's HIV infection. The relationship between A and V can be modelled to help predict V (see Section 5 for a simulation example).
We assume that the censoring time C is conditionally independent of (T, V) given Z(·) for an individual in the kth stratum. We also assume the mark V is missing at random (Rubin, 1976); that is, given δ = 1 and W = (T, Z(T), A) of an individual in the kth stratum, the probability that the mark V is missing depends only on the observed W, not on the value of V; this assumption is expressed as
(3) |
Let πk(Q) = P (R = 1|Q) where Q = (δ, W). Then πk(Q) = δrk(W) + (1 − δ). The missing at random assumption (3) also implies that V is independent of R given Q:
(4) |
For an observed value w of W of an individual in the kth stratum, we write rk(w) = P (R = 1|δ = 1, W = w) and ρk(v, w) = P (V ≤ v|δ = 1, W = w). The stratum-specific definitions of rk(w) and ρk(v, w) leave the options for the models of the probability of complete-case and mark distribution to be different for different strata.
Let for each k, {Xki, Zki(·), δki, Rki, Vki, Aki; i = 1, …, nk} be iid replicates of {X, Z(·), δ, R, V, A} from that stratum. The observed data is denoted by {Oki; i = 1 …, nk, k = 1, …, K}, where Oki = {Xki, Zki(·), Rki, RkiVki, Aki} for δki = 1 and Oki = {Xki, Zki(·), Rki = 1} for δki = 0. We assume that {Oki; i = 1 …, nk, k = 1, …, K} are independent for all subjects. Similarly, we denote Wki = (Tki, Zki(Tki), Aki) and Qki = (δki, Wki).
We consider a parametric model rk(Wki, ψk) for rk(Wki), where ψk is an unknown vector of parameters to be further discussed in the next section. Let πk(Qki, ψk) = δkirk(Wki, ψk) + (1 − δki). Additional notation is introduced in the following. For , t ≥ 0, let Yki(t) = I(Xki ≥ t),
where for any , z⊗0 = 1, z⊗1 = z and z⊗2 = zzT. Define and . Under the missing at random assumption (3), if the model rk(Wki, ψk) is correctly specified. Let
Let and .
3 Estimation procedures
When there are no missing marks, β(v) in model (2) can be estimated by maximizing the following local log-partial likelihood function for β = β(v) at a fixed v:
(5) |
where Kh(x) = K(x/h)/h, K(·) is a kernel function with support [−1, 1], τ is the end of the follow-up period and h = hn is a bandwidth. Here Nki(t, v) = I(Xki ≤ t, δki = 1, Vki ≤ v) is the marked counting process with a jump at the uncensored failure time Xki and the associated mark Vki. The log partial likelihood function (5) resembles that of Kalbfleisch & Prentice (1980) in the case of discrete marks, except that it borrows strength from observations having marks in the neighborhood of v. The kernel function is designed to give greater weight to observations with marks near v than those further away. As discussed by Silverman (1986, p.43), the choice of kernel function has little to do with the performance of the estimators. For example, the asymptotic relative efficiency of the Tukey kernel compared to the optimal Epanechnikov kernel is 0.9897 and the Gaussian kernel has a relative efficiency of 0.9512. It is common to assume compact support for technical simplicity. The bandwidth, on the other hand, is an important parameter in the estimation of β(v). The cross-validation method is often used for bandwidth selection. This is further discussed in Section 5.
Taking the derivative of l(v, β) with respect to β gives the score function
(6) |
The maximum partial likelihood estimator is a solution to U(v, β) = 0. This estimator was studied by Sun et al. (2009) when there is no stratification, i.e. K = 1.
In the presence of missing marks, (5) can only be applied directly to estimate β(v) through a complete-case analysis (e.g., excluding failures with a missing mark), which may be biased and inefficient. Our first approach to remedying this problem, following the idea of Horvitz & Thompson (1952), uses inverse probability weighting of complete-cases. However, this approach has been shown to be inefficient in several situations (Gao & Tsiatis, 2005; Lu & Liang, 2008); also see Scharfstein et al. (1999). Our second approach, adapting the idea of Robins et al. (1994), augments the inverse probability weighted complete-case estimation equation with a consistent estimator of the conditional distribution of the mark that uses the estimator from the first approach and auxiliary data from subjects with missing marks.
Our model formulation is different from those in the existing literature on the discrete competing risks model, where only the cause-specific hazard function for one particular cause is modeled while the cause-specific hazard functions for the other causes are left unspecified. To evaluate mark-specific vaccine effects, model (2) specifies a proportional mark-specific hazards model for each mark v, 0 ≤ v ≤ 1. This model induces restrictions on the conditional mark distribution ρk(v, Wk) given in (4), so that arbitrary modeling may run into conflicts with the mark-specific PH model (2) and result in inconsistent estimation. This modeling restriction requires a different approach, as described below.
3.1 Inverse probability weighted complete-case estimator
Following Horvitz & Thompson (1952), inverse probability weighting of complete-cases has been commonly used in missing data problems. Let rk(Wki, ψk) be the parametric model for the probability of complete-case rk(Wki) defined in (3), where ψk is a q-dimensional parameter. For example, one can assume the logistic model with logit for those with δki = 1. By (3), the maximum likelihood estimator of ψ = (ψ1,…,ψK) is obtained by maximizing the observed data likelihood,
(7) |
We propose the following inverse probability weighted (IPW) estimating equation for β:
(8) |
The IPW estimator of β(v) solves the above equation and is denoted by .
The baseline function λ0k(t, v) can be estimated by , obtained by smoothing the increments of the following estimator of the doubly cumulative baseline function dsdu:
(9) |
For example, one can use kernel smoothing,
(10) |
where and , with K(1)(·) and K(2)(·) the kernel functions and b1 and b2 the bandwidths.
3.2 Augmented inverse probability weighted complete-case estimator
The IPW estimator uses only complete cases and is inefficient. Following Robins et al. (1994), Gao & Tsiatis (2005) proposed augmented inverse probability weighted complete-case estimators for the linear transformation model of cause-specific functions subject to missing failure types. Lu & Liang (2008) studied a semiparametric additive model for cause-specific functions with missing failure types. In both papers, an additional model is supplied for the conditional probability of the failure type of interest P(J = 2|W) to utilize the data from the failures with missing types as well as from the complete cases when there is only one stratum, where J is the type of failure in a typical competing risks set-up. Whereas Gao & Tsiatis (2005) and Lu & Liang (2008) had the flexibility to model P(J = 2|W), our conditional distribution of the mark, ρk(v, w), being specified for each v ∈ [0, 1], has a built-in structure that must be considered. Nevertheless, we will show that the idea of Robins et al. (1994) can still be used for model (2) to improve efficiency. Specifically, more efficient estimation of (2) can be achieved by incorporating knowledge of ρk(v, w) into the estimation procedure. Let gk(a|t, v, z) = P(Aki = a|Tki = t, Vki = v, Zki = z, δki = 1). Then
(11) |
where w = (t, z, a). If Aki is independent of Vki given (Tki, Zki, δki), then . In this case, ρk(v, w) can be estimated by , where . Routine kernel methods can be used to show that at the rate of (nh)−1/2 when b1 ≍ b2 ≍ h, where ≍ means convergence to zero at the same rate.
When the auxiliary marks Aki are correlated with Vki, the conditional distribution ρk(v, w) involves the function gk(a|t, u, z), which can be modeled to capture the correlation. The stronger the correlation, the greater potential to improve efficiency. As considered in Section 1, if Vki is the genetic distance of an early virus, then an auxiliary mark Aki that is expected to predict Vki is the genetic distances of all available later viruses and their sampling times.
Consider a parametric model gk(a|t, u, z, θk) for gk(a|t, u, z). Let be an estimator of gk(a|t, u, z). Then ρk(v, w) can be estimated by
(12) |
Taking b1 ≍ b2 ≍ h, it is easy to show that at the rate of (nh)−1/2.
Let , and . Following Robins et al. (1994), we obtain the following augmented (AUG) inverse probability weighted estimating equation for β:
(13) |
The AUG estimator of β(v) solves the above equation and is denoted by .
The doubly cumulative baseline function is estimated by
(14) |
The baseline hazard function λ0k(t, v) can be estimated by the kernel estimator
(15) |
4 Asymptotic properties
Let , be the (right-continuous) filtration generated by the full data processes {Nki(s, v), Yki(s), Zki(s); 0 ≤ s ≤ t, 0 ≤ v ≤ 1, i = 1, …, nk, k = 1, …, K}. Assume , that is, the mark-specific instantaneous failure rate at time t given the observed information up to time t only depends on the failure status and the current covariate value. By the definition (1), . Hence, the mark-specific intensity of Nki(t, v) with respect to equals Yki(t)λki(t, v|Zki(t)). Let . By Aalen & Johansen (1978), Mki(·, v1) and Mki(·, v2) − Mki(·, v1) are orthogonal square integrable martingales with respect to for any 0 ≤ v1 ≤ v2 ≤ 1.
Let be the right continuous filtration obtained by adding Rki and δkiAki to . Let . Then
(16) |
where is the intensity of Nki(t, v) with respect to . Assume that is continuous in (t, v). Let . By Aalen & Johansen (1978), for any 0 ≤ v1 ≤ v2 ≤ 1, the processes and , 0 ≤ t ≤ τ are orthogonal square integrable martingales.
The following regularity conditions are assumed throughout the rest of the paper. Most of the notation can be found at the end of Section 2.
Condition A
-
(A.1)
β(v) has componentwise continuous second derivatives on [0, 1]. For each k = 1, …, K, the second partial derivative of λ0k(t, v) with respect to v exists and is continuous on [0, τ] × [0, 1]. The covariate process Zki(t) has paths that are left continuous and of bounded variation, and satisfies the moment condition E[∥Zki(t)∥4 exp(2M∥Zki(t)∥)] < ∞, where M is a constant such that (v, β(v)) ∈ [0, 1] × (−M, M)p for all v and ∥A∥ = maxk,l |akl| for a matrix A = (akl).
-
(A.2)
Each component of is continuous on [0, τ]×[−M, M]p, is continuous on [0, τ]×[−M, M]p×[−L, L]q for some M, L > 0 and j = 0, 1, 2. , and .
-
(A.3)
The limit pk = limn→∞ nk/n exists and 0 < pk < 1. on [0, τ] ×[−M, M]p and the matrix is positive definite, where .
-
(A.4)
The kernel functions K(·), K(1)(·) and K(2)(·) are symmetric with support [−1, 1] and have bounded variation. The bandwidths satisfy b1 ≍ b2 ≍ h and nh2 → ∞ and nh5 = O(1) as n → ∞.
-
(A.5)
There is a ε > 0 such that rk(Wki) ≥ ε for all k, i with δki = 1.
Discussion of some of these conditions can be found in Sun et al. (2009). To avoid the problems at the boundaries v = 0, 1, we study the asymptotic properties of for interior values of v ∈ [a, b] ⊂ (0, 1).
4.1 Asymptotic results of inverse probability weighted complete-case estimator
The consistency and asymptotic normality of are established in the next two theorems.
Theorem 1. Under Condition A, if the model for rk(Wki) is correctly specified, then uniformly in v ∈ [a, b] ⊂ (0, 1) as n → ∞.
Let ν0 = ʃ K2(x) dx, μ2 = ʃ x2K(x)dx,
Theorem 2. Under Condition A, if the model for rk(Wki) is correctly specified, then for v ∈ [a, b] as n → ∞, where and Σ(v) is given in condition (A.3).
By (16) and by the property of expectation that E(·) = E{E(·|Wki),
ν0Σ*(v) can be consistently estimated by
The proof of Theorem 2 uses a Taylor expansion of the score function, leading to , where β*(v) is on the line segment between and β(v), and
is the derivative of with respect to β. Here is defined at the end of Section 2. It can be shown that as n → ∞. Thus, the asymptotic variance ν0Σ−1(v)Σ*(v)Σ−1(v) can be estimated by .
4.2 Asymptotic results of augmented inverse probability weighted complete-case estimator
Let and be the score vector and information matrix for under (7). Then
(17) |
(18) |
and .
We also introduce the following notation:
(19) |
The next theorem shows that the AUG estimator is consistent if either rk(w, ψk) or gk(a|t, v, z, θk) is correctly specified, a double robustness property.
Theorem 3. Assuming Condition A, uniformly in v ∈ [a, b] ⊂ (0, 1) as n → ∞. This consistency holds if either rk(w; ψk) or gk(a|t, v, z, θk) is correctly specified.
Following the proofs of Theorem 2 and 3, it is easy to show that is asymptotically normal for v ∈ [a, b] ⊂ (0, 1) for some function ϑ*(v) if either rk(w, ψk) or gk(a|t, v, z, θk) is correctly specified. When both rk(w, ψk) and gk(a|t, v, z, θk) are correctly specified, Theorem 4 below shows that is more efficient than .
Theorem 4. Assuming Condition A, if both rk(w, ψk) and gk(a|t, v, z, θk) are correctly specified, we have for v ∈ [a, b] as n → ∞ where and Σ(v) is given in the condition (A.3). The estimator is more efficient than in the sense that
(20) |
Equation (20) shows that the asymptotic covariance is smaller than and the difference between the two covariances is at the order of h. The demonstrated efficiency gain for is reasonable since the estimation procedures for both and are based on the local partial likelihood with h as its bandwidth.
Let be defined similarly to given in (12) with the AUG estimator . Let , and
The asymptotic variance of can be consistently estimated by .
Let where β1(v) is the coefficient for vaccination status and β2(v) for other covariates. Let be the first element on the diagonal of the matrix ν0Σ−1(v)Σ*(v)Σ−1(v) and its consistent estimator, which is the first element on the diagonal of . By Theorem 4, a large sample 100(1−α% pointwise confidence interval for β1(v) is given by , 0 < v < 1, where zα/2 is the upper α/2 quantile of the standard normal distribution. The confidence intervals for the other coefficient functions can be constructed similarly.
The mark-specific vaccine efficacy V E(v) = 1−exp(β1(v)) can be estimated by . By Theorem 4 and the delta method, for v ∈ (0,1). A large sample 100(1 − α)% pointwise confidence interval for VE(v) is then given by , o < v < 1.
5 Simulation study
5.1 Assessment of estimation procedures under correctly specified models
First, we conduct a simulation study to check the finite sample performance of the two proposed estimation procedures when both rk(w) and gk(a|t, v, z) are correctly specified. The augmented inverse probability weighted complete-case estimator (AUG) for (2) with missing marks is compared to the complete-case estimator (CC) wherein the failures with missing marks are deleted for the analysis, and to the inverse probability weighted complete-case estimator (IPW). These estimators are compared to the unachievable benchmark, the full data likelihood estimator (Full), which analyzes the full data-set before some simulated marks are deleted. We consider the case with K = 1 stratum, in which case the CC and Full methods are that of Sun et al. (2009).
With k = 1 throughout this section, let Zki be the treatment indicator taking value 0 or 1 with probability of 0.5 for each value. The (Tki, Vki) are generated from the following mark-specific proportional hazards model:
(21) |
where α, β and γ are constants. Under model (21), the mark-specific baseline function is λ0(t, v) = exp(γv) and the mark-specific vaccine efficacy is VE(v) = 1 − exp(α + βv). The vaccine efficacy VE(v) = 0 for α = 0 and β = 0, indicating no vaccine efficacy. The vaccine efficacy does not depend on the type of infecting strain if β = 0, while the VE(v) decreases in v if β > 0. We examine the estimation procedures for the following specific models:
-
(M1)
(α, β, γ) = (0, 0, 0.3), where VE(v) = 0;
-
(M2)
(α, β, γ) = (−0.69, 0, 0.3), where VE(v) does not depend on v;
-
(M3)
(α, β, γ) = (−0.6, 0.6, 0.3), where VE(v) decreases;
-
(M4)
(α, β, γ) = (−1.2, 1.2, 0.3), where VE(v) decreases.
All failure times greater than τ = 2.0 are considered as censored. In addition, we generate random censoring times from an exponential distribution, independent of (T, V), with parameter adjusted so that the overall censoring rates range from 25% to 35%. The complete-case indicator Rki is generated with conditional probability rk(Wki) = P(Rki = 1|δki = 1, Wki), where
(22) |
Here ψk0 = 0.2 and ψk1 = −0.2. The proportion of missing marks is about 50%, similar to the rate observed in current HIV vaccine efficacy trials.
For the proposed AUG estimator, conditional on (Tki, Zki, Vki), we assume that a single auxiliary mark follows the model
(23) |
for i = 1, …, nk, where Vki is the possibly missing mark generated from model (21), Uki is uniformly distributed on [0, 1] independent of Vki, and θ > 0 measures the association between Aki and Vki. The correlation coefficient ρ between Aki and Vki is 1 for θ = 0. Since Aki is observed for all observed failure times, the AUG estimator in this case is the Full estimator. The Aki and Vki are independent for θ = ∞(ρ = 0), in which case the AUG estimator is denoted by AUG-A0. The AUG estimator is denoted by AUG-A1 for θ = 0.8 yielding ρ ∞ 0.78, by AUG-A2 for θ = 0.4 with ρ ∞ 0.92 and by AUG-A3 for θ = 0.2 with ρ ∞ 0.98. Thus the simulation compares the AUG estimators at four levels of association between Aki and Vki.
We study settings with relatively high correlations because these occur for real data-sets due to the fact that inter-subject HIV sequence diversity is considerably larger than intra-subject HIV sequence diversity (Keele et al., 2008). We have found linear correlations between early and later sequence distances as high as 0.98.
Under model (23), the conditional density of Aki given (Tki, Zki, Vki) is
(24) |
The likelihood for θ is
It is easy to show that the maximum likelihood estimator is . Plugging in the density estimator into (12) yields , which is used to construct the AUG estimator of β in (13).
Performance of the proposed estimation procedures is evaluated for the models described in (21), (22) and (24) through simulations. We use the Epanechnikov kernel K(x) = .75(1 − x2)I{|x| ≤ 1}. The same kernel is also used for K(1)(x) and K(2)(x). For sample size n = 500 and bandwidths b1 = 0.1 and h = b2 = 0.15, Figure 1 shows the average estimates of β(v) under the models (M1)–(M4) based on 500 simulations for four different procedures: Full, CC, IPW and AUG. Figure 2 shows the standard errors of these estimates and Figure 3 shows the pointwise coverage probabilities of 95% confidence intervals for β(v) constructed using the AUG estimators based on 500 simulations.
The simulation results show that the bias of both estimators is very small at a level comparable to that of the full data likelihood estimator (Figure 1). However, the complete-case estimator yields large bias especially for models (M2)–(M4). The simulations also validate that the AUG estimator is more efficient than the IPW estimator (Figure 2). In particular, the AUG estimator without auxiliary marks (AUG-A0) has smaller standard errors than the IPW estimator, and the AUG estimator with the auxiliary mark has smaller standard errors than the AUG estimator without auxiliary marks. Moreover, the standard errors of the AUG estimator decrease as the correlation between the auxiliary mark and the mark of interest increases, with the standard errors for AUG-A1, AUG-A2 and AUG-A3 getting closer to that of the Full estimator. The pointwise coverage probabilities displayed in Figures 3 are almost all in the range of 92.5% and 97.5% for v ∈ (0, 1), indicating adequate performance of the proposed variance estimator for the AUG estimator. The deviation from the 95% nominal level is attributed to both the limited sample size (at n = 500) and the limited number of simulation runs (also at 500). The pointwise coverage probabilities are expected to be closer to 95% when both the sample size and the number of simulation runs increase.
The bandwidth selection is important in nonparametric smoothing procedures. The bias tends to be larger and the standard errors smaller for larger bandwidths. A reasonable bandwidth choice trades-off between bias and standard error. We have tried simulations with different bandwidths, taking bandwidth b2 equal to 0.05 and 0.1 and b1 equal to 0.15, and found our procedures relatively insensitive to the choice. The standard errors are larger at the boundary points than at the inner points, v ∈ (b2, 1 − b2). In practice, the appropriate bandwidth selection can be based on a -fold cross-validation method. This approach is widely used in the nonparametric function estimation literature and has been investigated by Efron & Tibshirani (1993) and Tian et al. (2005) among others. The simulations also show satisfactory performance of the AUG estimator of V E(v); see Figures 1–3 in the web Appendices.
5.2 Robustness analysis of the estimation procedures under mis-specified models
This subsection considers robustness of the proposed estimators to mis-specifications of rk(w) and/or gk(a|t, v, z), and to violation of the missing at random assumption. As above Zki is a binary random variable taking value 0 or 1 with probability 0.5 for each value. The (Tki, Vki) are generated from the mark-specific proportional hazards model (21) with (α, β, γ) given in (M3). The censoring times Cki for model (M3) remain the same as before yielding approximately 30% censoring and Xki = min(Tki, Cki).
Robustness of the estimators to mis-specification of rk(w) is examined by assuming model (22) while the actual complete-case indicator Rki is generated with the conditional probability rk(Wki) = P(Rki = 1|δki = 1, Wki), where
(25) |
Robustness of the AUG estimator is also examined when gk(a|t, v, z) is mis-specified. This is carried out by assuming the model (23) for the auxiliary mark or equivalently model (24) for gk(a|t, v, z) while the actual mark is generated from
(26) |
for i = 1, …, nk, where Uki is uniformly distributed on [0, 1] independent of Vki.
Robustness of the estimators to violation of the missing at random assumption (3) is examined by assuming model (22) while the actual Rki depends on Vki through the model
(27) |
The proportion of missing marks among the observed failure times is kept around 50% in all of the cases.
For sample size n = 500 and bandwidths b1 = 0.1 and h = b2 = 0.15, Figure 4 shows the bias of the estimators of β(v) under model (M3) based on 500 simulations for four different procedures: Full, CC, IPW and AUG. Figure 5 shows the standard errors of these estimators. In both figures, panel (a) shows the plots when rk(w) is mis-specified following (25) and gk(a|t, v, z) is correctly specified by (24) with θ = 0.2; panel (b) shows the plots when gk(a|t, v, z) is mis-specified following (26) and rk(w) is correctly specified by (22) with ψk1 = 0.2 and ψk1 = −0.2; and panel (c) shows the plots when rk(w) is mis-specified following (25) and gk(a|t, v, z) is mis-specified following (26); and panel (d) shows the plots when rk(w) depends on Vki following (27) and gk(a|t, v, z) is correctly specified by (24) with θ = 0.2.
The bias of the IPW estimator is large as seen in Figure 4(a) when rk(w) is mis-specified. Figures 4(a) and (b) show that the AUG estimator has very little bias tracking closely the full data likelihood estimator when one of rk(w) and gk(a|t, v, z) is mis-specified, reflecting the double robustness property of the AUG estimator. When both rk(w) and gk(a|t, v, z) are mis-specified, the AUG estimator still has smaller bias compared to the IPW estimator, as seen in Figure 4(c). Surprisingly, the AUG estimator yields as little bias as the full data likelihood estimator when the missing at random assumption is violated in Figure 4(d) while the IPW estimator shows very large bias. In all the scenarios, the complete-case estimator has largest bias. Figure 5 shows that the AUG estimator is more efficient than the IPW estimator with smaller standard error when both rk(w) and gk(a|t, v, z) are correctly specified, and this efficiency gain diminishes when one of them is mis-specified. Again the AUG estimator seems to have smaller standard error when the missing at random assumption is violated as shown in Figure 5(d).
6 Discussion
To address the fact that mark variables are subject to missingness in HIV vaccine efficacy trials and other applications, this article extends previous work on the continuous mark-specific proportional hazards model to accommodate marks missing at random. We developed two approaches based on inverse probability weighting (IPW) of complete-cases, wherein failures in the score equation are weighted by the reciprocal of the probability the mark is observed. As always for IPW-based methods, the methods provide unstable estimation if some failures have outlying estimated probabilities near zero; therefore effective implementation in practice requires careful definition of the mark and covariates in the missingness model. Moreover, performance of the IPW-based methods depends on the ability to build a good model for the probability of observing the mark in failures. For HIV vaccine efficacy trials with the mark defined as the genetic distance of an `early' virus (i.e., measured on an HIV antibody negative blood sample), subject characteristics that could help predict whether the `early' virus is observed include viral load (very high viral load predicts the virus is early, and very low/undetectable viral load makes the sequencing technology prone to fail to measure the sequence), the timing of the subject's HIV tests including any extra tests beyond those scheduled, whether the infection was diagnosed during the series of injections (when HIV testing is more frequent), and whether any antiretroviral pre-exposure prophylaxis was received.
Following Robins et al. (1994), our second approach augments the IPW estimating equation with a term for failures with a missing mark, which recovers information through modeling and estimation of the mark distribution. The simulation study showed that this approach effectively gains back efficiency compared to the IPW approach, and that the quality of predictive modeling heavily influences the extent of efficiency improvement. In our formulation the conditional density of the auxiliaries given the mark of interest is modeled and estimated. Therefore in practice a one-dimensional (or at most two-dimensional) auxiliary may be all that the data can support, given the need to either use nonparametric density estimation or to specify a reasonable parametric model. Fortunately for HIV vaccine efficacy trials with mark the genetic distance of the `early' virus, the genetic distance of a `later' virus may be a well-predicting univariate mark, as supported by examination of HIV sequence databases. Recently completed and ongoing efficacy trials are measuring distances from early and later viruses, and it will be of interest to apply the new methods to the forthcoming data-sets.
This article focused on point and interval estimation, leaving the problem of inference for future work.
Supplementary Material
Acknowledgements
The research of Yanqing Sun was partially supported by NSF grants DMS-0604576 and DMS-0905777, NIH grant R37 AI054165-09 and a fund provided by UNC Charlotte. The research of Peter Gilbert was partially supported by NIH grant R37 AI054165-09. The authors thank the associate editor and two referees for their constructive suggestions.
Footnotes
Supplementary Materials Web Appendices A and B referenced in this article are available at the website of the Scandinavian Journal of Statistics, http://onlinelibrary.wiley.com/journal.
REFERENCES
- Aalen OO, Johansen S. An empirical transition matrix for non-homogeneous Markov chains based on censored observations. Scandinavian Journal of Statistics. 1978;5:141–150. [Google Scholar]
- Dabrowska DM. Smoothed Cox regression. The Annals of Statistics. 1997;25:1510–1540. [Google Scholar]
- Dewanji A. A note on a test for competing risks with missing failure type. Biometrika. 1992;79:855–857. [Google Scholar]
- Dinse GE. Nonparametric prevalenace and mortality estimators for animal experiments with incomplete cause-of-death data. Journal of the American Statistical Association. 1986;81:328–336. [Google Scholar]
- Efron B, Tibshirani RJ. An introduction to the bootstrap. Chapman & Hall; New York: 1993. [Google Scholar]
- Fauci AS, Johnston MI, Dieffenbach CW, Burton DR, Hammer SM, Hoxie JA, Martin M, Overbaugh J, Watkins DI, Mahmoud A, Greene WC. HIV vaccine research: the way forward. Science. 2008;321:530–532. doi: 10.1126/science.1161000. [DOI] [PubMed] [Google Scholar]
- Flynn NM, Forthal DN, Harro CD, Judson FN, Mayer KH, Para MF, Gilbert PB, The rgp120 HIV Vaccine Study Group Placebo-controlled phase 3 trial of recombinant glycoprotein 120 vaccine to prevent HIV-1 infection. Journal of Infectious Diseases. 2005;191:654–665. doi: 10.1086/428404. [DOI] [PubMed] [Google Scholar]
- Gao G, Tsiatis AA. Semiparametric estimators for the regression coefficients in the linear transformation competing risks model with missing cause of failure. Biometrika. 2005;92:875–891. [Google Scholar]
- Gilbert PB, Lele S, Vardi Y. Maximum likelihood estimation in semiparametric selection bias models with application to AIDS vaccine trials. Biometrika. 1999;86:27–43. [Google Scholar]
- Gilbert PB, McKeague IW, Sun Y. Tests for comparing mark-specific hazards and cumulative incidence functions. Lifetime Data Analysis. 2004;10:5–28. doi: 10.1023/b:lida.0000019253.69537.91. [DOI] [PubMed] [Google Scholar]
- Gilbert PB, McKeague IW, Sun Y. The Two-Sample Problem for Failure Rates Depending on a Continuous Mark: An Application to Vaccine Efficacy. Biostatistics. 2008;9:263–276. doi: 10.1093/biostatistics/kxm028. [DOI] [PubMed] [Google Scholar]
- Goetghebeur E, Ryan L. A modified logrank test for competing risks with missing failure type. Biometrika. 1990;77:207–211. [Google Scholar]
- Goetghebeur E, Ryan L. Analysis of competing risks survival data when some failure types are missing. Biometrika. 1995;82:821–834. [Google Scholar]
- Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association. 1952;47:663–685. [Google Scholar]
- Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. Wiley; New York: 1980. [Google Scholar]
- Keele B, Giorgi E, Salazar-Gonzalez J, Decker JM, Pham KT, Salazar MG, Sun C, Grayson T, Wang S, Li H, Wei X, Jiang C, Kirchherr JL, Gao F, Anderson JA, Ping L-H, Swanstrom R, Tomaras GD, Blattner WA, Goepfert PA, Kilby JM, Saag MS, Delwart EL, Busch MP, Cohen MS, Montefiori DC, Haynes BF, Gaschen B, Athreya GS, Lee HY, Wood N, Seoighe C, Perelson AS, Bhattacharya T, Korber BT, Hahn BH, Shaw GM. Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection. Proceedings of the National Academy of Sciences. 2008;105:7552–7557. doi: 10.1073/pnas.0802203105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lagakos SW, Louis TA. Use of tumour lethality to interpret tumorigenicity experiments lacking cause of death data. Journal of Applied Statistics. 1988;37:169–179. [Google Scholar]
- Lu W, Liang Y. Analysis of competing risks data with missing cause of failure under additive hazards model. Statistica Sinica. 2008;19:219–234. [Google Scholar]
- Lu K, Tsiatis AA. Multiple imputation methods for estimating regression coefficients in the competing risks model with missing cause of failure. Biometrics. 2001;57:1191–1197. doi: 10.1111/j.0006-341x.2001.01191.x. [DOI] [PubMed] [Google Scholar]
- Racine-Poon AH, Hoel DG. Nonparametric estimation of survival function when cause of death is uncertain. Biometrics. 1984;40:1151–1158. [PubMed] [Google Scholar]
- Rubin DB. Inference and missing data. Biometrika. 1976;63:581–592. [Google Scholar]
- Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association. 1994;89:846–866. [Google Scholar]
- Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse models: rejoinder. Journal of the American Statistical Association. 1999;94:1135–1146. [Google Scholar]
- Silverman BW. Density Estimation for Statistics and Data Analysis. Chapman and Hall; London: 1986. [Google Scholar]
- Sun Y, Gilbert PB, McKeague IW. Proportional hazards models with continuous marks. The Annals of Statistics. 2009;37:394–426. doi: 10.1214/07-AOS554. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tian L, Zucker D, Wei LJ. On the Cox model with time-varying regression coefficients. Journal of the American Statistical Association. 2005;100:172–183. [Google Scholar]
- Wu TJ, Hsieh YC, Li LA. Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition. Biometrics. 2001;57:441–448. doi: 10.1111/j.0006-341x.2001.00441.x. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.