A Hybrid Approach for the Stratified Mark-Specific Proportional Hazards Model with Missing Covariates and Missing Marks, with Application to Vaccine Efficacy Trials

Yanqing Sun; Li Qi; Fei Heng; Peter B Gilbert

doi:10.1111/rssc.12417

. Author manuscript; available in PMC: 2021 Aug 1.

Published in final edited form as: J R Stat Soc Ser C Appl Stat. 2020 May 22;69(4):791–814. doi: 10.1111/rssc.12417

A Hybrid Approach for the Stratified Mark-Specific Proportional Hazards Model with Missing Covariates and Missing Marks, with Application to Vaccine Efficacy Trials

Yanqing Sun ¹, Li Qi ², Fei Heng ³, Peter B Gilbert ⁴

PMCID: PMC7664296 NIHMSID: NIHMS1643050 PMID: 33191955

Abstract

Deployment of the recently licensed CYD-TDV dengue vaccine requires understanding of how the risk of dengue disease in vaccine recipients depends jointly on a host biomarker measured after vaccination (neutralization titer – NAb) and on a “mark” feature of the dengue disease failure event (the amino acid sequence distance of the dengue virus to the dengue sequence represented in the vaccine). The CYD14 phase 3 trial of CYD-TDV measured NAb via case-cohort sampling and the mark in dengue disease failure events, with about a third missing marks. We addressed the question of interest by developing inferential procedures for the stratified mark-specific proportional hazards model with missing covariates and missing marks. Two hybrid approaches are investigated that leverage both augmented inverse probability weighting and nearest neighborhood hot deck multiple imputation. The two approaches differ in how the imputed marks are pooled in estimation. Our investigation shows that NNHD imputation can lead to biased estimation without properly selected neighborhood. Simulations show that the developed hybrid methods perform well with unbiased NNHD imputations from proper neighborhood selection. The new methods applied to CYD14 show that NAb is strongly inversely associated with risk of dengue disease in vaccine recipients, more strongly against dengue viruses with shorter distances.

Keywords: Augmented inverse probability weighting, competing risks failure time, missing marks, nearest neighborhood hot deck multiple imputation, semiparametric regression, two-phase sampling

1. Introduction

The CYD14 phase 3 trial randomized 2 to 14 year-old children within five countries of southeast Asia in 2:1 allocation to receive the CYD-TDV vaccine or placebo in three injections at months 0, 6, and 12, where the CYD-TDV vaccine (Sanofi Pasteur) is a recombinant, live-attenuated, tetravalent vaccine containing one representative dengue strain from each of the four dengue serotypes (Capeding et al., 2014). Participants underwent active surveillance for the primary study endpoint symptomatic virologically confirmed dengue (henceforth “dengue disease”) between Month 13 and Month 25 post first vaccination. Partly based on this trial that showed that the rate of dengue disease was an estimated 56% lower in the vaccine group than the placebo group (p < 0.001), this vaccine has been licensed in more than a dozen countries. The vaccine has been thought to work by inducing anti-dengue neutralizing antibodies (NAbs).

We develop statistical methods to analyze the CYD14 efficacy trial data that are appropriate for interrogating how the association of anti-dengue NAb with dengue disease risk may differ depending on the amino acid (AA) sequence of the dengue virus causing the study endpoint, accounting for the fact that the expensive covariate of interest (NAb titer) was measured through a classic case-cohort sampling design (measured from a Bernoulli simple random sample of 19.5% of participants at enrollment and from all disease cases) and that there is a substantial percentage (about a third) of missing dengue sequences among cases. This integrated assessment of how a host biomarker and a “mark” feature of the failure event relate to failure risk has many applications, including general prospective studies that follow a cohort for acquisition of a genetically-diverse infectious disease, encompassing many pathogens including HIV-1, influenza, and malaria.

To define the general statistical problem, let T be the time to a failure event of interest, and Z be a time-independent p-dimensional covariate. Under the competing risks model, a cause-of-failure mark V is observed when a failure event occurs. Let V be a continuous mark variable with bounded support [0, 1]. The mark-specific failure time data follow a competing risks model where the mark variable V plays the role of cause-of-failure that is only observable upon failure. In the motivating dengue vaccine study, the mark V measures the AA sequence distance of a dengue-disease causing dengue sequence to the nearest dengue sequence inside the vaccine, which can only be observed in subjects experiencing the dengue disease endpoint and is not available nor meaningfully defined in subjects without the endpoint.

The mark-specific hazard function, defined as $λ (t, v) = {lim}_{Δ_{1} \to 0, Δ_{2} \to 0} P {T \in [t, t + Δ_{1}), V \in [v, v + Δ_{2}) ∣ T \geq t} / (Δ_{1} Δ_{2})$ , was studied by Gilbert et al. (2004). It measures the instantaneous risk of failure by a mark in the presence of all marks, e.g., dengue sequences circulating in the efficacy trial region exposing trial participants through mosquito bites, and can be considered as an extension of the cause-specific hazard function to a continuous mark. Subsequent statistical methods have been developed to model the conditional mark-specific hazard function with applications to HIV vaccine efficacy studies; see Sun et al. (2009), Sun and Gilbert (2012), Juraska and Gilbert (2013, 2016) and Yang et al. (2017).

Suppose that the population of interest includes K subpopulations or strata, each with different baseline mark-specific hazard functions. Let λ_k(t, v|z) be the conditional mark-specific hazard function at (T, V) = (t, v) given covariate Z = z for an individual in the kth stratum. The stratified mark-specific proportional hazards (PH) model postulates that

λ_{k} (t, v ∣ z) = λ_{0 k} (t, v) exp {β {(v)}^{T} z}, k = 1, \dots, K,

(1)

where λ_0k(·, v) = λ_k(t, v|z = 0) is the unspecified baseline hazard function for the kth stratum, and β(v) is the p-dimensional unknown regression coefficient function of v. Model (1) allows different baseline functions for different strata.

The mark-specific PH model (1) was first studied by Sun et al. (2009) under K = 1 with the objective of evaluating mark-specific HIV vaccine efficacy, where the mark is an amino acid sequence distance of an infecting HIV strain to an HIV strain inside the vaccine. The model (1) was further studied by Sun and Gilbert (2012), Gilbert and Sun (2015) and Juraska and Gilbert (2016) for the situation where the marks are subject to missingness in subjects with observed failure times. Yang et al. (2017) investigated model (1) under two-phase sampling of components of Z allowing some participants to have missing covariates. However, the methods accounting for missing marks assumed complete measurements of all covariates, and the methods accounting for missing covariates assumed complete data on the marks of failures. Therefore, new methods are needed to account for both types of missing data.

In the motivating CYD14 efficacy trial, there are two types of missing data. The covariate NAb titer is missing through a case-cohort sampling design and the mark V (dengue sequence distance) is missing for some cases. Multiple imputation has been widely used for handling missing data, cf. Rubin (1987). Two-phase sampling/case-cohort designs are common forms of studies with missing covariates, where covariates are divided into phase one or phase two, with the former measured in all enrolled subjects and the latter measured only in a subset, typically because of expense of measurement. A “case-cohort” design typically refers to randomly sampling subjects at enrollment into a subcohort for measuring the phase-two covariates, which are also measured in all subjects outside the subcohort who experience the failure event and have the requisite samples available (White, 1982; Prentice, 1986; Breslow and Lumley, 2013). “Two-phase sampling” typically refers to the generalization of outcome-dependent case-control sampling, where within each cell of a 2 × K table defined by outcome status cross-classified with the K levels of a discrete phase-one covariate, subjects are randomly sampled for measuring the phase-two covariates Breslow et al. (2009). These designs can be implemented with Bernoulli or without replacement sampling, and our methods apply for any of the Bernoulli sampling versions. As is the usual case, application of the methods to the without replacement sampling versions provides approximately correct results, with inferences tending to be slightly conservative. There is extensive literature on statistical methods for two-phase sampling/case-cohort designs, e.g. Prentice (1986); Robins et al. (1994); Borgan et al. (2000); Scheike and Martinussen (2004); Kulich and Lin (2004); Nan (2004); Breslow et al. (2009); Breslow and Lumley (2013).

Nearest neighborhood imputation is one of the hot deck imputation methods commonly used in survey sampling (Sedransk, 1985; Kovar et al., 1988; Jonsson and Wohlin, 2004; Andridge and Little, 2010). The idea of nearest neighborhood hot deck (NNHD) imputation is to replace each missing value with an observed response from a matching subject from the same dataset. The proposed hybrid approach leverages both augmented inverse probability weighted complete-case (AIPW) estimation (Robins et al., 1994) to handle the two-phase sampled covariates and NNHD imputation to fill in missing marks in failure cases (Chen and Shao, 2000; Beretta and Santaniello, 2016). AIPW estimation possesses a double robust property, yielding consistent estimates if either the model for whether phase-two covariates are missing, or the model for the conditional expectations of phase-two covariates, is correctly specified (Robins et al., 1994; Gao and Tsiatis, 2005). Most imputation methods assume a parametric model for the variable to be imputed. In contrast, as a nonparametric technique, NNHD imputation does not rely on model fitting for the variable to be imputed, and thus is potentially less sensitive to model misspecification than a parametric-model based imputation method. However, our investigation shows that NNHD imputation can lead to biased estimation without proper neighborhood selection.

We develop hybrid estimation and hypothesis testing procedures for model (1) that use both AIPW estimation and NNHD imputation. We investigate the neighborhood selection for the NNHD imputation for unbiased estimation. The NNHD imputation is employed to impute the values of missing marks, followed by completed-marks two-phase sampling data analysis with an AIPW method similar to that of Yang et al. (2017) that did not account for missing marks. We investigate two hybrid estimation methods using the completed-marks two-phase sampling data that differ in the way in which the imputed marks are pooled in estimation. We develop hypothesis testing procedures to evaluate whether the mark-specific hazard ratios are unity and whether they change with the mark. The main contribution of this paper is the development of hybrid estimation and hypothesis testing methods for model (1) that relates the hazard of an outcome to both covariates and marks, accounting for missingness in both, including the investigation of neighborhood selection for the NNHD imputation of marks to achieve valid inference on the association parameters. The developed procedures enable assessment of whether and how the hazard rate of an infectious disease with a pathogen genetically close to or far from a reference genetic sequence are modified by participant covariates. This application is exemplified by the dengue vaccine efficacy trial, with reference sequence the closest dengue strain in the vaccine construct and covariates age and immune response to the dengue vaccine strains.

In Section 2, we formulate the missing data problem, presenting notations and assumptions. The NNHD imputation technique is introduced in Section 2.1. The two hybrid estimation procedures are developed in Section 2.2. Techniques for estimation of the mark-specific cumulative incidence function rate are given in Section 2.3. Statistical procedures for hypothesis testing of the mark-specific hazard ratios are developed in Section 3. An extensive simulation study is conducted in Section 4 to examine the performances of the newly proposed methods, which are applied to the CYD14 data in Section 5. Some concluding remarks are given in Section 6. Additional discussions about the proposed hybrid methods along with more simulation results, analysis of the simulated data based on the CYD14 efficacy trial and additional analysis of the CYD14 efficacy trial are presented in the Supplementary Materials.

2. Hybrid estimation using AIPW and NNHD multiple imputations

The AIPW estimation method was proposed by Robins, Rotnitzky and Zhao (1994) for missing data to improve robustness and efficiency over simple inverse probability weighted estimators. This important methodology has been widely used, and shown efficiency and the double robust property in many studies, cf., Gao and Tsiatis (2005), Sun and Gilbert (2012), Yang et al. (2017) and Sun et al. (2018), among others. We investigate two hybrid methods of estimation of the mark-specific proportional hazards model that use both AIPW and NNHD imputation. We propose employing NNHD to impute the values of missing marks, followed by completed-marks two-phase sampling data analysis with an AIPW method similar to that of Yang et al. (2017) that did not account for missing marks. The first approach follows the standard multiple imputation scheme of Rubin (1987) while the second approach incorporates multiple imputations in estimating equations (MIEE).

Suppose the failure time T is subject to right censoring, and is partially observed through observation of X = min{T, C} and δ = I(T ≤ C*), where I(·) is the indicator function and C* = min(C, τ) is the right censoring time with τ the end of follow-up and C the right censoring random variable. Let Z be a time-independent covariate vector. We assume independent censoring – that C is independent of (T, V) conditional on Z. Suppose that $Z = {(Z_{1}^{T}, Z_{2}^{T})}^{T}$ consists of two parts – Z₁ are observed in all subjects (phase one) and Z₂ are only measured in a subset (phase-two sample). In addition, the mark variable V is subject to missingness. Let ξ = (ξ_z, ξ_v) be the vector of missing data indicators, where ξ_z is the indicator for whether a subject has complete covariate information, and ξ_v is the indicator for whether the mark variable V is observed. We set ξ_v = 1 if δ = 0 since the mark V is inherently not available and is not considered as missing. We also set ξ_v = 1 if δ = 1 and V is observed; otherwise ξ_v = 0. Let A = (A_z, A_v) be auxiliary variables, with A_z the auxiliary variable predictive of phase-two covariates and A_v the auxiliary variable predictive of missing marks. For convenience, we denote Ω = (X, Z₁, A) and represent the observed data by ${\tilde{Ω}}_{o} = (Ω, ξ_{z} Z_{2}, ξ_{v} δ V, δ)$ .

We assume that Z₂ and V are missing at random (Rubin, 1976), satisfying MAR:

P(ξ_z = 1|X, Z₁, Z₂, A, δV, δ) = P(ξ_z = 1|X, Z₁, A_z, δ),
P(ξ_v = 1|X, Z₁, Z₂, A, δV, δ = 1) = P(ξ_v = 1|X, Z₁, Z₂, A_v, δ = 1), and
P(ξ_z = 0, ξ_v = 0|X, Z₁, Z₂, A, δV, δ) = 0.

MAR (i) assumes that the missingness of Z₂ does not depend on the value of Z₂ and δV; MAR (ii) assumes that the missingness of V do not depend on the value of V; and MAR (iii) implies that Z₂ and V do not have missing values on the same subjects, which is always satisfied under Prentice’s (1986) original case-cohort sampling design for which no cases have missing Z₂ values. It is approximately satisfied for implemented case-cohort sampling designs (including our example) that intend to measure Z₂ in all cases but end up with a small number of happenstance missing values.

Suppose there are K strata. Let n_k be the number of subjects in the kth stratum and $n = \sum_{k = 1}^{K} n_{k}$ . We label the ith subject in the kth stratum with a pair of subscripts {ki}. Let Z_1,ki and Z_2,ki be copies of covariates Z₁ and Z₂ for subject i in stratum k, respectively. Similarly, ξ_z,ki and ξ_v,ki are copies of ξ_z and ξ_v, respectively. Let $Z_{k i} = {(Z_{1, k i}^{T}, Z_{2, k i}^{T})}^{T}$ , ξ_ki = (ξ_z, k_i, ξ_v, k_i), and Ω_ki = (X_ki, Z_1,ki, A_ki). The observed data are ${\tilde{Ω}}_{o, k i} = (Ω_{k i}, ξ_{z, k i} Z_{2, k i}, ξ_{v, k i} δ_{k i} V_{k i}, δ_{k i})$ , for i = 1, …, n_k, k = 1, …, K. We assume that {T_ki, C_ki, V_ki, Z_ki, ξ_ki, A_ki; i = 1, …, n_k} are independent identically distributed (i.i.d.) replicates of (T, C, V, Z, ξ, A) from stratum k, k = 1, …, K.

2.1. Nearest neighborhood hot deck imputation of missing marks

In the competing risks setting, V_ki is observable if a failure is observed, i.e., δ_ki = 1. If the mark value V_ki is not available for δ_ki = 1, then we have a missing mark indicated by ξ_v,ki = 0. The standard imputation approach involves first drawing the parameters of the posterior distribution of the missing variables given the observed data, and then drawing M sets of imputed values for the missing data from their posterior distribution given the observed data, cf. Rubin (1987). However, parametric multiple imputation can be sensitive to misspecification of the imputation model (Carroll et al., 1984).

NNHD imputation, as a hot deck imputation method, replaces each missing value with an observed response from a matching subject from the same dataset. Hot deck imputation methods have been studied by Little (1988), Reilly (1993), Chen and Shao (2000), Beretta and Santaniello (2016), among others. Using the hot deck method, we impute a missing value V of a subject by choosing at random from observed V values among matching donors. Donors are matched for their similarity in regard to some metric. This approach does not rely on model fitting for the variable to be imputed, and thus is potentially less sensitive to model misspecification than an imputation method based on a parametric model.

We describe the NNHD imputation procedure as follows. Suppose V_ki is missing in which case ξ_v,ki = 0. We impute missing values V_ki using hot deck imputation from donors with similar $H_{k i} = (T_{k i}, Z_{k i})$ or $H_{k i} = (T_{k i}, Z_{k i}, A_{v, k i})$ in case that a relevant A_v,ki is available. Let $d (H_{k i}, H_{k j})$ be a measure of similarity between $H_{k i}$ and $H_{k j}$ . Each hot deck imputation of V_ki is obtained by randomly selecting a donor’s mark from the L-nearest neighborhood $L_{k i}$ matched based on the similarity measure $d (H_{k i}, H_{k j})$ , where L is a number less than the number of non-missing marks for observed failures. Let R_ki,L be the Lth order statistic of $d (H_{k i}, H_{k j})$ for subjects with δ_kj = 1, ξ_v,kj = 1, j = 1, …, n_k. An L-nearest neighborhood of V_ki is defined as $L_{k i} = {V_{k j} : d (H_{k i}, H_{k j}) \leq R_{k i, L}, δ_{k j} = 1, ξ_{v, k j} = 1, j = 1, \dots, n_{k}}$ . The implementation of the nearest neighborhood hot deck depends on the choice of metric and the variables included for the neighborhood selection. If some components of Z_ki are discrete, then the L-nearest neighborhood imputations are carried out based on the remaining variables $H_{k i}^{s}$ in $H_{k i}$ stratified by the values of the discrete components of Z_ki. Further, the similarity measure $d (H_{k i}^{s}, H_{k j}^{s})$ can be calculated based on the z-scores or the ranks of variables, which eliminates the effects of scales or units of the variables on the nearest neighbor selections. Let $V_{k i}^{(m)}$ , m = 1, …, M, be M random selections from $L_{k i}$ with replacement. If V_ki is not missing, in which case ξ_v,ki = 1, then we let $V_{k i}^{(m)} = V_{k i}$ .

NNHD imputation is related to variable-bandwidth L-nearest neighbors kernel smoothing that is widely used in nonparametric density estimation and regression, cf. Stone (1977), Li (1984) and Altman (1992). Every case with missing V has the same number of marks imputed from the L-nearest neighbors. The NNHD approach with a fixed number, L, of neighbors is similar to defining neighborhoods by a metric with varying bandwidth such as R_ki,L, whereas an alternative approach with a fixed bandwidth B is similar to allowing variable L. A fixed B-bandwidth neighborhood of V_ki is defined as $B_{k i} = {V_{k j} : d (H_{k i}, H_{k j}) \leq B, δ_{k j} = 1, ξ_{v, k j} = 1, j = 1, \dots, n_{k}}$ . In this case, the number, L, of neighbors belonging to $B_{k i}$ varies among subjects with missing marks. An advantage of using fixed L is that the bandwidth is allowed to be larger when data are sparse, which is a common nonparametric smoothing approach to guard against incorporating too few points that could occur by using fixed bandwidth. While we study NNHD with fixed L, the method could also be implemented with fixed bandwidth or variable L.

Choosing the set, $H_{k i}$ , of variables for neighborhood selection is very important. Our investigation shows that NNHD imputation can lead to biased estimation without proper selection of the neighborhood. Let W = (T, Z, A_v) and ρ_k(v, W) = P(V ≤ v|δ = 1, W) be the conditional distribution of V given W for cases. For an observed value w = (t, z, a) of W of an individual in the kth stratum, ρ_k(v, w) = P(V ≤ v|δ = 1, W = w). Let g_k(a|t, v, z) = P(A_v,ki = a|T_ki = t, V_ki = v, Z_ki = z, δ_ki = 1) be the probability density of a possible auxiliary variable for V. By Sun and Gilbert (2012),

ρ_{k} (v, w) = \int_{0}^{v} λ_{k} (t, u ∣ z) g_{k} (a ∣ t, u, z) d u / \int_{0}^{1} λ_{k} (t, u ∣ z) g_{k} (a ∣ t, u, z) d u .

(2)

If A_v,ki is not available or independent of V_ki given (T_ki, Z_ki, δ_ki), then $ρ_{k} (v, w) = \int_{0}^{v} λ_{k} (t, u ∣ z) d u / \int_{0}^{1} λ_{k} (t, u ∣ z) d u$ . Equation (2) shows that the conditional distribution of V_ki depends on (T_ki, Z_ki, A_v,ki) in general. Unbiased imputation of V_ki should be selected from a neighborhood defined based on $H_{k i} = (T_{k i}, Z_{k i}, A_{v, k i})$ except for certain special situations where β(v) in model (1) does not change with v and A_v,ki is conditionally independent of Z_ki given (T_ki, V_ki, δ_ki). In this case, z cancels out from (2) under model (1). A simulation example in Section 4 shows that the NNHD imputation leads to biased estimation without including Z_2,ki in $H_{k i}$

2.2. Hybrid estimation procedures

We propose two hybrid approaches for estimation of model (1). The first approach follows the standard multiple imputation scheme of Rubin (1987) such that the NNHD estimator is the average of the AIPW estimates of Yang et al. (2017) for two-phase sampling of covariates for completed marks under each imputation and the variance estimator is adjusted using Rubin’s formula. The second approach utilizes multiple imputations in a single AIPW estimating equation.

Let Y_ki(t) = I(X_ki ≥ t) be the at-risk process. The sampling probabilities of the phase-two covariates are given by π_z,ki(t) = P_k{ξ_z,ki = 1|Ω_ki, δ_ki, Y_ki(t) = 1}. Suppose that ${\hat{π}}_{z, k i} (t)$ is an estimator of π_z,ki(t) based on parametric models as discussed in Yang et al. (2017). Let W_ki(t) = ξ_z,ki(π_z,ki(t))⁻¹ and ${\hat{W}}_{k i} (t) = ξ_{z, k i} {({\hat{π}}_{z, k i} (t))}^{- 1}$ . We define the marked counting processes for the completed marks by $N_{k i}^{(m)} (t, v) = I (X_{k i} \leq t, δ_{k i} = 1, V_{k i}^{(m)} \leq v)$ for m = 1, …, M. If V_ki is not missing, $V_{k i}^{(m)} = V_{k i}$ and $N_{k i}^{(m)} (t, v) = N_{k i} (t, v) \equiv I (X_{k i} \leq t, δ_{k i} = 1, V_{k i} \leq v)$ .

2.2.1. Hybrid estimation using standard multiple imputation

Standard MI estimation of β(v) uses the average of estimates obtained for each imputation. Following Yang et al. (2017), for the mth imputation, m = 1, …, M, let ${\hat{β}}_{R}^{(m)} (v)$ be the solution to the estimating equation for β = β(v) for v ∈ (0, 1):

U^{(m)} (v, β) = \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} \int_{0}^{1} \int_{0}^{τ} K_{h} (u - v) ({\hat{W}}_{k i} (t) {Z_{k i} - {\hat{Z}}_{k} (t, β)} + (1 - {\hat{W}}_{k i} (t)) [{\hat{E}}_{k} {Z_{k i} ∣ Ω_{k i}, δ_{k i} V_{k i}^{(m)}} - {\hat{Z}}_{k} (t, β)]) N_{k i}^{(m)} (d t, d u),

(3)

where K_h(x) = K(x/h)/h, K(·) is a kernel function and h the bandwidth, and ${\hat{E}}_{k} {Z_{k i} ∣ Ω_{k i}, δ_{k i} V_{k i}^{(m)}}$ and ${\hat{Z}}_{k} (t, β) = {\hat{S}}_{k}^{(1)} (t, β) / {\hat{S}}_{k}^{(0)} (t, β)$ are the estimates described in Yang et al. (2017).

In particular, ${\hat{E}}_{k} {Z_{k i} ∣ Ω_{k i}, δ_{k i} V_{k i}^{(m)}}$ is the estimate of $E_{k} {Z_{k i} ∣ Ω_{k i}, δ_{k i} V_{k i}^{(m)}}$ , and

{\hat{S}}_{k}^{(j)} (t, β) = n_{k}^{- 1} \sum_{i = 1}^{n_{k}} {{\hat{W}}_{k i} (t) Y_{k i} (t) exp {β^{T} Z_{k i}} Z_{k i}^{\otimes j} + (1 - {\hat{W}}_{k i} (t)) Y_{k i} (t) {\hat{E}}_{k} [exp {β^{T} Z_{k i}} Z_{k i}^{\otimes j} ∣ Ω_{k i}, δ_{k i} V_{k i}^{(m)}]},

(4)

for j = 0, 1, and 2, where ${\hat{E}}_{k} [exp {β^{T} Z_{k i}} Z_{k i}^{\otimes j} ∣ Ω_{k i}, δ_{k i} V_{k i}^{(m)}]$ is the estimate of the conditional expectation $E_{k} [exp {β^{T} Z_{k i}} Z_{k i}^{\otimes j} ∣ Ω_{k i}, δ_{k i} V_{k i}^{(m)}]$ for j = 0, 1, and 2. Write $β = {(β_{1}^{T}, β_{2}^{T})}^{T}$ , where β₁ and β₂ are the coefficients for Z_1,ki(t) and Z_2,ki, respectively. Note that Z_1,ki(t) is a part of Ω_ki. For given β, the first part of $E_{k} {Z_{k i} ∣ Ω_{k i}, δ_{k i} V_{k i}^{(m)}}$ is Z_1,ki(t) and the second part is $E_{k} {Z_{2, k i} ∣ Ω_{k i}, δ_{k i} V_{k i}^{(m)}}$ . Similarly, $E_{k} [exp {β^{T} Z_{k i}} Z_{k i}^{\otimes j} ∣ Ω_{k i}, δ_{k i} V_{k i}^{(m)}]$ , for j = 0, 1 and 2, depend on the observed data, and are functions of the conditional expectations $E_{k} [exp {β_{2}^{T} Z_{2, k i}} Z_{2, k i}^{\otimes r} ∣ Ω_{k i}, δ_{k i} V_{k i}^{(m)}]$ , r = 0, 1 and 2. Yang et al. (2017) considered using parametric models for $E_{k} {g (Z_{2, k i}) ∣ Ω_{k i}, δ_{k i} V_{k i}^{(m)}}$ to obtain the estimate ${\hat{E}}_{k} {g (Z_{2, k i}) ∣ Ω_{k i}, δ_{k i} V_{k i}^{(m)}}$ , where g(Z_2,ki) is a specified function of Z_2,ki such as Z_2,ki, exp(β₂Z_2,ki) or Z_2,ki exp(β₂Z_2,ki).

By the standard multiple imputation scheme of Rubin (1987), the hybrid-Rubin estimator is defined by ${\hat{β}}_{R} (v) = M^{- 1} \sum_{m = 1}^{M} {\hat{β}}_{R}^{(m)} (v)$ . The variance estimate of ${\hat{β}}_{R} (v)$ adjusting for multiple imputation using Rubin’s (Rubin, 1987, pp. 76) rule equals

\hat{Var} ({\hat{β}}_{R} (v)) = M^{- 1} \sum_{m = 1}^{M} \hat{Var} ({\hat{β}}_{R}^{(m)} (v)) + (1 + M^{- 1}) {(M - 1)}^{- 1} \sum_{m = 1}^{M} {({\hat{β}}_{R}^{(m)} (v) - {\hat{β}}_{R} (v))}^{2},

(5)

where $\hat{Var} ({\hat{β}}_{R}^{(m)} (v))$ is the variance estimator of Yang et al. (2017) based on the mth imputation. The first part accounts for within-imputation variability, and the second part ${(M - 1)}^{- 1} \sum_{m = 1}^{M} {({\hat{β}}_{R}^{(m)} (v) - {\hat{β}}_{R} (v))}^{2}$ for between-imputation variability. The term 1 + M⁻¹ corrects for bias due to the finite number of multiply imputed data sets.

2.2.2. Hybrid estimation via the estimating equations approach

This subsection proposes another hybrid approach that incorporates multiple imputations into a single estimating equation. A subject with a missing mark gets M imputed marks, which are associated with a particular subject and are dependent. The M imputed marks can be considered as a cluster. We consider the following hybrid estimating equation for β = β(v) for v ∈ (0, 1):

U (v, β) = M^{- 1} \sum_{m = 1}^{M} U^{(m)} (v, β) = 0,

(6)

where U^(m)(v, β) is defined in (3). The estimating equation (6) for the hybrid-MIEE resembles the generalized estimation equation (GEE) approach for repeated measures analysis that assumes working independence (Liang and Zeger, 1986). A subject with observed mark, in which case $V_{k i}^{(m)} = V_{k i}$ , receives weight one, while the weight for a subject with missing mark is 1/M for each imputation. The estimator $\hat{β} (v)$ that solves U(v, β) = 0 is termed the hybrid multiple imputation estimation equation (hybrid-MIEE) estimator.

The estimator $\hat{β} (v)$ can be implemented using the Newton-Raphson iterative algorithm. Starting with an initial value β⁽⁰⁾(v), let β^(l)(v) be the estimate of β(v) at step l. The estimator $\hat{β} (v)$ is obtained by iterating steps (i) and (ii) as follows until convergence: (i) Estimate the conditional expectations $E_{k} [exp {{(β_{2}^{(l)} (v))}^{T} Z_{2, k i}} Z_{2, k i}^{\otimes j} ∣ Ω_{k i}, δ_{k i} V_{k i}]$ for j = 0, 1, 2, and calculate ${\hat{Z}}_{k} (t, β^{(l)} (v))$ ; (ii) Update the estimate β^(l+1)(v) at step l + 1 by β^(l+1)(v) = β^(l)(v) − (∂U(v, β^(l)(v))/∂β)⁻¹ U(v, β^(l)(v)).

Estimation of the stratified mark-specific PH model (1) also involves estimation of the baseline mark-specific hazard function λ_0k(t, v). The MIEE approach treats the multiple imputations for a given subject as a cluster. As such, the Nelson-Aalen type estimator ${\hat{Λ}}_{0 k} (t, v) = M^{- 1} \sum_{m = 1}^{M} \sum_{i = 1}^{n_{k}} \int_{0}^{t} \int_{0}^{v} {n_{k} {\hat{S}}_{k}^{(0)} (s, \hat{β} (u))}^{- 1} N_{k i}^{(m)} (d s, d u)$ is a natural estimator of the doubly cumulative baseline function $Λ_{0 k} (t, v) = \int_{0}^{t} \int_{0}^{v} λ_{0 k} (s, u) dsdu$ . The baseline function λ_0k(t, v) can be estimated by ${\hat{λ}}_{0 k} (t, v)$ obtained by smoothing the increments of the estimator ${\hat{Λ}}_{0 k} (t, v)$ . For example, one can use kernel smoothing ${\hat{λ}}_{0 k} (t, v) = \int_{0}^{τ} \int_{0}^{1} K_{h_{1}}^{(1)} (t - s) K_{h_{2}}^{(2)} (v - u) {\hat{Λ}}_{0 k} (d s, d u)$ , where $K_{h_{1}}^{(1)} (x) = K^{(1)} (x / h_{1}) / h_{1}$ and $K_{h_{2}}^{(2)} (x) = K^{(2)} (x / h_{2}) / h_{2}$ , with K⁽¹⁾(·) and K⁽²⁾(·) kernel functions and h₁ and h₂ bandwidths. Other model parameters of interest discussed in Section 2.3 such as the overall conditional survival function, the conditional cumulative incidence function, and the mark-specific cumulative incidence function rate can be estimated under the same framework.

Next, we propose an estimator of the variance of $\hat{β} (v)$ . Let ${\hat{J}}_{k} (t, β) = {\hat{S}}_{k}^{(2)} (t, β) / {\hat{S}}_{k}^{(0)} (t, β) - {({\hat{Z}}_{k} (t, β))}^{\otimes 2}$ . The derivative of U(v, β) with respect to β equals $U^{'} (v, β) = - M^{- 1} \sum_{m = 1}^{M} \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} \int_{0}^{1} \int_{0}^{τ} K_{h} (u - v) {\hat{J}}_{k} (t, β) N_{k i}^{(m)} (d t, d u)$ . Following the proof of Theorem 2 of Yang et al. (2017), we have the approximation

{(n h)}^{1 / 2} (\hat{β} (v) - β (v) \approx {(\hat{Σ} (v))}^{- 1} n^{- 1 / 2} h^{1 / 2} M^{- 1} \sum_{m = 1}^{M} \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} Q_{k i}^{(m)} (v),

(7)

where $\hat{Σ} (v) = - n^{- 1} U^{'} (v, \hat{β} (v))$ , and, similar to Yang et al. (2017), $Q_{k i}^{(m)} (v)$ is approximated by

{\hat{Q}}_{k i}^{(m)} (v) = \int_{0}^{1} \int_{0}^{τ} K_{h} (u - v) {\hat{W}}_{k i} (t) [Z_{k i} - {\hat{Z}}_{k} (t, \hat{β} (u))] {\hat{M}}_{k i}^{(m)} (d t, d u) + \int_{0}^{1} \int_{0}^{τ} K_{h} (u - v) (1 - {\hat{W}}_{k i} (t)) {\hat{M}}_{k i, \bar{z}}^{o (m)} (d t, d u) .

(8)

Here ${\hat{M}}_{k i}^{(m)} (d t, d u) = N_{k i}^{(m)} (d t, d u) - Y_{k i} (t) exp {\hat{β} {(u)}^{T} Z_{k i}} {\hat{Λ}}_{0 k} (d t, d u)$ and ${\hat{M}}_{k i, \bar{z}}^{o (m)} (d t, d u)$ is the estimator of $M_{k i, \bar{z}}^{o (m)} (d t, d u)$ given by

[E {Z_{k i} ∣ Ω_{k i}} - {\bar{z}}_{k} (t, β_{0} (u))] N_{k i}^{(m)} (d t, d u) - (E [Z_{k i} exp {β_{0}^{T} (v) Z_{k i}} ∣ Ω_{k i}, δ_{k i} V_{k i}^{(m)}] - {\bar{z}}_{k} (t, β_{0} (u)) E [exp {β_{0}^{T} (v) Z_{k i}} ∣ Ω_{k i}, δ_{k i} V_{k i}^{(m)}]) Y_{k i} (t) λ_{0 k} (t, u) dtdu .

Hence, ${\hat{M}}_{k i, \bar{z}}^{o (m)} (d t, d u)$ is obtained by replacing ${\bar{z}}_{k} (t, β_{0} (u))$ with ${\hat{Z}}_{k} (t, \hat{β} (u))$ , λ_0k(t, u) dtdu with ${\hat{Λ}}_{0 k} (d t, d u)$ , and by replacing $E {Z_{k i} ∣ Ω_{k i}, δ_{k i} V_{k i}^{(m)}}$ , $E [exp {β_{0}^{T} (v) Z_{k i}} ∣ Ω_{k i}, δ_{k i} V_{k i}^{(m)}]$ and $E [Z_{k i} exp {β_{0}^{T} (v) Z_{k i}} ∣ Ω_{k i}, δ_{k i} V_{k i}^{(m)}]$ with their estimates.

Using Rubin’s idea to account for the between-imputation variability, we propose to estimate the variance of $\hat{β} (v)$ by ${\hat{Σ}}_{\hat{β}} (v) = {(\hat{Σ} (v))}^{- 1} {\hat{Σ}}_{R}^{*} (v) {(\hat{Σ} (v))}^{- 1} / (n h)$ , where

{\hat{Σ}}_{R}^{*} (v) = \frac{h}{n} [\frac{1}{M} \sum_{m = 1}^{M} \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} {{\hat{Q}}_{k i}^{(m)} (v)}^{\otimes 2} + \frac{M + 1}{M} \frac{1}{M - 1} \sum_{m = 1}^{M} {(U^{(m)} (v, \hat{β}) - U (v, \hat{β}))}^{\otimes 2}] .

(9)

In Web Appendix A, we present heuristic arguments to show that the proposed hybrid-MIEE and hybrid-Rubin estimators are unbiased for large sample using NNHD imputation and under the model assumptions given by Yang et al. (2017). The hybrid-MIEE and hybrid-Rubin estimators also enjoy the double robustness properties similar to the AIPW estimators of Yang et al. (2017).

Parametric multiple imputation often uses between 2 and 10 imputations (Rubin, 1987, pp. 15). Reilly (1993) recommended that hot deck estimation be performed with 3, 5 and 10 imputations.

2.3. Estimation of the mark-specific cumulative incidence function rate

By definition of the conditional mark-specific hazard function, λ_k(t, v|z) dv measures the instantaneous rate of failure at time t with failure type/mark (e.g., dengue sequence distance) V ∈ [v, v+dv) in the presence of all other possible failure types (dengue viruses with different sequence distances) for a very small dv. In this section, we introduce the mark-specific cumulative incidence function rate that provides interpretable results (e.g., through visual display of estimates) and is useful for prediction. The conditional cumulative incidence function (CIF) for stratum k is defined by $F_{k} (t, v ∣ z) = P (T_{k} \leq t, V_{k} \leq v ∣ Z_{k} = z)$ , which has the interpretation of the classical CIF as the conditional probability of failure by time t with failure cause V_k ≤ v. The conditional mark-specific cumulative incidence function rate (CIFR) f_k,v(t, v|z) is the derivative of $F_{k} (t, v ∣ z)$ with respective to v. The quantity f_k,v(t, v|z) dv is the conditional probability that failure with mark V ∈ [v, v + dv) occurs by time t. The cumulative incidence of failure with mark V in an interval (v₁, v₂] ⊂ (0, 1) is given by $P (T_{k} \leq t, v_{1} < V_{k} \leq v_{2} ∣ Z_{k} = z) = \int_{v_{1}}^{v_{2}} f_{k, v} (t, v ∣ z) d v$ for (v₁, v₂] ⊂ (0, 1).

While λ_k(t, v|z) is useful for measuring the instantaneous rate of failure occurrence at time t for those at risk, the mark-specific CIFR f_k,v(t, v|z) is useful to estimate/predict the probability of failure by time t with V ∈ [v, v+dv). As with the classical competing risks model, the mark-specific hazard function is related to the CIFR through the simple formula $f_{k, v} (t, v ∣ z) = exp {β {(v)}^{T} z} \int_{0}^{t} S_{k} (s ∣ z) A_{0 k} (d s, v)$ , where $A_{0 k} (t, v) = \int_{0}^{t} λ_{0 k} (s, v) d s$ and S_k(t|z) is the conditional overall survival function of T_k given Z_k = z that is given by $S_{k} (t ∣ z) = exp {- \int_{0}^{1} A_{0 k} (t, v) exp {β {(v)}^{T} z} d v}$ under model (1). The CIF and CIFR can be can be estimated by plugging in the estimates of β(v) and λ_0k(t, v). The details of estimation are given in the Web Appendix B. The relationships among the estimated conditional mark-specific hazard function, CIF and CIFR are the same as for their population quantities using the hybrid-MIEE approach with multiple imputations, but this is not the case for the hybrid-Rubin approach.

3. Statistical inferences for β(v)

We develop procedures for testing two sets of hypotheses regarding β(v). Let β_r(v) be the rth component of β(v), 1 ≤ r ≤ p. We first test the null hypothesis H₁₀: β_r(v) = 0 for v ∈ [a, b] ⊂ (0, 1) against the general alternative H_1a: β_r(v) ≠ 0 for at least some v ∈ [a, b], and against the monotone alternative H_1m: β_r(v) ≤ 0 with β_r(v) < 0 for some v ∈ [a, b]. The testing procedure can be used to test β_r(v) ≥ 0 with simple modifications. The second hypothesis H₂₀ concerns whether β_r(v) does not depend for v ∈ [a, b]. We test H₂₀ against the general alternative H_2a that β_r(v) depends on v for v ∈ [a, b] and the monotone alternative H_2m that β_r(v) is a monotone increasing function. The test can be modified to test the monotone alternative that β_r(v) is a monotone decreasing function. The tests of H₁₀ are helpful for identifying covariates that are correlated with risk for at least some failure types/marks. The tests of H₂₀ evaluate whether the strength of association of a covariate with risk varies with values of the failure type/mark V.

We construct the following test procedures based on the hybrid MIEE estimator of β(v). Let 0 < v₁ < ⋯ < v_G < 1 be a grid of G points in the range of the marks (0, 1). By Aalen and Johansen (1978) and Sun et al. (2009), it can be shown that $\hat{β} (v_{1}), \dots, \hat{β} (v_{G})$ are asymptotically independent and approximately normal. The estimated variance of ${\hat{β}}_{r} (v)$ , $\hat{Var} ({\hat{β}}_{r} (v))$ , is the rth element on the diagonal of ${\hat{Σ}}_{\hat{β}} (v)$ . Let ${\hat{β}}_{r} = {({\hat{β}}_{r} (v_{1}), {\hat{β}}_{r} (v_{2}), \dots, {\hat{β}}_{r} (v_{G}))}^{T}$ . We propose the following test statistic to test H₁₀: β_r(v) = 0 against H_1a: β_r(v) ≠ 0:

T_{1 a} = \sum_{g = 1}^{G} {({\hat{β}}_{r} (v_{g}))}^{2} / \hat{Var} (β (v_{g})) .

The following test statistic is used to test H₁₀ against H_1m: β_r(v) ≤ 0:

T_{1 m} = \sum_{g = 1}^{G} {\hat{β}}_{r} (v_{g}) / {(\hat{Var} (β (v_{g})))}^{1 / 2} .

Under the null hypothesis H₁₀, T_1a has an approximately Chi-square distribution $χ_{G}^{2}$ with G degrees of freedom, and T_1m has an approximate normal distribution with mean zero and variance G. A larger value of T_1a indicates departures from H₁₀, rejecting H₁₀ in favor of H_1a at significance level α if T_1a is greater than the (1 − α) percentile of $χ_{G}^{2}$ . A smaller value of T_1m shows evidence in favor of H_1m, rejecting H₁₀ at significance level α if T_1m is less than the α percentile of N(0, G).

To test the null hypothesis H₂₀ that the rth component β_r(v) does not depend on v, we let $Q_{\hat{β}} = {(\hat{β} (v_{2}) - \hat{β} (v_{1}), \dots, \hat{β} (v_{G}) - \hat{β} (v_{G - 1}))}^{T}$ . Then $Q_{\hat{β}} = A {\hat{β}}_{r}$ , where A is the (G − 1) × G matrix with −1 as the (i, i)th element, 1 as the (i, i + 1)th element for i = 1, …, G − 1, and the rest of the elements zero. Thus the covariance matrix of $Q_{\hat{β}}$ is $C o v (Q_{\hat{β}}) = A C o v ({\hat{β}}_{r}) A^{T}$ , where $C o v ({\hat{β}}_{r})$ is the diagonal matrix with $Var (\hat{β} (v_{g}))$ , g = 1, …, G, on the diagonals. The following are the two test statistics for testing H₂₀:

T_{2 a} = {(Q_{\hat{β}})}^{T} {(\hat{C o v} (Q_{\hat{β}}))}^{- 1} Q_{\hat{β}}, and T_{2 m} = J^{T} {(\hat{C o v} (Q_{\hat{β}}))}^{- 1 / 2} Q_{\hat{β}},

where $\hat{C o v} (Q_{\hat{β}})$ is the diagonal matrix with $\hat{Var} (\hat{β} (v_{g}))$ , g = 1, …, G, on the diagonals, and J is a G − 1 dimensional vector of 1’s. Under H₂₀, T_2a has an approximately Chi-square distribution $χ_{G - 1}^{2}$ with G − 1 degrees of freedom, and T_2m has an approximate normal distribution with mean zero and variance G − 1. We reject H₂₀ in favor of H_2a at significance level α if T_2a is greater than the (1−α) percentile of $χ_{G - 1}^{2}$ and reject H₂₀ in favor of H_2m if T_2m is greater than the (1 − α) percentile of N(0, G − 1).

In practice, we recommend that G takes a value from 3 to 5 with approximately evenly spaced grid points with spacing greater then the size of the bandwidth for better approximations of the null distributions of the test statistics.

4. Simulation study

We conducted a simulation study to evaluate the finite-sample performance of the proposed estimation and hypothesis testing procedures. Let U₁, U₂ and U₃ be independent uniformly distributed random variables on (0, 1). Let Z₁ = U₁ +2U₃ be a phase-one covariate, and Z₂ = −U₁+2U₂ a phase-two covariate, with resulting correlation coefficient of −0.2. We study the scenario of one stratum K = 1. The (T_1i, V_1i) are generated from the following mark-specific proportional hazards model:

λ (t, v ∣ Z) = λ_{0} (t, v) exp {β_{1} (v) Z_{1} + β_{2} (v) Z_{2}}, t \geq 0, 0 \leq v \leq 1,

(10)

where the mark-specific baseline function is λ₀(t, v) = exp(−0.3v), β₁(v) = −0.2v and β₂(v) = α+θv. We study performance of the hypothesis tests of H₁₀ and H₂₀ for β₂(v). The parameters α and θ are chosen to examine the sizes and powers of the proposed tests. All failure times greater than τ = 2.0 are right-censored at τ. Censoring times are generated from an exponential distribution, independent of (T, V), with parameter adjusted so that the overall censoring rate during follow-up is approximately 40%.

For the two-phase sampling, we consider a simple Bernoulli random sample taken separately for cases and controls, with selection probability π_z,1i = 1 for cases (δ_1i = 1) and 0.5 for controls (δ_1i = 0). Suppose there is an auxiliary variable A_z correlating with Z₂, A_z = Z₂ + ϵ, where ϵ is normally distributed with mean zero and standard deviation 0.5, which corresponds to a Pearson correlation coefficient between Z₂ and A_z of ρ = 0.75. The conditional expectations involving the phase-two covariate Z₂ are estimated using linear models with (1, δ, Z₁, A_z, δZ₁, δA_Z) as predictors based on the subjects with observed Z₂. Covariates (1, Z₁, A_z) are used for estimating the logit linear model for π_z,1i for subjects with δ_1i = 0.

The mark V_1i is missing following the conditional probability logit(π_v,1i) = logit(P(ξ_v,1i = 1|Ω_1i)) = 0.3Z_1,1i + 0.8 for δ_1i = 1, yielding about 22% missing marks. The hot deck imputation of a missing V_ki is obtained from donors from the same stratum k with δ_ki = 1 and with similar $H_{k i}$ defined by Euclidean distance and z-scores of (T_ki, Z_ki); for our simulations we study only one stratum k = 1. The L-nearest neighborhood imputations are carried out based on $H_{1 i}$ with the Euclidean metric. By considering the z-scores of variables, we eliminate the effects of scales or units of the variables on the nearest neighbor selections. We consider M = 3 and M = 5 imputations from the M-nearest neighborhoods for the cases with missing marks.

The performances of the proposed test procedures are evaluated through simulations under model (10) for the parameter settings P1 to P5 that are defined by P1: (α, θ) = (0, 0); P2: (α, θ) = (−0.4, 0); P3: (α, θ) = (−0.5, 0); P4: (α, θ) = (−1, 1.5); P5: (α, θ) = (−1, 2). P1 and P3 are models under the null hypothesis H₁₀ and H₂₀, respectively; P2-P3 are H_1m alternatives to H₁₀, and P4-P5 are H_2m alternatives to H₂₀.

The Epanechnikov kernel K(x) = .75(1−x²)I{|x| ≤ 1} is used for the kernel smoothing. The bandwidth is selected using the formula $h = C {\hat{σ}}_{v} n^{- 1 / 3}$ , where ${\hat{σ}}_{v}$ is the estimated standard error of the observed marks for uncensored failure times and C is a constant ranging from 2 to 5. Sun et al. (2016) and Yang et al. (2017) showed that this formula works well in simulations. A larger C can be used if the distribution of the observed marks is skewed or marks are sparse in some areas. Alternatively, the formula $h = C {\hat{σ}}_{v} n_{o}^{- 1 / 3}$ has also been used in situations with a very large phase-one sample and low event rate (Yang et al., 2017), where n_o is the observed number of events. The values of ${\hat{σ}}_{v}$ under model (10) for the settings P1 to P5 are approximately 0.29, yielding $h = 4 {\hat{σ}}_{v} n^{- 1 / 3} = 0.15$ for n = 500 and h = 0.13 for n = 800. We also studied the impact of using larger bandwidths: h = 0.20 for n = 500 and h = 0.17 for n = 800.

We estimate β(v) over 21 evenly spaced grid points in [0, 1] with spacing 0.05 such that v₁ = 0, v₂ = 0.05, …, v₂₁ = 1. The initial value for estimating β(v₁) is set to zero. The estimate $\hat{β} (v_{1})$ is used as the initial value for estimating β(v₂) such that $\hat{β} (v_{i - 1})$ is used as the initial value for estimating β(v_i) for i = 2, …, N. The Newton-Raphson iterative algorithm proposed in Section 2.2.2 is not overly sensitive to the choice of initial values.

Figure 1 shows the simulation results for estimating β(v) = (β₁(v), β₂(v))^T under the setting P4 for model (10) without auxiliary A_z with M = 5 imputations from 5-nearest neighborhoods of cases with missing marks based on 1000 simulations using bandwidths h = 0.13 and h = 0.15 for 5-nearest neighborhoods $L_{1 i}$ calculated using Euclidean distance and z-scores of the $H_{1 j} = (T_{1 j}, Z_{1 j})$ for cases (with δ_1j = 1), where Bias is the bias, SEE is the sample standard error of the estimator, ESE is the sample mean of the estimated standard errors, and CP is the 95% empirical coverage probability. Figure (2) compares the estimates using different neighborhood selections under the setting P4 for model (10).

Figure 2. — Bias, SEE, ESE and CP for ${\hat{β}}_{1} (v)$ and ${\hat{β}}_{2} (v)$ for n=800 under the setting P4 for model (10) with M = 5 imputations from the 5-nearest neighborhoods of cases with missing marks based on 1000 simulations. MIEE-NN(T,Z) is for the hybrid-MIEE estimator with the 5-nearest neighborhoods $L_{1 i}$ calculated using Euclidean distance and z-scores of the $H_{1 j} = (T_{1 j}, Z_{1 j})$ for cases (with δ_1j = 1), while MIEE-NN(T) is the same except z-scores of the $H_{1 j} = (T_{1 j})$ are used. The legends Rubin-NN(T,Z) and Rubin-NN(T) are defined similarly for the hybrid-Rubin estimator. The red lines are for the hybrid-MIEE estimator while the blue lines are for the hybrid-Rubin estimator.

Additional simulation studies are presented in Web Appendix C, which includes simulation results under setting P3 of model (10), and for a different mark-specific proportional hazards model with K = 2 strata. Web Appendix C also includes a real data simulation study that applies the proposed methods to a data set generated based on the the CYD14 trial data.

The simulation study shows that the biases of both the hybrid-MIEE estimator and the hybrid-Rubin estimator are very small except in the left and right tails for β₂(v) (which are the expected boundary effects in nonparametric estimation) by using the 5-nearest neighborhoods $L_{1 i}$ of cases with missing marks calculated using Euclidean distance and z-scores of the $H_{k j} = (T_{k j}, Z_{k j})$ for cases. The pointwise coverage probabilities are slightly below but very close to 95% for v ∈ (0, 1) except in the left and right tails for β₂(v), indicating adequate performance of the proposed variance estimators. For larger bandwidths, the SEE and ESE of the proposed estimator are smaller. The study also shows that estimation based on the L-nearest neighborhoods calculated only using a subset of $H_{k j} = (T_{k j}, Z_{k j})$ can yield much larger biases unless β(v) does not depend on v. In particular, under the setting P4 for model (10), Figure (2)(b) shows that using $H_{k j} = (T_{k j})$ yields much larger biases than using $H_{k j} = (T_{k j}, Z_{k j})$ . We also notice from Web Figure 3 in Web Appendix C that the biases are small for both selections of $H_{k j}$ under setting P3 since β(v) does not vary with v in this setting.

Using the 5-nearest neighborhoods $L_{1 i}$ calculated using Euclidean distance and z-scores of the $H_{1 j} = (T_{1 j}, Z_{1 j})$ for all cases (with δ_1j = 1), Figure 3 shows the simulation results for estimating the conditional mark-specific cumulative incidence function rate f_1,v(t, v|z) at Z₁ = 1.5 and at the 10th, 50th and 90th percentiles of Z₂ for t = 1 and n = 800 under the setting P4 for model (10) with M = 5 based on 1000 simulations using bandwidth h = 0.13. The figures show that the average of the estimated f_1,v(t, v|z) are close to the true values f_1,v(t, v|z).

The simulations are carried out to examine the performances of the proposed tests with the nearest neighborhoods $L_{1 i}$ calculated using Euclidean distance and z-scores of the $H_{k j} = (T_{k j}, Z_{k j})$ for subjects with δ_1j = 1. Table 1 presents the empirical sizes and powers of the tests T_1a and T_1m for testing H₁₀ and the tests T_2a and T_2m for testing H₂₀ at nominal level 0.05 using M = 3 and M = 5 imputations from the M-nearest neighborhoods based on 1000 simulations. The test statistics are calculated using bandwidth h = 0.15 for n = 500 and h = 0.13 for n = 800 and using G = 3 grid points with v₁ = 0.2, v₂ = 0.5, v₃ = 0.8. The empirical sizes for testing H₁₀ under P1 and for testing H₂₀ under P3 are slightly higher but very close to the nominal level 0.05, indicating adequate performance of the proposed tests. The powers of the tests for testing H₁₀ increase as the model moves from P1 to P3, while the powers of the tests for testing H₂₀ increase as the model moves from P3 to P5, representing increasing departures from the null hypotheses H₁₀ and H₂₀, respectively. Powers of the tests with auxiliary variable A_z are slightly higher than those without using A_z. Powers of the tests are not overly sensitive to the number of imputations M, but seem to increase slightly for larger bandwidth.

Table 1.

Empirical sizes and powers of the test statistics T_1a and T_1m for testing H₁₀ and the test statistics T_2a and T_2m for testing H₂₀ under model (10) with M = 3 and M = 5 imputations from the M-nearest neighborhoods of the missing marks at nominal level 0.05 based on 1000 simulations. The test statistics are constructed using G = 3, v₁ = 0.2, v₂ = 0.5, v₃ = 0.8. “Without A_z” refers to the scenario where there is no auxiliary A_z, and “With A_z” refers to the scenario where the auxiliary A_z is used. BAND1 is the bandwidth setting of h = 0.15 for n = 500 and h = 0.13 for n = 800 while BAND2 is the bandwidth setting of h = 0.20 for n = 500 and h = 0.17 for n = 800.

Model	n	M	Without A_z				With A_z
			BAND1		BAND2		BAND1		BAND2
			Testing H₁₀		Testing H₁₀		Testing H₁₀		Testing H₁₀
			T_1a	T_1m	T_1a	T_1m	T_1a	T_1m	T_1a	T_1m
P1	500	3	0.064	0.062	0.067	0.065	0.070	0.051	0.063	0.054
	500	5	0.067	0.066	0.071	0.072	0.079	0.061	0.081	0.061
	800	3	0.069	0.060	0.062	0.058	0.072	0.051	0.070	0.045
	800	5	0.073	0.060	0.060	0.058	0.066	0.045	0.063	0.043
P2	500	3	0.771	0.934	0.876	0.968	0.814	0.959	0.915	0.983
	500	5	0.746	0.921	0.843	0.960	0.800	0.941	0.891	0.974
	800	3	0.897	0.974	0.959	0.996	0.933	0.991	0.980	0.999
	800	5	0.886	0.982	0.946	0.990	0.911	0.991	0.964	0.997
P3	500	3	0.912	0.982	0.966	0.995	0.945	0.992	0.984	0.999
	500	5	0.907	0.977	0.955	0.989	0.943	0.990	0.976	0.997
	800	3	0.974	0.998	0.992	1.000	0.986	0.998	0.999	1.000
	800	5	0.983	0.999	0.998	1.000	0.994	0.999	1.000	1.000
Model	n	M	Testing H₂₀		Testing H₂₀		Testing H₂₀		Testing H₂₀
Model	n	M	T_2a	T_2m	T_2a	T_2m	T_2a	T_2m	T_2a	T_2m
P3	500	3	0.070	0.049	0.055	0.047	0.079	0.055	0.062	0.053
	500	5	0.056	0.063	0.050	0.059	0.070	0.069	0.062	0.069
	800	3	0.066	0.060	0.069	0.054	0.075	0.062	0.083	0.058
	800	5	0.066	0.061	0.056	0.059	0.073	0.066	0.069	0.063
P4	500	3	0.757	0.883	0.839	0.942	0.779	0.891	0.867	0.953
	500	5	0.755	0.909	0.868	0.964	0.772	0.918	0.896	0.970
	800	3	0.894	0.973	0.959	0.990	0.904	0.980	0.967	0.993
	800	5	0.871	0.970	0.955	0.991	0.891	0.975	0.966	0.992
P5	500	3	0.953	0.990	0.989	0.999	0.996	0.998	0.993	0.999
	500	5	0.946	0.988	0.987	0.998	0.999	1.000	0.993	0.999
	800	3	0.991	1.000	1.000	1.000	1.000	1.000	1.000	1.000
	800	5	0.995	0.999	0.999	1.000	1.000	1.000	0.999	1.000

Open in a new tab

5. Dengue Vaccine Efficacy Trial Analysis

The CYD14 cohort for data analysis is all participants attending the Month 13 study visit without previously experiencing the dengue disease primary endpoint, comprising 6639 vaccine recipients and 3220 placebo recipients. Of these, 116 vaccine recipients and 129 placebo recipients experienced the dengue endpoint by Month 25, constituting an estimated 56.5% vaccine-reduction in the hazard of dengue disease between Month 13 and 25 (Capeding et al., 2014). The percentage of right-censoring by Month 25 was 98.3%. An important scientific question is how does natural and vaccine immunity work in preventing dengue disease. Neutralizing antibodies are generally believed to be important for both natural and vaccine-induced protection, which are present in many placebo recipients (caused by prior dengue exposures and infections), and are boosted/increased in many vaccine recipients (caused by dengue vaccination) (Moodie et al., 2018). In this section, we apply the developed methods to analyze the CYD14 data with objective of understanding, for each of the placebo and vaccine groups, the association of Month 13 neutralizing antibody levels (“NAb titer”) with subsequent occurrence of the dengue disease primary endpoint through Month 25, and whether and how the associations depend on dengue amino acid sequence. The NAb titer marker is the average of an individual’s log-base-10 50% neutralization titer to each of the four dengue strains in the vaccine (one strain for each dengue serotype), where the 50% neutralization titer quantifies the ability of antibodies in an individual’s blood sample to kill a given dengue strain (defined in detail in (Moodie et al., 2018)). The analyses by treatment group can be interpreted as assessing NAb titer as a marker of different kinds of acquired protection/disease resistance – for placebo recipients naturally-acquired resistance and for vaccine recipients a combination of naturally- and vaccine-acquired resistance. This integrated analysis of host and pathogen data types would increase knowledge of NAb titer as a correlate of risk of dengue disease, with many applications including aiding refinement of models for bridging vaccine efficacy to new settings not studied in CYD14.

As summarized in the Introduction, NAb titer was measured from Month 13 blood samples from a subset of participants selected through a case-cohort sampling design. With controls defined as participants reaching the Month 25 visit never experiencing the dengue disease study endpoint, the NAb titer marker was measured from n=1879 controls (1275 vaccine, 604 placebo), and from all n=245 cases (116 vaccine, 129 placebo).

From blood samples drawn at dengue disease failure event times, dengue virus nucleotide sequences of the complete antigen-coding region of the dengue genome represented in the CYD-TDV vaccine (prM/E) were measured using 454 sequencing (Rabaa et al., 2017). The prM/E dengue genome (1,985 base pairs for serotypes 1, 2, 4 and 1,979 base pairs for serotype 3) was sequenced and translated to 661 amino acid positions (659 for serotype 3). The amino acid sequences were multiply aligned with the four vaccine strain sequences. A subset of 65 of the prM/E amino acid positions have been documented to be “NAb contact sites”, defined as positions on the outer surface of dengue that have been documented to interact with anti-dengue neutralizing antibodies. Because sequence variation in these contact sites was hypothesized to be especially relevant for potential protection against dengue disease, we studied the mark V defined as the Hamming distance based on these NAb contact sites. The distance V “Hamming distances: NAb contact sites” was calculated, which is the percent amino acid mismatch in the 65 NAb contact sites between the dengue sequence from a given disease case and the closest dengue sequence among the four vaccine strain sequences. The mark V was measured from 76 (66%) of the 116 vaccine recipient cases and from 84 (65%) of the 129 placebo recipient cases.

Vaccine recipients exposed to dengue sequences with short distances to the vaccine may be more likely to be protected by antibodies than vaccine recipients exposed to dengue sequences with large distances. Therefore, if NAb titer is important for protection, its inverse correlation with dengue disease risk would be expected to be strongest against dengue viruses with small distances and be weakest or non-existent against dengue sequences with large distances. The results on these hypotheses may provide insights into how the vaccine partially worked and thereby guide next steps of vaccine research. In addition, the same analysis in placebo recipients aids understanding of how naturally acquired NAb titers associate with sequence-specific dengue risk.

Let T be the time between the Month 13 visit until diagnosis of dengue disease through Month 25. We consider the following mark-specific PH model:

λ_{k} (t, v ∣ z (t)) = λ_{k 0} (t, v) exp {β_{1} (v) Age + β_{2} (v) NAb},

(11)

with K = 1 baseline stratum, where NAb is the Month 13 NAb titer and Age is age at enrollment. Age is a phase-one variable while NAb is a phase-two variable. For both the vaccine and placebo groups, NAb titer is observed for all the cases but missing for 80.5% of the non-cases.

We implement the proposed estimation and testing procedures described in Sections 2 and 3. We estimated the probability of observing the NAb titer marker with a logistic regression model, with logit(P(ξ_z = 1|Ω)) a linear function of (1, Age, Sex). To implement the AIPW method, we use linear models for E(NAb|Ω) and E(exp(β₂(v)NAb^⊗j)|Ω) for j = 0, 1, 2, with predictors (1, Age, Sex). For each case i with missing mark V_i, we use M = 5 imputed marks from the 5-nearest neighborhoods $L_{1 i}$ calculated using z-scores of $H_{1 j} = (T_{1 j}, A g e_{1 j}, N A b_{1 j})$ from all cases j (with δ_1j = 1).

Due to the very large phase-one sample and the low event rate, we used the bandwidth $h = 5 {\hat{σ}}_{v} n_{o}^{- 1 / 3}$ , where ${\hat{σ}}_{v}$ is the estimated standard error of the observed marks and n_o is the number of cases. The standard deviation of the observed mark “Hamming distances: NAb contact sites” is 0.0225 for the vaccine group and 0.0243 for the placebo group, resulting in bandwidth h = 0.023 and h = 0.024, respectively.

Figure 4 shows point and 95% confidence interval estimates of β₁(v) for Age and β₂(v) for NAb titer, by treatment arm. Greater age is associated with lower risk of dengue disease for the vaccine group and apparently not for the placebo group, and the associations do not appear to depend on the mark. NAb titer is strongly inversely associated with risk of dengue disease in the vaccine group, with stronger association for dengue viruses closest to the vaccine strains. In the placebo group the results suggest a weak inverse association of NAb titer with dengue disease, only for dengue viruses close to the vaccine strains.

Augmenting results from Figure 4, Figure 5 shows the estimated conditional mark-specific cumulative incidence function rate f_v(τ, v|z) at Month τ = 25 for the 10th, 50th and 90th percentiles of the NAb titer marker and at the average age 8.35 years old, by treatment arm. The figure shows that ${\hat{f}}_{1, v} (τ = 25, v ∣ z)$ is highest at the 10th percentile of NAb titer and lowest at the 90th percentile. Figure 5 also shows the ratios of ${\hat{f}}_{v} (τ = 25, v ∣ z)$ for the 10th vs. 90th percentiles and 50th vs. 90th percentiles of NAb titer at the average age. For the vaccine group, this ratio for the 10th vs. 90th percentile is almost twice that of the ratio for the 50th vs. 90th percentile for mark values v < 0.021. Such differences in estimated f_v(τ = 25, v|z) are not observed for the placebo group.

Table 2 presents the results of the hypothesis testing for β₁(v) and β₂(v) under H₁₀ and H₂₀ for the vaccine group and placebo group. The p-values are calculated using G = 4 grid points with v₁ = 0.01, v₂ = 0.03, v₃ = 0.05, v₄ = 0.07. The results support that the risk of dengue disease decreases as the NAb titer increases for the vaccine group, but not for the placebo group. Older children are at lower risk of dengue disease for both treatment groups but more significantly for the vaccine group. There are statistically significant results that the magnitude of the mark-specific association parameter β₂(v) for NAb titer decreases with increasing mark values for both treatment groups.

Table 2.

Results of hypothesis tests of H₁₀ and H₂₀ for the CYD14 trial.

	NAb titer				Age
	Testing H₁₀		Testing H₂₀		Testing H₁₀		Testing H₂₀
	T_1a	T_1m	T_2a	T_2m	T_1a	T_1m	T_2a	T_2m
Vaccine	< 0.001	< 0.001	0.069	0.005	< 0.001	< 0.001	0.857	0.762
Placebo	0.116	0.393	0.070	0.012	0.170	0.016	0.383	0.961

Open in a new tab

The analysis presented in this manuscript imputes missing dengue sequence distances from subjects with similar event times, ages, and NAb titer in a neighborhood. In Web Appendix C, we present the results of the data analysis using the alternative hotdeck imputations implemented by Juraska et al. (2018), which were obtained using information on study site and local clinic, as well as on dengue genotype and serotype. Similar results are obtained but with slightly weaker evidence that the magnitude of β₂(v) decreases with increasing mark values in the vaccine group.

Because the CYD-TDV vaccine is licensed for children 9 years of age or older, we repeated the analyses restricting to 9–14 year-olds (Web Appendix D). For the vaccine group, the results for inference on β₂(v) with covariate NAb titer are similar to those for all ages 6–14 (Web Appendix D Table 3, Figure 8, Figure 9). However, for the placebo group, the analysis restricting to 9–14 year-olds supports an inverse correlation of NAb titer with dengue disease for low dengue mark values, whereas the analysis of 6–14 year-olds did not suggest a correlation for any mark values.

6. Concluding Remarks

Motivated by the CYD14 dengue vaccine efficacy trial, this article developed estimation and hypothesis testing procedures for β(v) in model (1) under two-phase sampling of some covariates and with missing marks for some individuals with the failure event. We investigated two hybrid approaches that utilize nonparametric NNHD multiple imputations to impute missing marks of observed failures, followed by application of the AIPW technique to the completed-marks case-cohort sampled data sets. The two hybrid methods differ in how the imputed marks are pooled. Our simulations show that the hybrid-Rubin and the hybrid-MIEE estimators have similar performances in estimation.

We consider hot deck imputations of missing marks from donors with similar characteristics $H_{k j} = (T_{k j}, Z_{k j}, A_{v, k j})$ among the observed failures. The implementation of the NNHD depends on the choice of metric and the variables included for the neighborhood selection. The imputation based on a subset of (T_kj, Z_kj) can lead to biased estimation. Our L-nearest neighborhoods imputations are carried out based on z-scores of the $H_{k j}$ for cases and with the Euclidean metric. By considering the z-scores of variables, we eliminate the effects of scales or units of the variables on the nearest neighbor selections. Hsu and Yu (2019) recently studied the Cox model with missing covariates using the nonparametric multiple imputation approach with the neighborhood selected based on the predictive scores of two working regression models. We conducted a limited simulation study and found no advantages of the predictive score approach for the neighborhood selection.

Achieving consistent variance estimation in the presence of imputed data remains a challenge. Rubin’s (Rubin, 1987) rule of adjusting for multiple imputation has been widely used in practice. Other methods for estimating variances have been investigated, but few are rigorously justified; see, for example, Kovar and Chen (1994); Lee et al. (1994); Rancourt et al. (1994); Lee et al. (1995); Montaquila and Jernigan (1997). Chen and Shao (2000) investigated the theoretical properties of the NNHD imputation method and showed that the NNHD method provides asymptotically unbiased estimators for population means, quantiles and univariate distributions. They also derived consistent variance estimators of the NNHD estimators. The proposed hybrid-Rubin and hybridMEEE estimators for the mark-specific proportional hazards model (1) work very well with small biases in many different models we examined. However, finding consistent variance estimators is very challenging for the NNHD imputation of missing mark under two-phase sampling of covariates. We adopted Rubin’s rule for the variance estimators, which seems to slightly underestimate the true variances under some situations. The underestimated variances also lead to slightly inflated observed sizes for the proposed tests. Further investigation of variance estimation is needed.

For the analysis of the CYD14 efficacy trial, model (11) assumes that the mark-specific log-hazard ratio for Age is the same for every unit increase in Age and similarly for the mark-specific log-hazard ratio for NAb. However, the model assumptions may fail and thus model checking is an important problem. Sun et al. (2016) proposed a goodness-of-fit test procedure for the stratified mark-specific proportional hazards model (1) when covariates are observed and there are no missing marks. Developing the goodness-of-fit test procedure for model (1) with missing data is a project meriting future research.

The main manuscript presents the analysis of model (11) for children of all ages. However, the mark-specific effects β(v) may be different for different age groups. In Web Appendix D of the Supplementary Material, we conducted separate analyses for children in two different age groups: 2–8 and 9–14 year-olds. The additional analyses provide some insights on whether the effects of Age and NAb titer on the mark-specific risk of the dengue disease are different for different age groups.

Web Appendix E of the Supplementary Material also includes the analyses using the hotdeck imputations implemented in Juraska et al. (2018) that defined the neighborhood based on biological and geographic information, i.e., dengue genotype, serotype, study site and local clinic, and Juraska et al. validated that these hotdeck imputations were highly accurate. These hotdeck imputations are scientifically based and more robust to model misspecifications, while the other hotdeck imputations approach that we studied in this manuscript exploits the link between the failure time data and observed marks specified by the mark-specific proportional hazards model, which can improve power but at the expense of being less robust to model misspecifications. Further research is warranted to investigate the neighborhood selections and their impacts.

Supplementary Material

sup 01

NIHMS1643050-supplement-sup_01.pdf^{(345.7KB, pdf)}

Acknowledgements

The authors thank the participants and investigators of the CYD14 trial. This research was partially supported by NIAID NIH award number R37AI054165, and by a contract from the CYD14 study sponsor Sanofi Pasteur. Dr. Sun’s research was partially supported by the National Science Foundation grants DMS1513072 and DMS1915829. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

Footnotes

Supplementary Materials

Web Appendices A, B, C, D and E referenced in this article are given in the Supplementary Material available at the journal’s website. The MATLAB code and instructions for doing the analysis for a simulated data set presented in Section 3.3 of the Web-based Supplementary Material is also available from the website.

Contributor Information

Yanqing Sun, University of North Carolina at Charlotte, Charlotte, U.S.A..

Li Qi, Sanofi, Bridgewater, U.S.A..

Fei Heng, University of North Florida, Jacksonville, U.S.A..

Peter B. Gilbert, University of Washington and Fred Hutchinson Cancer Research Center, Seattle, U.S.A.

References

Aalen OO and Johansen S (1978) An empirical transition matrix for non-homogeneous markov chains based on censored observations. Scandinavian Journal of Statistics, 5, 141–150. [Google Scholar]
Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46, 175–185. [Google Scholar]
Andridge RR and Little RJA (2010) A review of hot deck imputation for survey non-response. International Statistical Review, 78, 40–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
Beretta L and Santaniello A (2016) Nearest neighbor imputation algorithms: a critical evaluation. BMC medical informatics and decision making, 16, 197–208. [DOI] [PMC free article] [PubMed] [Google Scholar]
Borgan O, Langholz B, Samuelsen SO, Goldstein L and Pogoda J (2000) Exposure stratified case-cohort designs. Lifetime data analysis, 6, 39–58. [DOI] [PubMed] [Google Scholar]
Breslow NE and Lumley T (2013) Semiparametric models and two-phase samples: Applications to Cox regression, vol. Volume 9 of Collections, 65–77. Beachwood, Ohio, USA: Institute of Mathematical Statistics. [Google Scholar]
Breslow NE, Lumley T, Ballantyne CM, Chambless L and Kulich M (2009) Improved horvitz-thompson estimation of model parameters from two-phase stratified samples: Applications in epidemiology. Statistics in Biosciences, 1, 32–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
Capeding M, Tran N, Hadinegoro S, Ismail H, Chotpitayasunondh T, Chua M, Luong C, Rusmil K, Wirawan D, Nallusamy R, Pitisuttithum P, Thisyakorn U, Yoon I, van der Vliet D, Langevin E, Laot T, Hutagalung Y, Frago C, Boaz M, Wartel T, Tornieporth N, Saville M, Bouckenooghe A and CYD14 Study Group. (2014) Clinical efficacy and safety of a novel tetravalent dengue vaccine in healthy children in asia: a phase 3, randomised, observer-masked, placebo-controlled trial. Lancet, 384, 1358–65. [DOI] [PubMed] [Google Scholar]
Carroll RJ, Spiegelman CH, Lan KKG, Bailey KT and Abbott RD (1984) On errors-in-variables for binary regression models. Biometrika, 71, 19–25. [Google Scholar]
Chen J and Shao J (2000) Nearest neighbor imputation for survey data. J. Official Stat, 16, 113–141. [Google Scholar]
Gao G and Tsiatis AA (2005) Semiparametric estimators for the regression coefficients in the linear transformation competing risks model with missing cause of failure. Biometrika, 92, 875–891. [Google Scholar]
Gilbert P, McKeague I and Sun Y (2004) Tests for comparing mark-specific hazards and cumulative incidence functions. Lifetime Data Analysis, 10, 5–28. [DOI] [PubMed] [Google Scholar]
Gilbert P and Sun Y (2015) Inferences on relative failure rates in stratified mark-specific proportional hazards models with missing marks, with application to human immunodeficiency virus vaccine efficacy trials. Journal of the Royal Statistical Society: Series C (Applied Statistics), 64, 49–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hsu C-H and Yu M (2019) Cox regression analysis with missing covariates via nonparametric multiple imputation. Statistical Methods in Medical Research. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jonsson P and Wohlin C (2004) An evaluation of k-nearest neighbour imputation using likert data In 10th International Symposium on Software Metrics, 2004. Proceedings, 108–118. IEEE. [Google Scholar]
Juraska M and Gilbert P (2013) Mark-specific hazard ratio model with multivariate continuous marks: An application to vaccine efficacy. Biometrics, 69, 328–337. [DOI] [PMC free article] [PubMed] [Google Scholar]
— (2016) Mark-specific hazard ratio model with missing multivariate marks. Lifetime Data Analysis, 22, 606–625. [DOI] [PMC free article] [PubMed] [Google Scholar]
Juraska M, Magaret C, Shao J, Carpp L, Fiore-Gartland A, Benkeser D, Girerd-Chambaz Y, Langevin E, Frago C, Guy B, Jackson N, Duong T, Simmons C, Edlefsen P and Gilbert P (2018) Viral genetic diversity and protective efficacy of a tetravalent dengue vaccine in two phase 3 trials. Proceedings of the National Academy of Sciences, 115, E8378–E838. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kovar J, Whitridge P and MacMillan J (1988) Generalized edit and imputation system for economic surveys at statistics canada. In Proceedings of the Survey Research Methods Section, American Statistical Association, 627–630. [Google Scholar]
Kovar JG and Chen EJ (1994) Jackknife variance estimation of imputed survey data. Survey Methodology, 20, 45–52. [Google Scholar]
Kulich M and Lin D (2004) Improving the efficiency of relative-risk estimation in case-cohort studies. Journal of the American Statistical Association, 99, 832–844. [Google Scholar]
Lee H, Rancourt E and Särndal C (1995) Variance estimation in the presence of imputed data for the generalized estimation system In Proceedings of the section on survey research methods, vol. 1, 384–389. American Statistical Association USA. [Google Scholar]
Lee H, Rancourt E and Särndal CE (1994) Experiments with variance estimation from survey data with imputed values. Journal of Official Statistics, 10, 231–243. [Google Scholar]
Li K-C (1984) Consistency for cross-validated nearest neighbor estimates in nonparametric regression. The Annals of Statistics, 12, 230–240. [Google Scholar]
Liang K-Y and Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika, 73, 13–22. [Google Scholar]
Little RJA (1988) Missing-data adjustments in large surveys. Journal of Business & Economic Statistics, 6, 287–296. [Google Scholar]
Montaquila J and Jernigan R (1997) Variance estimation in the presence of imputed data. In Proceedings of the Survey Research Methods Section of the American Statistical Association, 273–277. [Google Scholar]
Moodie Z, Juraska M, Huang Y, Zhuang Y, Fong Y, Carpp L, Self S, Chambonneau L, Small R, Jackson N, Noriega F and Gilbert P (2018) Neutralizing antibody correlates analysis of tetravalent dengue vaccine efficacy trials in Asia and Latin America. Journal of Infectious Diseases, 217(5), 742–753. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nan B (2004) Efficient estimation for case-cohort studies. Canadian Journal of Statistics, 32, 403–419. [Google Scholar]
Prentice RL (1986) A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika, 73, 1–11. [Google Scholar]
Rabaa MA, Girerd-Chambaz Y, Duong Thi Hue K, Vu Tuan T, Wills B, Bonaparte M, van der Vliet D, Langevin E, Cortes M, Zambrano B, Dunod C, Wartel-Tram A, Jackson N and Simmons CP (2017) Genetic epidemiology of dengue viruses in phase iii trials of the cyd tetravalent dengue vaccine and implications for efficacy. eLife., 6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rancourt E, Särndal C and Lee H (1994) Estimation of the variance in the presence of nearest neighbor imputation In Proceedings of the section on survey research methods, 888–893. American Statistical Association. [Google Scholar]
Reilly M (1993) Data analysis using hot deck multiple imputation. Journal of the Royal Statistical Society, Series D: The Statistician, 42, 307–313. [Google Scholar]
Robins J, Rotnitzky A and Zhao L (1994) Estimation of regression-coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89, 846–866. [Google Scholar]
Rubin D (1987) Multiple Imputation for Nonresponse in Surveys. Wiley, New York. [Google Scholar]
Scheike TH and Martinussen T (2004) Maximum likelihood estimation for cox’s regression model under case–cohort sampling. Scandinavian Journal of Statistics, 31, 283–293. [Google Scholar]
Sedransk J (1985) The objective and practice of imputation In Proceedings of the First Annual Research Conference, U.S. Bureau of the Census, Washington D.C., 445–452. [Google Scholar]
Stone CJ (1977) Consistent nonparametric regression. Ann. Statist, 5, 595–620. [Google Scholar]
Sun Y and Gilbert P (2012) Estimation of stratified mark-specific proportional hazards models with missing marks. Scandinavian Journal of Statistics, 39, 34–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sun Y, Gilbert P and McKeague I (2009) Proportional hazards models with continuous marks. Annals of Statistics, 37, 394–426. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sun Y, Li M and Gilbert P (2016) Goodness-of-fit test of the stratified mark-specific proportional hazards model with continuous mark. Computational Statistics and Data Analysis, 93, 348–358. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sun Y, Qi L, Yang G and Gilbert P (2018) Hypothesis tests for stratified mark-specific proportional hazards models with missing covariates, with application to hiv vaccine efficacy trials. Biometrical Journal, 60(3), 516–536. [DOI] [PMC free article] [PubMed] [Google Scholar]
White JE (1982) A two stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology, 115, 119–128. [DOI] [PubMed] [Google Scholar]
Yang G, Sun Y, Qi L and Gilbert P (2017) Estimation of stratified mark-specific proportional hazards models under two-phase sampling with application to hiv vaccine efficacy trials. Statistics In Biosciences, 9, 259–283. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

sup 01

NIHMS1643050-supplement-sup_01.pdf^{(345.7KB, pdf)}

[R1] Aalen OO and Johansen S (1978) An empirical transition matrix for non-homogeneous markov chains based on censored observations. Scandinavian Journal of Statistics, 5, 141–150. [Google Scholar]

[R2] Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46, 175–185. [Google Scholar]

[R3] Andridge RR and Little RJA (2010) A review of hot deck imputation for survey non-response. International Statistical Review, 78, 40–64. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Beretta L and Santaniello A (2016) Nearest neighbor imputation algorithms: a critical evaluation. BMC medical informatics and decision making, 16, 197–208. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Borgan O, Langholz B, Samuelsen SO, Goldstein L and Pogoda J (2000) Exposure stratified case-cohort designs. Lifetime data analysis, 6, 39–58. [DOI] [PubMed] [Google Scholar]

[R6] Breslow NE and Lumley T (2013) Semiparametric models and two-phase samples: Applications to Cox regression, vol. Volume 9 of Collections, 65–77. Beachwood, Ohio, USA: Institute of Mathematical Statistics. [Google Scholar]

[R7] Breslow NE, Lumley T, Ballantyne CM, Chambless L and Kulich M (2009) Improved horvitz-thompson estimation of model parameters from two-phase stratified samples: Applications in epidemiology. Statistics in Biosciences, 1, 32–49. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Capeding M, Tran N, Hadinegoro S, Ismail H, Chotpitayasunondh T, Chua M, Luong C, Rusmil K, Wirawan D, Nallusamy R, Pitisuttithum P, Thisyakorn U, Yoon I, van der Vliet D, Langevin E, Laot T, Hutagalung Y, Frago C, Boaz M, Wartel T, Tornieporth N, Saville M, Bouckenooghe A and CYD14 Study Group. (2014) Clinical efficacy and safety of a novel tetravalent dengue vaccine in healthy children in asia: a phase 3, randomised, observer-masked, placebo-controlled trial. Lancet, 384, 1358–65. [DOI] [PubMed] [Google Scholar]

[R9] Carroll RJ, Spiegelman CH, Lan KKG, Bailey KT and Abbott RD (1984) On errors-in-variables for binary regression models. Biometrika, 71, 19–25. [Google Scholar]

[R10] Chen J and Shao J (2000) Nearest neighbor imputation for survey data. J. Official Stat, 16, 113–141. [Google Scholar]

[R11] Gao G and Tsiatis AA (2005) Semiparametric estimators for the regression coefficients in the linear transformation competing risks model with missing cause of failure. Biometrika, 92, 875–891. [Google Scholar]

[R12] Gilbert P, McKeague I and Sun Y (2004) Tests for comparing mark-specific hazards and cumulative incidence functions. Lifetime Data Analysis, 10, 5–28. [DOI] [PubMed] [Google Scholar]

[R13] Gilbert P and Sun Y (2015) Inferences on relative failure rates in stratified mark-specific proportional hazards models with missing marks, with application to human immunodeficiency virus vaccine efficacy trials. Journal of the Royal Statistical Society: Series C (Applied Statistics), 64, 49–73. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Hsu C-H and Yu M (2019) Cox regression analysis with missing covariates via nonparametric multiple imputation. Statistical Methods in Medical Research. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Jonsson P and Wohlin C (2004) An evaluation of k-nearest neighbour imputation using likert data In 10th International Symposium on Software Metrics, 2004. Proceedings, 108–118. IEEE. [Google Scholar]

[R16] Juraska M and Gilbert P (2013) Mark-specific hazard ratio model with multivariate continuous marks: An application to vaccine efficacy. Biometrics, 69, 328–337. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] — (2016) Mark-specific hazard ratio model with missing multivariate marks. Lifetime Data Analysis, 22, 606–625. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Juraska M, Magaret C, Shao J, Carpp L, Fiore-Gartland A, Benkeser D, Girerd-Chambaz Y, Langevin E, Frago C, Guy B, Jackson N, Duong T, Simmons C, Edlefsen P and Gilbert P (2018) Viral genetic diversity and protective efficacy of a tetravalent dengue vaccine in two phase 3 trials. Proceedings of the National Academy of Sciences, 115, E8378–E838. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Kovar J, Whitridge P and MacMillan J (1988) Generalized edit and imputation system for economic surveys at statistics canada. In Proceedings of the Survey Research Methods Section, American Statistical Association, 627–630. [Google Scholar]

[R20] Kovar JG and Chen EJ (1994) Jackknife variance estimation of imputed survey data. Survey Methodology, 20, 45–52. [Google Scholar]

[R21] Kulich M and Lin D (2004) Improving the efficiency of relative-risk estimation in case-cohort studies. Journal of the American Statistical Association, 99, 832–844. [Google Scholar]

[R22] Lee H, Rancourt E and Särndal C (1995) Variance estimation in the presence of imputed data for the generalized estimation system In Proceedings of the section on survey research methods, vol. 1, 384–389. American Statistical Association USA. [Google Scholar]

[R23] Lee H, Rancourt E and Särndal CE (1994) Experiments with variance estimation from survey data with imputed values. Journal of Official Statistics, 10, 231–243. [Google Scholar]

[R24] Li K-C (1984) Consistency for cross-validated nearest neighbor estimates in nonparametric regression. The Annals of Statistics, 12, 230–240. [Google Scholar]

[R25] Liang K-Y and Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika, 73, 13–22. [Google Scholar]

[R26] Little RJA (1988) Missing-data adjustments in large surveys. Journal of Business & Economic Statistics, 6, 287–296. [Google Scholar]

[R27] Montaquila J and Jernigan R (1997) Variance estimation in the presence of imputed data. In Proceedings of the Survey Research Methods Section of the American Statistical Association, 273–277. [Google Scholar]

[R28] Moodie Z, Juraska M, Huang Y, Zhuang Y, Fong Y, Carpp L, Self S, Chambonneau L, Small R, Jackson N, Noriega F and Gilbert P (2018) Neutralizing antibody correlates analysis of tetravalent dengue vaccine efficacy trials in Asia and Latin America. Journal of Infectious Diseases, 217(5), 742–753. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Nan B (2004) Efficient estimation for case-cohort studies. Canadian Journal of Statistics, 32, 403–419. [Google Scholar]

[R30] Prentice RL (1986) A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika, 73, 1–11. [Google Scholar]

[R31] Rabaa MA, Girerd-Chambaz Y, Duong Thi Hue K, Vu Tuan T, Wills B, Bonaparte M, van der Vliet D, Langevin E, Cortes M, Zambrano B, Dunod C, Wartel-Tram A, Jackson N and Simmons CP (2017) Genetic epidemiology of dengue viruses in phase iii trials of the cyd tetravalent dengue vaccine and implications for efficacy. eLife., 6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Rancourt E, Särndal C and Lee H (1994) Estimation of the variance in the presence of nearest neighbor imputation In Proceedings of the section on survey research methods, 888–893. American Statistical Association. [Google Scholar]

[R33] Reilly M (1993) Data analysis using hot deck multiple imputation. Journal of the Royal Statistical Society, Series D: The Statistician, 42, 307–313. [Google Scholar]

[R34] Robins J, Rotnitzky A and Zhao L (1994) Estimation of regression-coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89, 846–866. [Google Scholar]

[R35] Rubin D (1987) Multiple Imputation for Nonresponse in Surveys. Wiley, New York. [Google Scholar]

[R36] Scheike TH and Martinussen T (2004) Maximum likelihood estimation for cox’s regression model under case–cohort sampling. Scandinavian Journal of Statistics, 31, 283–293. [Google Scholar]

[R37] Sedransk J (1985) The objective and practice of imputation In Proceedings of the First Annual Research Conference, U.S. Bureau of the Census, Washington D.C., 445–452. [Google Scholar]

[R38] Stone CJ (1977) Consistent nonparametric regression. Ann. Statist, 5, 595–620. [Google Scholar]

[R39] Sun Y and Gilbert P (2012) Estimation of stratified mark-specific proportional hazards models with missing marks. Scandinavian Journal of Statistics, 39, 34–52. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] Sun Y, Gilbert P and McKeague I (2009) Proportional hazards models with continuous marks. Annals of Statistics, 37, 394–426. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Sun Y, Li M and Gilbert P (2016) Goodness-of-fit test of the stratified mark-specific proportional hazards model with continuous mark. Computational Statistics and Data Analysis, 93, 348–358. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] Sun Y, Qi L, Yang G and Gilbert P (2018) Hypothesis tests for stratified mark-specific proportional hazards models with missing covariates, with application to hiv vaccine efficacy trials. Biometrical Journal, 60(3), 516–536. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] White JE (1982) A two stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology, 115, 119–128. [DOI] [PubMed] [Google Scholar]

[R44] Yang G, Sun Y, Qi L and Gilbert P (2017) Estimation of stratified mark-specific proportional hazards models under two-phase sampling with application to hiv vaccine efficacy trials. Statistics In Biosciences, 9, 259–283. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A Hybrid Approach for the Stratified Mark-Specific Proportional Hazards Model with Missing Covariates and Missing Marks, with Application to Vaccine Efficacy Trials

Yanqing Sun

Li Qi

Fei Heng

Peter B Gilbert

Abstract

1. Introduction

2. Hybrid estimation using AIPW and NNHD multiple imputations