Generalized accelerated recurrence time model for multivariate recurrent event data with missing event type

Huijuan Ma; Limin Peng; Zhumin Zhang; HuiChuan J Lai

doi:10.1111/biom.12847

. Author manuscript; available in PMC: 2018 Sep 27.

Published in final edited form as: Biometrics. 2018 Feb 9;74(3):954–965. doi: 10.1111/biom.12847

Generalized accelerated recurrence time model for multivariate recurrent event data with missing event type

Huijuan Ma ¹, Limin Peng ^1,^*, Zhumin Zhang ², HuiChuan J Lai ²

PMCID: PMC6085173 NIHMSID: NIHMS953529 PMID: 29427311

Summary

Recurrent events data are frequently encountered in biomedical follow-up studies. The generalized accelerated recurrence time (GART) model (Sun et al., 2016), which formulates covariate effects on the time scale of the mean function of recurrent events (i.e. time to expected frequency), has arisen as a useful secondary analysis tool to provide meaningful physical interpretations. In this paper, we investigate the GART model in a multivariate recurrent events setting, where subjects may experience multiple types of recurrent events and some event types may be missing. We propose methods for the GART model that utilize the inverse probability weighting technique or the estimating equation projection strategy to handle event types that are missing at random. The new methods do not require imposing any parametric model for the missing mechanism, and thus are robust; moreover they enjoy easy and stable implementation. We establish the uniform consistency and weak convergence of the resulting estimators and develop appropriate inferential procedures. Extensive simulation studies and an application to a dataset from Cystic Fibrosis Foundation Patient Registry (CFFPR) illustrate the validity and practical utility of the proposed methods.

Keywords: Accelerated recurrence time model, Missing at random, Multivariate recurrent event data, Nadaraya–Watson kernel estimator

1. Introduction

Recurrent events data are frequently encountered in biomedical follow-up studies where subjects may experience events of interest repeatedly over time. A major analytic strategy for recurrent events data is to assess and model the mean or rate functions of recurrent events (Pepe and Cai, 1993; Lawless and Nadeau, 1995; Lin et al., 2000, among others) which are intuitive to interpret and implicate weak assumption on the within-subject dependency. While existing mean or rate function based approaches mostly attend to the frequency scale of the mean function, there are increasing interests in characterizing the progression of recurrent events by the time scale of the mean function for meaningful physical interpretations. For example, the classic accelerated failure time model (AFT) for recurrent events (Lin et al., 1998) specifies covariate effects as constant time scale changes of the mean function. To explicitly quantify the changes in the time scale of the mean function, Huang and Peng (2009) introduced the concept of time to expected frequency, defined as the inverse function of the mean function, and proposed the accelerated recurrence time model (ART) model. The ART model extends the AFT model by allowing for evolving covariate effects on time to expected frequency. More recently, Sun et al. (2016) derived the generalized accelerated recurrence time (GART) model from a counting process modeling perspective. The GART model is a strict extension of the ART model, permitting a more flexible transformation from the frequency scale to the time scale of the mean function. The inferential methods developed by Sun et al. (2016) accommodate recurrent events data subject to observation windows that take the form of general time interval(s).

However, the aforementioned methods are all oriented to the settings where recurrent events are of the same type. In practice, subjects may experience multiple recurrent events of different types and moreover the identification of the event type can be missing due to a variety of reasons. For example, Pseudomonas aeruginosa (Pa) is a major respiratory pathogen acquired in the early life of patients with cystic fibrosis (CF) and usually leads to chronic infections. The organism can also transition from a motile, virulent, nonmucoid type to a nonmotile, comparatively avirulent, mucoid type. The mucoid type is more likely to be drug-resistent and associated with more severe CF disease progression. In the past, the two were not always differentiated in clinics. Recently, it has become a common practice to classify Pa positive cultures into mucoid and nonmucoid types to aid in treatment decisions. However unknown or missing Pa infection types still occur. As shown by our simulations studies (see Section 5), ignoring this data complication can seriously bias the estimation of the recurrence pattern of Pa infections of each type. This can potentially misguide the CF disease management.

In this paper, we consider the problem of fitting the GART models to the multivariate recurrent events data with missing event types. Several authors have addressed such recurrent events data in other model settings. For example, Chen and Cook (2009) specified a multiplicative conditional Poisson model for the multivariate recurrent events data and derived an EM-algorithm to perform the maximum likelihood analysis in the presence of missing event types. Schaubel and Cai (2006a) and Schaubel and Cai (2006b) studied the semiparametric proportional rate model, using the multiple imputation technique and weighted estimating equations respectively to account for missing event types. Schaubel and Cai (2006a)’s weighted estimating equations were further adapted to the additive rate model (Ye, Zhao, Sun, and Xu, 2015) and the additive-multiplicative rate models (Ye, Sun, Zhao, and Xu, 2015). More recently, Lin et al. (2013) proposed a fully nonparametric estimator of the mean function in the one-sample case. While these methods shed useful insight for dealing with missing event types, they are not readily extendable to the GART model. This is because the GART model does not imply a likelihood, unlike a parametric model. In addition, the semiparametric rate models mentioned above only involve real valued coefficients, while the coefficients of the GART model, which accommodate varying covariate effects, take the form of functions.

To tackle the data complication caused by missing event types under the GART models, we consider two strategies. One is to apply the inverse probability weighting (IPW) technique to correct the bias only using the data with observed event types. The other one is to impute the missing event type by its estimated probability of being each specific event type in the estimating equation that assumes a complete observation of event types. The second strategy shares the same spirit as that of Schaubel and Cai (2006a), Schaubel and Cai (2006b), and Lin et al. (2013), and we shall refer it to as estimating equation projection (EEP) strategy hereafter. To carry out the IPW or EEP strategy, the key task is to estimate the conditional probability of event type being observed or the missing event type being a specific type given covariates and/or other observed data. To this end, we propose nonparametric Nadaraya-Watson type estimators to avoid additional parametric modeling. Like in Lin et al. (2013), the proposed conditional probability estimators can be justified from the local likelihood estimation perspective. Our estimators also have explicit closed forms despite the incorporation of covariates, which are not available in Lin et al. (2013)’s method. As another appealing feature, the two methods derived from the IPW and EEP strategies can be unified in an inferential framework that resembles Sun et al. (2016)’s method. This entails simple and stable implementations of the proposed methods. For example, we are able to obtain the proposed estimators via algorithms that only involve minimizations of a sequence of L₁-type convex functions, which can readily be solved by existing functions in R and S-PLUS. By our asymptotic studies, the two proposed estimators are shown to be asymptotically equivalent.

We organize the rest of the paper as follows. In Section 2, we present the generalized accelerated recurrence time (GART) model and propose the estimating methods derived from the IPW and EEP strategies. We establish the asymptotic properties of the proposed estimators, including the uniform consistency and weak convergence, in Section 3, and discuss the inference procedures in Section 4. The simulation studies that investigate the finite sample performance of the proposed estimators are reported in Section 5. We illustrate the proposed methods via an application to a dataset from the Cystic Fibrosis Foundation Patient Registry (CFFPR) in Section 6. Finally, we provide some concluding remarks in Section 7.

2. The Proposed Methods

2.1 Data and Model

Suppose that a subject may experience K types of recurrent events, and recurrent events are subject to an observation window that is an time interval, (L,R]. Let T⁽^j⁾ denote the j-th recurrent event time, δ̄(t) ∈ {1, …, K} denote the type of the event that occurs at time t, and A(t) is a binary indicator which equals 1 if the event type is observed at time t and 0 otherwise. Define δ_k(t) = I{δ̄(t) = k}, where I(·) is the indicator function, and write $δ_{k}^{(j)} ≐ δ_{k} (T^{(j)})$ and A⁽^j⁾ = A(T⁽^j⁾). Let X̃ be a (p − 1) × 1 covariate vector and X = (1, X̃).

For type-k events, the underlying counting process is given by $N_{k}^{*} (t) = \sum_{j = 1}^{\infty} I (T^{(j)} \leq t, δ_{k}^{(j)} = 1)$ , which represents the total number of type-k events that have occurred by time t. The observation of recurrent events is only available in the time interval, (L,R]. When all event types are known, $N_{k} (t) ≐ \sum_{j = 1}^{\infty} I (L < T^{(j)} \leq t^R, δ_{k}^{(j)} = 1)$ captures the total number of type-k events observed by time t. Accounting for the fact that some event types may be missing, we define ${\overset{ˇ}{N}}_{k} (t) = \sum_{j = 1}^{\infty} I (L < T^{(j)} \leq t^R, A^{(j)} = 1, δ_{k}^{(j)} = 1}$ to represent the total number of type-k events that are observed by time t and are known to be type-k. Finally, we define $N . (t) = \sum_{k = 1}^{K} N_{k} (t)$ , which captures the total number of recurrent events (regardless their types) observed by time t. We assume that $N_{k}^{*} (\cdot)$ is independent of L and R given X for each k, $d N_{k}^{*} (s) \in {0, 1}$ , and $d N_{k}^{*} (s) d N_{l}^{*} (s) = 0$ for k ≠ l. This means, the observation window (L,R] is non-informative of the recurrent events, and only up to one type of event can occur at one time point.

For the multivariate recurrent events data considered in this paper, recurrent event times are always observed but the corresponding event types may be unkown/missing. That is, the observed data consist of n independent and identically distributed (i.i.d.) replicates of {N_·(t), Ň_k(t), L, R, dN_· (t)A(t), dN_· (t)A(t)δ_k(t), X; t > 0, k = 1, …, K}, denoted by ${N_{i \cdot} (t), {\overset{ˇ}{N}}_{i k} (t), L_{i}, R_{i}, d N_{i \cdot} (t) A_{i} (t), d N_{i \cdot} (t) A_{i} (t) δ_{i k} (t), X_{i}; t > 0, k = 1, \dots, K}_{i = 1}^{n}$ .

For type-k events, time to expected frequency u (Huang and Peng, 2009) is defined as

τ_{X, k} (u) = inf {t \geq 0 : μ_{X, k} (t) \geq u},

where $μ_{X, k} (t) ≐ E {N_{i k}^{*} (t) ∣ X}$ is the conditional mean function of the type-k event given X. For each event type k, we assume the generalized accelerated recurrence time (GART) model (Sun et al., 2016):

τ_{X, k} {G (u)} = exp {X^{⊤} β_{0 k} (u)}, u \in (0, U],

(1)

where $G (u) = \int_{0}^{u} g (s) d s$ with g being a known positive continuous function, β₀_k(·) is a p×1 vector of unknown coefficient functions, and U is a positive constant. The non-intercept components of β₀_k(u) represent covariate effects on the time to expected frequency G(u) of the type-k event. When they are all constant over u and g(·) = 1, model (1) becomes the AFT model for recurrent events. In the non-recurrent event setting (i.e. $T_{i}^{(j)} = \infty$ for all j > 1), model (1) with g(·) = 1 reduces to a standard quantile regression model for the type-k event time.

2.2 The Proposed Estimating Equations

By Sun et al. (2016), model (1) implies $E {N_{i k} (e^{X_{i}^{⊤} β_{0 k} (u)}) ∣ X_{i}} = E {\int_{0}^{u} Y_{i} (e^{X_{i}^{⊤} β_{0 k} (s)}) g (s) d s ∣ X_{i}}$ , where Y_i(t) = I(L_i < t ≤ R_i) denotes the at-risk process for recurrent events. When the event types are always observed, we have N_ik(t) = Ň_ik(t). Thus we can apply Sun et al. (2016)’s method to estimate β₀_k(u). That is, we solve the following estimating equation for β_k(·):

n^{- 1 / 2} \sum_{i = 1}^{n} X_{i} {{\overset{ˇ}{N}}_{i k} (exp {X_{i}^{⊤} β_{k} (u)}) - \int_{0}^{u} Y_{i} (exp {X_{i}^{⊤} β_{k} (s)}) g (s) d s} = 0.

(2)

When some of event type information is missing, using equation (2) to estimate β₀_k(u) corresponds to the so called complete–case (CC) analysis, which ignores the events of unknown type. In this case, Ň_ik(·) deviates from N_ik(·) if $A_{i}^{(j)} = 0$ for some j. Consequently, the expectation of the left-hand side of estimating equation (2) with β_k(u) = β₀_k(u) is generally away from zero, even when the event type is missing completely at random (MCAR) (Little and Rubin, 2002). This suggests that the CC analysis based on estimating equation (2) is problematic and can yield a biased estimator of β₀_k(u).

To obtain an unbiased estimator of β₀_k(u), our basic idea is to find an appropriate proxy of N_ik(t), denoted by N̂_ik(t), and then solve the following equation for β_k(·):

n^{- 1 / 2} \sum_{i = 1}^{n} X_{i} {{\overset{ˇ}{N}}_{i k} (exp {X_{i}^{⊤} β_{k} (u)}) - \int_{0}^{u} Y_{i} (exp {X_{i}^{⊤} β_{k} (s)}) g (s) d s} = 0.

(3)

To attain consistent estimation of β₀(·), we shall properly design N̂_ik(t) so that the left-hand side of equation (3) (multiplied by n⁻¹^/²) approaches zero as n →∞ when β_k(·) = β₀_k(·).

To proceed, we assume a missing-at-random (MAR) mechanism (Little and Rubin, 2002) for event types that implies the conditional independence between A_i(t) and δ_ik(t) given dN_i_·(t) and Z_i, where Z_i encompasses covariate X_i and possibly other observed time-independent data, such as L_i and R_i. Similar MAR assumptions for recurrent event type were adopted in previous work, such as Schaubel and Cai (2006a,b); Lin et al. (2013). With Z_i formulated as independent of time, our MAR assumption imposes an implicit constraint that the event type missing probability is only influenced by the observed data that are fixed over time. As shown in Sections 2.2.1 and 2.2.2, this MAR assumption facilitates the derivation of an appropriate inverse probability weight and the construction of the proposed EEP equation.

In the following subsections, we give two specific forms of N̂_ik(t) based on the inverse probability weighting (IPW) technique and the estimating equation projection (EEP) strategy respectively.

2.2.1 Inverse Probability Weighting (IPW) Method

Let π_k(t,z) = E{A_i(t)|dN_ik(t)} = 1,Z_i = z}, $A_{i}^{(j)} = A_{i} (T_{i}^{(j)})$ , and $π_{i k}^{(j)} = π_{k} (T_{i}^{(j)}, Z_{i})$ . Using the standard IPW arguments, we can show that $E {d N_{i k} (t) ∣ Z_{i}} = E {\frac{1}{π_{k} (t, Z_{i})} d {\overset{ˇ}{N}}_{i k} (t) ∣ Z_{i}}$ , and thus $E {N_{i k} (t) ∣ Z_{i}} = E {\int_{0}^{t} \frac{1}{π_{k} (s, Z_{i})} d {\overset{ˇ}{N}}_{i k} (s)}$ . Therefore, a special form of N̂_ik(t) is suggested as

{\hat{N}}_{i k}^{IPW} (t) = \int_{0}^{t} \frac{1}{{\hat{π}}_{k} (s, Z_{i})} d {\overset{ˇ}{N}}_{i k} (s) ≐ \sum_{j = 1}^{\infty} \frac{1}{{\hat{π}}_{i k}^{(j)}} I (L_{i} < T_{i}^{(j)} \leq t \land R_{i}, A_{i}^{(j)} = 1, δ_{i k}^{(j)} = 1),

where π̂_k(t, z) (or ${\hat{π}}_{i k}^{(j)}$ ) is a reasonable estimate for π_k(t, z) (or $π_{i k}^{(j)}$ ).

To derive π̂_k(t, z) (or ${\hat{π}}_{i k}^{(j)}$ ), we first note that under the assumed MAR mechanism,

π_{k} (t, z) = E {A_{i} (t) ∣ δ_{i k} (t) = 1, d N_{i \cdot} (t) = 1, Z_{i} = z} = E {A_{i} (t) ∣ d N_{i \cdot} (t) = 1, Z_{i} = z} .

(4)

This implies that π_k(t, z)’s are the same for all k ∈ {1, …, K}. Thus, we can drop the subscript k in π_k(t, z), $π_{i k}^{(j)}$ , and ${\hat{π}}_{i k}^{(j)}$ , and use the notation π(t, z), $π_{i}^{(j)}$ , and ${\hat{π}}_{i}^{(j)}$ instead.

Intuitively, one may adopt a parametric regression model, such as a logistic regression model, for A_i(t) to obtain an estimate for π(t, z). However, such an estimator may be biased when the parametric model is misspecified. To avoid this issue, we propose a fully nonparametric method to estimate π(t, z). Specifically, we propose a Nadaraya-Watson type nonparametric estimator of π(t, z) that takes the form

\hat{π} (t, z) = \frac{\sum_{i = 1}^{n} K_{h} (Z_{1 i} - z_{1}) I (Z_{2 i} = z_{2}) \int K_{h} (s - t) A_{i} (s) d N_{i \cdot} (s)}{\sum_{i = 1}^{n} K_{h} (Z_{1 i} - z_{1}) I (Z_{2 i} = z_{2}) \int K_{h} (s - t) d N_{i \cdot} (s)},

(5)

where K_h(u) = h⁻¹K(u/h), h is a bandwidth depending on n, $K_{h} (u) = Π_{i = 1}^{d} K_{h} (u_{i})$ for u = (u₁, u₂, …, u_d) ∈ ℛ^d, z = (z₁, z₂), d is the number of the continuous elements in Z_i, and Z₁_i and Z₂_i are the continuous and discrete elements of Z_i respectively. Here K(u) is a rth order (r > d + 1) kernel function with compact support satisfying ∫K(u)du = 1, ∫u^mK(u)du = 0 for m = 1, 2, …, r − 1, ∫u^rK(u)du ≠ 0, and ∫K(u)²du < ∞. In the Supplementary Materials, we show that π̂(t, z) is the (kernel-based) local likelihood estimator of π(t, z) via a locally constant likelihood approximation. Similar types of estimators have been used in other methods that deal with missing data, for example, Zhou et al. (2008), Chen et al. (2015), and Qiu et al. (2017).

Plugging ${\hat{N}}_{i k}^{IPW} (t)$ with ${\hat{π}}_{i k}^{(j)} = \hat{π} (T_{i}^{(j)}, Z_{i})$ into (3), we obtain an IPW type estimating equation for β₀_k(·):

S_{n k}^{IPW} (β_{k}) ≐ n^{- 1 / 2} \sum_{i = 1}^{n} X_{i} {{\hat{N}}_{i k}^{IPW} (exp {X_{i}^{⊤} β_{k} (u)}) - \int_{0}^{u} Y_{i} (exp {X_{i}^{⊤} β_{k} (s)}) g (s) d s} = 0.

(6)

The procedure to solve this estimating equation is elaborated in Section 2.3.

2.2.2 Estimating Equation Projection (EEP) Method

Following the EEP strategy exploited in literature (Schaubel and Cai, 2006a,b; Lin et al., 2013, among others), we write

N_{i k} (t) = \int_{0}^{t} [A_{i} (s) δ_{i k} (s) + {1 - A_{i} (s)} δ_{i k} (s)] d N_{i \cdot} (s) ≐ \sum_{j = 1}^{\infty} {A_{i}^{(j)} δ_{i k}^{(j)} + (1 - A_{i}^{(j)}) δ_{i k}^{(j)}} I (L_{i} < T_{i}^{(j)} \leq R_{i} \land t),

and propose to recover the missing component of N_ik(t) (i.e. {1− A_i(s)}δ_ik(s)) by imputing the δ_ik(s) with A_i(s) = 0 by its estimated expectation.

Specifically, define p_k(t, z) = E{δ_ik(t)|A_i(t) = 0, dN_i_·(t) = 1,Z_i = z}. Under the assumed MAR mechanism, we have p_k(t, z) = Pr{δ_ik(t) = 1|A_i(t) = 1, dN_i_·(t) = 1,Z_i = z}. We propose a Nadaraya–Watson type nonparametric estimator of p_k(t, z), given by

{\hat{p}}_{k} (t, z) = \frac{\sum_{i = 1}^{n} K_{h} (Z_{1 i} - z_{1}) I (Z_{2 i} = z_{2}) \int K_{h} (s - t) A_{i} (s) δ_{i k} (s) d N_{i \cdot} (s)}{\sum_{i = 1}^{n} K_{h} (Z_{1 i} - z_{1}) I (Z_{2 i} = z_{2}) \int K_{h} (s - t) A_{i} (s) d N_{i \cdot} (s)} .

Similar to the derivation of π̂(t, z), p̂_k(t, z) is a maximum local likelihood estimator when p_k(t, z) is approximated by a constant within a kernel band in t and z; more details can be found in the Supplementary Materials. Note that Lin et al. (2013) also adopted a similar local likelihood method to estimate the counterpart of p_k(t, z) in the one-sample case. They used a local polynomial with order q to approximate the imputed probability, and it is hard to generalize their estimator to account for covariates. Our idea of using a locally constant approximation circumvents such a difficulty. Moreover it enables a closed form for p̂_k(t, z), which facilitates the computation while not sacrificing the estimation efficiency.

A special form of N̂_ik(t) derived by the EEP strategy is given by

{\hat{N}}_{i k}^{EEP} (t) = \int_{0}^{t} [A_{i} (s) δ_{i} (s) + {1 - A_{i} (s)} {\hat{p}}_{k} (s, Z_{i})] d N_{i \cdot} (s) ≐ \sum_{j = 1}^{\infty} [A_{i}^{(j)} δ_{i k}^{(j)} + (1 - A_{i}^{(j)}) {\hat{p}}_{i k}^{(j)}] I (L_{i} < T_{i}^{(j)} \leq R_{i} \land t)

where ${\hat{p}}_{i k}^{(j)} = {\hat{p}}_{k} (T_{i}^{(j)}, Z_{i})$ . The resulting EEP type estimating equation takes the form,

S_{n k}^{EEP} (β_{k}) ≐ n^{- 1 / 2} \sum_{i = 1}^{n} X_{i} {{\hat{N}}_{i k}^{EEP} (exp {X_{i}^{⊤} β_{k} (u)}) - \int_{0}^{u} Y_{i} (exp {X_{i}^{⊤} β_{k} (s)}) g (s) d s} = 0.

(7)

2.3 Computation algorithm

We generally denote the proposed estimating equations by

S_{n k}^{L} (β_{k}) = n^{- 1 / 2} \sum_{i = 1}^{n} X_{i} {{\hat{N}}_{i k}^{L} (exp {X_{i}^{⊤} β_{k} (u)}) - \int_{0}^{u} Y_{i} (exp {X_{i}^{⊤} β_{k} (s)}) g (s) d s} = 0,

(8)

with L = IPW or EEP. The resulting estimators are denoted as ${\hat{β}}_{k}^{L} (\cdot)$ . Following Peng and Huang (2008) and Sun et al. (2016), we adopt a grid-based algorithm to get ${\hat{β}}_{k}^{L} (\cdot)$ based on equations (8). Specifically, define a grid S_L₍_n₎ = {0 = u₀ < u₁ < ··· < u_L₍_n₎ = U}, and denote its size by ||S_L₍_n₎|| = max_j₌₁_,_…_,L₍_n₎ |u_j − u_j₋₁|. We define ${\hat{β}}_{k}^{L} (\cdot)$ as a right continuous piecewise-constant function that jumps only at the grid points of S_L₍_n₎. We set $exp {X_{i}^{⊤} {\hat{β}}_{k}^{L} (0)} = 0$ for every i since τ_X_,k(0) = exp{X^⊤β₀_k(0)} = 0. We obtain ${\hat{β}}_{k}^{L} (u_{l})$ , l = 1, 2, …, L(n) by sequentially solving the estimating equation,

n^{- 1 / 2} \sum_{i = 1}^{n} X_{i} {{\hat{N}}_{i k}^{L} (exp {X_{i}^{⊤} β_{k} (u_{l})}) - \sum_{m = 0}^{l - 1} Y_{i} (exp {X_{i}^{⊤} {\hat{β}}_{k}^{L} (u_{m})}) \int_{u_{m}}^{u_{m + 1}} g (s) d s} = 0,

(9)

with L = IPW or EEP.

An exact solution that makes the equation (9) strictly hold may not exist owning to the fact that (9) is not continuous. Since equation (9) is monotone, ${\hat{β}}_{k}^{L} (u_{l})$ is defined as a generalized solution to equation (9) and the set of generalized solutions is convex of diameter O(n⁻¹) An equivalent alternative approach to find a generalized solution to (9) is to locate the minimizer of the L₁-type convex function,

W_{l, k}^{L} (h) = \sum_{i = 1}^{n} \sum_{j = 1}^{\infty} {\hat{ω}}_{i, j, k}^{L} | log T_{i}^{(j)} - X_{i}^{⊤} h | + | R^{*} - {\sum_{i = 1}^{n} \sum_{j = 1}^{\infty} {\hat{ω}}_{i, j, k}^{L} (- X_{i}^{⊤} h)} | + | R^{*} - {\sum_{i = 1}^{n} 2 X_{i}^{⊤} h \sum_{m = 0}^{l - 1} Y_{i} (exp {X_{i}^{⊤} {\hat{β}}_{k}^{L} (u_{m})}) \int_{u_{m}}^{u_{m + 1}} g (s) d s} |,

where L = IPW or EEP, ${\hat{ω}}_{ijk}^{IPW} = \frac{A_{i}^{(j)}}{{\hat{π}}_{i}^{(j)}} δ_{i k}^{(j)} I (L_{i} < T_{i}^{(j)} \leq R_{i}), {\hat{ω}}_{ijk}^{EEP} = [A_{i}^{(j)} δ_{i k}^{(j)} + (1 - A_{i}^{(j)}) {\hat{p}}_{i k}^{(j)}] I (L_{i} < T_{i}^{(j)} \leq R_{i})$ , and R^* is a large constant that bounds $| \sum_{i = 1}^{n} \sum_{j = 1}^{\infty} {\hat{ω}}_{i, j, k}^{L} (- X_{i}^{⊤} h) |$ and $| \sum_{i = 1}^{n} 2 X_{i}^{⊤} h \sum_{m = 0}^{l - 1} Y_{i} (exp {X_{i}^{⊤} {\hat{β}}_{k}^{L} (u_{m})}) \int_{u_{m}}^{u_{m + 1}} g (s) d s |$ from the above.

We can show that $\partial W_{l, k}^{L} (β (u_{l})) / \partial β (u_{l})$ equals −2 times the estimating equation in (9) by following arguments similar to those in the Appendix of Peng and Fine (2009). This justifies the use of the minimizer of $W_{l, k}^{L} (h)$ as a generalized solution to equation (9). We can solve the minimization of $W_{l, k}^{L} (h)$ by using standard statistical software, for example the l1fit() function in S-PLUS or the rq() function in R package quantreg. More specifically, let m_i = N_i_·(R_i), 1_{m_i} denote a m_i × 1 vector with all components equal to 1, and ⊗ denote the Kronecker product. One may directly apply the l1fit() or rq() to solve a median regression problem with an augmented dataset, where the response vector is

{(log (T_{1}^{(1)}), \dots, log (T_{1}^{(m_{1})}), \dots, log (T_{n}^{(1)}), \dots, log (T_{n}^{(m_{n})}), R^{*}, R^{*})}^{⊤},

the covariate matrix is

({(1_{m_{1}} \otimes X_{1}^{⊤})}^{⊤}, \dots, {(1_{m_{n}} \otimes X_{n}^{⊤})}^{⊤}, - \sum_{i = 1}^{n} \sum_{j = 1}^{\infty} {\hat{ω}}_{i, j, k}^{L} X_{i}, \sum_{i = 1}^{n} 2 X_{i} \sum_{m = 0}^{l - 1} Y_{i} (exp {X_{i}^{⊤} {\hat{β}}_{k}^{L} (u_{m})}) \cdot \int_{u_{m}}^{u_{m + 1}} g (s) d s),

and the weight vector is ${({\hat{ω}}_{1, 1, k}^{L}, \dots, {\hat{ω}}_{1, m_{1}, k}^{L}, \dots, {\hat{ω}}_{n, 1, k}^{L}, \dots, {\hat{ω}}_{n, m_{n}, k}^{L}, 1, 1)}^{⊤}$ .

3. Asymptotic Properties

In this Section, we establish the uniform consistency and weak convergence of the proposed estimator ${\hat{β}}_{k}^{L} (\cdot)$ . Denote the density of Z by f(z). Define

N_{i k}^{AIPW} (t) = \sum_{j = 1}^{\infty} [\frac{A_{i}^{(j)}}{π_{i}^{(j)}} δ_{i k}^{(j)} + (1 - \frac{A_{i}^{(j)}}{π_{i}^{(j)}}) p_{i k}^{(j)}] I (L_{i} < T_{i}^{(j)} \leq R_{i} \land t),

μ̃_Z_,k(t) = E{N_ik(t)|Z_i}, g_Z_,k(t) = dμ̃_Z_,k(t)/dt, $g_{Z} (t) = \sum_{k = 1}^{K} g_{Z, k} (t)$ , μ̃_X_,k(x) = E{N_ik(x)|X_i}, g_X_,k(x) = dμ̃_X_,k(x)/dx, $v_{k} (b) = E [X_{i} N_{i k} {exp (X_{i}^{⊤} b)}]$ , and B_k(b) = dυ_k(b)/db^⊤. It follows from simple algebra that B_k(b) = E{X^⊗2e^X^{^⊤}^bg_X_,k(e^X^{^⊤}^b)}, where v^⊗2 = vv^⊤ for any vector v. Let $f_{X}^{L} (x)$ and $f_{X}^{R} (x)$ be the conditional density functions of L and R given X respectively, $\tilde{v} (b) = E [X_{i} Y_{i} {exp (X_{i}^{⊤} b)}]$ , and J(b) = dυ̃(b)/db^⊤, we have $J (b) = E [X^{\otimes 2} e^{X^{⊤} b} {f_{X}^{L} (e^{X^{⊤} b}) - f_{X}^{R} (e^{X^{⊤} b})}]$ . Denote ℬ_k(d) = {b ∈ R^p : inf_u_∈(0_,U_]||υ_k(b) − υ_k{β₀_k(τ)}|| ≤ d} as a neighborhood containing {β₀_k(u), u ∈ (0, U]}, where ||·|| is the Euclidean norm.

We assume the following regularity conditions:

C1
X_i and N_ik(R_i) are bounded, E(X^⊗2) is positive definite.
C2
Each component of υ_k{β₀_k(u)} is Lipschitz continuous for u ∈ (0, U], k = 1, …, K.
C3
For some d₀ > 0, g_X_,k(exp(X^⊤b)) > 0 for any b ∈ ℬ_k(d₀) and X ∈ 𝒳.
C4
Each component of J(b)B_k(b)⁻¹ is uniformly bounded in b ∈ ℬ_k(d₀).
C5
For any v ∈ (0, U], inf_u_∈[_v,U_] eigmin B_k{β₀_k(u)} > 0, where eigmin(·) denotes the minimum eigenvalue of a matrix.
C6
The bandwidth sequence h satisfies nh²^r → 0, nh²⁽^d⁺¹⁾ → ∞, and nh^d⁺¹/ log n → ∞.
C7
The functions, f(z), g_Z(t), π(t, z), and p_k(t, z) are uniformly bounded away from zero, and have r continuous and bounded partial derivatives with respect to t and the continuous components of z almost surely.

Note that conditions C1–C5 are the same as those adopted by Sun et al. (2016) for justifying the use of equation (2) for estimating the GART model with fully observed event types. It is worth mentioning that condition C3 implies that the support of L must include 0 and the support of R must cover exp{X^⊤β₀_k(U)} for all X ∈ 𝒳. This constraint is necessary to ensure the identifiability of {β₀_k(u) : u ∈ (0, U]. Conditions C6 and C7 are common assumptions in literature (Chen et al., 2015; Qiu et al., 2017, for example) that ensure the desirable large sample properties of nonparametric kernel estimators π̂(t, z) and p̂_k(t, z). We have the following theorems:

Theorem 1

Suppose model (1) holds for u ∈ (0, U]. Under the regularity conditions C1–C7, if lim_n_→∞ ||𝒮_L₍_n₎|| = 0, then ${sup}_{u \in [v, U]} ‖ {\hat{β}}_{k}^{L} (u) - β_{0 k} (u) ‖ \overset{p}{\to} 0$ for k = 1, …, K and L=IPW or EEP, where 0 < v < U.

Theorem 2

Suppose model (1) holds for u ∈ (0, U]. Under the regularity conditions C1–C7, if lim_n_→∞ n^1/2||𝒮_L₍_n₎|| = 0, then $n^{1 / 2} {{\hat{β}}_{k}^{L} (u) - β_{0 k} (u)}$ converges weakly to a Gaussian process for u ∈ [v,U] with covariance Σ(s, t) ≐ E[η_ik(s)η_ik(t)^⊤], where 0 < v < U, η_ik(u) = B_k{β₀_k(u)}⁻¹ϕ(ξ_ik),

ξ_{i k} (τ) = X_{i} {N_{i k}^{AIPW} {X_{i}^{⊤} β_{0 k} (τ)} - \int_{0}^{τ} Y_{i} (exp {X_{i}^{⊤} β_{0 k} (u)}) g (u) d u},

$ϕ (w) (u) = \int_{0}^{u} I (s, u) d w (s)$ is a linear operator, and

I (s, t) = \prod_{u \in (s, t]} [I_{p} + J {β_{0 k} (u)} B_{k} {β_{0 k} (u)}^{- 1} g (u) d u] .

Note that, Theorem 2 not only establishes the weak convergence result for the proposed estimators but also indicates that the proposed IPW and EEP estimators have the same limit distributions. Detailed proofs of Theorems 1–2 are provided in the Supplementary Materials.

4. Inferences

4.1 Resampling approach

For inference on β₀_k(u), we propose a simple resampling procedure by adapting the work of Jin et al. (2001). Suppose {ζ_i, i = 1, …, n} are independent and identically distributed variables from a nonnegative known distribution with mean 1 and variance 1, such as the exponential distribution with rate 1.

We first need to obtain the resampled versions of π(t, z) and p_k(t, z), which are respectively

{\hat{π}}^{*} (t, z) = \frac{\sum_{i = 1}^{n} ζ_{i} K_{h} (Z_{1 i} - z_{1}) I (Z_{2 i} - z_{2}) \int K_{h} (s - t) A_{i} (s) d N_{i \cdot} (s)}{\sum_{i = 1}^{n} ζ_{i} K_{h} (Z_{1 i} - z_{1}) I (Z_{2 i} = z_{2}) \int K_{h} (s - t) d N_{i \cdot} (s)}

and

{\hat{p}}_{k}^{*} (t, z) = \frac{\sum_{i = 1}^{n} ζ_{i} K_{h} (Z_{1 i} - z_{1}) I (Z_{2 i} = z_{2}) \int K_{h} (s - t) A_{i} (s) δ_{i k} (s) d N_{i \cdot} (s)}{\sum_{i = 1}^{n} ζ_{i} K_{h} (Z_{1 i} - z_{1}) I (Z_{2 i} = z_{2}) \int K_{h} (s - t) A_{i} (s) d N_{i \cdot} (s)} .

Then we define $β_{k}^{L *} (\cdot)$ as the generalized solution to the perturbed estimating equation,

n^{- 1 / 2} \sum_{i = 1}^{n} ζ_{i} X_{i} {{\hat{N}}_{i k}^{L *} (exp {X_{i}^{⊤} β_{k} (u)}) - \int_{0}^{u} Y_{i} (exp {X_{i}^{⊤} β_{k} (s)}) g (s) d s} = 0,

(10)

where ${\hat{N}}_{i k}^{L *} (\cdot)$ is ${\hat{N}}_{i k}^{L}$ , with π̂ or p̂_k replaced by π̂^* or ${\hat{p}}_{k}^{*}$ respectively. We can obtain $β_{k}^{L *} (\cdot)$ using a similar procedure to that described in subsection 2.3. It can be shown that the conditional distribution of $n^{1 / 2} {β_{k}^{L *} (u) - {\hat{β}}_{k}^{L} (u)}$ based on the observed data and the unconditional distribution of $n^{1 / 2} {{\hat{β}}_{k}^{L} (u) - β_{k 0} (u)}$ have the same limiting distribution. By fixing the data at the observed values and repeatedly generating {ζ_i, i = 1, …, n}, we can obtain a large number of realizations of $β_{k}^{L *} (u)$ . The empirical distribution of $β_{k}^{L *} (u)$ can be used to estimate the covariance of ${\hat{β}}_{k}^{L} (u)$ or to construct the confidence interval of β_k₀(u).

4.2 Sample-based variance and covariance estimation

We develop a sample-based approach to estimate the variance and covariance of ${\hat{β}}_{k}^{L} (\cdot)$ , following the lines of Sun et al. (2016). Specifically, define $L_{n k}^{L} (b) = n^{- 1 / 2} \sum_{i = 1}^{n} X_{i} {\hat{N}}_{i k}^{L} (exp {X_{i}^{⊤} b}), ι_{i k}^{L} (u) = X_{i} {\hat{N}}_{i k}^{L} (exp {X_{i}^{⊤} {\hat{β}}_{k}^{L} (u)}), Ω_{n k}^{L} (u) = n^{- 1} \sum_{i = 1}^{n} {ι_{i k}^{L} (u)}^{\otimes 2}$ , and ${\tilde{L}}_{n} (b) = n^{- 1 / 2} \sum_{i = 1}^{n} X_{i} Y_{i} (exp {X_{i}^{⊤} b})$ . The following are steps to obtain consistent estimates for B_k{β₀_k(τ )} and J{β₀_k(τ)}, the key unknown components of the asymptotic covariance from Theorem 2:

Find a nonsingular and symmetric p × p matrix $E_{n k}^{L} (u) \equiv {e_{n k, 1}^{L} (u), \dots, e_{n k, p}^{L} (u)}$ such that $Ω_{n k}^{L} (u) = {E_{n k}^{L} (u)}^{2}$ .
Find the solution $b_{n k, j}^{L} (u)$ by solving the equation
$L_{n k}^{L} (b) = L_{n k}^{L} ({\hat{β}}_{k}^{L} (u)) + e_{n k, j}^{L} (u)$ (11)

for b, j = 1, …, p. The working estimating equation (11) is monotone and can be solved by minimizing the following L₁ function:
$\sum_{i = 1}^{n} \sum_{j = 1}^{\infty} {\hat{ω}}_{i, j, k}^{L} ∣ log (T_{i}^{(j)}) - X_{i}^{⊤} b ∣ + | R^{*} - {- \sum_{i = 1}^{n} \sum_{j = 1}^{\infty} X_{i}^{⊤} {\hat{ω}}_{i, j, k} + 2 \sum_{i = 1}^{n} X_{i} {\hat{N}}_{i k}^{L} (exp {X_{i}^{⊤} {\hat{β}}_{k}^{L} (u)}) + 2 n^{1 / 2} e_{n k, j}^{L} (u)}^{⊤} b | .$

with the same strategy presented for minimizing $W_{l, k}^{L *} (h)$ .
Compute $D_{n k}^{L} (u) ≐ {b_{n k, 1}^{L} (u) - {\hat{β}}_{k}^{L} (u), \dots, b_{n k, p}^{L} (u) - {\hat{β}}_{k}^{L} (u)}$ , and ${\tilde{E}}_{n k}^{L} (u) ≐ {{\tilde{L}}_{n} (b_{n k, 1}^{L} (u)) - {\tilde{L}}_{n} ({\hat{β}}_{k}^{L} (u)), \dots, {\tilde{L}}_{n} (b_{n k, p}^{L} (u)) - {\tilde{L}}_{n} ({\hat{β}}_{k}^{L} (u))}$ .
Calculate $n^{- 1 / 2} E_{n k}^{L} (u) D_{n k}^{L} {(u)}^{- 1}$ and $n^{- 1 / 2} {\tilde{E}}_{n k}^{L} (u) D_{n k}^{L} {(u)}^{- 1}$ , which are consistent estimates for B_k{β₀_k(u)} and J{β₀_k(u)} respectively.

Denote B̂_k(u) and Ĵ_k(u) as the estimators of B_k{β₀_k(u)} and J{β₀_k(u)} respectively, and denote ${\hat{η}}_{i k}^{L} (t) = {\hat{B}}_{k} {(t)}^{- 1} \hat{ϕ} ({\hat{ξ}}_{i k}^{L})$ , where ϕ̂(·) is the plug-in estimate for the operator ϕ(·) (defined in Theorem 2). Let ${\hat{N}}_{i k}^{AIPW} (t) = \sum_{j = 1}^{\infty} [\frac{A_{i}^{(j)}}{{\hat{π}}_{i}^{(j)}} δ_{i k}^{(j)} + (1 - \frac{A_{i}^{(j)}}{{\hat{π}}_{i}^{(j)}}) {\hat{p}}_{i k}^{(j)}] I (L_{i} < T_{i}^{(j)} \leq R_{i} \land t)$ , and ${\hat{ξ}}_{i k}^{L} (u) = X_{i} {{\hat{N}}_{i}^{AIPW} (exp {X_{i}^{⊤} {\hat{β}}_{k}^{L} (u)}) - \int_{0}^{u} Y_{i} (exp {X_{i}^{⊤} {\hat{β}}_{k}^{L} (s)}) g (s) d s}$ , for i = 1, …, n and k = 1, …, K and L = IPW or EEP. A consistent sample-based estimate for Σ(s, t) is given by $n^{- 1} \sum_{i = 1}^{n} {\hat{η}}_{i k}^{L} (s) {\hat{η}}_{i k}^{L} {(t)}^{⊤}$ .

4.3 Second-stage exploration of varying effects

Given ${\hat{β}}_{k}^{L} (τ)$ ’s on a range of τ’s, we can employ second-stage inference to summarize and explore the underlying varying pattern of β₀_k(u). The second-stage inference procedures can be carried out by adapting the lines of Sun et al. (2016).

Below we illustrate the second-stage inference procedures via a case where the interest is to assess the constancy of a covariate effect. This problem corresponds to testing the null hypothesis, $H_{k 0, j} : β_{k 0}^{(j)} (u) = ρ_{0}$ , u ∈ [u_L, u_U], where ρ₀ is an unspecified constant. Here and in the rest of this subsection, the superscript ⁽^j⁾ indicates the jth component of a vector (j = 2, …, p), and we omit the superscript ^L that indicates IPW or EEP.

For H₀, we can use the test statistic $T = n^{1 / 2} \int_{u_{L}}^{u_{U}} Ξ (u) {{\hat{β}}_{k}^{(j)} (u) - {\hat{ρ}}_{k}^{(j)}} d u$ , where Ξ(u) is a non-constant weight function satisfying $\int_{u_{L}}^{u_{U}} Ξ (u) d u = 1$ , and ${\hat{ρ}}_{k} = {(u_{U} - u_{L})}^{- 1} \int_{u_{L}}^{u_{U}} {\hat{β}}_{k} (u) d u$ . Let $T^{*} = n^{1 / 2} \int_{u_{L}}^{u_{U}} Ξ (u) [{{\hat{β}}_{k}^{* (j)} (u) - {\hat{β}}_{k}^{(j)} (u)} - ({\hat{ρ}}_{k}^{* (j)} - {\hat{ρ}}_{k}^{(j)})] d u$ , where ${\hat{ρ}}_{k}^{*} = {(u_{U} - u_{L})}^{- 1} \int_{u_{L}}^{u_{U}} {\hat{β}}_{k}^{*} (u) d u$ . We may reject H_k₀_,j if 𝒯 > d₁₋_α/₂ or 𝒯 < d_α/₂, where d_α/₂ and d₁₋_α/₂ are the (α/2)th and the (1 − α/2)th empirical quantiles of 𝒯^*. Accepting H_k₀_,j for all j = 2, …, p may indicate the adequacy of a AFT model when g(·) = 1. Following the arguments of Li and Peng (2014), we can show that the presented constancy test procedure has a type-I error approaching α as n → ∞. The power of the test may be influenced by the choice of the weight function Ξ(u). In practice, one may choose Ξ(u) according to the observed pattern of β̂_k(u) such that it emphasizes the differences from the null to avoid poor power. Note that, we can also show that ρ̂_k is a consistent estimate for the average covariate effect, defined as ${(u_{U} - u_{L})}^{- 1} \int_{u_{L}}^{u_{U}} β_{k 0} (u) d u$ . The standard error of ρ̂_k can be obtained as the empirical standard deviation of ${\hat{ρ}}_{k}^{*}$ . When β_k₀(u) is indeed constant over u, such a constant effect equals the average covariate effect, and hence can be estimated by ρ̂_k.

5. Simulation Studies

We conduct Monte Carlo simulations to examine the finite sample performance of the proposed method. We consider the situation where there exist two event types (i.e. K = 2). Let { $T_{k}^{* (j)}$ , j = 1, 2, …} be a sequence of ordered random numbers following a standard homogeneous Poisson process; in another word, { $T_{k}^{* (j)} - T_{k}^{* (j - 1)}$ : j = 1, 2, …} are independent and identically exponetial(1) random variables with $T_{k}^{* (0)} = 0$ . The type-k recurrent event times are generated as

T_{k}^{(j)} = exp {min (1, \frac{ρ_{1 k} \cdot T_{k}^{* (j)}}{1.5 γ_{k}}) \cdot X_{1} + ρ_{2 k} \cdot X_{2}} \frac{ρ_{0 k} \cdot T_{k}^{* (j)}}{γ_{k}}, k = 1, 2; j = 1, 2, \dots,

where the two covariates X₁ and X₂ follow the Bernoulli distribution, Bernoulli(0.5), and the uniform distribution Uniform(−0.5, 0.5), respectively. The frailty γ_k, which determines the level of intra-individual correlation, is drawn from the following two cases:

Case 1: γ_k = 1;
Case 2: γ_k ~ Gamma(2, 1/2) with E(γ_k) = 1 and Var(γ_k) = 1/2.

Under these simulation setups,

τ_{X, k} (u) = exp {log (ρ_{0 k} \cdot u) + min (1, ρ_{1 k} \cdot u / 1.5) \cdot X_{1} + ρ_{2 k} \cdot X_{2}}

for k = 1, 2. It is seen that X₁’s effect on time to expected frequency increases with u, while X₂’s effect is constant. In addition, we generate L_i from ω · Uniform(0, 1) and R_i from Uniform(L, 12), where ω is a Bernoulli(0.8) random variable. We set ρ₀₁ = ρ₁₁ = ρ₂₁ = 1.5 to yield the average number of observed type-1 recurrent events per subject about 2.7, and ρ₀₂ = ρ₁₂ = ρ₂₂ = 2 to let that of type-2 events approximately 2.

We simulate missing event types by drawing $A_{i}^{(j)}$ at each recurrent event time $T_{i}^{(j)}$ from a $Bernoulli (π_{i}^{(j)})$ , where $π (t, z) = 1 - \frac{1}{1 + exp {z {(t)}^{⊤} α}}$ , and z(t) = (X₁, t)^⊤. In our simulations, we set α = (1, 0.15)^⊤, leading to about 30% missing event types. For each data scenario, we generate 500 datasets of sample size n = 200.

We fit the GART model (1) to each simulated dataset setting g(u) = 1. We apply the proposed IPW and EEP methods, adopting an equally spaced grid on u ∈ (0, 3] with step size 0.02, and choosing the kernel function as the Normal kernel, K(x) = (2π)^−1/2 exp(−x²/2). We compare our methods with the naive complete-case (CC) analysis which only uses the events with known event types and the hypothetical Full analysis which applies Sun et al. (2016)’s method to the underlying full data which contain the complete event type information. In Figure 1, we present the simulation results for the type-1 event coefficient estimates in Case 2. In the first row of Figure 1, we plot the empirical bias of the IPW estimator (dotted lines), the EEP estimator (dash dotted lines), the CC estimator (dashed lines), and the Full estimator (solid lines). The results show that the proposed IPW and EEP estimators exhibit very small bias except for those corresponding to small u’s. In contrast, the CC method produces very biased coefficient estimation. The second row of Figure 1 depicts the empirical standard derivation (SD) and the average standard errors (ASE) (based on the resampling method) versus expected frequency u for the proposed IPW and EEP estimators. We observe that the empirical SD and ASE agree with each other very well. The standard errors of IPW estimator are slightly larger than those of EEP estimator.

The simulation results for event type 1 coefficients with Case 2. IPW, the inverse probability weighting estimator; EEP, the estimating equation projection estimator; CC, the complete-case estimator; Full, the full data estimator. SD, the empirical standard derivation. ASE, the average standard error. CP, the coverage probability.

In our simulations, we evaluate both resampling-based and sample-based inference procedures. For the resampling method, the resampling size of 100 is chosen. The coverage probabilities of 95% confidence intervals obtained from both inference approaches are depicted in the third row and fourth row of Figure 1 respectively. It shows that the resampling procedure and the sample-based strategy have quite comparable performance. The resulting coverage probabilities (CP) of the two proposed estimators are fairly close to the nominal value; the resampling procedure may perform slightly better than the sample-based method. This is consistent with the observed large bias of the CC estimator. The computation of the sample-based approach is about 2 to 3 times faster than that of the resampling procedure.

We have very similar observations on the results from fitting the GART model for type-2 events in Case 2 and results obtained in Case 1; these results are relegated to the Supplementary Materials (see Figures S1-S3). In some unreported simulations, we find that using a different kernel function, such as the Epanechnikov kernel K(x) = 0.75(1 − x²)I(|x| < 1), yields little change to the empirical performance of the proposed estimators.

We also investigate the sensitivity of the proposed procedures to bandwidth selection. We consider Case 2 with 500 replications of sample size n = 200. Figure 2 presents the proposed coefficient estimates for type-1 event with different choices of h: h = 0.6 (solid lines), h = 0.8 (dashed lines), h = 1.0 (dotted lines) and h = 1.2 (dot dashed lines). The results for type-2 event are presented in Figure S4 of the Supplementary Materials. As seen from Figure 2, the empirical bias and empirical standard derivations corresponding to different values of h are almost the same. This indicates that the performance of the proposed IPW and EEP methods are insensitive with respect to the choice of bandwidth h.

The comparison of event type 1 coefficient estimates for Case 2 under different values of h. IPW, the inverse probability weighting estimator; EEP, the estimating equation projection estimator. SD, the empirical standard derivation.

6. A Real Data Example

Cystic fibrosis (CF) is a life-limiting genetic disorder with an incidence rate in Caucasian approximately 1:3400 (Boat and Acton, 2007). Cystic Fibrosis Foundation (CFF) patient registry (CFFPR) that has documented the diagnosis, treatments and health of all known cystic fibrosis patients at more than 120 CFF-accredited care centers across the United States since 1970s (Knapp et al., 2016). Pseudomonas aeruginosa (Pa) is one of major pathogens in CF lungs that leads to chronic infections and lung function decay. Pa types, mucoid, nonmucoid, or mucoid status unknown, have been reported in CFFPR. It is of scientific interest to assess how the recurrence times of nonmucoid Pa infections and mucoid Pa infections are influenced respectively by potential risk factors in young CF children.

We consider a dataset from the 2007 CFFPR registry data, which includes 4,144 subjects who were born after 1997 and had known diagnosis factor mode before the end of year 2007. During the follow-up of these subjects, 9,615 nonmucoid Pa infections and 3,393 mucoid Pa infections were recorded, along with 1,585 Pa infections with unknown types. The percentage of nonmucoid, mucoid, and missing Pa infection types are 65.9%, 23.2%, and 10.9% respectively. The number of positive Pa infections (nonmucoid and mucoid) observed for each subject ranges from 1 to 40, with mean 3.5 and median 2.

In our data analysis, with time origin set as the birth of each subject, the recurrent event time T⁽^j⁾ stands for the age of a CF child at his/her jth Pa infection, L corresponds to the age at registry entry, and R corresponds to the age at death or the last follow-up. In our dataset, 13.8% of subjects entered the study right after birth, and L = 0 in these cases. We consider risk factors including sex and diagnosis factor (meconuim ileus status; newborn screening; family history and signs/symptoms). The summary statistics of these risk factors are provided in Table 1. The covariates included in our models are coded as Sex, 1 if the subject was female and 0 otherwise; MI, 1 if the subject is diagnosed by meconuim ileus and 0 otherwise; NewScreen, 1 if the subject is done newborn screening and 0 otherwise; FamilyHis, 1 if the subject’s family has the history of CF and 0 otherwise.

Table 1.

Summary Statistics of Sex and Diagnosis Factor in the CFFPR dataset

	Sex		Diagnosis Factor

	Male	Female	MI	NewScreen	FamilyHis	Symptoms
n (%)	2031 (49%)	2113 (51%)	1090 (26%)	624 (15%)	197 (5%)	2233 (54%)

Open in a new tab

MI: meconuim ileus; NewScreen: newborn screening; FamilyHis: family history; Symptoms: signs/symptoms

We apply the proposed methods to this CFFPR dataset with the covariates described above, setting g(u) = 1. We choose the Normal kernel function and the bandwidth h = 4n^−1/3sd(T) as suggested in Qiu et al. (2017). We use the proposed resampling procedure for inference such as confidence intervals. In our analysis, we adopt the MAR assumption (4) with Z_i including covariates, Sex, MI, NewScreen, and FamilyHis, which means, these observed covariates can fully account for the missingness of the PA infection type. This is a reasonable assumption for the CFFPR dataset because, according to the investigation of Gouskova et al. (2017), the two major causes of missing PA infection types are (a) lack of technology to classify the type of PA infection as mucoid or nonmucoid; (b) data recording negligence. Since the MAR assumption is not statistically verifiable (Little and Rubin, 2002), we perform a sensitivity analysis by considering different specifications of Z_i. As shown in the Supplementary Materials (see Section S4), when Z_i only includes NewScreen and FamilyHis, the analysis results are very similar to those in Figure 4. This suggests the robustness of the proposed method to the variations of the adopted MAR model.

CFFPR data example: the proposed IPW coefficient estimates (solid lines in the top row) and their corresponding 95% pointwise confidence intervals, the proposed EEP coefficient estimates (solid lines in the bottom row) and their corresponding 95% pointwise confidence intervals, along with the complete-case (CC) coefficient estimates (dotted lines) for mucoid PA infection

In Figures 3 and 4, we plot the estimated coefficients along with the 95% pointwise confidence intervals for the coefficients for the nonmucoid and mucoid Pa infections respectively. The inverse probability weighting (IPW) estimates are shown in the first row in solid lines, while the estimating equation projection (EEP) estimators are plotted in solid lines in the second row. It can be seen that the two proposed estimators demonstrate little difference.

In Figures 3 and 4, the intercept coefficient estimates represent the estimated log time to expected frequency of nonmucoid or mucoid Pa infection for the reference group, which consists of CF boys diagnosed by signs/symptoms. For example, for this reference group, the time from birth to expected nonmucoid and mucoid Pa infection frequency of 1.0 are approximately 0.36 and 2.04 years respectively. This indicates a much later development of mucoid Pa infection compared to nonmucoid Pa infection in CF children, which is consistent with the common clinical manifestations of Pa infections.

The nonintercept coefficient estimates depict the estimated effects of covariates, where negative ones indicate more rapid progression to recurrence of nonmucoid or mucoid Pa infections. We see from Figure 3 that there is no significant difference in recurrence times of nonmucoid Pa infections between CF boys and CF girls. Newborn screening (NewScreen) shows a positive effect on the time to expected frequency of nonmucoid Pa infections with u < 0.1; however, its effect seems to diminish at larger u’s. The estimated coefficients for MI and FamilyHis are mostly significantly above zero. These results may reflect the benefits of early CF diagnosis, as CF children typically are diagnosed earlier through MI, new born screening, and family history than through signs/symptoms.

Considering mucoid Pa infections, we have some different findings regarding the covariate effects. That is, in Figure 4, the estimated coefficients for Sex are significantly negative for most u’s, suggesting that CF girls tend to develop mucoid Pa infections sooner than CF boys. The estimated coefficients for MI, NewScreen and FamilyHis are significantly positive except for those with large u’s, indicating s that CF children diagnosed by symptoms developed mucoid PA earlier than those diagnosed by the other methods. Importantly, the beneficial effect of newborn screening on mucoid Pa is stronger than that on the nonmucoid Pa. This finding is encouraging in that early diagnosis of newborn screening significantly delays the onset of mucoid Pa, as well as repeated mucoid Pa.

In Figures 3 and 4, we also plot the coefficient estimates of the complete-case (CC) analysis (dotted lines). Some major discrepancy exists between the CC analysis, which naively exclude Pa infections with unknown Pa types, and our estimates for nonmucoid Pa infections. Specifically, the intercept coefficients for nonmucoid Pa estimated by the CC method are significantly larger than those from the proposed IPW and EEP methods. This indicates that the CC analysis would significantly overestimate the time to expected frequency of nonmucoid Pa infection. One possible explanation is that the majority of missing Pa types may in fact be nonmucoid Pa but are ignored by the CC analysis, leading to over-optimistic estimates for time to expected frequency of nonmucoid Pa infections. Moreover, the proposed estimates and the naive CC estimates generally diverge as u increases. This may relate to the fact that the total number of events with missing event type cumulates over time.

We also conduct constancy tests for each covariate effect. The weight function is chosen as Ξ(u) = 2I{u ≤ (u_L + u_U)/2}/(u_U − u_L) with u_L = 0.02 and u_U = 3. Our constancy tests confirm the diminishing pattern of the estimated coefficients for NewScreen observed in Figure 3, with p < 0.01. Our tests also suggest that constant effects may be adequate for all the other covariates considered in the fitted GART models. The average covariate effect estimates provided in Table 2 may serve as the estimates for these constant effects.

Table 2.

The CFFPR example: Estimated average covariate effects (EstAvg) and the corresponding standard errors (SE)

Event Type	Method		Sex	MI	FamilyHis
Mucoid	IPW	EstAvg	−0.066	0.219	0.173
	IPW	SE	0.038	0.049	0.065
	EEP	EstAvg	−0.067	0.217	0.172
	EEP	SE	0.038	0.048	0.063
Nonmucoid	IPW	EstAvg	−0.019	0.182	0.260
	IPW	SE	0.043	0.050	0.095
	EEP	EstAvg	−0.020	0.170	0.248
	EEP	SE	0.042	0.048	0.093

Open in a new tab

7. Concluding Remarks

In this paper, we investigate the generalized accelerated recurrence time model for multivariate recurrent event data with missing event types. We employ two strategies, the inverse probability weighting and the estimating equation projection, to handle the missing event types. The two proposed estimators have desirable asymptotic properties and are shown to be asymptotically equivalent.

As discussed in Section 2, we adopt a missing at random (MAR) mechanism for the missing event types, which is weaker than the assumption of missing completely at random. Our MAR mechanism implies π_k(t, z) is the same for each event type k. This may not be realistic in practice when some types of events are more likely to be missing. In that case, the event types are not missing at random (NMAR). Some additional unverifiable modeling of the event type missing mechanism would be warranted to tackle the non-identifiability issue. When the event types are missing at random but under a mechanism changing over time, we expect the kernel estimator of π(t, z) or p_k(t, z) would take a much more complicated form and likely lack sufficient efficiency with moderate sample sizes. Developing methods for handling these situations merits future research.

Regarding the bandwidth h for the nonparametric kernel estimators of π(t, z) and p_k(t, z), the optimal bandwidths may be chosen by minimizing the mean square errors of the kernel estimators, but may be difficult to estimate. Several authors (Wang and Wang, 2001; Chen et al., 2015; Qiu et al., 2017) have studied data-driven methods for selecting bandwidths in the classical survival setting with only non-recurrent events. It is worth investigating their extensions the settings with multivariate recurrent events data.

Supplementary Material

Supp info

NIHMS953529-supplement-Supp_info.pdf^{(635.7KB, pdf)}

Acknowledgments

This work is partially supported by National Institutes of Health Grants R01HL113548 and R01DK072126. The authors would like to thank the Cystic Fibrosis Foundation for the use of CF Foundation Patient Registry data to conduct this study. Additionally, we would like to thank the patients, care providers and clinic coordinators at CF Centers throughout the United States for their contributions to the CF Foundation Patient Registry.

Footnotes

8. Supplementary Materials

Supplementary Materials, which include justifications of the proposed methods and additional numerical results referenced in Sections 2, 3, and 5, are available at the Biometrics website on Wiley Online Library.

References

Boat T, Acton J. Cystic fibrosis. In: Kliegman, et al., editors. Nelson Textbook of Pediatrics. 18 Philadelphia: Saunders Elsevier; 2007. [Google Scholar]
Chen B, Cook R. The analysis of multivariate recurrent events with partially missing event types. Lifetime Data Analysis. 2009;15:41–58. doi: 10.1007/s10985-008-9091-3. [DOI] [PubMed] [Google Scholar]
Chen X, Wan A, Zhou Y. Efficient quantile regression analysis with missing observations. Journal of the American Statistical Association. 2015;110:723–741. [Google Scholar]
Gouskova N, Lin F, Fine J. Nonparametric analysis of competing risks data with event category missing at random. Biometrics. 2017;73:104–113. doi: 10.1111/biom.12547. [DOI] [PubMed] [Google Scholar]
Huang Y, Peng L. Accelerated recurrence time models. Scandinavian Journal of Statistics. 2009;36:636–648. doi: 10.1111/j.1467-9469.2009.00645.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jin Z, Ying Z, Wei L. A simple resampling method by perturbing the minimand. Biometrika. 2001;88:381–390. [Google Scholar]
Knapp EA, Goss FA, Sewall C, Ostrenga A, Dowd J, Elbert C, Petren AK, Marshall B. The cystic fibrosis foundation patient registry: Design and methods of a national observational disease registry. Annals of the American Thoracic Society. 2016;13(7):1173–1179. doi: 10.1513/AnnalsATS.201511-781OC. [DOI] [PubMed] [Google Scholar]
Lawless JF, Nadeau C. Some simple robust methods for the analysis of recurrent events. Technometrics. 1995;37:158–168. [Google Scholar]
Li R, Peng L. Varying coefficient subdistribution regression for left-truncated semi-competing risks data. Journal of Multivariate Analysis. 2014;131:65–78. doi: 10.1016/j.jmva.2014.06.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lin D, Wei L, Ying Z. Accelerated failure time models for counting processes. Biometrika. 1998;85:605–618. [Google Scholar]
Lin DY, Wei LJ, Yang I, Ying Z. Semiparametric regression for the mean and rate functions of recurrent events. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2000;62:711–730. [Google Scholar]
Lin F, Cai J, Fine J, Lai H. Nonparametric estimation of the mean function for recurrent event data with missing event category. Biometrika. 2013;100:727–740. doi: 10.1093/biomet/ast016. [DOI] [PMC free article] [PubMed] [Google Scholar]
Little R, Rubin D. Statistical Analysis with Missing Data. New York: Wiley; 2002. [Google Scholar]
Peng L, Fine J. Competing risks quantile regression. Journal of the American Statistical Association. 2009;104:1440–1453. [Google Scholar]
Peng L, Huang Y. Survival analysis with quantile regression models. Journal of the American Statistical Association. 2008;103:637–649. [Google Scholar]
Pepe MS, Cai J. Some graphical displays and marginal regression analyses for recurrent failure times and time dependent covariates. Journal of the American Statistical Association. 1993;88:811–820. [Google Scholar]
Qiu Z, Wan A, Zhou Y, Gilbert P. Smoothed rank regression for the accelerated failure time competing risks model with missing cause of failure. Statistica Sinica. 2017 doi: 10.5705/ss.202016.0231. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schaubel D, Cai J. Multiple imputation methods for recurrent event data with missing event category. Canadian Journal of Statistics. 2006a;34:677–692. [Google Scholar]
Schaubel D, Cai J. Rate/mean regression for multiple-sequence recurrent event data with missing event category. Scandinavian Journal of Statistics. 2006b;33:191–207. [Google Scholar]
Sun X, Peng L, Huang Y, Lai H. Generalizing quantile regression for counting processes with applications to recurrent events. Journal of the American Statistical Association. 2016;111:145–156. doi: 10.1080/01621459.2014.995795. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang S, Wang C. A note on kernel assisted estimators in missing covariate regression. Statistics & Probability Letters. 2001;55:439–449. [Google Scholar]
Ye P, Sun L, Zhao X, Xu W. An additive-multiplicative rates model for multivariate recurrent events with event categories missing at random. Science China Mathematics. 2015;58:1163–1178. [Google Scholar]
Ye P, Zhao X, Sun L, Xu W. A semiparametric additive rates model for multivariate recurrent events with missing event categories. Computational Statistics & Data Analysis. 2015;89:39–50. [Google Scholar]
Zhou Y, Wan A, Wang X. Estimating equations inference with missing data. Journal of the American Statistical Association. 2008;103:1187–1199. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp info

NIHMS953529-supplement-Supp_info.pdf^{(635.7KB, pdf)}

[R1] Boat T, Acton J. Cystic fibrosis. In: Kliegman, et al., editors. Nelson Textbook of Pediatrics. 18 Philadelphia: Saunders Elsevier; 2007. [Google Scholar]

[R2] Chen B, Cook R. The analysis of multivariate recurrent events with partially missing event types. Lifetime Data Analysis. 2009;15:41–58. doi: 10.1007/s10985-008-9091-3. [DOI] [PubMed] [Google Scholar]

[R3] Chen X, Wan A, Zhou Y. Efficient quantile regression analysis with missing observations. Journal of the American Statistical Association. 2015;110:723–741. [Google Scholar]

[R4] Gouskova N, Lin F, Fine J. Nonparametric analysis of competing risks data with event category missing at random. Biometrics. 2017;73:104–113. doi: 10.1111/biom.12547. [DOI] [PubMed] [Google Scholar]

[R5] Huang Y, Peng L. Accelerated recurrence time models. Scandinavian Journal of Statistics. 2009;36:636–648. doi: 10.1111/j.1467-9469.2009.00645.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Jin Z, Ying Z, Wei L. A simple resampling method by perturbing the minimand. Biometrika. 2001;88:381–390. [Google Scholar]

[R7] Knapp EA, Goss FA, Sewall C, Ostrenga A, Dowd J, Elbert C, Petren AK, Marshall B. The cystic fibrosis foundation patient registry: Design and methods of a national observational disease registry. Annals of the American Thoracic Society. 2016;13(7):1173–1179. doi: 10.1513/AnnalsATS.201511-781OC. [DOI] [PubMed] [Google Scholar]

[R8] Lawless JF, Nadeau C. Some simple robust methods for the analysis of recurrent events. Technometrics. 1995;37:158–168. [Google Scholar]

[R9] Li R, Peng L. Varying coefficient subdistribution regression for left-truncated semi-competing risks data. Journal of Multivariate Analysis. 2014;131:65–78. doi: 10.1016/j.jmva.2014.06.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Lin D, Wei L, Ying Z. Accelerated failure time models for counting processes. Biometrika. 1998;85:605–618. [Google Scholar]

[R11] Lin DY, Wei LJ, Yang I, Ying Z. Semiparametric regression for the mean and rate functions of recurrent events. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2000;62:711–730. [Google Scholar]

[R12] Lin F, Cai J, Fine J, Lai H. Nonparametric estimation of the mean function for recurrent event data with missing event category. Biometrika. 2013;100:727–740. doi: 10.1093/biomet/ast016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Little R, Rubin D. Statistical Analysis with Missing Data. New York: Wiley; 2002. [Google Scholar]

[R14] Peng L, Fine J. Competing risks quantile regression. Journal of the American Statistical Association. 2009;104:1440–1453. [Google Scholar]

[R15] Peng L, Huang Y. Survival analysis with quantile regression models. Journal of the American Statistical Association. 2008;103:637–649. [Google Scholar]

[R16] Pepe MS, Cai J. Some graphical displays and marginal regression analyses for recurrent failure times and time dependent covariates. Journal of the American Statistical Association. 1993;88:811–820. [Google Scholar]

[R17] Qiu Z, Wan A, Zhou Y, Gilbert P. Smoothed rank regression for the accelerated failure time competing risks model with missing cause of failure. Statistica Sinica. 2017 doi: 10.5705/ss.202016.0231. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Schaubel D, Cai J. Multiple imputation methods for recurrent event data with missing event category. Canadian Journal of Statistics. 2006a;34:677–692. [Google Scholar]

[R19] Schaubel D, Cai J. Rate/mean regression for multiple-sequence recurrent event data with missing event category. Scandinavian Journal of Statistics. 2006b;33:191–207. [Google Scholar]

[R20] Sun X, Peng L, Huang Y, Lai H. Generalizing quantile regression for counting processes with applications to recurrent events. Journal of the American Statistical Association. 2016;111:145–156. doi: 10.1080/01621459.2014.995795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Wang S, Wang C. A note on kernel assisted estimators in missing covariate regression. Statistics & Probability Letters. 2001;55:439–449. [Google Scholar]

[R22] Ye P, Sun L, Zhao X, Xu W. An additive-multiplicative rates model for multivariate recurrent events with event categories missing at random. Science China Mathematics. 2015;58:1163–1178. [Google Scholar]

[R23] Ye P, Zhao X, Sun L, Xu W. A semiparametric additive rates model for multivariate recurrent events with missing event categories. Computational Statistics & Data Analysis. 2015;89:39–50. [Google Scholar]

[R24] Zhou Y, Wan A, Wang X. Estimating equations inference with missing data. Journal of the American Statistical Association. 2008;103:1187–1199. [Google Scholar]

PERMALINK

Generalized accelerated recurrence time model for multivariate recurrent event data with missing event type

Huijuan Ma

Limin Peng

Zhumin Zhang

HuiChuan J Lai

Summary

1. Introduction

2. The Proposed Methods

2.1 Data and Model

2.2 The Proposed Estimating Equations

2.2.1 Inverse Probability Weighting (IPW) Method

2.2.2 Estimating Equation Projection (EEP) Method

2.3 Computation algorithm

3. Asymptotic Properties

Theorem 1

Theorem 2

4. Inferences

4.1 Resampling approach

4.2 Sample-based variance and covariance estimation

4.3 Second-stage exploration of varying effects

5. Simulation Studies

Figure 1.

Figure 2.

6. A Real Data Example

Table 1.

Figure 4.

Figure 3.

Table 2.

7. Concluding Remarks

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases