Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2009 Mar 18;10(3):451–467. doi: 10.1093/biostatistics/kxp004

Conditional GEE for recurrent event gap times

David Y Clement 1,2, Robert L Strawderman 1,2,*
PMCID: PMC2697342  PMID: 19297655

Abstract

This paper deals with the analysis of recurrent event data subject to censored observation. Using a suitable adaptation of generalized estimating equations for longitudinal data, we propose a straightforward methodology for estimating the parameters indexing the conditional means and variances of the process interevent (i.e. gap) times. The proposed methodology permits the use of both time-fixed and time-varying covariates, as well as transformations of the gap times, creating a flexible and useful class of methods for analyzing gap-time data. Censoring is dealt with by imposing a parametric assumption on the censored gap times, and extensive simulation results demonstrate the relative robustness of parameter estimates even when this parametric assumption is incorrect. A suitable large-sample theory is developed. Finally, we use our methods to analyze data from a randomized trial of asthma prevention in young children.

Keywords: Asthma, Censoring, Generalized estimating equation, Intensity model, Longitudinal data, Marginal model

1. INTRODUCTION

In longitudinal studies, each subject may experience several consecutive events of the same basic type. Such “recurrent event” outcome data now constitute a heavily studied area in statistics, with applications ranging from economics to engineering to biomedicine. Examples common in medicine and public health applications include recurrent infections and other diseases, hospitalizations, and seizures. The development of useful regression models for recurrent outcome data is therefore a problem of significant practical and methodological interest. Beginning with the extension of the proportional hazards regression model of Cox (1972) to the case of multivariate counting processes by Andersen and Gill (1982), a significant literature on this topic has developed over the past 25 years. In general, methods for analyzing recurrent event data can be cross-classified into one of the 4 categories determined by the following: (i) the choice of “calendar” versus “gap” times as the fundamental temporal scale and (ii) the use of “marginal” versus “intensity” models for analyzing the data. Several classes of marginal and intensity models have been proposed for analyzing recurrent event outcomes on each timescale, and an extensive, contemporary review of existing methods is available in Cook and Lawless (2007).

When the events are all considered to be of the same type, the gap timescale is arguably the most natural and informative timescale for analysis. Examples of marginal models developed for the gap-time setting include Chang and Wang (1999), Chang (2004), Chen and Wang (2004), Huang (2002), Huang and Chen (2004), Lin and others (1999), and Prentice and others (1981). Gap-time-focused intensity models are considered in Aalen and Husebye (1991), Oakes and Cui (1994), Peña and others (2001), Duchateau and others (2003), and Strawderman (2005), Strawderman (2006), among others. In an interesting paper, Murphy and others (1995) propose a methodology that is difficult to wholly classify into a single category. Specifically, Murphy and others (1995) introduce a variant of the estimating equations considered in Murphy and Li (1995) in order to develop a model appropriate for describing the conditional mean length of women's menstrual cycles. In this model, the mean of the current cycle is allowed to depend on aspects of past cycle behavior and/or other time-fixed and time-dependent covariates. The methodology developed in Murphy and others (1995) forms a starting point for developing a large and useful class of methods for conducting gap-time analyses that relaxes the stringent restrictions imposed by simpler marginal models while avoiding the need to fully specify how the probability of subsequent recurrence depends on the prior event and covariate histories.

This paper expands upon the work of Murphy and others (1995) in several useful ways, including making allowances for transformations of gap times, providing clarification of the censoring conditions under which the proposed estimating equations are unbiased, correcting 2 errors in the original paper, further illuminating the connections to generalized estimating equations (GEEs), and the provision of an appropriate large-sample theory. The paper will proceed as follows. The proposed methodology is first developed in Sections 2.1–2.3, and Sections 2.4 and 2.5 then, respectively, establish the connections to the methodology originally proposed in Murphy and others (1995) and GEE. Section 3 contains an expanded version of the simulation study summarized in Tables II and III of Murphy and others (1995). In Section 4, the proposed methodology is used to reanalyze the asthma prevention trial data of Duchateau and others (2003). We close the paper in Section 5 with a discussion of several interesting directions for further study. Technical details are provided in an appendix; the supplementary material available at Biostatistics online contains further results and details that are referenced but not included in the main article.

2. METHODOLOGY

2.1. Notation and model

For simplicity, we introduce the notation for just one subject; the inclusion of an additional subscript permits immediate extension to the case of multiple subjects, as will be required in Section 2.3. We assume that the time origin for analysis is S0=0, with subsequent events occurring at times 0<S1<S2< until observation terminates at an observed time C > 0. It is assumed that S1 represents a complete observation time, thereby covering those cases in which observation begins with the occurrence of an event. In settings where observation starts in between 2 events, S0 is taken to represent the time of the first event subsequent to the start of observation. Despite a mild loss of information, such a convention does not cause bias provided that the decision to delete this initial time period is made independently of its length (e.g Aalen and Husebye 1991; Murphy and others, 1995).

Let N(u)=max{n1:Snu} count the number of events up to and including time u, and let N=N(C). The observed data on this subject are assumed to take the form

2.1. (2.1)

where Inline graphicj denotes the covariate information available at time Sj1 for j1 and may include baseline covariates, covariates measured at or before each event time, and summaries of the past event and covariate history. It is not assumed that Inline graphicj necessarily captures the full covariate or event history up to time Sj1. Let Xj=SjSj1 denote the jth gap time and define Yj=h(Xj), where h(·) is a specified monotone nondecreasing transformation. Also, let Inline graphic denote the cumulative information concerning the event and covariate histories assumed available through time Sj1; note that jj+1 for j1.

The fundamental modeling assumption of this paper is

2.1. (2.2)

where μj(θ) and Vj(θ)>0 are known scalar functions of the parameter vector θ and σ2>0. Importantly, these means and variances are defined conditionally upon j and are therefore independent of censoring information. Further assumptions are needed in order to properly deal with the presence of censoring; see, for example, Sections 2.2 and 2.3 as well as conditions (A0) and (A1) of the Appendix. Three examples of interesting model choices include the following:

2.1. (2.3)
2.1. (2.4)
2.1. (2.5)

Murphy and others (1995) consider the model (2.3), allowing for the possibility of a general variance function Vj2(θ). Models (2.4) and (2.5) provide useful generalizations of the accelerated gap times (AGT) model proposed in Strawderman (2005). The AGT model assumes that the gap times of the recurrent event process satisfy Xj=Rjμ(θ), where {Rj,j1} are independent and identically distributed with a distribution independent of θ and μ(θ) accelerates or decelerates the baseline gap times for a subject based on the values of time-independent covariates. Assuming that E[Rj]=1, model (2.4) is observed to be a direct generalization of this model upon taking μj(θ)=μ(θ), Vj(θ)=μ(θ), and σ2=Var[Rj] for j1. Taking logs, the alternative model logXj=logμ(θ)+logRj is obtained, and (2.5) evidently covers this form of the AGT model with μj(θ)=logμ(θ), E[logRj]=0, σ2=Var[logRj], and Vj(θ)=1 for j1.

2.2. Estimation with “full” and observed data: single subject

For convenience, define for j ≥ 1 the following notation:

2.2. (2.6)

For simplicity, we make no distinction between η=(θT,σ2)T and the data-generating parameter η0 throughout this section. Under (2.2), it follows that E[Zj(θ)|j]=0 and E[Zj2(θ)|j]=σ2 for each j ≥ 1. A naïve approach to the estimation of η might therefore begin with consideration of the estimating equations

2.2.

where bj(η) is a scalar weight function satisfying E[bj(η)|j]=bj(η) for each j1. However, these estimating equations generally fail to be unbiased because each utilizes only the complete gap times X1,,XN, each of which satisfies XjC,j=1,,N.

Let Inline graphic denote the observed data (2.1), augmented with the additional information on the first event time SN + 1 following time C. Since SN + 1 is not generally observable, one may view Inline graphic as a suitable representation of “full data” in this setting. Consider the pair of Inline graphic-dependent estimating equations (cf. Murphy and Li, 1995)

2.2. (2.7)

Theorem 2.1, proved in Section S.1.1 of the supplementary material available at Biostatistics online, shows that the Inline graphic-dependent estimating equations (2.7) are unbiased under condition (A0) of the Appendix.

THEOREM 2.1

Under condition (A0), the estimating equations (2.7) are unbiased.

While unbiased, the estimating equations (2.7) cannot be used directly for estimating η from the observed data (2.1) because each depends on YN+1=h(XN+1), information not available under (2.1). Similarly to Murphy and others (1995), one starting point for developing practically useful estimating equations is to project (2.7) onto the observed data:

2.2. (2.8)

and

2.2. (2.9)

Using iterated expectation, an easy calculation shows that the Inline graphic-dependent estimating equations (2.8) and (2.9) remain unbiased under the conditions of Theorem 2.1.

Defining Wj(η)=σ1Zj(θ) and Hj=j{CSj1} for j1, the expectations appearing on the right-hand side of (2.8) and (2.9) can be rewritten as follows:

2.2.

where w(η)=[σVN+1(θ)]1{h(CSN)μN+1(θ)} is considered fixed in each expression.

The dependence of WN+1(η) on YN+1 implies that WN+1(η) represents missing data under (2.1). The projections (2.8) and (2.9) therefore correspond to using an obvious form of conditional imputation and further modeling assumptions that permit computation of these conditional expectations are needed. Under condition (A0), YN+1 and hence WN+1(η) may be considered missing at random (MAR; e.g. Little and Rubin, 2002), and immediate progress is possible under a parametric specification for the conditional distribution WN+1(η)|HN+1. Noting that {Wj(η),j1} is a sequence of dependent, standardized (i.e. mean 0, variance 1) random variables, we propose to proceed under the simplifying assumption that WN+1(η)|HN+1 is distributed according to a fully specified parametric distribution F0(·) having mean 0 and variance 1. This immediately implies that

2.2. (2.10)

where

2.2. (2.11)

Under (2.10) and (2.11), (2.8) and (2.9) reduce to

2.2. (2.12)

and

2.2. (2.13)

For example, with F0(x)=Φ(x),

2.2.

whereas the choice F0(x)=1e(x+1) for x > 1 leads to K1(x)=x+1 and K2(x)=1+(x+1)2. Each of (2.12) and (2.13) has mean 0 provided that (2.10) holds and F0(·) is correctly specified.

The moment assumptions on F0(·) are natural in view of the mean–variance specification of the model. In addition, this parsimonious model facilitates straightforward implementation and justification of inference procedures, as will be seen in Section 2.3. We emphasize here that the parametric specification of F0(·) is only introduced for the purposes of dealing with the censored time YN+1=h(XN+1). We have not assumed that each member of the sequence {Wj(η),j1} has distribution F0(·); in addition, we have not introduced any assumptions that impose a fully parametric dependence structure on {Wj(η),j1}. The easiest way to see this is to note that such distributional assumptions are neither needed nor used in the development of (2.7); see also Murphy and Li (1995) for related results and discussion.

REMARK:

One may write K1(x)=E[W|W+x]=x+M(x), where W is a random variable with distribution function F0(·) and M(x)=E[Wx|W > x] is the corresponding mean residual life function. An immediate consequence of this relationship is that one cannot specify a valid parametric model for K1(·) without also specifying a valid parametric model for F0(·) e.g. (Cox, 1962; Oakes and Dasu, 1990; kotz and Shanbhag, 1980).

2.3. Estimation and inference with observed data: n subjects

Suppose there are data available on n independent subjects, say Inline graphici, i = 1, …, n, where Inline graphic is the data (2.1) on the ith subject. Then, (2.12) and (2.13) immediately generalize, yielding the pair of estimating equations

2.3. (2.14)

and

2.3. (2.15)

where

2.3.

Assuming that (2.10) holds and F0(·) has been correctly specified, (2.14) and (2.15) together form a collection of unbiased estimating equations for η. In order to solve these equations and expect to obtain a unique solution Inline graphic for finite n, smoothness assumptions on Kr(·),r=1,2 are required. Conditions (A3)–(A5) of the Appendix impose sufficient smoothness restrictions on Kr(·),r=1,2; see the Appendix and Section S.1.4 of the supplementary material available at Biostatistics online for further details and discussion.

Let Sn(η)=(Sn,1(η)T,Sn,2(η))T; then, we may write

2.3. (2.16)

where ψ(η, Inline graphici) = (ψ1(η, Inline graphici)T, ψ2(η, Inline graphici))T is a vector of known functions of η and Inline graphici, i = 1, …,n. Define S(η) = Eη0[ψ(η, Inline graphic1)] and Inline graphic Theorems 2.2 and 2.3 show that Inline graphic is both consistent and asymptotically normal as n; see the Appendix for the statement of regularity conditions and further details on proof.

THEOREM 2.2

Under conditions (A0)–(A4) and as n, there exists a sequence Inline graphic and a unique η0 such that Inline graphic with probability going to 1, and Inline graphic

THEOREM 2.3

Under conditions (A0)–(A7), the sequence Inline graphic is asymptotically normal with mean 0 and covariance matrix

THEOREM 2.3 (2.17)

It further follows that one can consistently estimate Inline graphic via

THEOREM 2.3 (2.18)

REMARK:

The asymptotic results for Inline graphic rely on the assumption that (2.10) holds with F0(·) correctly specified. As pointed out earlier in Section 2.2, the parametric imputation assumption (2.10) has been introduced in order to deal with the censoring of Yi,Ni+1=h(Xi,Ni+1),i=1,,n. Other models for imputation, as well as methods for handling the missing data problem, are possible. However, under a MAR specification, all such methods rely on further modeling assumptions and, similarly to the problem of misspecifying F0(·), incorrect specifications create bias. We refer the reader to Section 5 for further discussion.

2.4. Connections to Murphy and others (1995)

Similarly to Section 2.3, Murphy and others (1995) focus on estimating η from the observed data Inline graphici, i = 1, …, n, when h(x)=x, suggesting an adaptation of the expectation–maximization algorithm of Dempster and others (1977) for use with estimating equations, an idea recently explored in greater generality by Elashoff and Ryan (2004). Specifically, for each i=1,,n and given current estimates Inline graphic and Inline graphic of θ and σ, Murphy and others (1995) suggest imputing

2.4. (2.19)

where εr,r=1,,B, are independent and identically distributed mean 0, variance 1 random variables. These B variables are then used to compute both Inline graphic and Inline graphic

2.4.

The estimates Inline graphic and Inline graphic are then updated according to the following procedure. Expressed in our notation, Inline graphic is first computed by solving

2.4.

As indicated in the Appendix of Murphy and others (1995), one then computes

2.4. (2.20)

This iteration continues until the relative change in each estimated model parameter is small.

We now illuminate the connections to the methodology summarized in Section 2.3, as well as an important problem with the methodology described above. The “E-type” step described above involves computing Monte Carlo approximations to the conditional cumulants E[Yi,Ni+1|Yi,Ni+1>CiSiNi,Hi,Ni+1] and Var[Yi,Ni+1|Yi,Ni+1 > CiSiNi,Hi,Ni+1], evaluated at the current parameter values Inline graphic and Inline graphic Under the imputation assumption (2.19), it is apparent that E~i(k) is a Monte Carlo approximation to σK1(CiSiNi), i=1,,n, where K1(·) is defined in (2.11) and εrF0,r=1,,B. The use of (2.14) in place of the Monte Carlo approximation SM(θ) used by Murphy and others (1995) reduces computational demands and leads to a stable estimation procedure independent of Monte Carlo error. The use of (2.15) corrects a fundamental error in Murphy and others (1995). Specifically, as shown in the supplementary material available at Biostatistics online, the estimator (2.20) is biased. Moreover, the degree of bias increases with fewer complete observations per subject because the censored cycles contribute an increased proportion of the information to the estimating equation.

REMARK:

A less subtle error corrected by this paper involves the variance estimate (2.18). The corresponding estimate proposed in Murphy and others (Appendix 1995) assumes independence of the gap times within subjects and is therefore inconsistent when this assumption fails.

2.5. Connections with GEE

Similarly to Section 2.2, define Inline graphic Then, assuming Inline graphici, i = 1, … , n, represent the available data and also that subjects are independent of each other, the results of Theorem 2.1 imply that

2.5. (2.21)

and

2.5. (2.22)

form a system of unbiased (full data) estimating equations for η. Under the same assumptions leading to (2.14) and (2.15), it follows that the projections of (2.21) and (2.22) onto the observed data Inline graphici, i = 1 … n, reproduce (2.14) and (2.15).

The estimating equations (2.21) and (2.22) represent a particular example of a full data GEE system. Specifically, define for i=1,,n, the matrices

2.5.

In addition, let INi+1 denote the identity matrix of dimension of (Ni+1)×(Ni+1),i=1,,n. Then, one may write Un,1(η) as (nσ)1i=1nGi(θ)[Ai(θ)INi+1Ai(θ)]1εi(θ), where εi(θ) is a vector with elements Yikμik(θ),k=1,,Ni+1. The correspondence between (2.21) and a GEE system is now evident, the use of INi+1 in [Ai(θ)INi+1Ai(θ)]1 further imposing a “working independence” correlation structure on Yi1,,Yi,Ni+1 (cf. Molenberghs and Verbeke, 2005, Section 8.2). A similar construction is possible for (2.22); moreover, (2.21) and (2.22) together form a particular example of a GEE2 system that imposes a block diagonal covariance structure on Un(η)=(Un,1(η)T,Un,2(η))T (cf. Molenberghs and Verbeke, 2005, Section 8.5). In the supplementary material available at Biostatistics online, we consider the use of alternative working correlation structures, demonstrating in particular that valid structures must respect the conditional specification of the model in order for (2.21) and (2.22) to remain unbiased.

3. SIMULATIONS

The work of Murphy and others (1995) was motivated by the analysis of menstrual cycle patterns. More specifically, the authors were interested in developing insight into the relationship between cycle length and covariates such as location, body mass index (BMI), and age. Murphy and others (1995) also carried out a small simulation study modeled after these data in order to evaluate the robustness of their methods to assumptions regarding the nature of the censored cycle length. The following conditional mean and variance specifications were, respectively, used in both the analysis and the simulation study: E[Yij|ij]=μij(θ), where

3. (3.1)

and θ=(γT,ρ)T, and Var[Yij|ij]=σ2Vj2(θ), where

3. (3.2)

The use of (3.1) and (3.2) evidently corresponds to a special case of (2.3). Assuming an equal number of cycles per woman, the proposed mean function corresponds to a simple linear mixed effects model using woman as a random effect. The parameter ρ can thus be interpreted as an intrawoman correlation coefficient, with ρ=0 denoting that past cycle lengths are not useful in predicting current cycle length.

In this section, we consider an expanded version of the simulations considered in Murphy and others (1995). In addition to allowing for the possibility of a misspecified censored cycle length distribution, we consider the impact of both varying and shorter observation periods (i.e. fewer observed events), varying number of subjects, and 2 specifications of Vij(θ). More specifically, cycle data on either 50 or 200 independent women are simulated by generating cycle times according to the model Yij=max{Yij*,1}, j1, i=1,,50, where Yij*=μij(θ)+σVij(θ)εij and εij are independent, identically distributed observations from either a standard normal density or a shifted exponential density with mean 0 and unit variance. As in Murphy and others (1995), we assume that μij(θ) is specified according to the following trivial modification of (3.1):

3. (3.3)

where γ0=0.6, γ1=0.4, and ρ=0.03. The single time-varying covariate Inline graphic, where BMIij is assumed to decrease linearly from 22 kg/m2 on day 1 to 20 kg/m2 on day 195, increase linearly back to 21 kg/m2 on day 225, and then remain constant thereafter. As in Murphy and others (1995), we consider the specification (3.2) for Vij(θ) in conjunction with σ2=11; in addition, we also consider the specification Vij(θ)=|μij(θ)|, where μij(θ) is given in (3.3) and σ2=1/72.2=0.014. These 2 choices of Vij(θ) correspond to the model specifications (2.3) and (2.4); our choices of σ2 approximately equalize the variances of the average gap times across the 2 settings. Finally, we assume the observation period for subject i is [0,Ci], where Ci=max{Ci*,1} and Ci* is normally distributed. Four possible settings are considered, with E[Ci*] set to either 125 or 225 days and Var(Ci*) set to either 0 or 50, respectively. The average number of events per subject under an observation period with E[Ci*]=125 (E[Ci*]=225) is approximately 3.9 (7.4). All simulations were run using bij(η)=1, a choice that corresponds to the use of generalized least squares.

Tables 1 and 2, respectively, summarize the results for Ci*N(225,0) and Ci*N(125,0); a comparison of these tables demonstrates the impact of the expected number of events. The remaining simulation results are summarized in Tables 5–10 of the supplementary material available at Biostatistics online. The top panel of Table 1 corresponds to the simulation summarized in Table II of Murphy and others (1995). Each table corresponds to one combination of censoring distribution and sample size and summarizes the results for the 4 possible combinations of true and assumed error distributions for each choice of variance function (i.e. Vij(θ)). In the tables, the absolute relative bias (|rBias|) and empirical standard errors (ESEs) are, respectively, the average and standard deviation computed from 1000 simulated data sets of the indicated sample size. The average estimated asymptotic standard errors (ASE) are obtained by averaging the square root of the diagonal of (2.18) over the 1000 simulated data sets. Relative bias is reported because the magnitude of σ2 differs greatly across the top and bottom halves of each table.

Table 1.

Simulation Results for n = 50, Ci ∼ N(225, 0)

Expected number of events ≐ 7.4
True F0
Model Imputed F0 Parameter Normal
Exponential
|rBias| ESE ASE |rBias| ESE ASE
μij(θ) =  (3.3) Normal γ0 0.003 0.193 0.188 0.009 0.189 0.191
Vij(θ)  =  (3.2) γ1 0.014 0.276 0.272 0.034 0.267 0.260
ρ 0.113 0.032 0.032 0.007 0.034 0.032
σ2 0.004 0.865 0.848 0.035 1.524 1.486
Exponential γ0 0.006 0.189 0.187 0.010 0.187 0.190
γ1 0.006 0.285 0.272 0.013 0.269 0.261
ρ 0.193 0.034 0.032 0.023 0.034 0.032
σ2 0.025 0.931 0.879 0.003 1.655 1.561
μij(θ) =  (3.3) Normal γ0 0.023 0.194 0.189 0.013 0.193 0.191
Vij(θ)  = |(3.3)| γ1 0.035 0.278 0.274 0.006 0.263 0.264
ρ 0.098 0.033 0.032 0.036 0.035 0.032
σ2 0.008 0.001 0.001 0.031 0.002 0.002
Exponential γ0 0.007 0.198 0.189 0.007 0.196 0.189
γ1 0.010 0.295 0.275 0.012 0.272 0.264
ρ 0.105 0.034 0.032 0.136 0.035 0.032
σ2 0.015 0.001 0.001 0.003 0.002 0.002

Table 2.

Simulation Results for n = 50, Ci ∼ N(125, 0)

Expected number of events ≐ 3.9
True F0
Model Imputed F0 Parameter Normal
Exponential
|rBias| ESE ASE |rBias| ESE ASE
μij(θ) =  (3.3) Normal γ0 0.075 0.469 0.463 0.012 0.455 0.458
Vij(θ)  =  (3.2) γ1 0.152 0.713 0.701 0.011 0.664 0.687
ρ 0.069 0.058 0.057 0.066 0.062 0.061
σ2 0.004 1.206 1.236 0.037 2.173 1.985
Exponential γ0 0.025 0.489 0.475 0.017 0.489 0.474
γ1 0.078 0.736 0.706 0.005 0.708 0.710
ρ 0.104 0.060 0.058 0.024 0.062 0.061
σ2 0.015 1.331 1.274 0.006 2.247 2.136
μij(θ) =  (3.3) Normal γ0 0.044 0.475 0.466 0.032 0.471 0.460
Vij(θ)  = |(3.3)| γ1 0.094 0.727 0.702 0.067 0.702 0.688
ρ 0.092 0.064 0.059 0.182 0.067 0.061
σ2 0.014 0.001 0.001 0.047 0.002 0.002
Exponential γ0 0.028 0.472 0.470 0.031 0.463 0.474
γ1 0.060 0.720 0.698 0.046 0.698 0.701
ρ 0.172 0.064 0.058 0.047 0.068 0.063
σ2 0.008 0.001 0.001 0.015 0.003 0.002

In general, the simulation results demonstrate that the model specifications (2.3) and (2.4) produce estimates of θ and σ2 with comparable relative biases. The standard errors for Inline graphic are also comparable across models and ASE provides an acceptable approximation to ESE in all cases. While a strong effect of misspecifying F0(·) is absent, the impact of doing so on ρ and σ2 does become more apparent when the sample size is increased to n=200. In addition, comparing the results for Tables 7 and 9 of the supplementary material available at Biostatistics online, one can additionally see that biases increase under both incorrect and correct model misspecifications when the average number of expected events decreases. In general, though, the relative biases remain modest in most cases and the signs of all estimated parameters are also correct (results not shown).

Other interesting patterns also arise in these tables. For example, the bias of ρ is typically the largest, especially so when the true F0(·) is normally distributed. In addition, the sign of the bias is frequently negative (results not shown), indicating that ρ is often underestimated. However, as expected, this bias drops dramatically with an increase in sample size. Comparing results for E[Ci*]=125 versus E[Ci*]=225, we further observe that a decrease in the number of complete times generally leads to substantially increased standard errors, as might be expected. Interestingly, with a mean censoring time of 125, the use of a nonconstant censoring time often leads to lower standard errors in comparison with a fixed censoring time; however, with a mean censoring time of 225, the situation is reversed. The exact reasons for this change in behavior are unclear, though may have something to do with the fact that the BMI variable changes from a decreasing function at 195 days, a time that lies in between these 2 censoring times.

Finally, we remark that Table 1 demonstrates a substantial increase in the standard error of Inline graphic in comparison with Table II in Murphy and others (1995). We repeated the simulation corresponding to Table 1, replacing the analytical computation (2.11) with the Monte Carlo approximation described in Section 2.4 using B=10000. The empirical and estimated asymptotic standard errors were very similar to those in Table 1; hence, such a large discrepancy in the estimated standard error of Inline graphic almost certainly reflects the use of an incorrect standard error formula by Murphy and others (1995), as discussed earlier in Section 2.4.

4. DATA ANALYSIS: RECURRENT ASTHMA IN CHILDREN

In this section, we use our methodology to reanalyze patterns of recurrent asthma events occurring in young children. Briefly, 232 children aged 6 months at high risk of experiencing an asthmatic event, but who have not yet done so, were randomized to receive either drug or placebo and then followed for up to 18 months. The data are available from the data archive at http://blackwellpublishing.com/rss. Duchateau and others (2003, Table 1) analyze these data using various frailty models and timescales for recurrent event counting processes and find a statistically significant treatment effect. Such a difference seems apparent in Figure 2 (supplementary material available at Biostatistics online), which, respectively, summarizes the number of asthmatic events experienced in the drug and placebo groups over the course of the trial.

The gap times of particular interest in this study are the “asthma-free” periods, namely, (i) the time elapsed between randomization and the beginning of the first asthmatic episode and (ii) the time elapsed between the end of one asthmatic episode and the beginning of the next attack. A complicating factor in this analysis is the fact that a child is not considered to be at risk for another asthmatic event until the current episode ends, with such periods possibly lasting several days. However, with a median length of 4 days and 95% of the asthmatic events having durations less than 20 days, such times are generally short in comparison with the asthma-free periods (median of 39 days, with 95% of the times less than 430 days). For the purposes of this analysis, we therefore focus on the asthma-free gap times, accounting for the potential impact of asthmatic episodes through covariate adjustment. A secondary analysis, conducted by redefining the gap times of interest as the time elapsed between the start of each asthma-free period, led to the same qualitative and similar quantitative conclusions (results not shown).

In addition to evaluating the treatment effect using various frailty models, Duchateau and others (2003) comment that there is interest in “the evolution of the asthma recurrent event rate over time,” “how the appearance of an event influences the event rate,” and “how the asthma event rate changes with age.” As suggested in Aalen and others (2004), heterogeneity (i.e. frailty) can sometimes be accounted for using covariate information that changes over the course of observation. Thus, in the context of modeling gap-time data as proposed here, one might investigate how the average length of the current asthma-free episode depends on treatment, the occurrence and length of prior asthmatic and asthma-free episodes, and child age. Due to a significant right skew in the complete gap times, we model the conditional mean of Yij=logXij (i.e. μij(θ)), as in (2.5). Two models, described later, are fit assuming μij(θ) depends on some function of the covariates Inline graphic, where Di=I{child took the drug}, Inline graphic (i.e. “0” for the first event, 1 otherwise), Inline graphic is the length of the most recent asthmatic episode (Inline graphic is the length of the most recent asthma-free episode (Inline graphic), and Inline graphic is the age of the child (in days) at the beginning of the jth asthma-free period. To account for the possibility that the Yijs might be heavy tailed, the cumulative distribution function F0(·) is chosen as a standardized t with 3 degrees of freedom. The use of F0(·)=Φ(·) results in no qualitative and minimal quantitative changes (results not shown). The results, reported in Tables 3 and 4, use Vij(θ)=1; in general, selecting Vij(θ)=|μij(θ)| leads to nearly identical answers for the regression coefficients θ but rather different estimates of σ2. The reported standard errors (in brackets) are based on (2.18).

Table 3.

Estimated regression coefficients and standard errors, Models 1 and 2

θ0 θ1 θ2 θ3 θ4 θ5 θ6 θ7
Model 1a 4.874 – 0.758 – 0.012 0.846 – 0.204 – 0.846
(0.163) (0.281) (0.155) (0.188) (0.180) (0.234)
Model 1b 4.404 – 0.771 0.312 0.748 – 0.434 – 0.230
(0.194) (0.382) (0.139) (0.287) (0.171) (0.144)
Model 2 4.420 0.437 – 0.747 0.313 – 0.449 0.766 – 0.325 – 0.457
(0.161) (0.175) (0.257) (0.134) (0.202) (0.185) (0.125) (0.143)

Table 4.

Estimated regression coefficients and standard errors, Model 3

θ0 θ1 θ2 θ3 θ4 θ5 θ6
Model 3 4.376 0.529 – 1.263 0.121 – 0.258 0.369 – 0.778
(0.162) (0.167) (0.349) (0.071) (0.107) (0.063) (0.136)

Define Inline graphic and Inline graphic to, respectively, represent the lengths of the most recent asthmatic and asthma-free episodes as being above and below the sample median values. Also, let Inline graphic and Inline graphic denote indicators of a child's age to be 1–1.5 years or greater than 1.5 years. Then, the 2 mean models summarized in Table 3 are described below:Inline graphic.

In Model 2, the intercept θ0 has a useful interpretation as the average of the logarithm of the first gap time for subjects taking placebo. Notably, Model 1 is not fit directly; rather, this model is fit separately by treatment group in order to help illuminate important interactions with treatment, resulting in Models 1a (drug group) and 1b (control group). The intercepts in Models 1a and 1b, respectively, represent the mean log first gap times for subjects taking drug and placebo. Model 2 assumes a parametric form for the interaction between Di and Inline graphic and is therefore a more restricted version of Models 1a and 1b.

Respectively, the results for θ0 under Model 1 and θ1 under Model 2 summarized in Table 3 indicate the presence of a treatment effect on the initial asthma-free episode, the average gap time being larger in the drug group. Through θ2, both models also indicate that the mean gap time decreases after the occurrence of the first asthmatic episode, the magnitude of this effect varying little by treatment status. Qualitatively, these results are consistent with those of the gap-time-intensity models reported in Table 3 of Duchateau and others (2003, Models 2 and 3), where statistically significant differences by treatment and between the effects of the first and subsequent events are reported.

Together, Models 1a and 1b further suggest that the mean gap times for the drug and control groups tend to be considerably closer to each other for Inline graphic = 1 (i.e. among children whose most recent asthmatic attack is on the longer side) than for Inline graphic = 0. These patterns are present whether or not the effects corresponding to θ5θ7 are included in the model (results not shown). Models 1a and 1b further suggest that a longer previous asthma-free episode tends to result in a longer current asthma-free episode, regardless of treatment. We also observe that the average length of asthma-free episodes tends to decrease with increasing age, with the drug having a relatively protective effect in younger children (i.e. less than 1.5 years old) that may start to wear off with increasing age.

Models 1a and 1b also demonstrate the existence of a modest level of interaction between certain patient history variables and treatment. Specifically, there are noticeable changes in θ3 (length of most recent asthmatic episode) and θ7 (children older than 1.5 years) and more minor changes in θ5 (length of most recent asthma-free episode) and θ6 (children aged 1–1.5 years). However, with the exception of θ3, these effects only exhibit changes in magnitude not in direction. Therefore, Model 2 can be expected to provide a parsimonious description of the trends exhibited in Models 1a and 1b, as well as a more direct evaluation of the treatment effect. In fact, under Model 2, we observe that the treatment effect and its interaction with Inline graphic are both statistically significant (p=0.012 and p=0.026, respectively).

Model 3, summarized in Table 4, is specified similarly to Model 2, except that continuous versions of certain covariates are used in order to investigate the impact of discretizing covariates. Let Inline graphic if j=1 and Inline graphic define Inline graphic similarly. Also, let Inline graphic be the log age of a child, centered at its minimum value of 182 days. Then, the mean function for Model 3 is the following:

4.

The results of fitting this model reflect the same trends observed in Model 2.

Residual analysis for GEE models with longitudinal data, particularly in the presence of missing data, is not a well-developed field. In the current setting, one might consider using

4.

that is, the estimated “standardized residuals” derived from the complete gap-time information. Figure 1 provides histograms of these quantities, respectively, obtained under Models 1–3. While such plots cannot be used to validate that the individual mean and variance functions have been well specified, the lack of unusually large standardized residuals is an indication that the model has done a reasonable overall job in describing the observed gap-time data. However, care must be exercised in the interpretation of such plots due to the presence of correlation between residuals and also the fact that Inline graphic. The last result is a consequence of using residuals derived only from complete gap-time information and likely explains the mild left skew observed in each plot.

Fig. 1.

Fig. 1.

Standardized complete gap-time residuals for Models 1–3.

5. DISCUSSION

The use of GEE-type methods for analyzing recurrent event counting processes is now common. In contrast, the use of such methods for directly analyzing gap-time data has not been systematically investigated. The methodology proposed here extends and corrects methodology originally proposed in Murphy and others (1995) for the purposes of analyzing menstrual cycle data. The result is a simple yet flexible class of models for analyzing gap-time data. An especially attractive feature is the ability to specify rich models through mean and variance structures, leading to direct interpretability of regression effects on the gap times.

The main limitation of the proposed methodology is the reliance of (2.14) and (2.15) on the parametric model specified by (2.11) and (2.10). This parametric assumption is used for the sole purpose of dealing with censored gap-time information, and our simulation results also demonstrate a degree of robustness to the misspecification of F0(·). However, for reasons explained earlier, both consistency and asymptotic normality of Inline graphic do rely on the correct specification of the imputation model (2.10). Consequently, it is worthwhile to consider estimating η using alternative methods for handling the censored gap times.

For example, one might try and estimate the required conditional moments for the censored gap times under (2.10) without imposing a specific parametric model in (2.11). Let M+0 be an arbitrarily large fixed integer and define for i=1,,n and j=1,,M the probabilities

5. (5.1)

Assume that πijϵ>0 for i,j1 and, in addition, that E[Wijr(η0)|Wij(η0)+w]=Kr(w),r=1,2 and i,j1. Under mild regularity conditions

5.

is a pointwise-consistent estimator of Kr(w) for r=1,2. This estimator avoids the need to use a parametric specification for F0(·) in (2.11) at the price of assuming that E[Wijr(η0)|Wij(η0)+w]=Kr(w),r=1,2, for all standardized gap times. This latter assumption is stronger than that made in connection with (2.10), which is only imposed on the last incomplete gap time on each subject. In practice, use of this inverse-probability-of-censoring-weighted (IPCW) estimator also requires estimating η0 and the censoring probabilities in (5.1). In the special case where censoring is completely independent of the event process and covariates, it is possible to consistently estimate the πij via Inline graphic, where Inline graphic is the empirical cumulative distribution function of the censoring times C1,,Cn. More generally, a model for the censoring mechanism must be imposed; if this model is misspecified, bias can be expected for reasons analogous to the misspecification of F0(·). For these and other reasons (e.g. the need for uniform asymptotic results, proper methods of variance estimation), we have not investigated the utility of this estimator any further.

An alternative use of IPCW estimation is to construct direct analogs of (2.14) and (2.15). Specifically, one might proceed by estimating η using

5.

While such an approach obviates the need to impose assumptions (2.11) and (2.10), the avoidance of bias continues to require that one be able to correctly model and consistently estimate the censoring probabilities (5.1). In addition, because the information on censored gap times is no longer utilized directly, the efficiency of such an approach may suffer. This efficiency loss may be offset by including information on the censored gap times via augmented estimating equations (e.g. Rotnitzky and others, 1998; Scharfstein and others, 1999; Bang and Robins, 2005). We hope to explore the utility of this approach for analyzing gap-time data in subsequent work.

Finally, the methods developed in this paper can in principle be extended to multivariate recurrent event processes arising either as a result of having clustered data or due to the presence of multiple recurrent event outcomes on each subject. However, if the dependence structure between processes or especially between censored multivariate gap times must be modeled, the robustness of the present approach is likely to suffer and it may be advantageous to use a proper extension of the IPCW-type estimation scheme described above.

FUNDING

National Institutes of Health (R01 GM056182) to D.Y.C. and R.L.S.

SUPPLEMENTARY MATERIAL

Supplementary Material is available at http://www.biostatistics.oxfordjournals.org.

[Supplementary Materials]
kxp004_index.html (866B, html)

Acknowledgments

Conflict of interest: None declared.

Appendix

Regularity conditions sufficient for Theorems 2.1–2.3 to hold are summarized below:

  • (A0) The parameter η=(θT,σ)T lies in some compact subset 𝒪p, and the data-generating parameter η0 is assumed to lie interior to 𝒪. The known transformation h(x) is monotone nondecreasing and bounded for x(0,). Subjects are independent and identically distributed. Noninformative censoring holds in the sense that for j1, we have E[Y1j|H1j]=E[Y1j|1j] and Var[Y1j|H1j]=Var[Y1j|1j] for H1j=1j{C1S1,j1}.

  • (A1) Assumption (2.10) holds, with F0(·) correctly specified. In addition, bij(η), μij(θ), and Vij1(θ) are each bounded and twice continuously differentiable for i,j1, η𝒪.

  • (A2) Sn(η) is continuous for η𝒪, and Sn(η) converges uniformly in probability to S(η) := Eη0[ψ(η, Inline graphic1)] in some open neighborhood containing η0, where S(η0)=0.

  • (A3) Inline graphic exists and is continuous for η𝒪, and Sn(η) converges uniformly in probability to Inline graphic in some open neighborhood containing η0.

  • (A4) S(η0) is nonsingular.

  • (A5) ψ(η, Inline graphic1) satisfies the Lipschitz condition Inline graphic, where η1 and η2 both lie in a neighborhood containing η0 and Inline graphic is a measurable, scalar-valued function with Inline graphic.

  • (A6) Inline graphic.

  • (A7) Inline graphic.

Conditions (A0) and (A1) impose assumptions specific to the problem at hand, and conditions (A2)–(A7) are more general, consisting of a combination of regularity conditions taken from Yuan and Jennrich (1998) and van der Vaart (Section 5.3 1998) adapted to the current problem. Yuan and Jennrich (1998), extending results originally due to Foutz (1977), use (A2)–(A4) and the inverse function theorem to prove the consistency of a sequence of solutions obtained via an unbiased estimating equation. van der Vaart (Section 5.3 1998) uses (A4)–(A7) and the assumption that a consistent estimator exists in order to prove asymptotic normality under an i.i.d. sampling assumption. In particular, under conditions (A0)–(A7), the proofs of Theorems 2.2 and 2.3 are, respectively, direct consequences of Theorem 3 of Yuan and Jennrich (1998) and Theorem 5.21 of van der Vaart (1998). The remaining details are therefore omitted.

Condition (A1) says very little about the nature of Kr(·),r=1,2, appearing in (2.14) and (2.15), and the requisite assumptions are embedded in (A2)–(A7). For example, the derivatives appearing in (A3)–(A5) involve the functions Inline graphic. Using integration by parts and assuming that F0(·) is continuously differentiable, we see that

graphic file with name biostskxp004fx38_ht.jpg (A.1)

for h=1,2, where λ0(u)=f0(u)/(1F0(u)) is the hazard function corresponding to F0. Thus, conditions (A3)–(A5) impose implicit smoothness assumptions on Kh(·) and F0(·). These sufficient conditions can be refined in a way that makes the required smoothness assumptions more transparent, and further details may be found in Section S.1.4 of the supplementary material available at Biostatistics online.

References

  1. Aalen O, Husebye E. Statistical analysis of repeated events forming renewal processes. Statistics in Medicine. 1991;10:1227–1240. doi: 10.1002/sim.4780100806. [DOI] [PubMed] [Google Scholar]
  2. Aalen OO, Fosen J, Weedon-Fekjær H, Borgan Ø, Husebye E. Dynamic analysis of multivariate failure time data. Biometrics. 2004;60:764–773. doi: 10.1111/j.0006-341X.2004.00227.x. [DOI] [PubMed] [Google Scholar]
  3. Andersen PK, Gill RD. Cox's regression model for counting processes: a large sample study. Annals of Statistics. 1982;10:1100–1120. [Google Scholar]
  4. Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61:962–972. doi: 10.1111/j.1541-0420.2005.00377.x. [DOI] [PubMed] [Google Scholar]
  5. Chang S. Estimating marginal effects in accelerated failure time models for serial sojourn times among repeated events. Lifetime Data Analysis. 2004;10:175–190. doi: 10.1023/b:lida.0000030202.20842.c9. [DOI] [PubMed] [Google Scholar]
  6. Chang SH, Wang MC. Conditional regression analysis for recurrence time data. Journal of the American Statistical Association. 1999;94:1221–1230. [Google Scholar]
  7. Chen Y, Wang M. Semiparametric regression analysis on longitudinal pattern of recurrent gap times. Biostatistics. 2004;5:277–290. doi: 10.1093/biostatistics/5.2.277. [DOI] [PubMed] [Google Scholar]
  8. Cook R, Lawless J. The Statistical Analysis of Recurrent Events. New York: Springer; 2007. [Google Scholar]
  9. Cox DR. Renewal Theory. London: Methuen; 1962. [Google Scholar]
  10. Cox DR. Regression models and life-tables (with discussion by F. Downton, Richard Peto, D.J. Bartholomew, D.V. Lindley, P.W. Glassborow, D.E. Barton, Susannah Howard, B. Benjamin, John J. Gart, L.D. Meshalkin, A.R. Kagan, M. Zelen, R.E. Barlow, Jack Kalbfleisch, R.L. Prentice and Norman Breslow, and a reply by D.R. Cox) Journal of the Royal Statistical Society Series B. 1972;34:187–220. [Google Scholar]
  11. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm (with discussion) Journal of the Royal Statistical Society Series B. 1977;39:1–38. [Google Scholar]
  12. Duchateau L, Janssen P, Kezic I, Fortpied C. Evolution of recurrent asthma event rate over time in frailty models. Journal of the Royal Statistical Society Series C. 2003;52:355–363. [Google Scholar]
  13. Elashoff M, Ryan L. An EM algorithm for estimating equations. Journal of Computational and Graphical Statistics. 2004;13:48–65. [Google Scholar]
  14. Foutz RV. On the unique consistent solution to the likelihood equations. Journal of the American Statistical Association. 1977;72:147–148. [Google Scholar]
  15. Huang Y. Censored regression with the multistate accelerated sojourn times model. Journal of the Royal Statistical Society Series B. 2002;64:17–29. [Google Scholar]
  16. Huang Y, Chen Y. Marginal regression of gaps between recurrent events. Lifetime Data Analysis. 2004;9:293–303. doi: 10.1023/a:1025892922453. [DOI] [PubMed] [Google Scholar]
  17. Kotz S, Shanbhag DN. Some new approaches to probability distributions. Advances in Applied Probability. 1980;12:903–921. [Google Scholar]
  18. Lin DY, Sun W, Ying Z. Nonparametric estimation of the gap time distributions for serial events with censored data. Biometrika. 1999;86:59–70. [Google Scholar]
  19. Little RJA, Rubin DB. Statistical Analysis with Missing Data. 2nd edition. New York: Wiley Series in Probability and Statistics, Wiley-Interscience; 2002. [Google Scholar]
  20. Molenberghs G, Verbeke G. Models for Discrete Longitudinal Data. New York: Springer; 2005. [Google Scholar]
  21. Murphy SA, Bentley GR, O'Hanesian MA. An analysis for menstrual data with time-varying covariates. Statistics in Medicine. 1995;14:1843–1857. doi: 10.1002/sim.4780141702. [DOI] [PubMed] [Google Scholar]
  22. Murphy SA, Li B. Projected partial likelihood and its application to longitudinal data. Biometrika. 1995;82:399–406. [Google Scholar]
  23. Oakes D, Cui L. On semiparametric inference for modulated renewal processes. Biometrika. 1994;81:83–90. [Google Scholar]
  24. Oakes D, Dasu T. A note on residual life. Biometrika. 1990;77:409–410. [Google Scholar]
  25. Peña EA, Strawderman RL, Hollander M. Nonparametric estimation with recurrent event data. Journal of the American Statistical Association. 2001;96:1299–1315. [Google Scholar]
  26. Prentice RL, Williams BJ, Peterson AV. On the regression analysis of multivariate failure time data. Biometrika. 1981;68:373–379. [Google Scholar]
  27. Rotnitzky A, Robins JM, Scharfstein DO. Semiparametric regression for repeated outcomes with nonignorable nonresponse. Journal of the American Statistical Association. 1998;93:1321–1339. [Google Scholar]
  28. Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse models (with comments and a rejoinder by the authors) Journal of the American Statistical Association. 1999;94:1096–1146. [Google Scholar]
  29. Strawderman RL. The accelerated gap times model. Biometrika. 2005;92:647–666. [Google Scholar]
  30. Strawderman RL. A regression model for dependent gap times. International Journal of Biostatics. 2006:34. 2, Artcle 1(electronic) [Google Scholar]
  31. van der Vaart A. Asymptotic Statistics. New York: Cambridge University Press; 1998. [Google Scholar]
  32. Yuan K-H, Jennrich RI. Asymptotics of estimating equations under natural conditions. Journal of Multivariate Analysis. 1998;65:245–260. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

[Supplementary Materials]
kxp004_index.html (866B, html)
kxp004_1.pdf (123.6KB, pdf)
kxp004_supplement.txt (14.3KB, txt)
kxp004_tablessupp.txt (12.6KB, txt)

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES