Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Mar 19.
Published in final edited form as: Stat Med. 2014 Jul 23;33(27):4770–4789. doi: 10.1002/sim.6262

Regression modeling of longitudinal data with outcome-dependent observation times: extensions and comparative evaluation

Kay See Tan 1,*,, Benjamin French 1, Andrea B Troxel 1
PMCID: PMC10949856  NIHMSID: NIHMS1971342  PMID: 25052289

Abstract

Conventional longitudinal data analysis methods assume that outcomes are independent of the data-collection schedule. However, the independence assumption may be violated, for example, when a specific treatment necessitates a different follow-up schedule than the control arm or when adverse events trigger additional physician visits in between prescheduled follow-ups. Dependence between outcomes and observation times may introduce bias when estimating the marginal association of covariates on outcomes using a standard longitudinal regression model. We formulate a framework of outcome-observation dependence mechanisms to describe conditional independence given observed observation-time process covariates or shared latent variables. We compare four recently developed semi-parametric methods that accommodate one of these mechanisms. To allow greater flexibility, we extend these methods to accommodate a combination of mechanisms. In simulation studies, we show how incorrectly specifying the outcome-observation dependence may yield biased estimates of covariate-outcome associations and how our proposed extensions can accommodate a greater number of dependence mechanisms. We illustrate the implications of different modeling strategies in an application to bladder cancer data. In longitudinal studies with potentially outcome-dependent observation times, we recommend that analysts carefully explore the conditional independence mechanism between the outcome and observation-time processes to ensure valid inference regarding covariate-outcome associations.

Keywords: joint models, observation-time process, outcome process, outcome-dependent follow-up, semi-parametric regression, informative observation times

1. Introduction

Longitudinal studies commonly assume that the data-collection schedule is independent of a subject’s outcomes and measured or unmeasured characteristics. However, this independence assumption may be violated if observed covariate or outcome values influence the occurrence or timing of subsequent visits. For example, in a study following patients with diabetes, routine visits are scheduled every 6 months. However, spikes in blood sugar levels, exacerbation of other symptoms, or underlying patient characteristics may trigger additional closely spaced physician visits until the blood sugar level has stabilized. The intensity of events such as physician visits is dependent on previous outcomes and measured or unmeasured covariates. Less healthy patients may be over-represented in the analysis because of more frequent data collection. In the presence of the resultant selection bias, conventional methods such as generalized estimating equations (GEE) [1] may yield biased estimates of covariate-outcome associations [2,3]. Proper estimation must account for such selection bias. We focus on a marginal mean regression model to evaluate the association between observed covariates and a continuous outcome of interest. We denote the longitudinal outcomes as the outcome process and the occurrence of visits over time as the observation-time process.

Several authors have proposed parametric models to account for the potential dependence between the outcome and observation-time processes. Lipsitz et al. [4] developed a likelihood-based procedure for continuous outcomes, Fitzmaurice et al. [5] proposed a pseudo-likelihood estimation procedure for binary outcomes, and Lin et al. [6] and Bůžková and Lumley [7] utilized inverse intensity-weighted estimators with observation-level inverse weights. Others focused on estimation procedures based on joint likelihood approaches: Ryu et al. [8] developed a Bayesian fully parametric regression model; Liu et al. [9] considered a joint mixed-effects model in which the outcomes, observation times, and censoring times were correlated through latent variables. The study of outcome-dependent observation times shares features of research regarding incomplete [10,11] and recurrent marked point process data [12] but differs in that subjects do not share a common set of visit times, and outcomes (e.g., blood sugar level) exist even if an event (e.g., a physician visit) does not occur.

We introduce a framework of three outcome-observation dependence mechanisms. The first mechanism applies when the outcome and the observation-time processes are conditionally independent given outcome-model covariates. The second mechanism applies when the processes are conditionally independent given observation-time model covariates, which may include outcome-model covariates and previous outcomes. The third applies when the processes are conditionally independent given shared, unobserved, latent variables. We consider four semi-parametric marginal regression methods that do not require estimation of the mean effect of time on the outcomes: the Lin method [13] accommodates the first mechanism, the Bůžková method [14] accommodates the second mechanism, and the Liang [15] and Sun [16] methods accommodate the third mechanism. We extend both the Liang and the Sun methods to accommodate a combination of the three mechanisms, thereby increasing the flexibility of the models.

In this article, we compare currently available and newly extended methods that accommodate outcome-dependent observation times. Our goal is to provide much-needed clarification of the strengths and limitations of each estimation method under alternative outcome-observation dependence mechanisms. In Section 2, we elaborate on our framework of outcome-observation dependence mechanisms. We review existing methods under each of these mechanisms (Section 2.2) and detail our extensions to both the Liang and Sun methods to accommodate conditional independence through observation-time model covariates, and our extension to the Liang method to accommodate time-dependent covariates in the observation-time model (Section 2.3). We present simulation studies to evaluate the performance of the reviewed methods under alternative outcome-observation dependence mechanisms in Section 3 and illustrate their application to a bladder cancer study in Section 4. Section 5 provides guidance on the selection of estimation methods.

2. Estimation methods

Let Yi(t) denote a continuous outcome of interest at time t and Xi(t) denote a p×1 vector of possibly time-dependent covariates for subject i=1,,n. We only consider external covariates, such that any time-dependent covariate process at time t is conditionally independent of all previous outcomes given the history of the covariate process [17]. The outcome Yi() is measured at mi observation times 0Ti1<Ti2<<Timiτ, for which mi denotes the number of follow-up measurements on the ith individual, and τ denotes the maximum study duration. Using counting process notation, let Ni(t)=stdNi(s) denote the number of observations on the ith subject by tCi. The censoring time Ciτ is the time of last visit or an administrative end-of-study time. The indicator variable dNi(t) is 1 if a follow-up visit occurred at t and 0 otherwise. We assume non-informative censoring, such that EYi(t)Xi(t),Cit=E[Yi(t)Xi(t). That is, the covariate-outcome associations are the same in subjects who are censored at Ci as those who are still in the study.

2.1. Models and assumptions

2.1.1. Semi-parametric outcome model.

We assume that primary scientific interest lies in a semi-parametric regression model for the longitudinal continuous outcomes [13]:

Yit=μt+βXit+ϵit, (1)

for which μ(t) is an arbitrary function of time, β is a p×1 vector of regression parameters of interest, and ϵi(t) is a zero-mean process independent of Xi(t). Model (1) specifies a parametric structure for the effect of Xi(t) and a non-parametric structure for μ(t) [3,18,19]. A semi-parametric model is appealing if the effect of Xi(t) is of primary interest and the effect of time is considered a nuisance. Model (1) does not condition on the entire covariate process or on past outcomes. Instead, it includes covariate information available at t, such as baseline covariates, covariates measured at or before t, and summaries of the covariate history, that is, Model (1) is a partly conditional mean regression model [20].

2.1.2. Observation-time model.

We use a standard recurrent events model to describe the observation-time process. Given observation-time model covariates Zi(t) and a non-negative latent variable ηi with mean 1 and unknown variance σ2, Ni is a non-homogeneous Poisson process with intensity function [21]:

λi(t)=ηiexpγZi(t)λ(t), (2)

in which γ is a vector of unknown parameters, λ(t) is an unspecified baseline intensity function with Λ(t)=0tλ(u)du, and Xi(t)Zi(t). Unless otherwise specified, we assume that ηi is independent of Zi(t). Model (2) implies that the occurrence of observations follows a proportional intensity model, in which ηi inflates or deflates the visit intensity. The parameter γ from the observation-time model is considered a nuisance. However, incorporating the observation-time process into the estimation of β facilitates prediction from longitudinal data under similar outcome-observation dependence mechanisms, which we detail in the next section.

2.1.3. A framework of outcome-observation dependence mechanisms.

We distinguish three mechanisms that describe the dependence between the outcome and observation-time processes:

(M1) Conditional independence given past outcome-model covariates.

(M2) Conditional independence given past observation-time model covariates.

(M3) Conditional independence given shared latent variables.

Throughout the paper, ‘conditional independence given covariates’ implies conditional independence given past observed covariates. Recall that Xi(t) incorporates covariate information available at t, which may include baseline covariates, covariates measured at or before t, and summaries of the covariate history.

(M1) Conditional independence given past outcome-model covariates

The first mechanism applies when the outcome process is conditionally independent of the observation-time process given observed outcome-model covariates Xi(t), or a subset of Xi(t):

EdNi(t)Xi(t),Yi(t),Cit=EdNi(t)Xi(t).

The probability of observation at time t depends on Xi(t), Yi(t), and Ci only through observed outcome-model covariates Xi(t). (M1) is plausible when the occurrence of a visit is due to the features of the study design instead of subject-specific behaviors.

(M2) Conditional independence given past observation-time model covariates

The second mechanism applies when the probability of an observation at t depends on observation-time model covariates Zi(t), in which full or subset of Xi(t) is contained in Zi(t):

EdNi(t)Zi(t),Yi(t),Cit=EdNi(t)Zi(t).

The set of observation-time model covariates Zi(t) can include the outcome-model covariates, any additional measured covariates at or before t, and summaries of previous outcomes. Note that (M1) ⊂ (M2) because Xi(t)Zi(t).

(M3) Conditional independence given shared latent variables

The third mechanism applies when the outcome process is conditionally independent of the observation-time process given observed outcome-model covariates Xi(t) and an unobserved mean 1 subject-specific latent variable ηi:

EdNi(t)Xi(t),Yi(t),ηi,Cit=EdNi(t)Xi(t),ηi.

The parameter ηi expresses subject-specific unmeasured confounders and propensity for an observation. Note that (M1) ⊂ (M3) in situations to be described in Section 2.2.3.

Our framework for outcome-observation dependence in the analysis of longitudinal data provides guidance for the selection of reliable methods. (M2) and (M3) place fewer restrictions on the probability of having a visit than (M1) and are reasonable assumptions in most observational studies. However, (M2) and (M3) are more restrictive because fewer analysis methods are available to provide valid inference, which we detail in the following section.

2.2. Existing methods

In this section, we describe four existing methods to estimate covariate-outcome associations in the presence of outcome-observation dependence. All of the methods require estimation of an observation-time model. If the observation-time process is conditionally independent of the censoring times, the parameter γ can be consistently estimated by γˆ from the procedure in Lin et al. [21]:

U(γ)=i=1n0τ{Zi(t)Z¯(t;γ)}dNi(t), (3)

for which ξi(t)=ICi>t and:

Z(t;γ)=i=1nξi(t)expγZi(t)Zi(t)i=1nξi(t)expγZi(t).

2.2.1. Method under (M1).

Lin and Ying [13] assume that the observation-time process is conditionally independent of the outcome process given the outcome-model covariates, as in (M1). The Lin method specifies a marginal semi-parametric outcome model EYi(t)Xi(t)=μ(t)+βXi(t) and a proportional rate observation-time model EdNi(t)Vi(t)=expγVi(t)dΛ(t), in which Vi(t) is a subset of Xi(t). We define a zero-mean stochastic process [13]:

Mi(t;𝒜,β,γ)=0tYi(s)βXi(s)dNi(s)0texpγVi(s)ξi(s)d𝒜(s), (4)

in which 𝒜(t)=0tμ(s)dΛ(s). Based on (4), one set of estimating equations to solve for μ(t) and β is:

i=1nMi(t;β,γ)=0,0<tτ (5)
i=1n0τW(t)Xi(t)dMi(t;β,γ)=0. (6)

The common weight W(t) can improve efficiency and may be data-dependent, such as the proportion of subjects left in the study, that is, n1i=1nξi(t). The closed-form expression of 𝒜(t) in (5) yields 𝒜˜(t;β)=i=1n0tYi(s)βXi(s)dNi(s)j=1nξj(s)expγVi(s), which replaces 𝒜(t) in (6) to form the estimating equation:

i=1n0τW(t){Xi(t)X¯(t;γ)}{Yi(t)βXi(t)}dNi(t)=0. (7)

The centering term is defined as:

X(t;γ)=i=1nξi(t)expγVi(t)Xi(t)i=1nξi(t)expγVi(t).

We note that:

Ei=1n0τXi(t)X(t;γ)g(t)dNi(t)Xi(t),Ci;i=1,,n=i=1n0τXi(t)X(t;γ)g(t)dNi(t)=0

for any function g(), so we extend the left side of (7) to obtain the class of estimating functions for β:

Ug(β;γ)=i=1n0τW(t)Xi(t)X(t;γ)Yi(t)βXi(t)g(t;γ)dNi(t).

One optimal choice of g() that minimizes the variance of Ug(β,γ) is g(t;γ)=Y*(t;γ)βX(t;γ), in which:

Y*(t;γ)=i=1nξi(t)expγVi(t)Yi*(t)i=1nξi(t)expγVi(t),

and Yi*(t) is the measurement of Yi at the observation nearest to t. Hence, β can be consistently estimated from the estimating equation [13]:

U(β;γ)=i=1n0τW(t)XitXt;γ{YitY*t;γβXitXt;γ}dNi(t)=0,

in which γ is estimated by (3) conditioning on the covariates Vi(t). The inclusion of the centering term for covariates accounts for the probability of being observed at t and removes the need for estimation of μ(t). The centering of the outcome increases the efficiency of the estimation procedure. Note that Yi*(t) is the nearest-neighbor approximation of Yi(t) if the true measurement is not evaluable or collected at t. Li and Ryan [22] documented the potential issue of such mismeasured covariates. Discussion of other forms of g() and Yi*(t) can be found in the comments and rejoinder section of Lin and Ying [13].

2.2.2. Method under (M1) and (M2).

Bůžková and Lumley [14] relax the assumption of (M1) by addressing the dependence between the outcome and observation-time processes through observation-time model covariates. The set of covariates Zi(t) may include the outcome-model covariates Xi(t) and past outcomes.

The Bůžková method uses inverse intensity rate ratio (IIRR)-weighted estimators to estimate β in the outcome model EYi(t)Xi(t)=μ(t)+βXi(t). The observation-level inverse weights standardize the observed data to the time-specific underlying population under the proportional rate model for observation times EdNi(t)Zi(t)=expγZi(t)dΛ(t). Inverse weighting has also been shown to reduce bias when cluster size is informative (i.e., the outcomes measured among clustered units are not independent of cluster size) [23, 24] and when missing data are missing at random (i.e., missingness depends only on observed covariates and outcomes) [25, 26]. One particular weight with variance-stabilizing properties is:

ρi(t;γ,δ)=expγZi(t)expδXi(t),

for which δ is estimated by δˆ using (3) conditioning on Xi(t) instead of Zi(t). The proposed estimating equation for β is:

U(β;γˆ,δ)=i=1n0τW(t)ρi(t;γˆ,δ)Xi(t)X(t;δ)Yi(t)Y*(t;δ)βXi(t)X(t;δ)dNi(t)=0,

in which:

X(t;δ)=i=1nξi(t)expδXi(t)Xi(t)i=1nξi(t)expδXi(t),

and:

Y*(t;δ)=i=1nξi(t)expδXi(t)Yi*(t)i=1nξi(t)expδXi(t).

If Zi(t)=Xi(t), then ρi(t;γ,δ)=1, and the Bůžková method reduces to the Lin method (Section 2.2.1). The IIRR-weighted estimates are asymptotically consistent and normal, but the validity of the proposed IIRR-weighted estimator is contingent upon correct specification of Zi(t) in the observation-time model [14].

2.2.3. Methods under (M1) and (M3).

The following two methods accommodate subject-specific observation-time processes with arbitrary visit patterns through the use of latent variables. The Liang method [15] specifies a semi-parametric mixed-effects outcome model:

EYi(t)Xi(t),Qi(t)=μ(t)+βXi(t)+ηi1Qi(t), (8)

in which ηi1 is a vector of unobserved subject-specific latent variables and Qi(t) is a subset of Xi(t). The observation-time process is modeled as λi(t)=ηi2λ(t)expγVi, and Vi is a subset of baseline covariates in Xi(t). The Gamma-distributed latent variable ηi2 is independent of Vi, Eηi2=1, and Varηi2=σ2 is unknown. The relationship between ηi1 and ηi2 is defined by the conditional expectation Eηi1|ηi2=θηi21, so θ describes the magnitude and direction of the association between the outcome and observation-time processes. Note that the marginal expectation of ηi1 is 0. The linear link between ηi1 and ηi2 can also be extended to other specified link functions [15]. When ηi1=0, the Liang method reduces to the Lin method (Section 2.2.1).

Conditioning on ηi2, the observation-time process is a non-homogeneous process such that mi has a Poisson distribution with mean ηi2expγViΛCi. The cumulative baseline intensity function Λ(t) can be consistently estimated by the Aalen-Breslow-type estimator Λˆ(t)=Λˆ(t,γˆ):

Λˆ(t,γˆ)=i=1n0tdNi(s)j=1nξjsexpγˆVi,

for which γ is estimated by (3) conditioning on the baseline covariates Vi. Given Ci,mi,ηi2, the observation times Ti1,Ti2,,Timi are the order statistics of a set of independently and identically distributed random variables with the density function:

i=1npti1,ti2,,timiCi,mi,ηi2=i=1nmi!i=1ndΛtijΛCi.

Hence, EdNi(t)Ci,mi,ηi2=ξi(t)midΛ(t)ΛCi. It follows that:

EYi(t)βXi(t)dNi(t)Ci,mi=EEμ(t)+ηi1Qi(t)+ϵi(t)dNi(t)Ci,mi,ηi2Ci,mi=μ(t)ξi(t)midΛ(t)ΛCi+θQi(t)Eηi21Ci,miEdNi(t)Ci,mi. (9)

We define Bi(t)=Qi(t)Eηi21Ci,mi as a covariate based on the subject-specific propensity of visit, and 𝒜(t)=0tμ(s)dΛ(s). Then (9) can be expressed as:

EYi(t)βXi(t)θBi(t)dNi(t)Ci,mi=ξ(t)miΛCid𝒜(t).

We can then formulate the zero-mean process:

Mi2t;𝒜,β,θ,γ=0tYisβXisθBitdNis0tξismiΛCid𝒜s, (10)

and define the set of estimating equations based on (10) to estimate μ(t), β, and θ simultaneously:

i=1nMi2(t;β,θ,γ)=0,0<tτ (11)
i=1n0τ(Xi(t)Bˆi(t))dMi2(t;β,θ,γ)=0. (12)

The closed-form expression for 𝒜(t) in (11) replaces 𝒜(t) in (12), so β and θ can be consistently estimated using the class of estimating equations [15]:

U(β,θ;Λˆ,Bˆ)=i=1n0τXi(t)X(t)Bˆi(t)Bˆ(t)Yi(t)βXi(t)θBˆi(t)dNi(t)=0,

for which:

X(t)=i=1nξi(t)Xi(t)mi/ΛˆCii=1nξi(t)mi/ΛˆCi,

and:

Bˆ(t)=i=1nξi(t)Bˆi(t)mi/ΛˆCii=1nξi(t)mi/ΛˆCi.

To estimate Bi(t), the conditional expectation of ηi2 given Ci,mi is required. If we assume that ηi2 is Gamma distributed with mean 1 and variance σ2, the expectation of ηi2 can be expressed as:

Eηi2Ci,mi=1+miσ21+expγViΛCiσ2.

The covariate Bi(t) can thus be estimated by:

Bˆi(t)=1+miσˆ21+expγˆViΛˆCiσˆ21Qi(t),

for which σˆ2 is a consistent estimator of σ2 defined as:

σˆ2=maxi=1nmi2miexp2γˆViΛˆ2Cii=1nexp2γˆViΛˆ2Ci,0. (13)

Similar to the Liang method, the Sun method [16] accommodates (M3). In contrast to the Liang method, the distribution of the latent variable is completely unspecified, and the same latent variable ηi is shared between the outcome and observation-time models. The Sun method specifies the semi-parametric marginal model:

EYi(t)Xi(t),ηi=μ(t)+βXi(t)+αηi. (14)

Similar to θ in the Liang method, α parameterizes the correlation between the outcome and observation-time processes. If α=0, then the Sun method reduces to the Lin method.

Conditioning on ηi, Ni(t) is a non-homogeneous Poisson process with intensity function λi(t)=ηiλ(t)expγXi(t). The distribution of ηi under the Sun method may depend on observed time-independent outcome-model covariates Vi with EηiVi=1. Discussion regarding covariate-dependent latent variables or frailties can be found in recent literature [2730]. Let πˆt;Xi=0texpγˆXi(u)dΛˆ(u), ηˆi=mi1/πˆCi;Xi, and Ωˆi=mi1mi2/πˆCi;Xi2. The class of estimating equations for β and α has the form:

U1(β,α;γ)=i=1n0τW(t)Xi(t)X(t;γ)Yi(t)βXi(t)αηˆidNi(t)=0,

and:

U2(β,α;γ)=i=1n0τW(t)ηˆiη(t;γ)Yi(t)βXi(t)αΩˆiηˆiη(t;γ)dNi(t)=0,

for which:

X(t;γ)=i=1nξi(t)expγXi(t)Xi(t)mi/πˆCi;Xii=1nξi(t)expγXi(t)mi/πˆCi;Xi,

and:

η(t;γ)=i=1nξi(t)expγXi(t)ηˆimi/πˆCi;Xii=1nξi(t)expγXi(t)mi/πˆCi;Xi.

2.3. Extensions

2.3.1. Extension to Liang method to accommodate time-dependent covariates.

The estimation procedure of Liang et al. [15] allows adjustment for time-independent covariates in the observation-time model. Here, we extend the Liang method to accommodate time-dependent covariates. πˆt;Vi=0texpγˆVi(u)dΛˆ(u). The class of estimating equations for β and θ permitting time-dependent covariates in the observation-time model has the form:

U(β,θ;Λˆ,Bˆ)=i=1n0τXi(t)X(t)Bˆi(t)Bˆ(t)Yi(t)βXi(t)θBˆi(t)dNi(t)=0,

for which:

X(t)=i=1nξi(t)expγXi(t)Xi(t)mi/πˆCi;Xii=1nξi(t)expγXi(t)mi/πˆCi;Xi,
Bˆ(t)=i=1nξi(t)expγXi(t)Bˆi(t)mi/πˆCi;Xii=1nξi(t)expγXi(t)mi/πˆCi;Xi,

and Bˆi(t) can be estimated as before by replacing ΛˆCi with mi/πˆCi;Vi. We provide details on consistency and asymptotic normality of the estimators in Appendix A.

2.3.2. Weighted-Liang and weighted-Sun methods.

We propose extensions to the Liang and Sun methods to offer additional flexibility when parameterizing outcome-observation dependence under both (M2) and (M3). Recall that we denote Xi(t) as the outcome-model covariates and Zi(t) as the observation-time model covariates. With the inclusion of observation-level weights ρi(t;γˆ,δ), the set of estimating equation for the weighted-Liang method can be expressed as:

U(β,θ;Λˆ,Bˆ)=i=1n0τW(t)ρi(t;γˆ,δ)Xi(t)X(t)Bˆi(t)Bˆ(t)Yi(t)βXi(t)θBˆi(t)dNi(t)=0,

for which:

X(t)=i=1nξi(t)expδXi(t)Xi(t)mi/πˆCi;Zii=1nξi(t)expδXi(t)mi/πˆCi;Zi,

and:

Bˆ(t)=i=1nξi(t)expδXi(t)Bˆi(t)mi/πˆCi;Zii=1nξi(t)expδXi(t)mi/πˆCi;Zi.

Similarly, the set of estimating functions for the weighted-Sun method is:

U1(β,α;γˆ)=i=1n0τW(t)ρi(t;γˆ,δ)Xi(t)X(t)Yi(t)βXi(t)αηˆidNi(t)=0,

and:

U2(β,α;γˆ)=i=1n0τW(t)ρi(t;γˆ,δ)ηˆiη(t)Yi(t)βXi(t)αΩˆiηˆiη(t)dNi(t)=0,

for which ηˆi=mi1πˆCi;Zi,

η(t)=i=1nξi(t)expδXi(t)ηˆimi/πˆCi;Zii=1nξi(t)expδXi(t)mi/πˆCi;Zi,

and:

X(t)=i=1nξi(t)expδXi(t)Xi(t)mi/πˆCi;Zii=1nξi(t)expδXi(t)mi/πˆCi;Zi.

We provide details on consistency and asymptotic normality of our extensions in Appendices A and B.

2.4. Summary

In this section, we formulated a semi-parametric linear regression model to evaluate the marginal association between covariates and a continuous outcome of interest in the presence of outcome-dependent observation times. We presented a framework of outcome-observation dependence mechanisms. The Lin method is the most restrictive of the reviewed methods, because it is suitable only for the stronger assumption of (M1); the Bůžková method accommodates (M2) and reduces to (M1) when the additional covariates in the observation-time model are not required; the Liang and Sun methods accommodate (M3), with (M1) as a special case. We proposed two methods, the weighted-Liang and weighted-Sun methods, which offer considerable flexibility in that they can accommodate all (or any combination) of the three outcome-observation dependence mechanisms. We note that standard error estimation for all methods is most easily obtained using bootstrap procedures; in this setting, a cluster bootstrap, in which subjects are sampled with replacement, is required [31,32]. The resampling of subjects assumes that the correlation structure within each subject is retained [33,34]. R code for the cluster-bootstrap procedure is included in the Appendix. In subsequent sections, we evaluate the statistical properties of these methods in simulation studies (Section 3). We illustrate their application and propose model verification in a case study (Section 4).

3. Simulation study

We evaluated the statistical properties of the reviewed methods through simulation studies under two outcome-observation dependence settings: (i) (M2) and (ii) (M2) and (M3). All simulations were conducted in R 2.13.1 (R Development Core Team, Vienna, Austria).

3.1. Setting 1: simulations under (M2)

3.1.1. Parameters.

In this setting, we used covariates to induce correlation between the outcome and observation-time processes. Following the simulation procedure of Bůžková and Lumley [14], we generated continuous outcomes at each of 1000 iterations using the linear mixed-effects model:

Yi(t)=μ(t)+β1Xi1(t)+β2Xi2EXi2Xi1+ϵi(t), (15)

for which μ(t)=t, ϵi(t)Normal(0,1) and β1 was the target of inference. The time-dependent covariate Xi1(t)=Xi1log(t) was a known function of time, in which Xi1 followed a Uniform[0,1] distribution. The time-independent covariate Xi2 was drawn from a mixture distribution, for which Xi2Normal(2,1) if Xi10.5 and Xi2Normal(0,4) if Xi1>0.5. Hence, Xi2 in model (15) influenced the covariate-outcome association of Xi1(t). To ensure proper marginalization of model (15), Xi2 was centered by its conditional mean given Xi1, resulting in the marginal semi-parametric outcome model:

EYi(t)Xi(t)=μ(t)+β1Xi1(t). (16)

We generated the observation times Tik following a non-homogeneous Poisson process with intensity function λi(t)=ηiλ(t)expγ1Xi1(t)+γ2Xi2. Note that Xi2 induced additional correlation between the outcome and observation-time processes. We set λ(t)=t2 and generated the latent variable ηi from a Gamma distribution with mean 1 and variance ση2=0.5. The independent censoring time Ci was generated from Uniform [5,10]. We considered various combinations of outcome parameters (β1=1, β2={0,0.3,1}) and intensity parameters γ1=0.5,γ2={0,0.2,0.5}. When β2=0 and γ2=0, the outcome-observation dependence model satisfied (M1); when γ20, the outcome-observation dependence model satisfied (M2).

3.1.2. Results.

Table I provides the estimated bias, empirical standard error estimates, and mean-squared error estimates for estimation of β1 in model (16). Recall that the Lin, Liang, and Sun methods estimate β1 without accounting for Xi2 in any way, whereas the Bůžková, weighted-Liang, and weighted-Sun methods incorporate the effect of Xi2 through observation-level weights. As anticipated, all six methods yielded approximately unbiased parameter estimates for β1 if (M1) was satisfied (γ2=0), that is, when the outcome process was conditionally independent of the observation-time process given outcome-model covariates. The Lin, Bůžková, weighted-Liang, and weighted-Sun estimates of β1 were comparable in bias and efficiency to both the Liang and Sun estimators. However, if (M1) was violated γ20, that is, the source of additional correlation between the two processes was induced by an additional covariate Xi2, then only the Bůžková, weighted-Liang, and weighted-Sun methods performed well, with negligible biases in all settings. When β2=0 and γ20, all methods performed well because Xi2 was not associated with the outcome. As β2 increased, the biases of Lin, Liang, and Sun estimates for β1 increased. A positive value of γ2 with positive values of Xi2 led to more observations per subject, which increased efficiency in the estimation of β1 in most cases.

Table I.

Simulation results for β1 under (M2).

n β2 γ1 Lin
Buzkova
Liang (extension)
Weighted-Liang (extension)
Sun
Weighted-Sun (extension)
Bias ESE MSE Bias ESE MSE Bias ESE MSE Bias ESE MSE EREa Bias ESE MSE Bias ESE MSE EREa

100 0 0 0.003 0.284 0.081 0.003 0.284 0.080 0.004 0.280 0.078 −0.006 0.393 0.155 1.915 0.004 0.282 0.079 −0.006 0.393 0.155 1.915
−0.2 −0.001 0.292 0.085 −0.002 0.289 0.084 0.003 0.287 0.082 −0.012 0.428 0.183 2.193 0.003 0.288 0.083 −0.013 0.428 0.183 2.193
0.5 −0.020 0.342 0.118 −0.009 0.289 0.084 −0.018 0.324 0.105 −0.007 0.379 0.143 1.720 −0.018 0.329 0.109 −0.008 0.379 0.144 1.724
0.3 0 0.003 0.318 0.101 0.002 0.313 0.098 0.004 0.313 0.098 −0.008 0.411 0.169 1.724 0.004 0.315 0.099 −0.007 0.411 0.169 1.313
−0.2 −0.136 0.336 0.132 −0.010 0.332 0.110 −0.098 0.323 0.114 −0.023 0.467 0.219 1.979 −0.119 0.328 0.121 −0.024 0.467 0.219 1.979
0.5 0.336 0.399 0.273 −0.003 0.323 0.104 0.231 0.352 0.177 0.002 0.385 0.149 1.421 0.300 0.367 0.225 0.011 0.393 0.155 1.480
1 0 0.003 0.543 0.294 −0.001 0.521 0.271 0.004 0.536 0.288 −0.010 0.577 0.333 1.227 0.004 0.539 0.290 −0.010 0.577 0.333 1.227
−0.2 −0.452 0.624 0.594 −0.029 0.587 0.345 −0.333 0.555 0.419 −0.048 0.687 0.475 1.370 −0.402 0.582 0.500 −0.050 0.686 0.473 1.366
0.5 1.167 0.769 1.954 0.011 0.558 0.312 0.811 0.549 0.960 0.022 0.560 0.314 1.007 1.043 0.629 1.484 0.054 0.599 0.362 1.152
200 0 0 −0.004 0.203 0.041 −0.003 0.203 0.041 −0.002 0.202 0.041 −0.010 0.284 0.081 1.957 −0.003 0.202 0.041 −0.010 0.285 0.081 1.971
−0.2 −0.004 0.213 0.045 −0.004 0.209 0.044 −0.001 0.210 0.044 −0.009 0.304 0.092 2.116 −0.001 0.211 0.044 −0.009 0.304 0.092 2.116
0.5 0.001 0.258 0.067 0.004 0.202 0.041 −0.003 0.237 0.056 0.002 0.269 0.072 1.773 −0.001 0.243 0.059 0.001 0.269 0.072 1.773
0.3 0 −0.007 0.227 0.052 −0.007 0.224 0.050 −0.006 0.225 0.051 −0.014 0.299 0.090 1.782 −0.006 0.226 0.051 −0.014 0.299 0.090 1.782
−0.2 −0.140 0.246 0.080 −0.008 0.237 0.056 −0.099 0.235 0.065 −0.013 0.329 0.108 1.927 −0.121 0.239 0.072 −0.013 0.329 0.108 1.927
0.5 0.378 0.305 0.236 0.006 0.220 0.048 0.255 0.255 0.130 0.007 0.271 0.073 1.517 0.326 0.267 0.178 0.010 0.270 0.073 1.506
1 0 −0.014 0.389 0.152 −0.015 0.374 0.140 −0.015 0.386 0.149 −0.023 0.421 0.178 1.267 −0.014 0.388 0.150 −0.023 0.421 0.178 1.267
−0.2 −0.457 0.462 0.422 −0.016 0.420 0.177 −0.328 0.405 0.272 −0.023 0.486 0.236 1.339 −0.401 0.428 0.344 −0.023 0.485 0.236 1.333
0.5 1.257 0.602 1.943 0.012 0.369 0.136 0.856 0.389 0.884 0.018 0.379 0.144 1.055 1.089 0.449 1.386 0.031 0.377 0.143 1.044

Bias, βˆ1β1, β1=1; ESE, empirical sample error; MSE, mean-squared error; ERE, estimated relative efficiency.

a

Estimated relative efficiency was calculated for unbiased estimators with the variance of the Bůžková parameter estimate in the denominator.

In this setting, we also quantified the price of assuming (M3) when the latent variable was unnecessary. We calculated the estimated relative efficiency (ERE) of unbiased estimators with the estimated variance of the weighted-Liang and weighted-Sun methods in the numerator and the estimated variance of the Bůžková method in the denominator. The ERE indicated that the loss of efficiency was reasonable and comparable between the weighted-Liang and weighted-Sun methods. As β2 increased (i.e., the dependence between the outcome and observation-time models increased), the ERE decreased. In addition, we also calculated the ERE of IIRR-weighted versus unweighted methods to investigate the loss of efficiency due to inclusion of the additional covariate Xi2 when none was needed (Appendix C). The EREs between the Bůžková and Lin methods were close to 1 under all scenarios. The loss of efficiency was greater for the weighted-Liang and weighted-Sun methods but decreased as the number of observations increased (i.e., greater γ2) and when β2 increased.

3.2. Setting 2: simulation under (M2) and (M3)

3.2.1. Parameters.

In the previous setting, we focused on outcome-observation dependence induced through covariates. In this setting, we focus on estimation of β1 under various forms of latent variable structures. To simulate data under both (M2) and (M3), we generated outcomes at each of 1000 iterations using the linear mixed-effects model:

Yi(t)=μ(t)+β1Xi1(t)+β2Xi2EXi2Xi1+αηi1Qi(t)+ϵi(t), (17)

in which μ(t), ϵi(t), Xi1(t), and Xi2 were as defined in Section 3.1.1. The observation times Tik were generated from a non-homogeneous Poisson process with intensity function λi(t)=ηi2λ(t)expγ1Xi1(t)+γ2Xi2, for which λ(t)=t2. The independent censoring time Ci was generated from Uniform[7,10]. The coefficients were set at β1=1, β2=0.3, γ1=0.5, γ2=0.2, and α=1. Because α0 in model (17), correlation was introduced between the outcome and the observation-time processes through latent variables. We generated the latent variable ηi2 under two scenarios:

  1. ηi2 from Gamma distribution with mean 1 and variance 0.5; hereby, ηi2(1).

  2. ηi2 from a mixture distribution, following Uniform [0.5,1.5] if Xi10.5 and Gamma distribution with mean 1 and variance 0.7 if Xi1>0.5; hereby, ηi2(2).

The latent variable ηi1 was generated under two scenarios:

  1. ηi1=ηi2; hereby, ηi1(1).

  2. Eηi1ηi2=θηi21, θ=1; hereby, ηi1(2).

We let Qi(t)=1 or Qi(t)=Xi1. When Qi(t)=Xi1, Model (17) can be considered a random coefficient model. The latent variables were dependent on the outcome process either through Qi(t)=Xi1 or ηi2(2). The simulation setup mirrored the setup of Sun et al. [16] if ηi1=ηi2 and Qi(t)=1 and mirrored the setup of Liang et al. [15] if α=1, ηi2 was Gamma distributed with mean 1 and ηi1 and ηi2 were linearly linked through Eηi1ηi2=θηi21.

3.2.2. Results.

Table II provides the estimated bias, empirical standard error estimates, and mean-squared error estimates for estimation of β1 in (17). The inclusion of Xi2 in the observation-time model satisfied (M2) and induced additional correlation between the outcome and observation-time processes, so the IIRR-weighted methods (Bůžková, weighted-Liang, and weighted-Sun) performed better than their unweighted counterparts, reflecting the results of Setting 1.

Table II.

Simulation results for β1 under (M2) and (M3).

n ηi1 a Qi ηi2 b Lin
Bůžková
Liang (extension)
Weighted-Liang (extension)
Sun
Weighted-Sun (extension)
Bias ESE MSE Bias ESE MSE Bias ESE MSE Bias ESE MSE Bias ESE MSE Bias ESE MSE

100 ηi1(1) 1 ηi2(1) −0.155 0.430 0.209 −0.030 0.426 0.182 −0.603 0.496 0.611 0.007 0.421 0.177 −0.595 0.494 0.598 0.016 0.407 0.166
ηi2(2) 0.335 0.426 0.294 0.469 0.418 0.394 −0.281 0.464 0.294 0.116 0.405 0.178 −0.305 0.462 0.306 0.008 0.398 0.158
ηi1(2) 1 ηi2(1) −0.171 0.488 0.267 −0.041 0.484 0.236 −0.592 0.539 0.641 0.002 0.481 0.231 −0.585 0.536 0.630 0.014 0.472 0.223
ηi2(2) 0.356 0.528 0.406 0.490 0.528 0.519 −0.223 0.541 0.342 0.134 0.470 0.239 −0.247 0.540 0.353 0.025 0.465 0.217
Xi1 ηi2(1) 0.104 0.415 0.183 0.232 0.411 0.223 −0.358 0.481 0.359 0.004 0.445 0.198 −0.284 0.493 0.324 0.250 0.442 0.258
ηi2(2) 0.348 0.483 0.354 0.486 0.485 0.472 −0.233 0.504 0.309 0.045 0.443 0.198 −0.163 0.506 0.283 0.127 0.438 0.208
200 ηi1(1) 1 ηi2(1) −0.166 0.307 0.122 −0.025 0.310 0.097 −0.621 0.351 0.509 0.012 0.302 0.092 −0.615 0.349 0.500 0.018 0.292 0.085
ηi2(2) 0.336 0.311 0.209 0.484 0.301 0.325 −0.285 0.342 0.198 0.113 0.294 0.100 −0.304 0.341 0.209 0.007 0.288 0.083
ηi1(2) 1 ηi2(1) −0.179 0.342 0.149 −0.043 0.350 0.124 −0.607 0.387 0.518 −0.004 0.353 0.124 −0.601 0.385 0.510 0.004 0.346 0.120
ηi2(2) 0.338 0.372 0.253 0.492 0.377 0.384 −0.240 0.390 0.210 0.104 0.346 0.131 −0.258 0.391 0.220 −0.002 0.343 0.118
Xi1 ηi2(1) 0.100 0.297 0.098 0.235 0.303 0.147 −0.361 0.352 0.254 0.003 0.324 0.105 −0.289 0.358 0.211 0.257 0.329 0.174
ηi2(2) 0.335 0.339 0.227 0.490 0.344 0.358 −0.237 0.367 0.190 0.024 0.323 0.105 −0.168 0.365 0.162 0.110 0.320 0.114

Bias, βˆ1β1, β1=1; ESE, empirical sample error; MSE, mean-squared error.

a

Two possible links: ηi1(1):ηi1=ηi2; ηi1(2):Eηi1ηi2=θηi21, θ=1.

b

Latent vaiiable distributions: ηi2(1):ηi2Gamma(mean=1,σ2=0.5; ηi2(2):ηi2IXi10.5Uniform[0.5,1.5]+IXi1>0.5Gamma(1,0.7).

Under the Sun setup (i.e., ηi1(1):ηi1=ηi2 and Qi(t)=1), all IIRR-weighted methods yielded approximately unbiased parameter estimates for β1 under ηi2(1). Under the Liang setup (i.e., ηi1(2):Eηi1ηi2=θηi21 and Qi(t)=1, all methods yielded approximately unbiased estimates under ηi2(1). Under ηi2(2), in which the distribution of the latent variable depended on Xi1, only the weighted-Sun method yielded approximately unbiased estimates under Qi(t)=1, although the bias under the weighted-Liang method was smaller in magnitude than the Bůžková method. Note that the Bůžková method is not expected to perform well in this setting (in which unobserved latent variables affect the outcome and observation-time processes), because latent variable models represent a different class of models. If the effect of the latent variable ηi1 on the outcomes was associated with the value of Xi1 (i.e., Qi(t)=Xi1), then the bias of β1 was small under the weighted-Liang method but large under all other methods.

3.3. Summary

Our simulation results quantified the potential for bias in estimated covariate-outcome associations under various outcome-observation dependence mechanisms. The Bůžková, weighted-Liang, and weighted-Sun methods performed better when (M2) is satisfied. In Setting 1, we examined the robustness of the methods that included latent variables when they were not needed. We showed that the potential loss of efficiency was moderate and decreased when the dependence between the outcome and observation-time models increased. We also examined the relative efficiency between IIRR-weighted and unweighted methods to examine potential loss of efficiency due to including an unnecessary additional covariate in the observation-time model. The results indicated that the loss of efficiency was moderate and decreased with greater number of observations or increased dependence between the outcome and observation-time processes. The weighted-Liang and weighted-Sun methods were the most flexible in that they could accommodate a combination of outcome-observation dependence mechanisms. They also provided estimates with negligible bias depending on the relationship between the latent variable and the outcome-model covariates. In practice, ensuring unbiased estimates through a more complex dependence model may be more important than a potential loss in efficiency.

In our simulation study, we generated data using the same set of covariates in settings with and without latent variables. In settings without such latent variables, the Bůžková method performs well, and simulations here (Section 3.1) and by others [7,14,35] have demonstrated small empirical bias. In practice, observed covariates (including previous outcomes) that are correlated with an unobserved latent variable may be used to partially capture information regarding subject-specific visit intensities.

Exploratory data analysis, model diagnostics, and sensitivity analyses can be used to investigate the relationship between the outcome and observation-time processes and to ensure selection of an appropriate analysis method. We illustrate and discuss strategies for model selection in Sections 4 and 5.

4. Case study

4.1. Background

We compared the reviewed methods using a subset of data from a bladder cancer study conducted by the Veterans Administration Cooperative Urological Research Group [36]. Eighty-five patients with superficial bladder tumors were randomly assigned to placebo (n=47) or thiotepa treatment (n=38). At each follow-up visit, new tumors were counted before being removed transurethrally. The maximum study duration was 53 months. There was notable heterogeneity in visit patterns across patients. The median (25th, 75th percentile) number of visits in the placebo group and treatment group was 9 (5, 12) and 9 (4, 23), respectively. The average time between visits for the placebo group was 3.7 months, compared with 2.3 months for the treatment group. These differences suggested that the patients in the treatment group visited the clinic more often. Hence, the observation-time process must account for this difference to estimate properly the effect of treatment on tumor recurrence.

Our analysis focused on the natural logarithm of the cumulative number of new tumors observed up to t plus 1 to retain a marginal response. We included a treatment indicator X1 and the natural logarithm of the initial number of tumors plus 1 X2 in the outcome model. We considered the following outcome models:

Lin and Bůžková methods: EYi(t)Xi(t)=μ(t)+β1Xi1+β2Xi2;

Liang and weighted-Liang methods: EYi(t)Xi(t),ηi1=μ(t)+β1Xi1+β2Xi2+ηi1Qi, Qi=Xi1;

Sun and weighted-Sun methods: EYi(t)Xi(t),ηi1=μ(t)+β1Xi1+β2Xi2+αηi1.

The consensus of previous analyses was that the tumor recurrence and observation-time processes were dependent [15,37,38]. We note that the outcome may be intrinsically dependent upon the measurement process, such that larger intervals between visits allows for more tumors to grow. The outcome is undoubtedly expected to increase with longer time between visits. We considered two observation-time models:

Case 1: λi(t)=ηi2expγ1Xi1+γ2Xi2λ(t)

Case 2: λi(t)=ηi2expγ1Xi1+γ2Xi2+γ3log(#newtumorssincebaseline+1)λ(t)

Case 1 specified the same set of covariates in both the outcome and observation-time models. Case 2 specified an additional covariate based on number of tumors since baseline because it is common for the physician to schedule a patient’s next visit based on the outcomes so far. Recall that ηi1=ηi2 in the Sun and weighted-Sun methods and Eηi1ηi2=θηi21 in the Liang and weighted-Liang methods.

4.2. Results

Table III provides estimates for β and γ under the Lin, Liang, and Sun methods in Case 1. We obtained γˆ1=0.444(SE,0.093) and γˆ2=0.001(0.115), which suggested that treatment assignment was significantly associated with the observation-time process. We specified Qi=Xi1 because Xi1 had a significant effect in the observation-time model, and the results in Table III mirrored the conclusion from Liang et al. [15].

Table III.

Parameter estimates and estimated standard errors (SE) under Case 1.

Method βˆ1(SE) βˆ2(SE) θˆ(SE) * αˆ(SE) *

Lin −0.701 (0.172) 0.657 (0.165)
Liang −0.588 (0.175) 0.682 (0.147) −0.235 (0.243)
Sun −0.751 (0.188) 0.680 (0.159) −0.043 (0.398)

γˆ1=0.444(0.093), γˆ2=0.001(0.115).

*

The parameters θ and α represent the association between the outcome and observation-time processes for the Liang and Sun methods, respectively.

Next, we examined the importance of the additional covariate in Case 2. Table IV provides estimates for β and γ for IIRR-weighted methods under Case 2. We found that the cumulative number of tumors since baseline was significantly related to the observation-time process. The Wald test of γ3=0 in the observation-time model provided a p-value < 0.001, implying that the inclusion of the additional covariate was appropriate. Hence, the IIRR-weighted methods were more appropriate than the unweighted methods, and we focused on the results in Table IV. The observation-level weights applied to the Bůžková, weighted-Liang, and weighted-Sun methods ranged from 0.50 to 1.26, with median (25th, 75th percentile) = 0.93 (0.84, 1.06). With the incorporation of observation-levels weights, the treatment effect under the Bůžková method was attenuated compared with the Lin method. The treatment effect estimates under weighted-Liang and weighted-Sun methods were lower than those under the Liang and Sun methods. Because the initial number of tumors was not significantly related to the observation-time process, the corresponding estimates βˆ2 were comparable under all methods.

Table IV.

Parameter estimates and their estimated standard errors (SE) under Case 2.

Method βˆ1(SE) βˆ2(SE) θˆ(SE) * αˆ(SE) *

Bůžková −0.565 (0.170) 0.572 (0.165)
Weighted-Liang −0.395 (0.166) 0.584 (0.147) −0.266 (0.229)
Weighted-Sun −0.423 (0.182) 0.580 (0.156) −0.247 (0.247)

γˆ1=0.536(0.090), γˆ2=0.105(0.128), γˆ3=0.227(0.076)

Stabilized weights: median (25th, 75th percentile) = 0.93 (0.84, 1.06).

*

The parameters θ and α represent the association between the outcome and observation-time processes for the weighted-Liang and weighted-Sun methods, respectively.

To determine the necessity of latent variables in the outcome models, we focused on the variance of the latent variable ηi2 in the observation-time model. Under Case 2, the estimated variance based on (13) was 0.448, indicating that the latent variable approaches were appropriate. Based on the variance property of the Gamma distribution, we partitioned the variance of ηi2 as the contribution from the placebo group (0.059) and the thiotepa group (0.417). The difference in variance estimates indicated the possibility of covariate-dependent ηi2, in that the distribution of ηi2 was different between the treatment groups. Next, we used the density curve of ηˆi2 to graphically check if ηi2 was covariate dependent. The density curves were indeed different between the treatment groups (Appendix D). Given the evidence of (M2) and (M3), we focused on the results under the weighted-Liang and weighted-Sun methods. We note that the same Zi(t) was used for the methods on Table IV. As in the simulation study, correct specification of covariates in Zi(t) may recover the effect of the latent variable under the Bůžková method. We did not have access to other measured covariates in this data set; if those were available, it may have been possible to find candidates for Zi(t) such that the treatment estimate under the Bůžková method were closer to those under the weighted-Liang and weighted-Sun methods.

The choice between weighted-Liang and weighted-Sun methods relied on the distribution of ηi2. The weighted-Liang method assumes that ηi2 is derived from a Gamma distribution with a common variance for all subjects, whereas the weighted-Sun method places no distributional assumption on ηi2. Considering the evidence of covariate dependence based on the density curves, the results from the weighted-Sun method best described the data, although the estimates for β1 were similar between the weighted-Liang and weighted-Sun methods. Overall, the results indicated that treatment and the initial number of tumors had significant effects on tumor recurrence. We also observed a negative correlation between tumor recurrence and the observation-time processes (αˆ=0.247).

Lastly, we evaluated the fit of the outcome model based on the procedure presented in Liang et al. [15]. We derived residuals ϵˆi(t)=Yi(t)yˆi(t) using parameter estimates from Table IV. Denote 0t1<t2<<tM as the M total observation times among all subjects. The estimate of μ(t) is a step function with jumps at unique observation times: μˆtk=d𝒜ˆtkdΛˆtk=𝒜ˆtk𝒜ˆtkΛˆtkΛˆtk, 1kM. More information on 𝒜ˆtk and dΛˆtk can be found in Appendix E. Based on the residual plots of ϵˆi(t) against the observation times (shown in Appendix E), there was some evidence of lack of fit for large outcome values, but it was not systematic with respect to time and was similar across all weighted methods.

5. Discussion

In this paper, we evaluated the statistical properties of currently available and newly extended semi-parametric methods for the analysis of longitudinal data with outcome-dependent observation times. Table V summarizes the strengths and limitations of each method under various outcome-observation dependence mechanisms. The performance of each method hinges on the assumed mechanism of dependence between the outcome and observation-time processes. For conditional independence given covariates in the outcome model only (M1), all reviewed methods are appropriate. For conditional independence given observation-time model covariates only (M2), the Bůžková method is preferred. For conditional independence given unobserved latent variables only (M3), all methods perform well when the latent variables are independent of outcome-model covariates. However, if the distribution of the latent variables is covariate dependent, then the Sun method is preferred; if the effect of the latent variable in the outcome model is modified by any outcome-model covariates, then the Liang method is preferred. Under both (M2) and (M3), our extensions, the weighted-Liang and weighted-Sun methods, are the most flexible and remove the bias otherwise associated with the original Liang and Sun methods under (M2). In addition, our extension of the method by Liang et al. [15] allows time-dependent covariates in the observation-time process, which would otherwise not be possible.

Table V.

Summary of methods for various outcome-observation dependence mechanisms.

Lin Bůžková Liang Liang (extension) Weighted-Liang (extension) Sun Weighted-Sun (extension)
Time-independent covariates in the observation-time model Mechanism (Ml) Conditional independence given outcome-models covariates + + + + + + +
Mechanism (M2) Conditional independence given observation-time model covariates + + +
Mechanism (M3) Conditional independence given latent variables (LV) LV not associated with outcome-model covariates LV distribution specified Gamma + + + + + + +
LV distribution unspecified + + + + + + +
LV associated with outcome-model co variates Effect of LV modified by Q(t)* + +
Distribution of LV based on Q(t)* + +
Mechanisms (M2) + (M3) +/− +/−
Time-dependent co variates in the observation-time model +/− +/− N/A +/− +/− +/− +/−

+ appropriate; − not appropriate; +/− appropriate under certain situations; N/A not applicable.

*

Q(t) is a subset of X(t).

In practice, empirical model checking can be useful to decide which method is most appropriate. First, to decide between (M1) and (M2), one can focus on the observation-time model and perform a Wald test of the additional qp covariates [21]. If the Wald test yields a significant result, the data suggest (M2). Next, one can determine the necessity of latent variables in the outcome model using the variance of the latent variable ηi2. If the estimated variance of the latent variables is small (i.e., close to 0), latent variables may not be required. One method to estimate Var[ηi2] is to assume a parametric distribution for the latent variables, such as using Equation (13) if we can assume ηi2 is Gamma distributed. The distribution of the latent variable in the observation-time model is unspecified in the Lin, Bůžková, and Sun methods but is assumed to be Gamma distributed in the Liang method. There is a lack of formal techniques to check the Gamma distribution assumption of the unobserved latent variable. A series of sensitivity analyses is recommended. Liang et al. [15] showed that the Liang method provided reasonable estimates for covariate-outcome association even if the distribution of the latent variable ηi2 was misspecified, especially when the variance of the distribution was small. Robustness of the Liang and weighted-Liang methods to misspecification of the distribution of ηi2 can be improved by replacing the estimate of ηi2 by ηˆi2=mi/0CiexpγˆXi(t)dΛ(t), removing any distributional assumption. The choice between the Liang and Sun methods rests upon whether the distribution of the latent variable is covariate dependent. An informal check is to partition the estimated variance of ηi2 by the covariate values to determine if the partitioned variances are similar across levels of Xi(t). We can also graphically display the density curves of ηˆi2 to check for covariate-dependent latent variables. Lastly, we can evaluate the overall fit of the models based on residuals. Formal model selection is an area of future research.

Several features of the methods discussed here deserve comment. First, the semi-parametric outcome model does not require the estimation of μ(t). However, the potential gain from the flexibility of the form of μ(t) is countered by the potential loss in efficiency of estimation of the parameters of interest. Second, we assume that censoring times are independent of the outcome and observation-time model processes, that is, non-informative censoring. This assumption may be relaxed to allow censoring to depend on the outcome and observation-time processes by estimating γ and Λ using the method proposed by Huang et al. [39]. In addition, the parameters in the outcome model are time-independent, which may not be appropriate in some cases. We refer readers to the procedure in Sun et al. [16] to derive time-dependent regression coefficients. Third, our goal is to generate inference regarding the marginal association between a set of covariates and the outcome of interest, rather than to conduct formal causal inference. If we allow intervention on Xi(t), modification of the exposure may influence not only the outcome of interest but also the occurrence of a visit. Hence, the quantification of the causal effect of the exposure on the outcome of interest requires techniques that establish the temporal association between exposure and outcome. A g-computation algorithm [40] or inverse-probability-of-treatment-weighted estimators [41] may provide insight into estimation of causal effects. Lastly, the observation-time process can be modeled on two time scales: total time scale (i.e., time-to-events model) in which each recurrent event is measured from a time of origin, and gap time scale (i.e., time-between-events model) in which the measure of interest is time between successive events [42]. The methods in this paper adopt the total time scale, but it may be appropriate to consider the alternative parameterization. The time-between-events approach is well studied within the recurrent events field [43], but the use of the gap time scale in the regression modeling of longitudinal data with outcome-dependent observation times warrants future research.

It is of interest to note that in the framework of incomplete data, GEE is able to accommodate missing completely at random data and the special case of covariate-dependent missingness [44,45]. Similarly, in the current focus on outcome-dependent observation times, GEE does provide reliable estimates of β under (M1), assuming a correctly specified function of time in the outcome model. With the inclusion of observation-level inverse intensity weights, a weighted-GEE model may also provide reasonable estimates of β under (M2) with the ease of currently available software packages [7]. However, the advantage of the methods in Section 2 is the flexibility provided by the non-parametric specification of the effect of time.

The methods we described are currently limited to linear models for continuous outcomes. Recent research has focused on the development of log-linear models for count outcomes [35, 46]. Future research will likely extend to semi-parametric models for binary outcomes. In addition, broader application of existing methods is likely hampered by the lack of available general-purpose statistical software. R code to generate Table IV along with some model-checking procedures is provided in Appendix F.

Supplementary Material

Supplementary material

Acknowledgements

We gratefully acknowledge the University of Pennsylvania for supporting this research. We also thank the associate editor and reviewers for invaluable comments that greatly improved the manuscript.

Footnotes

Supporting information

Additional supporting information may be found in the online version of this article at the publisher’s web site.

References

  • 1.Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika 1986; 73(1):13–22. [Google Scholar]
  • 2.Huang CY, Wang MC, Zhang Y. Analysing panel count data with informative observation times. Biometrika 2006; 93(4):763–775. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Sun J, Park DH, Sun L, Zhao X. Semiparametric regression analysis of longitudinal data with informative observation times. Journal of the American Statistical Association 2005; 100(471):882–889. [Google Scholar]
  • 4.Lipsitz S, Fitzmaurice G, Ibrahim J. Parameter estimation in longitudinal studies with outcome-dependent follow-up.Biometrics 2002; 58(3):621–630. [DOI] [PubMed] [Google Scholar]
  • 5.Fitzmaurice GM, Lipsitz SR, Ibrahim JG, Gelber R, Lipshultz S. Estimation in regression models for longitudinal binary data with outcome-dependent follow-up. Biostatistics 2006; 7(3):469–485. [DOI] [PubMed] [Google Scholar]
  • 6.Lin H, Scharfstein DO, Rosenheck RA. Analysis of longitudinal data with irregular, outcome-dependent follow-up. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2004; 66(3):791–813. [Google Scholar]
  • 7.Bůžková P, Lumley T. Longitudinal data analysis for generalized linear models with follow-up dependent on outcome-related variables. Canadian Journal of Statistics 2007; 35(4):485–500. [Google Scholar]
  • 8.Ryu D, Sinha D, Mallick B, Lipsitz SR, Lipshultz SE. Longitudinal studies with outcome-dependent follow-up. Journal of the American Statistical Association 2007; 102(479):952–961. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Liu L, Huang X, O’Quigley J. Analysis of longitudinal data in the presence of informative observational times and a dependent terminal event, with application to medical cost data. Biometrics 2008; 64(3):950–958. [DOI] [PubMed] [Google Scholar]
  • 10.Troxel AB, Lipsitz SR, Fitzmaurice GM, Ibrahim JG, Sinha D, Molenberghs G. A weighted combination of pseudo-likelihood estimators for longitudinal binary data subject to non-ignorable non-monotone missingness. Statistics in Medicine 2010; 29(14):1511–1521. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Albert PS. A transitional model for longitudinal binary data subject to nonignorable missing data. Biometrics 2000; 56(2):602–608. [DOI] [PubMed] [Google Scholar]
  • 12.French B, Heagerty PJ. Marginal mark regression analysis of recurrent marked point process data. Biometrics 2009; 65(2):415–422. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Lin DY, Ying Z. Semiparametric and nonparametric regression analysis of longitudinal data. Journal of the American Statistical Association 2001; 96(453):103–126. [Google Scholar]
  • 14.Bůžková P, Lumley T. Semiparametric modeling of repeated measurements under outcome-dependent follow-up. Statistics in Medicine 2009; 28:987–1003. [DOI] [PubMed] [Google Scholar]
  • 15.Liang Y, Lu W, Ying Z. Joint modeling and analysis of longitudinal data with informative observation times. Biometrics 2009; 65(2):377–384. [DOI] [PubMed] [Google Scholar]
  • 16.Sun L, Song X, Zhou J. Regression analysis of longitudinal data with time-dependent covariates in the presence of informative observation and censoring times. Journal of Statistical Planning and Inference 2011; 141(8):2902–2919. [Google Scholar]
  • 17.Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data 2nd edn. Wiley: New York, 2002. [Google Scholar]
  • 18.Brumback BA, Rice JA. Smoothing spline models for the analysis of nested and crossed samples of curves. Journal of the American Statistical Association 1998; 93:961–976. [Google Scholar]
  • 19.Lin X, Carroll RJ. Semiparametric regression for clustered data using generalized estimating equations. Journal of the American Statistical Association 2001; 96(455):1045–1056. [Google Scholar]
  • 20.Pepe MS, Couper D. Modeling partly conditional means with longitudinal data. Journal of the American Statistical Association 1997; 92(439):991–998. [Google Scholar]
  • 21.Lin DY, Wei LJ, Yang I, Ying Z. Semiparametric regression for the mean and rate functions of recurrent events. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2000; 62(4):711–730. [Google Scholar]
  • 22.Li Y, Ryan L. Survival analysis with heterogeneous covariate measurement error. Journal of the American Statistical Association 2004; 99(467):724–735. [Google Scholar]
  • 23.Williamson JM, Datta S, Satten G. Marginal analyses of clustered data when cluster size is informative. Biometrics 2003; 59(1):36–42. [DOI] [PubMed] [Google Scholar]
  • 24.Seaman SR, Pavlou M, Copas AJ. Methods for observed-cluster inference when cluster size is informative: a review and clarifications. Biometrics 2014; 70(2):449–456. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Rotnitzky A, Robins JM. Semiparametric regression estimation in the presence of dependent censoring. Biometrika 1995; 82(4):805–820. [Google Scholar]
  • 26.Zhao LP, Rotnitzky A, Robins JM. Analysis of semiparametric models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association 1995; 90(429):106–121. [Google Scholar]
  • 27.McCulloch CE, Neuhaus JM. Misspecifying the shape of a random effects distribution: why getting it wrong may notmatter. Statistical Science 2011; 26(3):388–402. [Google Scholar]
  • 28.Liu D, Kalbfleisch JD, Schaubel DE. A positive stable frailty model for clustered failure time data with covariate-dependent frailty. Biometrics 2011-March; 67(1):8–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Neuhaus JM, McCulloch CE. Separating between- and within-cluster covariate effects by using conditional and partitioning methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2006; 68(5):859–872. [Google Scholar]
  • 30.Heagerty PJ, Kurland BF. Misspecified maximum likelihood estimates and generalised linear mixed models. Biometrika 2001; 88(4):973–985. [Google Scholar]
  • 31.Efron B, Tibshirani R. An Introduction to the Bootstrap 1st edn. Chapman & Hall/CRC: Boca Raton, 1993. [Google Scholar]
  • 32.Field CA, Welsh AH. Bootstrapping clustered data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2007; 69(3):369–390. [Google Scholar]
  • 33.Chernick MR. Bootstrap Methods: A Guide for Practitioners and Researchers 2nd edn. Wiley: New York, 2007. [Google Scholar]
  • 34.Cheng G, Yu Z, Huang JZ. The cluster bootstrap consistency in generalized estimating equations. Journal of Multivariate Analysis 2013; 115:33–47. [Google Scholar]
  • 35.Bůžková P, Lumley T. Semiparametric log-linear regression for longitudinal measurements subject to outcome-dependent follow-up. Journal of Statistical Planning and Inference 2008; 138(8):2450–2461. [Google Scholar]
  • 36.Andrews DF, Herzberg A. Data: a Collection of Problems from Many Fields for the Student and Research Worker 1st edn. Springer: New York, 1985. [Google Scholar]
  • 37.Sun J, Wei L. Regression analysis of panel count data with covariate-dependent observation and censoring times. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2000; 62(2):293–302. [Google Scholar]
  • 38.Hu XJ, Sun J, Wei LJ. Regression parameter estimation from panel counts. Scandinavian Journal of Statistics 2003; 30(1):25–43. [Google Scholar]
  • 39.Huang CY, Qin J, Wang MC. Semiparametric analysis for recurrent event data with time-dependent covariates and informative censoring. Biometrics 2010; 66(1):39–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Robins JM, Greenland S, Hu FC. Estimation of the causal effect of a time-varying exposure on the marginal mean of a repeated binary outcome. Journal of the American Statistical Association 1999; 94:687–700. [Google Scholar]
  • 41.Robins JM, Hernán MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000; 11(5):550–560. [DOI] [PubMed] [Google Scholar]
  • 42.Cook RJ, Lawless J. The Statistical Analysis of Recurrent Events 1st edn. Springer: New York, 2007. [Google Scholar]
  • 43.Huang X, Liu L. A joint frailty model for survival and gap times between recurrent events. Biometrics 2007; 63(2):389–397. [DOI] [PubMed] [Google Scholar]
  • 44.Rubin DB. Inference and missing data. Biometrika 1976; 63(3):581–592. [Google Scholar]
  • 45.Little RJA. Modeling the drop-out mechanism in repeated-measures studies. Journal of the American Statistical Association 1995; 90(431):1112–1121. [Google Scholar]
  • 46.Sun J, Tong X, He X. Regression analysis of panel count data with dependent observation times. Biometrics 2007; 63(4):1053–1059. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

RESOURCES