Second-Order Analysis of Semiparametric Recurrent Event Processes

Yongtao Guan

doi:10.1111/j.1541-0420.2011.01557.x

. Author manuscript; available in PMC: 2012 Sep 1.

Published in final edited form as: Biometrics. 2011 Mar 1;67(3):730–739. doi: 10.1111/j.1541-0420.2011.01557.x

Second-Order Analysis of Semiparametric Recurrent Event Processes

Yongtao Guan ^1,^*

PMCID: PMC3137716 NIHMSID: NIHMS267780 PMID: 21361885

Summary

A typical recurrent event dataset consists of an often large number of recurrent event processes, each of which contains multiple event times observed from an individual during a followup period. Such data have become increasingly available in medical and epidemiological studies. In this paper, we introduce novel procedures to conduct second-order analysis for a flexible class of semiparametric recurrent event processes. Such an analysis can provide useful information regarding the dependence structure within each recurrent event process. Specifically, we will use the proposed procedures to test whether the individual recurrent event processes are all Poisson processes and to suggest sensible alternative models for them if they are not. We apply these procedures to a well-known recurrent event dataset on chronic granulomatous disease and an epidemiological dataset on Meningococcal disease cases in Merseyside, UK to illustrate their practical value.

Keywords: Pair correlation function, recurrent event process, second-order analysis

1. Introduction

In this paper we introduce the concept of second-order analysis for recurrent event data. A major interest when conducting such an analysis for recurrent event processes is to test whether the individual recurrent event processes can be viewed as Poisson processes. If the Poisson assumption is reasonable, then it would justify the validity of an analysis that is based on this assumption (e.g., Ng and Cook, 1999). A departure from the Poisson assumption often implies two alternatives, namely a clustered or a regular point pattern. For the CGD data, these alternatives in turn correspond to a more or less clustered infection pattern than what would be expected under complete randomness, and therefore are of relevance to how the disease progresses over time.

Upon a rejection of the Poisson assumption, it is often necessary to study in detail the dependence structure within the individual recurrent event processes. For the CMD data, the diagnosis time for the CMD cases is expected to be clustered due to the contagious nature of the disease. From a disease-controlling point of view, it is of interest to know the time window when an elevated disease risk should be expected following a confirmed CMD case, and how long that period will last. As we will illustrate in Section 6.2, our proposed second-order analysis procedures can be used to help answer these questions.

Second-order analysis has been widely used for spatial point pattern analysis (e.g., Diggle, 2003). We stress that there are important differences between second-order analysis of spatial point patterns and recurrent event data. For the former, it is often necessary to fully specify the mean structure of the process. Furthermore, both the size of the observation window and the number of events observed on it need to be large. For the latter, it is common to assume a semiparametric regression model for the mean structure of the recurrent event processes, and the number of events per process can be extremely small. Specifically, for the CGD data, the total number of events is 76 for 128 CGD patients; for the CMD data, the total number of CMD cases is 864 from 182 reporting units. It is challenging to extract dependence information from individually sparse and also not fully parametric recurrent event processes.

Overcoming the above difficulties, we develop novel statistical methods to conduct second-order analysis for recurrent event processes. Specifically, we propose consistent estimators for two second-order integrals, based on which useful second-order functions can be derived. Our main theoretical results are based on a flexible asymptotic framework that combines features from both spatial point processes and (commonly studied) recurrent event processes. Consequently, the proposed methods built upon these results can be applied to real data arising from a broad range of settings. As an important application, we develop a nonparametric graphical procedure to test whether the individual recurrent event processes are Poisson processes.

We organize the rest of the paper as follows. In Section 2 we give necessary background and set up the notation. In Section 3 we develop the proposed estimators for the second-order integrals and study their theoretical properties. In Section 4 we introduce the graphical procedure to test the Poisson assumption. We conduct a simulation study in Section 5 to assess the performance of the proposed estimators and apply them to two real examples in Section 6. We conclude this article with a discussion in Section 7. Proofs for our theoretical results and some additional simulation results are given in the web Supplementary Materials.

2. Background and Notation

2.1 Notation

Consider n independent recurrent event processes N₁, ···, N_n. Let dN_i(t) denote the number of events from the ith process observed over the time interval [t, t+dt), where dt > 0. Define the first- and second-order rate functions of N_i as

\begin{matrix} λ (t; N_{i}) = lim_{d t \to 0} \frac{E {{d N}_{i} (t)}}{d t}, \\ λ_{2} (t_{1}, t_{2}; N_{i}) = lim_{{d t}_{1}, {d t}_{2} \to 0} \frac{E {{d N}_{i} (t_{1}) {d N}_{i} (t_{2})}}{{d t}_{1} {d t}_{2}} . \end{matrix}

Intuitively, λ(t; N_i)dt and λ₂(t₁, t₂; N_i)dt₁dt₂ are the approximate probabilities to have one event in [t, t + dt) and in each of [t₁, t₁ + dt₁) and [t₂, t₂ + dt₂), respectively. We will refer to λ(t; N_i) as the rate function (Pepe and Cai, 1993; Lawless and Nadeau, 1995) and assume that it takes the following semiparametric form:

λ (t; N_{i}) = λ_{0} (t) ρ (β, X_{i}),

(1)

where λ₀(t) is an unspecified baseline rate function and ρ(β, X_i) is a parametric function depending on both some predictors X_i and unknown parameters β. A consistent estimator for β can be obtained by applying the estimating equation based approaches given in Lawless and Nadeau (1995) and Lin et al. (2000) without having to specify λ₀(t); these estimating equations have the same form as the score equations of Andersen and Gill (1982) except that they are expressed in terms of the rate function. We assume that X_i does not change with t for ease of theoretical derivation. However, our methods presented in Section 3 can be easily generalized to the case when X_i is time varying.

Given the first- and second-order rate functions, we define the pair correlation function

g (t_{1}, t_{2}; N_{i}) = \frac{λ_{2} (t_{1}, t_{2}; N_{i})}{λ (t_{1}; N_{i}) λ (t_{2}; N_{i})} .

The pair correlation function is very useful for second-order analysis of point processes (Møller and Waagepetersen, 2004). If the point process N_i is Poisson (conditional on the covariates X_i), then g(t₁, t₂; N_i) = 1. Otherwise, g(t₁, t₂; N_i) is larger than 1 if N_i shows attraction at the time locations (t₁, t₂) and is less than 1 if it exhibits inhibition at those locations.

2.2 Pair correlation function for Cox processes

Cox (1955) introduced the so-called doubly stochastic Poisson process, which generalizes a Poisson process by defining its rate function as a realization of a random field. This type of process is commonly referred to as Cox process. Henderson et al. (2000) discussed a very general class of Cox processes in the context of modeling recurrent event data. Here we consider one such example.

Specifically, let u_i be an unobserved normal random variable with mean zero and variance σ², and let Z_i(t) be a stationary Gaussian process that is independent of u_i and has mean zero and an isotropic covariance function R(|t₁ − t₂|) = cov{Z_i(t₁), Z_i(t₂)}. Assume that conditional on u_i and Z_i(t), N_i is a Poisson process with rate function

λ {t; N_{i} ∣ u_{i}, Z_{i} (t)} = λ_{0} (t) exp {u_{i} + X_{i}^{T} β + η Z_{i} (t)},

(2)

where β and η are some real constants and X_i is a vector of known covariates. By definition, N_i is a Cox process. For the CGD data given in Section 6.1, u_i in (2) accounts for the variability across subjects due to non-treatment factors, e.g., one’s overall health conditions, X_i indicates the treatment assignment, and Z_i(t) is a latent process to explain the local temporal variations. In the special case that η = 0, (2) becomes a random effect (frailty) model, which has been widely used in literature, e.g., see Ng and Cook (1999) and the references therein.

Conditional on u_i, N_i is a log-Gaussian Cox process (Müller et al., 1998) and its pair correlation function is given by

g (t_{1}, t_{2}; N_{i} ∣ u_{i}) = exp {η^{2} R (∣ t_{1} - t_{2} ∣)} .

Because u_i is not observed, it is more sensible to consider the pair correlation function without conditioning on u_i. This is given by

g (t_{1}, t_{2}; N_{i}) = exp {σ^{2} + η^{2} R (∣ t_{1} - t_{2} ∣)} .

Interestingly, g(t₁, t₂; N_i) is simply g(t₁, t₂; N_i|u_i) multiplied by a constant and g(t₁, t₂; N_i) = g(t₁, t₂; N_i|u_i) if and only if σ² = 0, i.e., if the random effect term u_i does not apply. Such a relationship is generally true even if u_i and Z_i(t) are nonnormal. Note also that g(t₁, t₂; N_i|u_i) = 1, which implies g(t₁, t₂; N_i) being a constant, for all t₁ ≠ t₂ if η = 0. In this case (2) leads to a Poisson process when conditional on u_i. Moreover, g(t₁, t₂; N_i|u_i) converges to one and g(t₁, t₂; N_i) converges to a constant as R(|t₁ − t₂|) goes to zero.

In practice, it is often difficult to estimate g(t₁, t₂; N_i|u_i) consistently due to the unobserved random effect u_i. We will introduce in Section 2.3 two second-order integrals based on which a consistent nonparametric estimator for g(t₁, t₂; N_i) can be derived.

Our proposed second-order analysis is useful for model building and model diagnostics for Cox processes. Specifically, the proposed test for the Poisson assumption can be used to check whether η in (2) is equal to zero, i.e., whether (2) can be reduced to a simpler random effect model, whereas our proposed estimator for g(t₁, t₂; N_i) (see Section 2.3) can be to used to assess the goodness-of-fit of a fitted model such as (2) by comparing it against its theoretical counterpart under the fitted model.

2.3 The proposed second-order integrals

Let f(t₁, t₂) be a known, nonnegative function satisfying that f (t₁, t₂) = f (t₂, t₁). Our main interest is to estimate the following two quantities:

A = \sum_{i = 1}^{n} \int_{0}^{τ_{i}} \int_{0}^{τ_{i}} λ_{0} (t_{1}) λ_{0} (t_{2}) f (t_{1}, t_{2}) g (t_{1}, t_{2}; N_{i}) {d t}_{1} {d t}_{2},

(3)

B = \sum_{i = 1}^{n} \int_{0}^{τ_{i}} \int_{0}^{τ_{i}} λ_{0} (t_{1}) λ_{0} (t_{2}) f (t_{1}, t_{2}) {d t}_{1} {d t}_{2},

(4)

where [0, τ_i): i = 1, ···, n, are the observation intervals for the n recurrent event processes and τ_i’s are assumed to be independent of N_i’s.

Integrals of the forms (3) and (4) play an essential role in second-order analysis of recurrent event processes. To see this point, consider

f (t_{1}, t_{2}) = \frac{1}{h} \cdot K (\frac{∣ t_{1} - t_{2} ∣ - r}{h}),

(5)

where K(·) is a bounded kernel function with a compact support over [−1, 1]. Assume that g(t₁, t₂; N_i) = g(|t₁− t₂|) for i = 1, ···, n, i.e., the pair correlation functions of the n recurrent event processes are identical and also isotropic. If g(·) is continuous and h is small, then

\begin{array}{l} A = \sum_{i = 1}^{n} \int_{0}^{τ_{i}} \int_{0}^{τ_{i}} λ_{0} (t_{1}) λ_{0} (t_{2}) \frac{1}{h} \cdot K (\frac{∣ t_{1} - t_{2} ∣ - r}{h}) g (∣ t_{1} - t_{2} ∣) {d t}_{1} {d t}_{2} \\ \approx \sum_{i = 1}^{n} \int_{0}^{τ_{i}} \int_{0}^{τ_{i}} λ_{0} (t_{1}) λ_{0} (t_{2}) \frac{1}{h} \cdot K (\frac{∣ t_{1} - t_{2} ∣ - r}{h}) g (r) {d t}_{1} {d t}_{2} \\ = g (r) B . \end{array}

Thus, A/B forms an approximately unbiased estimator for the pair correlation function.

The two quantities A and B defined through (5) depend on a user-specified bandwidth. This can be avoided by considering

f (t_{1}, t_{2}) = I (∣ t_{1} - t_{2} ∣ \leq r),

(6)

which leads to the cumulative quantities

\begin{array}{l} A = \sum_{i = 1}^{n} \int_{0}^{τ_{i}} \int_{0}^{τ_{i}} λ_{0} (t_{1}) λ_{0} (t_{2}) I (∣ t_{1} - t_{2} ∣ \leq r) g (∣ t_{1} - t_{2} ∣) {d t}_{1} {d t}_{2}, \\ B = \sum_{i = 1}^{n} \int_{0}^{τ_{i}} \int_{0}^{τ_{i}} λ_{0} (t_{1}) λ_{0} (t_{2}) I (∣ t_{1} - t_{2} ∣ \leq r) {d t}_{1} {d t}_{2} . \end{array}

If N_i is Poisson conditional on the covariates X_i and some random effect u_i, then g(r) = c for some c > 0 and for all r > 0, where c = 1 if N_i depends only on X_i. Then,

A = c \sum_{i = 1}^{n} \int_{0}^{τ_{i}} \int_{0}^{τ_{i}} λ_{0} (t_{1}) λ_{0} (t_{2}) I (∣ t_{1} - t_{2} ∣ \leq r) {d t}_{1} {d t}_{2} = c B .

Thus, A/B is a constant for all r > 0, under the null hypothesis that all individual recurrent event processes are Poisson processes. If an estimate for A/B can be obtained, we can then plot it against r so as to check for departures from the null hypothesis.

3. Estimation of Second-Order Integrals

In this section we develop consistent estimators for the two second-order integrals defined in (3) and (4). Because the function f(·) used to define these integrals may take very different forms, we wish to make our estimators as general as possible so they can be applied in a broad range of settings. For ease of presentation but without losing generality, we assume that the ending times τ_i’s are all distinct and more specifically, 0 = τ₀ < τ₁ < τ₂ < ··· < τ_n. Our proposed estimators can be easily extended to the situation with tied ending times and we will give the detailed expressions in Appendix A.

3.1 The proposed estimators

Estimator for A

For any given process N_i, define

A_{i} = \int_{0}^{τ_{i}} \int_{0}^{τ_{i}} λ_{0} (t_{1}) λ_{0} (t_{2}) f (t_{1}, t_{2}) g (t_{1}, t_{2}; N_{i}) {d t}_{1} {d t}_{2} .

We can then rewrite A as

A = \sum_{i = 1}^{n} A_{i} .

Assume that a consistent estimator β̂ is available for β. It then follows from Campbell’s Theorem (Daley and Vere-Jones, 2008, Section 13.1) that

\int_{0}^{τ_{i}} \int_{0}^{τ_{i}} \frac{f (t_{1}, t_{2}) I (t_{1} \neq t_{2})}{ρ {(\hat{β}, X_{i})}^{2}} {d N}_{i} (t_{1}) {d N}_{i} (t_{2})

is an approximately unbiased estimator for A_i. Note that we consider only one recurrent event process at a time, because distinct recurrent event processes are assumed to be independent and consequently contain no information regarding the pair correlation functions in A_i: i = 1, ···, n. This then leads to the following estimator for A:

\hat{A} = \sum_{i = 1}^{n} \int_{0}^{τ_{i}} \int_{0}^{τ_{i}} \frac{f (t_{1}, t_{2}) I (t_{1} \neq t_{2})}{ρ {(\hat{β}, X_{i})}^{2}} {d N}_{i} (t_{1}) {d N}_{i} (t_{2}) .

(7)

Estimator for B

For any two positive integers l, m ≤ n, define

B_{l m} = \int_{τ_{l - 1}}^{τ_{l}} \int_{τ_{m - 1}}^{τ_{m}} λ_{0} (t_{1}) λ_{0} (t_{2}) f (t_{1}, t_{2}) {d t}_{1} {d t}_{2} .

Because the ending times are assumed to be distinct, we can rewrite B as

B = \sum_{i = 1}^{n} \sum_{l = 1}^{i} \sum_{m = 1}^{i} B_{l m} = \sum_{l = 1}^{n} \sum_{m = 1}^{n} {n - max (l, m) + 1} B_{l m} .

Then, for any two distinct recurrent event processes N_j and N_k satisfying that j ≥ l and k ≥ m, i.e., τ_j ≥ τ_l and τ_k ≥ τ_m,

\int_{τ_{l - 1}}^{τ_{l}} \int_{τ_{m - 1}}^{τ_{m}} \frac{f (t_{1}, t_{2})}{ρ (\hat{β}, X_{j}) ρ (\hat{β}, X_{k})} {d N}_{j} (t_{1}) {d N}_{k} (t_{2})

is an approximately unbiased estimator for B_lm. Let C_lm = {n−max(l, m)+1}{n−min(l, m)} be the total number of such distinct pairs of recurrent event processes. Define

{\hat{B}}_{l m} = \frac{1}{C_{l m}} \sum_{j = l}^{n} \sum_{\begin{matrix} k = m \\ k \neq j \end{matrix}}^{n} \int_{τ_{l - 1}}^{τ_{l}} \int_{τ_{m - 1}}^{τ_{m}} \frac{f (t_{1}, t_{2})}{ρ (\hat{β}, X_{j}) ρ (\hat{β}, X_{k})} {d N}_{j} (t_{1}) {d N}_{k} (t_{2}) .

The above leads to the following estimator for B:

\hat{B} = \sum_{l = 1}^{n} \sum_{m = 1}^{n} \frac{1}{n - min (l, m)} \sum_{j = l}^{n} \sum_{\begin{matrix} k = m \\ k \neq j \end{matrix}}^{n} \int_{τ_{l - 1}}^{τ_{l}} \int_{τ_{m - 1}}^{τ_{m}} \frac{f (t_{1}, t_{2})}{ρ (\hat{β}, X_{j}) ρ (\hat{β}, X_{k})} {d N}_{j} (t_{1}) {d N}_{k} (t_{2}) .

(8)

Note that unlike in the case of estimating A, we estimate B based on all possible pairs of distinct recurrent event processes.

A major novelty of this paper is that the proposed estimators do not require estimating the unknown baseline rate function λ₀(·). This is desirable because the estimation of λ₀(·) can be problematic especially when the number of processes is small.

3.2 Theoretical justification

Let α < 1 be a positive real value. Given the conditions in web Appendix A, the following two theorems give the mean squared errors of Â and B̂, respectively.

Theorem 1

Assume that conditions (A.1)–(A.4), (A.6)–(A.9) in web Appendix A are true. Then

{(\hat{A} - A)}^{2} = O_{p} {{(\sum_{i = 1}^{n} τ_{i})}^{1 + α}} + o_{p} {{(\sum_{i = 1}^{n} τ_{i})}^{2}} .

Proof

See web Appendix B.

Theorem 2

Assume that conditions (A.1)–(A.8) in web Appendix A are true. Then

{(\hat{B} - B)}^{2} = O_{p} {(n log n) τ_{n} + τ_{n} {(\sum_{i = 1}^{n} τ_{i})}^{α}} + o_{p} {{(\sum_{i = 1}^{n} τ_{i})}^{2}} .

Proof

See web Appendix C.

Assume that both A and B are of order $\sum_{i = 1}^{n} τ_{i}$ . Then Theorems 1 and 2 suggest that both Â/A and B̂/B converge to one in probability, if

\sum_{i = 1}^{n} τ_{i} \to \infty and \frac{(n log n) τ_{n}}{{(\sum_{i = 1}^{n} τ_{i})}^{2}} \to 0 .

The above result has important practical implications. Specifically, it says that if the lengths of the observation intervals (i.e., τ_i’s) are small but the number of independent recurrent event processes (i.e., n) is large, then the two integrals A and B can still be estimated accurately. This is also true if τ_i’s are large, even though n may be small. In the most extreme case, consistent estimators for A and B can still be obtained with n = 2. Thus, our asymptotic framework is a hybrid of the commonly used asymptotic framework in spatial statistics including spatial point processes, which requires the size of the observation window to increase to infinity (Cressie, 1993), and that in recurrent event processes, which assumes that the number of independent processes n increases to infinity (Lin et al., 2000). Such an asymptotic framework is rarely considered in practice.

4. Graphical Assessment of Poisson Assumption

4.1 The null hypothesis

Let γ_i > 0 be an unobserved random effect for N_i. We are interested in testing the null hypothesis that conditional on γ_i and X_i, N_i is a Poisson process and has the rate function

λ (t; N_{i} ∣ u_{i}) = λ_{0} (t) γ_{i} ρ (β, X_{i}) \equiv λ_{0} (t) θ_{i} .

(9)

In general, θ_i can be any positive function that involves γ_i and X_i. We in fact do not need to know the exact form of this function in order to conduct the test.

In terms of model (2), γ_i = exp(u_i) and $ρ (β, X_{i}) = exp (X_{i}^{T} β)$ , and testing (9) is equivalent to testing whether η = 0. It is worth noting that when η ≠ 0, model estimation in particular maximum likelihood estimation can be quite challenging, because Z_i(t) is an infinite-dimension process (Møller and Waagepetersen, 2004). In contrast, model (2) when η = 0 is simply a random frailty model and can therefore be estimated with relative ease. This example shows the potential benefits for conducting the Poisson test.

We will perform the test based on the second-order statistics defined in the last section. To be specific, let Ĝ = Â/B̂, where Â and B̂ are defined in (7) and (8) with f(t₁, t₂) = I(|t₁ − t₂| ≤ r). We rewrite Ĝ as Ĝ(r) to make explicit its dependence on r. In light of the consistency results given in Theorems 1 and 2, we expect Ĝ(r) ≈ c for some unknown constant c under the null hypothesis. However, if N_i’s are clustered (regular), we then expect Ĝ(r) to take larger (smaller) values at some small lag values r.

4.2 The proposed procedure

We propose to generate permutated replicates, say Ĝ_l(r): l = 1, ···, L, for some positive integer L. The original statistics Ĝ(r) can then be plotted together with permutation envelope obtained from the replicates Ĝ_l(r) in order to assess evidence of departure from the null hypothesis. Specifically, we will conclude that the processes are clustered if Ĝ(r) falls above the permutation envelope and are regular if it falls below.

To that end, we first note that the union of all recurrent event processes, which we refer to as the combined process, is also a Poisson process but has a different rate function $λ (t) = λ_{0} (t) \sum_{i = 1}^{n} θ_{i}$ . The original n recurrent event processes can be viewed as a result of randomly labeling events in the combined process into n processes. To better illustrate this point, assume that the ending times satisfy 0 = τ₀ < τ₁ < τ₂ < ··· < τ_n. Because of (9), an event t ∈ [τ_j₋₁, τ_j) is assigned to N_i with probability $θ_{i} / \sum_{k = j}^{n} θ_{k}$ if i ≥ j and with probability zero if i < j. Note that these probabilities are the same for all events in [τ_j₋₁, τ_j).

Let m_ij denote the number of events in [τ_j₋₁, τ_j) from N_i and define $m_{i} = \sum_{j = 1}^{n} m_{i j}$ . An obvious approach to generate permutated replicates for the original processes is to randomly select m_ij events from all events in [τ_j₋₁, τ_j) with equal probability for all possible i and j. This would then generate n new processes that have the same distribution as the original ones, given that they share the same m_ij values. However, if the subintervals [τ_j₋₁, τ_j) are all small, e.g., τ_j = j/n, then the resulting permutated replicates may follow the micro-structure of the original processes too closely so that Ĝ _l(r): l = 1, ···, L, are all very similar to Ĝ(r); this may lead to a test with little power.

We instead propose to generate permutated replicates conditional on the total number of events (i.e., m_i) in each process. The rationale is as follows. First, note that for an arbitrary event t ∈ [0, τ₁) in the combined process, the probability that it comes from N₁ is $θ_{1} / \sum_{i = 1}^{n} θ_{i}$ . In other words, all events observed in [0, τ₁) have the same probability for being from N₁. We may thus select a simple random sample of m₁ events from all events in [0, τ₁) as a replicate for N₁. The remaining $\sum_{i = 2}^{n} m_{i}$ events will then form a replicate for $\cup_{i = 2}^{n} N_{i}$ . We next generate a replicate for N₂. This can be done by selecting m₂ random samples from the remaining events in [0, τ₂), because every such event has the same probability $θ_{2} / \sum_{i = 2}^{n} θ_{i}$ to be from N₂. The process can be repeated until the last m_n events are selected as a replicate for N_n. Under the null hypothesis and conditional on the total number of events in each process, the resulting n new recurrent event processes have the same distribution as the original ones.

It is worth noting that the proposed permutation procedure is independent of the assumed parametric model in terms of X_i. As such, it is valid even if the fitted parametric part of the model is wrong. Moreover, because X_i’s are fixed, the estimate for β typically depends only on the total number of events in each process; the same β̂ will therefore be used give that m_i’s are fixed. Our proposed testing procedure can be summarized in three steps:

Step 1
Fit the multiplicative rate function model (1) to data and calculate Ĝ(r).
Step 2
Set m₀ = 0. Randomly select m_j events from all available events in [0, τ_j), assign the selected m_j events as a permutated sample for the jth process, and remove them from the available events. Repeat this for j = 1, ···, n to obtain a permutated sample for all processes; calculate Ĝ_l(r) based on the lth permutated sample, for l = 1, ···, L.
Step 3
Plot Ĝ(r) and the permutation envelope based on Ĝ_l(r): l = 1, ···, L.

4.3 Comparison with existing methods

There is a rich literature available on testing for the Poisson assumption, often given under the names of testing for overdispersion or homogeneity; for example, see Dean (1992), Lambert and Roeder (1995), and Ng and Cook (1999). For most of the existing methods, a rejection of the Poisson assumption is interpreted as the presence of unobserved heterogeneity between subjects, i.e., a larger-than-zero variance for the random effect term γ_i given in (9). However, this can be overly simplistic, because the presence of within-process correlation, as is the case with η ≠ 0 in model (2), could also lead to overdispersion in the total counts of events from each process and hence to a rejection of the Poisson assumption. Our proposed testing procedure can be a valuable supplementary to the existing approaches by testing the within-process correlation. Moreover, most available tests are expected to be sensitive to the misspecification of the parametric part of model (1). However, our test can maintain the correct test size as long as the more general form (9) is valid.

5. Simulation

We have conducted a simulation study to evaluate the finite sample performance of the proposed estimators. Specifically, we have generated n independent recurrent event processes from either Poisson or Poisson cluster processes over the time intervals [0; $a τ_{i}^{*}$ ): i = 1, ···, n, where (n, a) = (128, 1), (128, 4) or (512, 1) and $τ_{i}^{*}$ ’s are the ending times in the CGD data. Specifically, the minimum and maximum of the $τ_{i}^{*}$ ’s are equal to 91 and 439, respectively.

For the Poisson processes, we use the rate function λ_i(t) = 2.5κ exp(trt_iβ) for the ith process, where κ = 0.001, trt_i is a binary treatment variable for the ith process (=1 for treatment and =0 for control) and β = −1.0971 as in the CGD data (Lin et al., 2000). For the Poisson cluster processes, we first simulate a Poisson process with intensity κ = 0.001 as the parent process, and then generate a Poisson random variable with mean equal to 2.5 exp(trt_iβ) for the number of offspring that each parent generates. The position of each offspring relative to its parent is then determined by a Gaussian random variable with standard deviation ω = 10 or 20. Note that a smaller ω value leads to a more clustered process. The resulting rate function for the ith process is the same as that in the Poisson case. For n = 128 and a = 1, the expected numbers of events in all cases are 16 and 46 for the treatment and control groups, respectively. Note that the total number of simulated events in each simulation is slightly smaller than, but nevertheless comparable to, that for the CGD data given in Section 6.1 (=76). For n = 512 and a = 4, the expected numbers of events become 64 and 184.

For each simulated set of n realizations, we calculate Â, B̂ and Ĝ with f(t₁, t₂) = I(|t₁ −t₂| ≤ r), where r = 20, 40, 80. Table 1 shows the empirical bias and standard deviation in each case from 1000 simulations. Both the bias and standard deviation have been divided by the target parameter for a fair comparison. The biases for all three estimators are typically small especially when n = 512 and a = 4, but appear to be larger for Ĝ and when r = 20. The larger bias for Ĝ is expected given that Ĝ is defined as the ratio of Â and B̂; the larger bias when r = 20 is probably due to the fact that fewer events are separated by a distance less than or equal to 20. The standard deviations are much smaller when n = 512 and/or a = 4 than when (n, a) = (128, 1). Combining this result with the fact that the biases are also smaller in these cases, we can conclude that the accuracy of the proposed estimators improves as n and a increase. This provides support for our theoretical results given in Theorems 1 and 2. In general, the standard deviations become smaller when r increases, especially when (n, a) = (128, 1). Moreover, the standard deviations typically decrease for Â and Ĝ but increase for B̂ when the clustering becomes stronger.

Table 1.

Empirical biases and standard deviations (STDs) of Â, B̂ and Ĝ = Â/B̂ from 1,000 simulations, with f(t₁, t₂) = I(|t₁ − t₂| ≤ r) for r = 20, 40, 80. PCP₁ and PCP₂ stand for the Poisson cluster process with ω = 20 and 10, respectively.

	Process	(n, a)	Â			B̂			Ĝ

			r = 20	40	80	20	40	80	20	40	80
BIAS	Poisson	(128,1)	.031	.000	.001	.005	.004	.006	.040	−.006	.005
		(128,4)	.005	.006	−.003	.004	.005	.004	.004	.001	−.006
		(512,1)	−.009	−.016	−.012	.004	.003	.003	−.009	−.017	−.012
	PCP₁	(128,1)	.027	.010	.002	.018	.018	.017	.082	.050	.037
		(128,4)	.015	.017	.016	.012	.012	.013	.022	.022	.017
		(512,1)	−.004	−.001	−.002	.000	.001	.000	.011	.012	.011
	PCP₂	(128,1)	.014	.008	.007	.025	.024	.022	.074	.052	.043
		(128,4)	.016	.017	.016	.007	.011	.013	.030	.024	.019
		(512,1)	.005	.004	.001	.002	.002	.002	.018	.016	.013
STD	Poisson	(128,1)	1.322	.906	.668	.302	.293	.293	1.274	.819	.617
		(128,4)	.569	.410	.321	.174	.171	.169	.541	.377	.276
		(512,1)	.559	.408	.306	.151	.151	.152	.540	.380	.276
	PCP₁	(128,1)	.676	.615	.598	.571	.565	.560	.527	.394	.362
		(128,4)	.329	.315	.312	.313	.311	.308	.210	.179	.163
		(512,1)	.277	.265	.260	.271	.271	.273	.188	.158	.145
	PCP₂	(128,1)	.620	.603	.600	.608	.588	.575	.433	.377	.361
		(128,4)	.314	.311	.310	.314	.313	.311	.200	.175	.164
		(512,1)	.271	.265	.266	.279	.278	.278	.162	.151	.148

Open in a new tab

We have also evaluated the performance of our proposed test procedure given in Section 4.2. The detailed results are presented in we Appendix D. Briefly speaking, our test procedure has high power in detecting clustering with a remarkably small sample size. The power increases as the sample size increases especially as a increases, but decreases as ω increases, i.e., when the clustering strength decreases. The maximum power is typically achieved at r ≈ 3ω. When the lag is set to be too large or too small, the power can be greatly reduced. For the Poisson cluster processes, the dependence beyond a lag distance 4ω is often ignorable; hence it is desirable to base the test on a lag value that is slightly smaller than the dependence range (≈ 4ω in this case). In practice, we can roughly estimate the dependence range based on a plot of the empirical pair correlation function.

The size of our test is not affected by whether the parametric part of the multiplicative model (1) is correctly specified or not, and does not appear to be very sensitive to modest departures from the overall multiplicative structure either. In terms of power, we have surprisingly found that an incorrect choice of β = 0 has led to a better power in our simulation; hence, it may be sufficient to directly test for clustering without first modeling the potential heterogeneity across subjects. Finally, we note that although our test is effective in detecting clustering, it generally has a low power in detecting regularity especially when the sample size is relatively small. This is because the lower permutation envelope is bounded by zero.

6. Applications to Real Data

6.1 Application to chronic granulomatous disease data

Chronic granulomatous disease is a group of inherited rare disorders of the immune function that are characterized by recurrent pyogenic infections. The data being considered here came from a placebo controlled trial of gamma interferon in CGD and consist of recurring pyogenic infection time for 128 patients. We define the rate function as λ_i(t) = λ₀(t) exp(trt_iβ), where β̂ = −1.0971 is obtained by the estimating equation approach given in Lin et al. (2000).

We apply our proposed graphical procedures in Section 4 to assess whether an individual’s recurring pyogenic infection time is Poisson. To that end, we calculate the function Ĝ(r) and its permuted replicates Ĝ_l(r): l = 1, ···, 49, for 0 < r ≤ 40. Figure 1 plots Ĝ(r) and the associated permutation envelope defined as the maximum and minimum of Ĝ_l(r): l = 1, ···, 49 for each given r value. Under the Poisson assumption, the use of 49 permutated replicates implies that Ĝ(r) has a 2/50 chance to fall outside of the permutation envelope for each given r. Because Ĝ(r) is completely included in the permutation envelope, we conclude no evidence against the null hypothesis. Note that the permutation envelope is quite wide especially at the beginning. This is a result of the very small number (=76) of events being observed. Nevertheless, Ĝ(r) appears to be quite flat, indicating a great compatibility with the Poisson assumption. This result justifies the use of a random effect model (Ng and Cook, 1999), as opposed to a more complex model.

Histograms of counts of events and plots of Ĝ(r) for the CGD data and the simulated Poisson cluster process data, where the counts are the total number of events for each recurrent event process. In plots (b) and (d), the solid lines are for Ĝ(r), the dashed lines are the permutation envelopes, and the dotted lines are the line of constant one.

To further illustrate the power of our test procedure, we apply it to 128 realizations from the Poisson cluster processes considered in Section 5 with κ = 0.001 and ω = 10. The total number of simulated events is comparable to that in the real data example (see the two histograms in Figure 1). Different from the real data example, however, Ĝ(r) now falls completely above the permutation envelope, indicating a strong evidence for clustering (see Figure 1). This is quite remarkable given such a small sample size. Note also that the shapes of the permutation envelopes for the real data and the simulated data are very similar, in that both are very wide at the beginning.

6.2 Application to childhood meningococcal disease data

Meningococcal disease is a severe bacterial infection of the bloodstream or meninges that is caused by the meningococcus germ. Infants and children are particularly vulnerable to being infected. The data being considered here consist of time of diagnosis for 864 meningococcal disease patients of age 0–15 living in Merseyside, UK during the period January 1, 1981 to December 31, 2007. For each case, the full unit post-code of residence is also available. Each post-code is converted to a grid reference using the online tool GeoConvert, developed by the Census Dissemination Unit at the University of Manchester, UK (http://cdu.mimas.ac.uk).

There are 182 Middle Level Super Output Areas (MSOAs) in Merseyside, belonging to five local authorities, namely Knowsley, Liverpool, St Helens, Sefton and Wirral, that make up the region. Each MSOA contains at least 5000 residents and 2000 households and will be treated as a reporting unit in our analysis. This in turn leads to 182 recurrent event processes. For comparison, we also use the five local authorities as the reporting units; this corresponds to the case of n being small in our asymptotic framework given in Section 3.

Our main interests of the analysis are to assess whether the reported time of diagnosis was random (i.e., Poisson) in each reporting unit and to understant the dependence if it was not. Clustering is suspected given the contagious nature of the disease. To formally verify this, we calculate the function Ĝ(r) and its permuted replicates Ĝ_l(r): l = 1, ···, 49, for 0 < r ≤ 40 days. The first plot of Figure 2 shows the resulting Ĝ(r) and the associated permutation envelope. Because Ĝ(r) exceeds the upper permutation envelope at multiple locations for r < 20, we conclude that there is a strong evidence of clustering in the reported time of diagnosis. Our analysis is novel in the sense that we require only (9) but do not need to assume any specific model for the reporting patterns in order to perform the test.

Plots of Ĝ(r) (solid lines) and the associated permutation envelopes (dashed lines) for the CMD data. The reporting units are MSOAs for plot (a) and local authorities for plot (b). The dotted line in each plot is the line of constant one.

Figure 3 plots the estimated pair correlation function, ĝ(r) for 0 < r ≤ 60 days using bandwidth h = 19. Note that ĝ(r) decreases with r for r ≤ 25 and becomes relatively flat for r > 25. The former again suggests clustering in the time of diagnosis within a given reporting unit, whereas the latter suggests that the correlation diminishes roughly after r = 25. Combining these facts together, we conclude that a meningococcal disease case will more likely lead to another case in the same reporting unit within a short period (say ≤ 10 days) after its diagnosis, but is expect to have no significant effect beyond 25 days. Such a conclusion can not be reached from the plot of Ĝ(r). Note also that ĝ(r) is consistently larger than one, which indicates a strong between-unit variation. As a result, variables specific to a reporting unit such as socioeconomic variables should be considered so as to explain this variability.

Plot of the empirical pair correlation ĝ(r) for the CMD data with h = 19.

The second plot of Figure 2 shows the function Ĝ(r) and its associated permutation envelope when the local authorities are used as the reporting units. The plot suggests no evidence of clustering but a strong evidence for between-unit variation. The former is due to the facts that temporal clustering is expected to be significant only at relatively small spatial scales and that the use of larger reporting units tends to dilute the evidence of clustering. The latter is a result of the high heterogeneity in the social composition and degree of urbanization among the five local authorities, which could affect the risk of developing the meningococcal disease (Diggle et al., 2010). The purpose of this analysis is simply to show what one should expect when the between-process variation dominates over the within-process clustering.

7. Discussion

We have introduced the new concept of second-order analysis for semiparametric recurrent event processes. We have applied the proposed methods to both simulated and two real data examples, and have shown through these applications that they can provide valuable information on the dependence structure of individual recurrent event process. Specifically, for the CGD example, we have concluded that an individual’s recurrent pyogenic infection time is compatible with a Poisson process. This justifies the use of the popular random effect models for this dataset. For the CMD example, we have formally confirmed the existence of clustering in the diagnosis time of meningococcal disease cases, and revealed that the clustering is highly significant with a lag distance less than or equal to 10 days but diminishes after a lag distance of 25 days. We do not need to model the potentially heterogeneous temporal trend for either example, which is a great strength of our proposed methods.

Our main theoretical results are derived based on a flexible asymptotic framework that combines main features of those commonly used in both spatial statistics and recurrent event processes. Consequently, the proposed methods based on these theoretical results can be used in many different settings, including processes that are either individually sparse but have many replicates or individually rich of events but have few replicates.

Our theoretical results are currently limited to consistency of the proposed estimators for the two second-order integrals defined in Section 2.3. We conjecture that the joint distribution of these estimators will be asymptotically normal under suitable conditions. However, a proof of this result will require advanced statistical theories on either empirical processes (e.g., Lin et al., 2000) or mixing conditions (e.g., Guan and Sherman, 2007) or both, depending on the exact type of asymptotics to be considered. Our preliminary results also show that the asymptotic covariance involves a large number of complex integrals in terms of the rate functions up to the forth order. It is nontrivial to develop a consistent yet computationally realistic estimator for the asymptotic covariance. These problems are important but are beyond the scope of this paper. To test the Poisson assumption, we prefer the proposed permutation test over a test based on asymptotic theories, because it is exact and also is not affected the validity of the fitted rate function model.

8. Supplementary Materials

Theoretical results and additional simulation results, referenced in Sections 1, 3.2 and 5, are available under the Paper Information link at the Biometrics website http://www.biometrics.tibs.org.

Supplementary Material

Supp Materials

NIHMS267780-supplement-Supp_Materials.pdf^{(306KB, pdf)}

Acknowledgments

This research was supported by NSF grant DMS-0845368 and by NIH grant 1R01DA029081. The CMD data were collected by Alistair Thomson, Omnia Marzouk, Andrew Riordan, Paul Baines, Enitan Carrol, Fauzia Paize, Scott Hackett and Niten Makwana, Alder Hey Children’s Hospital, Merseyside. The authors thank the reviewers for their constructive comments and suggestions, Peter Diggle for helpful discussions, and Michelle Stanton for organizing the CMD data.

Appendix A: Expressions with Tied Ending Time

The expression of Â remains unchanged so we consider only B̂. For ease of presentation, write ρ_i(β) for ρ (β, X_i). Assume that 0 = τ₍₀₎ < τ ₍₁₎ < ···< τ ₍_n*₎ are the distinct ending time, where n* < n. Let n₍_i₎ denote the number of point processes with ending time equal to τ ₍_i₎. Define $R_{(l)} = \sum_{i = l}^{n^{*}} n_{(i)}$ , L₍_l₎ = n −R₍_l₎ and

B_{(l) (m)} = \int_{τ_{(l - 1)}}^{τ_{(l)}} \int_{τ_{(m - 1)}}^{τ_{(m)}} λ_{0} (t_{1}) λ_{0} (t_{2}) f (t_{1}, t_{2}) {d t}_{1} {d t}_{2} .

Then,

B = \sum_{i = 1}^{n^{*}} n_{(i)} \sum_{l = 1}^{i} \sum_{m = 1}^{i} B_{(l) (m)} = 2 \sum_{l = 1}^{n^{*}} \sum_{m < l} R_{(l)} B_{(l) (m)} + \sum_{l = 1}^{n^{*}} R_{(l)} B_{(l) (l)} .

For m ≤ l, we estimate B₍_l₎₍_m₎ by

{\hat{B}}_{(l) (m)} = \frac{1}{R_{(l)} {R_{(m)} - 1}} \sum_{j = L_{(l)} + 1}^{n} \sum_{\begin{matrix} k = L_{(m)} + 1 \\ k \neq j \end{matrix}}^{n} \int_{τ_{(l - 1)}}^{τ_{(l)}} \int_{τ_{(m - 1)}}^{τ_{(m)}} \frac{f (t_{1}, t_{2})}{ρ_{j} (\hat{β}) ρ_{k} (\hat{β})} N_{j} ({d t}_{1}) N_{k} ({d t}_{2}) .

References

Andersen PK, Gill RD. Cox’s regression model for counting processes: A large sample study (Com: P1121–1124) The Annals of Statistics. 1982;10:1100–1120. [Google Scholar]
Cox DR. Some statistical models related with series of events. Journal of the Royal Statistical Society, Series B. 1955;17:129–164. [Google Scholar]
Cressie NAC. Statistics for Spatial Data. 2 New York: John Wiley & Sons; 1993. [Google Scholar]
Daley DJ, Vere-Jones D. General Theory and Structure. 2 II. New York: Springer-Verlag; 2008. An Introduction to the Theory of Point Processes. [Google Scholar]
Dean CB. Testing for overdispersion in Poisson and binomial regression models. Journal of the American Statistical Association. 1992;87:451–457. [Google Scholar]
Diggle PJ. Statistical Analysis of Spatial Point Patterns. 2 London: Edward Arnold; 2003. [Google Scholar]
Diggle PJ, Guan Y, Hart AC, Paize F, Stanton M. Estimating individual-level risk in spatial epidemiology using spatially aggregated information on the population at risk. Journal of the American Statistical Association. 2010 doi: 10.1198/jasa.2010.ap09323. To appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
Guan Y, Sherman M. On least squares fitting for stationary spatial point processes. Journal of the Royal Statistical Society, Series B: Statistical Methodology. 2007;69(1):31–49. [Google Scholar]
Henderson R, Diggle P, Dobson A. Joint modelling of longitudinal measurements and event time data. Biostatistics (Oxford) 2000;1(4):465–480. doi: 10.1093/biostatistics/1.4.465. [DOI] [PubMed] [Google Scholar]
Lambert D, Roeder K. Overdispersion diagnostics for generalized linear models. Journal of the American Statistical Association. 1995;90:1225–1236. [Google Scholar]
Lawless JF, Nadeau C. Some simple robust methods for the analysis of recurrent events. Technometrics. 1995;37:158–168. [Google Scholar]
Lin DY, Wei LJ, Yang I, Ying Z. Semiparametric regression for the mean and rate functions of recurrent events. Journal of the Royal Statistical Society, Series B, Methodological. 2000;62(4):711–730. [Google Scholar]
Møller J, Syversveen AR, Waagepetersen RP. Log Gaussian Cox processes. Scandinavian Journal of Statistics. 1998;25(3):451–482. [Google Scholar]
Møller J, Waagepetersen RP. Statistical Inference and Simulation for Spatial Point Processes. New York: Chapman & Hall; 2004. [Google Scholar]
Ng ETM, Cook RJ. Adjusted score tests of homogeneity for Poisson processes. Journal of the American Statistical Association. 1999;94:308–319. [Google Scholar]
Pepe MS, Cai J. Some graphical displays and marginal regression analyses for recurrent failure times and time dependent covariates. Journal of the American Statistical Association. 1993;88:811–820. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp Materials

NIHMS267780-supplement-Supp_Materials.pdf^{(306KB, pdf)}

[R1] Andersen PK, Gill RD. Cox’s regression model for counting processes: A large sample study (Com: P1121–1124) The Annals of Statistics. 1982;10:1100–1120. [Google Scholar]

[R2] Cox DR. Some statistical models related with series of events. Journal of the Royal Statistical Society, Series B. 1955;17:129–164. [Google Scholar]

[R3] Cressie NAC. Statistics for Spatial Data. 2 New York: John Wiley & Sons; 1993. [Google Scholar]

[R4] Daley DJ, Vere-Jones D. General Theory and Structure. 2 II. New York: Springer-Verlag; 2008. An Introduction to the Theory of Point Processes. [Google Scholar]

[R5] Dean CB. Testing for overdispersion in Poisson and binomial regression models. Journal of the American Statistical Association. 1992;87:451–457. [Google Scholar]

[R6] Diggle PJ. Statistical Analysis of Spatial Point Patterns. 2 London: Edward Arnold; 2003. [Google Scholar]

[R7] Diggle PJ, Guan Y, Hart AC, Paize F, Stanton M. Estimating individual-level risk in spatial epidemiology using spatially aggregated information on the population at risk. Journal of the American Statistical Association. 2010 doi: 10.1198/jasa.2010.ap09323. To appear. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Guan Y, Sherman M. On least squares fitting for stationary spatial point processes. Journal of the Royal Statistical Society, Series B: Statistical Methodology. 2007;69(1):31–49. [Google Scholar]

[R9] Henderson R, Diggle P, Dobson A. Joint modelling of longitudinal measurements and event time data. Biostatistics (Oxford) 2000;1(4):465–480. doi: 10.1093/biostatistics/1.4.465. [DOI] [PubMed] [Google Scholar]

[R10] Lambert D, Roeder K. Overdispersion diagnostics for generalized linear models. Journal of the American Statistical Association. 1995;90:1225–1236. [Google Scholar]

[R11] Lawless JF, Nadeau C. Some simple robust methods for the analysis of recurrent events. Technometrics. 1995;37:158–168. [Google Scholar]

[R12] Lin DY, Wei LJ, Yang I, Ying Z. Semiparametric regression for the mean and rate functions of recurrent events. Journal of the Royal Statistical Society, Series B, Methodological. 2000;62(4):711–730. [Google Scholar]

[R13] Møller J, Syversveen AR, Waagepetersen RP. Log Gaussian Cox processes. Scandinavian Journal of Statistics. 1998;25(3):451–482. [Google Scholar]

[R14] Møller J, Waagepetersen RP. Statistical Inference and Simulation for Spatial Point Processes. New York: Chapman & Hall; 2004. [Google Scholar]

[R15] Ng ETM, Cook RJ. Adjusted score tests of homogeneity for Poisson processes. Journal of the American Statistical Association. 1999;94:308–319. [Google Scholar]

[R16] Pepe MS, Cai J. Some graphical displays and marginal regression analyses for recurrent failure times and time dependent covariates. Journal of the American Statistical Association. 1993;88:811–820. [Google Scholar]

PERMALINK

Second-Order Analysis of Semiparametric Recurrent Event Processes

Yongtao Guan

Summary

1. Introduction

2. Background and Notation

2.1 Notation

2.2 Pair correlation function for Cox processes

2.3 The proposed second-order integrals

3. Estimation of Second-Order Integrals

3.1 The proposed estimators

Estimator for A

Estimator for B

3.2 Theoretical justification

Theorem 1

Proof

Theorem 2

Proof

4. Graphical Assessment of Poisson Assumption

4.1 The null hypothesis

4.2 The proposed procedure

4.3 Comparison with existing methods

5. Simulation

Table 1.

6. Applications to Real Data

6.1 Application to chronic granulomatous disease data

Figure 1.

6.2 Application to childhood meningococcal disease data

Figure 2.

Figure 3.

7. Discussion

8. Supplementary Materials

Supplementary Material

Acknowledgments

Appendix A: Expressions with Tied Ending Time

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases