Semiparametric Relative-risk Regression for Infectious Disease Transmission Data

Eben Kenah

doi:10.1080/01621459.2014.896807

. Author manuscript; available in PMC: 2016 Mar 1.

Published in final edited form as: J Am Stat Assoc. 2014 Mar 27;110(509):313–325. doi: 10.1080/01621459.2014.896807

Semiparametric Relative-risk Regression for Infectious Disease Transmission Data

Eben Kenah ¹

PMCID: PMC4489164 NIHMSID: NIHMS596023 PMID: 26146425

Abstract

This paper introduces semiparametric relative-risk regression models for infectious disease data. The units of analysis in these models are pairs of individuals at risk of transmission. The hazard of infectious contact from i to j consists of a baseline hazard multiplied by a relative risk function that can be a function of infectiousness covariates for i, susceptibliity covariates for j, and pairwise covariates. When who-infects-whom is observed, we derive a profile likelihood maximized over all possible baseline hazard functions that is similar to the Cox partial likelihood. When who-infects-whom is not observed, we derive an EM algorithm to maximize the profile likelihood integrated over all possible combinations of who-infected-whom. This extends the most important class of regression models in survival analysis to infectious disease epidemiology. These methods can be implemented in standard statistical software, and they will be able to address important scientific questions about emerging infectious diseases with greater clarity, flexibility, and rigor than current statistical methods allow.

Keywords: Survival analysis, Epidemiology, EM algorithm, Chain-binomial model

1 INTRODUCTION

Infectious diseases are an important threat to human health and commerce, and understanding the transmission of disease is crucial to the design of public health interventions. The statistical analysis of infectious disease data is complicated by the fact that infections are inherently dependent, especially when they are transmitted directly from person to person (Becker, 1989; Andersson and Britton, 2000). Epidemiologists have dealt with this problem in three ways. The most common approach is to model susceptibility to disease using standard statistical methods such as logistic or Cox regression, ignoring disease transmission. A second approach is to use chain binomial models (Rampey et al., 1992), which estimate the probability of escaping infectious contact from infected members of groups such as households, classrooms, or hospital wards. The third approach is to model the spread of disease as a branching process where infectees are the o spring of their infectors (Wallinga and Teunis, 2004; White and Pagano, 2008). The time between the infections of an infector and an infectee is called a generation interval. In this approach, the generation intervals are assumed to be independent and identically distributed (iid).

To understand transmission, it is crucial to separate the e ects of covariates on infectiousness and susceptibility from their association with exposure to infected people (Rhodes et al., 1996). Regression models that ignore transmission cannot do this. When disease transmission is modeled as a branching process, uninfected people do not exist and cannot be exposed to infected people. The failure to account for uninfected person-time and competing risks of infection cause several problems with this approach (Svensson, 2007; Kenah et al., 2008). The assumption that generation intervals are iid is di cult to relax, making estimation of covariate e ects di cult (Kenah, 2013). Chain binomial models are a statistically sound response to the problem of dependence and can be used to estimate covariate e ects. However, their use is limited in two ways: First, they are not implemented in standard statistical software—a problem solved partially by the publicly-available package TranStat (www.epimodels.org/midas/transtat.do). Second, they use discrete time. Since infectious disease data are usually recorded by the day or week, this is not unnatural. However, continuous-time models corrected for ties may o er a more flexible modeling framework.

Kenah (2011) extended parametric methods from survival analysis to infectious disease data by modeling the contact interval. In the ordered pair ij, the contact interval τ_ij is the time between the onset of infectiousness in i and the first infectious contact from i to j, where infectious contact is a contact sufficient to infect a susceptible individual. It is right-censored if the infectious period of i ends before i makes infectious contact with j or if j is infected by someone other than i. These methods solve the problem of dependence by treating ordered pairs of individuals, not the individuals themselves, as the units of analysis. Kenah (2013) showed that the contact interval distribution could be estimated nonparametrically by adapting the Nelson-Aalen estimator from standard survival analysis. These methods assume a homogeneous population where the contact interval distribution is the same for all pairs ij in which transmission from i to j is possible. They are unable to estimate covariate e ects on transmission, which is a primary goal of vaccine trials, outbreak investigations, and many other studies of infectious disease.

The goal of this paper is to extend the methods of Kenah (2013) to develop a relative-risk regression model similar to that of Cox (1972). This model will allow semiparametric estimation of the e ects of covariates on the hazard of infectious contact in pairs of individuals. For the ordered pair ij, the covariate vector can include infectiousness covariates for i, susceptibility covariates for j, and pairwise covariates. This semiparametric regression model will allow many of the most important scientific questions in infectious disease epidemiology to be addressed with greater clarity, flexibility, and rigor.

1.1 Stochastic S(E)IR epidemic model

Consider a closed population of n individuals assigned indices 1 . . . n. Each individual is in one of four states: susceptible (S), exposed (E), infectious (I), or removed (R). Person i moves from S to E at his or her infection time t_i, with t_i = ∞ if i is never infected. After infection i has a latent period of length ε_i, during which he or she is infected but not infectious. At time t_i + ε_i, i moves from E to I, beginning an infectious period of length _i. At time t_i + ε_i + l_i, i moves from I to R. Once in R, i can no longer infect others or be infected. The states and notation are illustrated at the top of Figure 1. The latent period is a nonnegative random variable, the infectious period is a strictly positive random variable, and both have finite mean and variance.

Notation for the stochastic SEIR model natural history (top) and infectious contact process (bottom). In the bottom diagram, the infectious contact interval $τ_{i j}^{*}$ is equal to the contact interval *τ_ij* because *τ_ij* ≤ *l_i*. Otherwise, we would have $τ_{i j}^{*} = \infty$ and no infectious contact from i to j would occur.

An epidemic begins with one or more persons infected from outside the population, which we call imported infections. The methods in this paper require that the set of imported infections is known. For simplicity, we assume that epidemics begin with one or more imported infections at time 0 and there are no other imported infections.

After becoming infectious at time t_i + ε_i, person i makes infectious contact with j ≠ i at time $t_{i j} = t_{i} + ε_{i} + τ_{i j}^{*}$ , where the infectious contact interval $τ_{i j}^{*}$ is a strictly positive random variable with $τ_{i j}^{*} = \infty$ if infectious contact never occurs. Since infectious contact must occur while i is infectious or never, $τ_{i j}^{*} \in (0, ι_{i}]$ or $τ_{i j}^{*} = \infty$ . We define infectious contact to be a contact sufficient infect a susceptible person, so t_j ≤ t_ij for all i ≠ j. The infectious contact interval is illustrated at the bottom of Figure 1.

For each ordered pair ij, let C_ij = 1 if infectious contact from i to j is possible and C_ij = 0 otherwise. These C_ij could be the entries in an adjacency matrix for a static contact network. We assume that the infectious contact interval $τ_{i j}^{*}$ is generated in the following way: A contact interval τ_ij is drawn from a distribution with hazard function λ_ij(τ). If τ_ij ≤ l_i and C_ij = 1, then $τ_{i j}^{*} = τ_{i j}$ . Otherwise, $τ_{i j}^{*} = \infty$ . In this paper, we assume the contact intervals in all ordered pairs ij are independent and have finite mean and variance.

1.2 Observation and censoring

Our population has size n, and we observe the times of all S → E (infection), E → I (onset of infectiousness), and I → R (removal) transitions in the population between time 0 and time T . For all ordered pairs ij such that i is infected, we observe C_ij. We first consider the case where who-infects-whom is observed and then consider the more realistic case where it is not.

We assume that we can observe τ_ij only if j is infected by i at time t_i+ε_i+τ_ij. Clearly, τ_ij can be observed only if C_ij = 1. We also have right-censoring of τ_ij:

1. Since infectious contact can occur only while i is infectious, τ_ij can be right-censored by the infectious period l_i of i. Let $I_{i} (τ) = 1_{τ \in (0, ι_{i}]}$ indicate whether i remains infectious at infectious age τ.

2. Since j is susceptible to infection by i only if he or she has not been infected by anyone else, τ_ij can be right-censored by t_i′j − t_i − ε_i for i′ ≠ i. Let $S_{i j} (τ) = 1_{t_{i} + ε_{i} + τ \leq t_{j}}$ indicate whether j remains susceptible at infectious age τ of i.

Let T denote the time at which observation ends. Then τ_ij can be right-censored by the end of observation at infectious age T − t_i − ε_i of i. Let $Y_{i} (τ) = 1_{t_{i} + ε_{i} + τ \leq T}$ indicate whether observation is ongoing when i reaches infectious age τ.

Since $I_{i} (τ)$ , $S_{i j} (τ)$ , and $Y_{i j} (τ)$ are left-continuous,

Y_{i j} (τ) = C_{i j} I_{i} (τ) S_{i j} (τ) Y_{i} (τ)

(1)

is a left-continuous process that indicates the risk of an observed infectious contact from i to j at infectious age τ of i. The assumptions made in the stochastic S(E)IR model above ensure that $I_{i} (τ)$ and $S_{i j} (τ)$ independently censor τ_ij. We also require that T is a stopping time with respect to the observed data such that, for all i, $Y_{i} (τ)$ independently censors τ_ij for each j exposed to infectious contact from i. The possible censoring scenarios are illustrated in Figure 2.

The three censoring processes for the contact interval *τ_ij*. The onset of infectiousness in i occurs at time *t_i* + *ε_i*, and the infection of j occurs at *t_j*. At the top, *τ_ij* is censored because $I_{i} (t_{j} - t_{i} - ε_{i}) = 0$ . In the middle, *τ_ij* is observed if $S_{i j} (t_{j} - t_{i} - ε_{i}) = 1$ and censored otherwise. At the bottom, *τ_ij* is censored because $Y_{i} (t_{j} - t_{i} - ε_{i}) = 0$ .

1.3 Transmission trees and infectious sets

Following Wallinga and Teunis (2004), let v_j denote the index of the person who infected person j, with v_j = 0 for imported infections and v_j = ∞ for persons not infected prior to the end of observation. The transmission tree is the directed network with an edge from v_j to j for each j such that t_j ≤ T . It can be represented by a vector v = (v₁, . . . , v_n). Let $V_{j} = {i : C_{i j} I_{i} (t_{j}) = 1}$ denote the set of possible infectors of person j, which we call the infectious set of j. Let $V$ denote the set of all v consistent with the observed data. A $v \in V$ can be generated by choosing a $v_{j} \in V_{j}$ for each non-imported infection j.

2 METHODS

Kenah (2013) described the nonparametric estimation of the contact interval distribution for a homogeneous population. Here, we consider a semiparametric relative-risk model like that of Prentice and Self (1983). Let

λ_{i j} (τ) = r (β_{0}^{T} X_{i j} (τ)) λ_{0} (τ),

(2)

where λ₀(τ) is an unspecified baseline hazard function, $r : R \to (0, \infty)$ is a relative risk function, β₀ is an unknown b × 1 coefficient vector, and X_ij(τ) is a b × 1 predictable covariate process taking values in a set $X$ . The covariates X_ij(τ) can include individual-level covariates predicting the infectiousness of i or the susceptibility of j as well as pairwise covariates (e.g., membership in the same household) that predict the hazard of infectious contact from i to j.

We assume that r has continuous first and second derivatives, r(0) = 1, and ln r(β^TX) is bounded on $X$ . Letting r(x) = exp(x) gives us a loglinear relative risk regression model like that of Cox (1972), and letting r(x) = 1 + x gives us a linear relative risk regression model. To fit these semiparametric models, we adapt the nonparametric estimators from Kenah (2013) to account for the relative risk function.

2.1 Who-infects-whom is observed

Let N_ij(τ) = 1_τij ≤_τ indicate whether an observed infectious contact from i to j has occurred by infectious age τ in i, and let $N (τ) = \sum_{j = 1}^{n} \sum_{i \neq j} N_{i j} (τ)$ . Note that we must observe who-infected-whom in order to calculate N(τ).

Let $Λ_{0} (τ) = \int_{0}^{τ} λ_{0} (u) d u$ . Given β, the Breslow estimator (Breslow, 1972) of Λ₀(τ) is

\hat{Λ} (β, τ) = \int_{0}^{τ} \frac{1}{Y (β, u)} d N (u),

(3)

where

Y (β, u) = \sum_{j = 1}^{n} \sum_{i \neq j} r (β^{T} X_{i j} (u)) Y_{i j} (u) .

(4)

The Breslow estimator has two desirable properties. First, $\hat{Λ} (β_{0}, τ)$ is an unbiased estimator of Λ₀(τ). Let

M (β, τ) = N (τ) - \int_{0}^{τ} Y (β, u) λ_{0} (u) d u .

(5)

Then for all τ such that Y (τ) > 0,

\hat{Λ} (β_{0}, τ) - Λ_{0} (τ) = \int_{0}^{τ} \frac{1_{Y (u) > 0}}{Y (β_{0}, u)} d M (β_{0}, u)

(6)

is a mean-zero martingale when β = β₀. Second, $\hat{Λ} (β, τ)$ maximizes the log likelihood

ℓ (β, Λ) = \sum_{j = 1}^{n} \sum_{i \neq j} \ln (r (β^{T} X_{v_{j} j} (τ_{v_{j} j})) d Λ (τ_{v_{j} j})) - \int_{0}^{\infty} Y (β, u) d Λ (u)

(7)

over all step functions Λ(τ). Substituting $\hat{Λ} (β, τ)$ into l(β, Λ) profile likelihood

ℓ (β, \hat{Λ}) = (\sum_{j = 1}^{n} \ln \frac{r (β^{T} X_{v_{j} j} (τ_{v_{j} j}))}{Y (β, τ_{v_{j} j})}) - τ,

(8)

where $T = \max {τ : Y (τ) > 0}$ . The first term is similar to the log partial likelihood from Cox (1972) and the second term does not depend on β. Dropping the second term, let

p l (β) = \sum_{j = 1}^{n} \ln \frac{r (β^{T} X_{v_{j} j} (τ_{v_{j} j}))}{Y (β, τ_{v_{j} j})}

(9)

be the log partial likelihood for β. This derivation of the partial likelihood as a profile likelihood follows that of Johansen (1983). Let $\hat{β}$ denote the value of β that maximizes pl(β), and let ${\hat{Λ}}_{0} (τ) = \hat{Λ} (\hat{β}, τ)$ denote the corresponding Breslow estimate of the baseline cumulative hazard.

2.2 Partial likelihood score process

We can rewrite pl(β) as a sum of stochastic integrals:

p l (β) = \sum_{j = 1}^{n} \sum_{i \neq j} \int_{0}^{\infty} \ln \frac{r (β^{T} X_{i j} (u))}{Y (β, u)} d N_{i j} (u) .

(10)

The corresponding score process is

U (β, τ) = \sum_{j = 1}^{n} \sum_{i \neq j} \int_{0}^{τ} \frac{\partial}{\partial β} \ln r (β^{T} X_{i j} (u)) - E (β, u) d N_{i j} (u),

(11)

where

E (β, u) = \frac{\sum_{j = 1}^{n} \sum_{i \neq j} r (β^{T} X_{i j} (u)) Y_{i j} (u) \frac{\partial}{\partial β} \ln r (β^{T} X_{i j} (u))}{\sum_{j = 1}^{n} \sum_{i \neq j} r (β^{T} X_{i j} (u)) Y_{i j} (u)} .

(12)

is the expected value of $\frac{\partial}{\partial β} \ln r (β^{T} X_{i j} (u))$ over the risk set at u when each pair is weighted by its hazard of infectious contact at u. By the Doob-Meyer decomposition, there is a mean-zero martingale M_ij(u) for each ij such that

d N_{i j} (u) = r (β_{0}^{T} X_{i j} (u)) λ_{0} (u) Y_{i j} (u) d u + d M_{i j} (u) .

(13)

Expanding equation (11) using this decomposition and simplifying, we get

U (β_{0}, τ) = \sum_{j = 1}^{n} \sum_{i \neq j} \int_{0}^{τ} \frac{\partial}{\partial β} \ln \frac{r (β_{0}^{T} X_{i j} (u))}{Y (β_{0}, u)} d M_{i j} (u) .

(14)

Since it is a sum of integrals of predictable processes with respect to martingales, U(β₀, τ) is a mean-zero martingale.

2.3 Observed and expected information

Let $v^{\otimes 2} = v v^{T}$ for a column vector v. Since the N_ij(τ) do not jump simultaneously in continuous time, the predictable variation process of U(β₀, τ) is

〈 U (β_{0}) 〉 (τ) = \int_{0}^{τ} V (β_{0}, u) Y (β_{0}, u) λ_{0} (u) d u,

(15)

where

V (β, u) = \sum_{j = 1}^{n} \sum_{i \neq j} {(\frac{\partial}{\partial β} \ln \frac{r (β^{T} X_{i j} (u))}{Y (β, u)})}^{\otimes 2} \frac{r (β^{T} X_{i j} (u)) Y_{i j} (u)}{Y (β, u)}

(16)

is the variance of $\frac{\partial}{\partial β} \ln r (β^{T} X_{i j} (u))$ over the risk set at u when each pair ij is weighted by its hazard of infectious contact at u.

Let $I (β) = - \frac{\partial^{2}}{\partial β^{2}} p l (β)$ be the observed information. Then

I (β) = \sum_{j = 1}^{n} \sum_{i \neq j} \int_{0}^{\infty} {(\frac{\partial}{\partial β} \ln r (β^{T} X_{i j} (u)))}^{\otimes 2} - E {(β, u)}^{\otimes 2} d N_{i j} (u) - \sum_{j = 1}^{n} \sum_{i \neq j} \int_{0}^{\infty} \frac{\frac{\partial^{2}}{\partial β^{2}} r (β^{T} X_{i j} (u))}{r (β^{T} X_{i j} (u))} - \frac{\frac{\partial^{2}}{\partial β^{2}} Y (β, u)}{Y (β, u)} d N_{i j} (u) .

(17)

Expanding I(β₀) via the Doob-Meyer decomposition (13) and simplifying, we get

I (β_{0}) = \int_{0}^{\infty} V (β_{0}, u) Y (β_{0}, u) λ_{0} (u) d u + \sum_{j = 1}^{n} \sum_{i \neq j} \int_{0}^{\infty} \frac{\partial^{2}}{\partial β^{2}} \ln \frac{r (β_{0}^{T} X_{i j} (u))}{Y (β_{0}, u)} d M_{i j} (u) .

(18)

The second term has expectation zero, so I(β₀) is an unbiased estimate of Var(U(β₀, ∞)).

Another estimate Var(U(β₀, ∞)) is obtained by substituting the increments of the Breslow estimator (3) for λ₀(u) du in equation (15). This gives us the estimated expected information

I (β) = \int_{0}^{\infty} V (β, u) d N (u) .

(19)

Expanding $I (β_{0})$ using the Doob-Meyer decomposition and simplifying, we get

I (β_{0}) = \int_{0}^{\infty} V (β_{0}, u) Y (β_{0}, u) λ_{0} (u) d u + \int_{0}^{\infty} V (β_{0}, u) d M (u),

(20)

where $M (τ) = \sum_{j = 1}^{n} \sum_{i \neq j} M_{i j} (τ)$ . The second term has expectation zero, so $I (β_{0})$ is also an unbiased estimate of the variance of U(β₀, ∞). $I (β_{0})$ may be a better estimator of Var(U(β₀, ∞)) than I(β₀) because it is guaranteed to be positive semidefinite (Prentice and Self, 1983) and it depends only on aggregates over risk sets (Aalen et al., 2009).

When r(x) = exp(x) as in the Cox model, $I (β) = I (β)$ for all β. For general r(x), I(β₀) and $I (β_{0})$ are asymptotically equivalent under weak regularity conditions (see Appendix A online).

2.4 Large-sample estimation of β₀ and Λ₀(τ)

Appendix A, available online, outlines sufficient conditions for the asymptotic normality of U(β₀, τ) and $\hat{β}$ as m → ∞, where m is the number of pairs ij at risk of transmission. Under these conditions, hypothesis tests and confidence intervals for β₀ can be obtained using score, Wald, or likelihood ratio statistics. These conditions are very similar to those for the standard Cox model (Prentice and Self, 1983) except for the additional requirement that both the number of susceptibles and the number of pairs be large such that each susceptible is exposed to a number of infectors < < m. When a given susceptible j is infected, all pairs ij are censored. If there were many pairs but few susceptibles, each susceptible would be exposed to a very high hazard of infection and most pairs would be censored very quickly. To take an extreme case, imagine a single susceptible exposed to m infecteds. The number of pairs at risk of transmission is m but the susceptible will be infected almost immediately when m is large. After this, there are no more pairs at risk of transmission and we can learn nothing further about β₀ or Λ₀(τ).

Given $\hat{β}$ the Breslow estimator of $Λ_{0} (τ) is {\hat{Λ}}_{0} (τ) = {\hat{Λ}}_{0} (\hat{β}, τ)$ . Its variance is consistently estimated by

{\hat{σ}}_{0}^{2} (τ) = {(\frac{\partial}{\partial β} \hat{Λ} (\hat{β}, τ))}^{T} I {(\hat{β})}^{- 1} (\frac{\partial}{\partial β} \hat{Λ} (\hat{β}, τ)) + \int_{0}^{τ} \frac{1}{Y {(\hat{β}, u)}^{2}} d N (u),

(21)

which is derived in Appendix B.1. $I (\hat{β})$ can be replaced by $I (\hat{β})$ . Using the martingale central limit theorem and a log transformation, we get the approximate pointwise 1 − α confidence limits

{\hat{Λ}}_{0} (τ) \exp (\pm \frac{{\hat{σ}}_{0} (τ)}{{\hat{Λ}}_{0} (τ)} Φ^{- 1} (1 - \frac{α}{2})) .

(22)

Point and interval estimates for the baseline survival function can be obtained using the relationship S₀(τ) = exp (− Λ₀(τ)

2.5 Who-infects-whom is not observed

When we do not observe who-infected-whom, we do not know which contact intervals are observed and which are censored. It impossible to calculate the partial likelihood pl(β) or the Breslow estimate $\hat{Λ} (β, τ)$ . In this section, we show how an EM algorithm similar to that of Kenah (2013) can be used to obtain consistent and asymptotically normal estimates of β₀ and Λ₀(τ).

Given β, λ(τ), and the observed information, the probability that j was infected by i is

p_{i j} (β, λ) = \frac{r (β^{T} X_{i j} (t_{j} - t_{i} - ε_{i})) λ (t_{j} - t_{i} - ε_{i}) 1_{i \in V_{j}}}{\sum_{k \in V_{j}} r (β^{T} X_{k j} (t_{j} - t_{k} - ε_{k})) λ (t_{j} - t_{k} - ε_{k})},

(23)

and the infectors of different infected persons can be chosen independently (Kenah et al., 2008). The probability of a transmission network v = (v₁, . . . , v_n) given β, λ(τ), and the observed data is

\Pr (v ∣ β, λ, observed data) = \prod_{j : 0 < v_{j} < \infty} p_{v_{j} j} (β . λ) .

(24)

Note that the last two equations assume a continuous contact interval distribution, so simultaneous infectious contacts have probability zero.

Let pl_v(β) be the log partial likelihood that we would have calculated had we observed the transmission network v. Given a coefficient vector β^* and a baseline hazard function λ^*(τ), the expected log likelihood is

{\tilde{p l}}_{β^{*}, λ^{*}} (β) = \sum_{v \in V} p l_{v} (β) pr (v ∣ β^{*}, λ^{*}, observed data) = \sum_{j = 1}^{n} \sum_{i \neq j} \int_{0}^{T} \ln \frac{r (β^{T} X_{i j} (u))}{Y (β, u)} d {\tilde{N}}_{i j} (u ∣ β^{*}, λ^{*}),

(25)

where Ñ_ij(τ|β^*, λ^*) = p_ij(β^*, λ^*)1_τ≥_{t_j}−_{t_i}−_{ε_i}. Now let N(τ|v) be the value of N(τ) that we would have calculated had we observed the transmission network v. The corresponding Breslow estimate is

{\hat{Λ}}_{v} (β, τ) = \int_{0}^{τ} \frac{1}{Y (β, u)} d N (u ∣ v) .

(26)

The marginal Breslow estimate given β^* and λ^* (τ) is

{\tilde{Λ}}_{β^{*}, λ^{*}} (β, τ) = \sum_{v \in V} {\hat{Λ}}_{v} (β, τ) \Pr (v ∣ β^{*}, λ^{*}, observed data) = \int_{0}^{τ} \frac{1}{Y (β, u)} d \tilde{N} (u ∣ β^{*}, λ^{*}),

(27)

where $\tilde{N} (τ ∣ β^{*}, λ^{*}) = \sum_{j = 1}^{n} \sum_{i \neq j} {\tilde{N}}_{i j} (τ ∣ β^{*}, λ^{*})$ .

For the relative risk function r(x) = exp(x), the expected log partial likelihood ${\tilde{p l}}_{β^{*}, λ^{*}} (β)$ is the log partial likelihood of a weighted Cox regression model (Therneau and Grambsch, 2000) with two copies of each pair ij: an uncensored copy with weight p_ij(β^* , λ^*) and a censored copy with weight 1 − p_ij(β^* , λ^*). The baseline hazard estimate from this model is the marginal Breslow estimate ${\tilde{Λ}}_{β^{*}, λ^{*}} (\tilde{β}, τ)$ , where $\tilde{β} = \arg \max_{β} {\tilde{p l}}_{β^{*}, λ^{*}} (β)$ .

2.6 EM algorithm

When who-infects-whom is not observed, the semiparametric regression model can be fit using the ECM algorithm of Meng and Rubin (1993), which is an extension of the EM algorithm of Dempster et al. (1977). In each iteration, we first estimate β₀ using the expected log partial likelihood and then calculate the marginal Breslow estimator of Λ₀(τ). We use these new estimates to re-weight the possible v. The entire process is described in Algorithm 1.

Algorithm 1 ECM algorithm for semiparametric estimation of β₀ and Λ₀(τ).

To show that this is an ECM algorithm, we must show that the CM1 and CM2 steps are conditional maximizations of the expected log likelihood. The CM1 step is a conditional maximization by definition, so it remains to show that the CM2 step is a conditional maximization. Given a coefficient vector β^* and a hazard function λ^*, the expected log likelihood is

{\tilde{ℓ}}_{β^{*}, λ^{*}} (β, Λ) = \sum_{j = 1}^{n} \sum_{i \neq j} p_{i j} (β^{*}, λ^{*}) \ln (r (β^{T} X_{i j} (t_{j} - t_{i} - ε_{i})) d Λ (t_{j} - t_{i} - ε_{i})) - \int_{0}^{\infty} Y (β, u) d Λ (u) .

(28)

Differentiating with respect to dΛ(t_j − t_i − ε_i) for each i and j shows that, for a fixed β, l̃β^*,_λ^* (β, Λ) is maximized over all step functions Λ(τ) by setting

d Λ (t_{j} - t_{i} - ε_{i}) = \frac{p_{i j} (β^{*}, λ^{*})}{Y (β, t_{j} - t_{i} - ε_{i})},

(29)

exactly as in the marginal Breslow estimator ${\tilde{Λ}}_{β^{*}, λ^{*}} (β, τ)$ . Therefore, Algorithm 1 is an ECM algorithm. When it is known that β = 0, it reduces to the EM algorithm in Kenah (2013). Therefore, the convergence of both β^(k) and Λ^(k)(τ) should be monitored.

2.7 Large-sample estimation of β₀

Let $\tilde{β}$ denote the estimate of β₀ to which the ECM algorithm converges, and let $\tilde{λ} (τ)$ denote the corresponding estimate of λ₀(τ). Let U_v(τ, β) and I_v(β) denote the score and the observed information that we would have calculated had we observed the transmission network v. Using the methods of Louis (1982), the observed information is

\tilde{I} (\tilde{β}) = E_{\tilde{β}, \tilde{λ}} [I_{v} (\tilde{β})] - E_{\tilde{β}, \tilde{λ}} [U_{v} {(\tilde{β}, \infty)}^{\otimes 2}],

(30)

where $E_{β, λ} [\cdot]$ denotes an expectation taken under the assumption that the true coefficient vector is β and the true baseline hazard function is λ(τ). The first term in (30) is

- \sum_{j = 1}^{n} \sum_{i \neq j} \int_{0}^{τ} \frac{\partial^{2}}{\partial β^{2}} \ln \frac{r ({\tilde{β}}^{T} X_{i j} (u))}{Y (β, u)} d {\tilde{N}}_{i j} (u),

(31)

where ${\tilde{N}}_{i j} (u) = {\tilde{N}}_{i j} (u ∣ \tilde{β}, \tilde{λ})$ . This is the observed information matrix from a weighted regression model where each ij has an uncensored copy with weight $p_{i j} (\tilde{β}, \tilde{λ})$ and a censored copy with weight 1 − $1 - p_{i j} (\tilde{β}, \tilde{λ})$ . To evaluate the second term in (30), let

{\tilde{U}}_{. j} (β, τ) = \sum_{i \neq j} \int_{0}^{τ} \frac{\partial}{\partial β} \ln \frac{r (β^{T} X_{i j} (u))}{Y (β, u)} d {\tilde{N}}_{i j} (u),

(32)

be the expected score contribution from all pairs with j as a susceptible. Since $\sum_{j = 1}^{n} {\tilde{U}}_{. j} (\tilde{β}, \infty) = 0$ , each infected person j has only one infector in any v, and the infectors of different individuals can be chosen independently, we have

E_{\tilde{β}, \tilde{λ}} [U {(\tilde{β}, \infty)}^{\otimes 2}]] = \sum_{j = 1}^{n} \sum_{i \neq j} \int_{0}^{\infty} {(\frac{\partial}{\partial β} \ln \frac{r ({\tilde{β}}^{T} X_{i j} (u))}{Y (\tilde{β}, u)})}^{\otimes 2} d {\tilde{N}}_{i j} (u) - \sum_{j = 1}^{n} {\tilde{U}}_{. j} {(\tilde{β}, \infty)}^{\otimes 2}

(33)

2.8 Large-sample estimation of Λ₀(τ)

Let ${\tilde{Λ}}_{0} (τ)$ be the marginal Breslow estimate obtained after convergence of the ECM algorithm. Appendix B.2, available online, derives the variance estimate

{\tilde{σ}}_{0}^{2} (τ) = {(\frac{\partial}{\partial β} {\tilde{Λ}}_{\tilde{β}, \tilde{λ}} (\tilde{β}, τ))}^{T} \tilde{I} {(\tilde{β})}^{- 1} (\frac{\partial}{\partial β} {\tilde{Λ}}_{\tilde{β}, \tilde{λ}} (\tilde{β}, τ))

(34)

+ 2 \int_{0}^{τ} \frac{1}{Y {(\tilde{β}, u)}^{2}} d \tilde{N} (u) - \sum_{j = 1}^{n} {(\int_{0}^{τ} \frac{1}{Y (\tilde{β}, u)} d {\tilde{N}}_{. j} (u))}^{2},

(35)

where Ñ_.j(u) = Σ_i≠j Ñ_.j(u). Using the martingale central limit theorem and a log transformation, we get the approximate pointswise 1 − α confidence limits

{\tilde{Λ}}_{0} (τ) \exp (\pm \frac{{\tilde{σ}}_{0} (τ)}{{\tilde{Λ}}_{0} (τ)} Φ^{- 1} (1 - \frac{α}{2})) .

(36)

Point and interval estimates for the baseline survival function can be obtained using the relationship S₀(τ) = exp (− Λ₀(τ)

3 SIMULATIONS

The performance of the methods from Section 2 was tested with a series of 12000 network-based epidemic simulations. All epidemics took place on a Watts-Strogatz small-world network (Watts and Strogatz, 1998), which mimics the high clustering and low diameter of real human contact networks. Starting with a ring of 50000 nodes, each node was connected to its 10 nearest neighbors and each edge was rewired to a randomly chosen node with probability 0.1. A new contact network was built for each simulation.

All epidemic models were written in Python 2.7 (www.python.org) using the packages NetworkX 1.6 (networkx.lanl.gov), NumPy 1.6, and SciPy 0.9 (www.scipy.org). Statistical analysis was done in in R 2.15 (www.r-project.org) via the Rpy2 2.2 package (rpy.sourceforge.net). The code for the models is available as Online Supplementary Information.

3.1 Transmission model

The transmission model had a latent period of zero and an exponential infectious period with mean one. The baseline contact interval distribution was Weibull(α, γ), where α is the shape parameter and γ is the rate parameter. 6000 simulations had a Weibull(0.5, 0.2) distribution, which has Λ₀(τ) = (0.2τ)^0.5. The other 6000 had a Weibull(2, 0.6) distribution, which has Λ₀(τ) = (0.6τ)². These distributions gave a basic reproduction number (expected number of infectious contacts made by a typical infectious person) R₀ ≈ 3 in a null model.

In the transmission model, each person i had an infectiousness covariate $X_{i}^{\inf}$ and a susceptibility covariate $X_{i}^{sus}$ . Each pair ij connected by an edge had a pairwise covariate $X_{i j}^{pair}$ . All covariates were independent Bernoulli(.5) random variables. For a connected pair ij, the hazard of infectious contact from i to j at infectious age τ of i was

λ_{i j} (τ) = \exp (β_{\inf} X_{i}^{\inf} + β_{sus} X_{j}^{sus} + β_{pair} X_{i j}^{pair}) λ_{0} (τ)

(37)

For each parameter β, there were 4000 simulations where its true value was chosen from a uniform distribution on (−1, 1). Of these, 2000 simulations used the Weibull(0.5, 0.2) baseline hazard and 2000 used the Weibull(2, 0.6) baseline hazard. Of the 2000 simulations for each baseline hazard, 1000 had the other two β set to 0 and 1000 had the other two β set to 1.

Each simulated epidemic began with a single person infected at time 0. Data from the next 1000 infections was used to fit two regression models, one using information on who-infected-whom as in Section 2.1 and one using an EM algorithm as in Section 2.5. The EM algorithm used a minimum of 2 and a maximum of 25 iterations. At each iteration, a weighted Cox model was run using the last parameter estimates as the initial parameter estimates. Convergence was defined as a change less than 0.002 in the expected log likelihood (tighter convergence criteria yielded nearly identical parameter estimates). After convergence, a Cox model was run using the final weights and initial parameters β_inf = β_sus = β_pair = 0.

After each simulation, we recorded the true value, estimate, and 95% confidence interval endpoints for each in the model and the baseline hazard at the 10^th, 25^th, 50^th, 75^th, and 90^th percentiles of all censored and observed contact intervals. We also recorded the α and γ of the baseline hazard function and the number of EM iterations.

3.2 Results

Figure 3 shows good agreement between the estimated and true β_inf, β_sus, and β_pair for both $\hat{β}$ (who-infects-whom observed) and $\tilde{β}$ (who-infects-whom unobserved). Table 1 shows excellent 95% confidence interval coverage probabilities for all combinations of baseline hazards and parameters. The $\tilde{β}$ estimates had slightly lower coverage probabilities than the $\hat{β}$ estimates. The lower right panel of Figure 3 shows that this was achieved with relatively few iterations. The median number of iterations was 6, and 98% of simulations required ≤ 10 iterations. Only 2 out of 12000 simulations failed to converge within 25 iterations.

The top two panels and the bottom left panel show $\hat{β}$ (black circles) and $\tilde{β}$ (gray circles) versus true β for β_inf, β_sus, and β_pair. The bottom right panel shows a histogram of the number of EM iterations required for convergence.

Table 1.

Hazard ratio 95% confidence interval coverage probabilities in simulations. Each probability is based on the results of 1000 simulations.

Baseline hazard	Parameter: β_inf
	β_sus=β_pair=0		β_sus=β_pair=1
	β _inf	β _inf	β _inf	β _inf
α = .5	.952	.937	.937	.945
α = 2	.955	.952	.957	.941
Baseline hazard	Parameter: β_sus
	β_inf = β_pair = 0		β_inf = β_pair = 1
	β _sus	β _sus	β _sus	β _sus
α = .5	.951	.950	.939	.939
α=2	.952	.948	.945	.946
Baseline hazard	Parameter: β_pair
	β_inf = β_sus = 0		β_inf = β_sus = 0
	β _pair	β _pair	β _pair	β _pair
α = .5	.942	.927	.955	.946
α=2	.942	.929	.951	.951

Open in a new tab

Figures 4 and 5 show good agreement between the estimated and true base-line hazard for both ${\hat{Λ}}_{0} (τ)$ (who-infects-whom observed) and ${\tilde{Λ}}_{0} (τ)$ (who-infects-whom unobserved). The smoothed means show almost no bias in ${\hat{Λ}}_{0} (τ)$ or ${\tilde{Λ}}_{0} (τ)$ . Table 2 shows good 95% confidence interval coverage probabilites for both base-line hazards and all percentiles except for α = 2 at the 10^th and 25^th percentiles. When α = 2, the baseline hazard of infectious contact is λ₀(τ) = 1.2τ. At low values of τ, the hazard of infectious contact is very small, so almost all of the contact intervals will be censored. Since the percentiles are calculated for all censored and observed contact intervals, there may be too few observed intervals at the 10^th and 25th percentiles when α = 2 for the large-sample normal approximation to be valid.

${\hat{Λ}}_{0} (τ)$ and ${\tilde{Λ}}_{0} (τ)$ versus true Λ₀(τ) for the 6000 simulations with a Weibull(0.5, 0.2) baseline contact interval distribution. For each simulation and each estimator, a circle is shown for the 10^th, 25^th, 50^th, 75^th, and 90^th percentiles of all possible contact intervals. The smoothed means were calculated using cubic smoothing splines.

${\hat{Λ}}_{0} (τ)$ and ${\tilde{Λ}}_{0} (τ)$ versus true Λ₀(τ) for the 6000 simulations with a Weibull(2, 0.6) baseline contact interval distribution. For each simulation and each estimator, a circle is shown for the 10^th, 25^th, 50^th, 75^th, and 90^th percentiles of all possible contact intervals. The smoothed means were calculated using cubic smoothing splines.

Table 2.

95% confidence interval coverage probabilities in simulations. Each probability is based on the results of 6000 simulations.

Baseline hazard Quantile	α = .5		α=2
Baseline hazard Quantile	${\hat{Λ}}_{0} (τ)$	${\tilde{Λ}}_{0} (τ)$	${\hat{Λ}}_{0} (τ)$	${\tilde{Λ}}_{0} (τ)$
10%	.949	.938	.957	.875
25%	.949	.940	.951	.905
50%	.950	.941	.955	.934
75%	.949	.936	.949	.939
90%	.949	.941	.953	.939

Open in a new tab

Figure 6 shows the widths of the confidence intervals when who-infects-whom is not observed in terms of the width of the confidence interval when who-infects-whom is observed for β_inf, β_sus, β_pair, and Λ₀(τ). For β_inf and β_pair, the precision gained by observing who-infects-whom is roughly equivalent to a 20-40% increase in sample size. The baseline hazard plays an important role in how much precision is gained, with a larger gain for α = 0.5 than for β = 2. There is no gain in precision for β_sus because observing who-infects-whom does not add to our knowledge of who was infected. Seeing who-infects-whom only slightly improves the precision of baseline hazard estimates.

The width of 95% confidence intervals when who-infects-whom is not observed divided by its width when who-infects-whom is observed for β_inf, β_sus, β_pair, and Λ₀(τ). The solid gray lines show smoothed means for α = 0.5 and dashed gray lines show smoothed means for α = 2. The smoothed means were calculated using cubic smoothing splines.

Observing who-infects-whom allows point estimates that are closer to the truth and interval estimates with better coverage probabilities. However, the EM algorithm can recover a great deal of information when who-infects-whom is not observed, making the iterative regression model of Section 2.5 a promising tool for infectious disease epidemiology.

4 DATA ANALYSIS

To show how the methods of Section 2 can be applied, we will look at the effect of antiviral prophylaxis and age on the transmission of pandemic influenza A(H1N1) in Los Angeles County in 2009. The Los Angeles County Department of Public Health (LACDPH) collected household surveillance data between April 22 and May 19 according to the following protocol (Sugimoto et al., 2011):

1. Nasopharyngeal swabs and aspirates were taken from individuals who reported to the LACDPH or other health care providers with acute febrile respiratory illness (AFRI), defined as a fever ≥ 100°F plus cough, core throat, or runny nose. These specimens were tested for influenza, and the age, gender, and symptom onset date of the AFRI patient were recorded.

2. Patients whose specimens tested positive for pandemic influenza A(H1N1) or for influenza A of undetermined subtype were enrolled as index cases. Each of them was given a structured phone interview to collect the following information about his or her household contacts: age, gender, type of contact (household, intimate, in-home daycare, non-home daycare), and high risk status (pregnant, child on long-term aspirin therapy, immuno-suppressed, or history of a chronic cardiac, pulmonary, renal, liver, or neurologic condition). The interviewer also recorded whether prophylactic antiviral medication was being taken by the household contacts. They were asked to report the symptom onset date of any AFRI episodes among their household contacts.

3. When necessary, a follow-up interview was given 14 days after the symptom onset date of the index case to assess whether any additional AFRI episodes had occurred in the household, including their illness onset date.

There were 58 households with a total of 299 members. There were 99 infections, of which 62 were index cases (4 of the 58 households had co-primary cases) and 27 were household contacts with an AFRI. For simplicity, we assume these were all influenza A(H1N1) cases and that all household members were susceptible to infection.

Our natural history assumptions were adapted from Yang et al. (2009) and are identical to those in Kenah (2013). In the primary analysis, we assumed an incubation period of 2 days, a latent period of 0 days, and an infectious period of 6 days. Under these assumptions, a person j with symptom onset at time $t_{j}^{sym}$ was infected at time $t_{j} = t_{j}^{sym} - 2$ and will stop being infectious at time $t_{j} + 6 = t_{j}^{sym} + 4$ . Under these assumptions, person j can transmit infection on days t_j + 1 to t_j + 6. In a sensitivity analysis, we vary the latent period from 0 to 1 days, and the infectious period from 5 to 7 days.

We modeled influenza transmission within households, not between households or from outside the observed households. In each household, infected household members who had no possible infector within the household according to our natural history assumptions were assumed to be imported infections. We assumed that any infected household member could infect any susceptible household member. We used the regression model of Section 2.5 to estimate influenza transmission hazard ratios for the following covariates:

age_inf = 0 if the infectious person is < 18 years old and 1 otherwise,
age_sus = 0 if the susceptible is < 18 years old and 1 otherwise,
proph_sus = 0 if the susceptible is not on antiviral prophylaxis and 1 otherwise.

Since antiviral prophylaxis was initiated after the initial case in each household, it was considered only as a susceptibility covariate. All statistical analysis was done in R 2.15 (www.r-project.org).

4.1 Results

There were 114 people aged < 18 years and 185 aged ≥ 18 years, with no missing age data. There were 91 people taking antiviral prophylaxis and 152 not taking prophylaxis, with missing prophylaxis data for 56 people. When who-infects-whom is not observed, a complete-case analysis requires the removal of all rows corresponding to infectious-susceptible pairs ij where $i \in V_{j}$ and any member of $V_{j}$ is missing data. Otherwise, the remaining members of $V_{j}$ get too much credit for the infection of j.

In the main analysis, there were 70 people infected from outside the household (i.e., no possible infector in the household), 16 with 1 possible infector, 7 with 2 possible infectors, 4 with 4 possible infectors, and 2 with 8 possible infectors, giving us 1¹⁶ × 2⁷ × 4⁴ × 8² = 2097152 possible transmission trees. The pairwise data contains 443 infectious-susceptible pairs with a total of 2455 pair-days at risk of infection. Of these, 16 × 1 + 7 × 2 + 4 × 4 + 2 × 8 = 62 rows represent possible infection events. All models used the Efron approximation (Efron, 1977) for the partial likelihood with tied failure times.

The top panel of Table 3 shows the results of seven models. All of the models including prophylaxis suggested that antiviral prophylaxis reduced the hazard of infectious contact by about 60%, with low p-values. Hazard ratio point estimates for the main effects of age in all models suggest that adults are more infectious and less susceptible than children. However, evidence for this result is very weak. Only one of the age effects was statistically significant in univariable models, and none were significant in any multivariable model. Multivariable and stratified models with interaction terms for age and antiviral prophylaxis suggest a stronger effect of antiviral prophylaxis on transmission to and from adults than on transmission to and from children. However, the evidence for this result is also weak; these coefficients had high p-values and wide confidence intervals. The bottom panel of Table 3 shows the results of a sensitivity analysis using the multivariable model without interaction. Varying the latent and infectious periods has little effect on the results of the model.

Table 3.

Hazard ratios with 95% con dence intervals and p-values for different models of the 2009 pandemic inuenza A(H1N1) household surveillance data from Los Angeles County. Likelihood ratio p-values comparing models with and without interaction terms are also given. The multivariable and stratied models without interaction were used as the final models.

	Main effects			Interactions
	age_inf	age_sus	prophy_sus	Variables	HR (p-value)
Regression model
Univariable	1.53 (0.66, 3.54) p = .321	0.41 (0.20, 0.85) p = .016	0.43 (0.18, 1.02) p = . 057
Multivariable	1.78 (0.69, 4.62) p = .234	0.69 (0.29, 1.64) p= .399	0.41 (0.17, 0.98) p = . 046
Multivariable + interaction	1.59 (0.32, 7.84) p = . 570	0.63 (0.14, 2.73) p= .532	0.04 (0.00, 9.62) p = . 253	age_inf:age_susage_inf :proph_susage_sus :proph_sus	0.66 (p = .71) 9.28 (p= .45) 2.72 (p = .36) LR p = .101
Stratified	strata	0.69 (0.29, 1.64) p = . 401	0.41 (0.17, 0.99) p = . 047
Stratified + interaction	strata	0.52 (0.29, 1.64) p = .219	0.23 (0.05, 1.16) p = . 075	age_inf :prophsus	2.37 (p = .38) LR p = .353
Sensitivity analysis (multivariable model without interaction)
Latent period 1 day	1.44 (0.64, 3.26) p = .378	0.83 (0.36, 1.93) p = . 670	0.35 (0.15, 0.80) p = .013
Infectious period 5 days	1.59 (0.60, 4.20) p = . 348	0.64 (0.27, 1.55) p = .322	0.45 (0.18, 1.07) p = .073
7 days	1.45 (0.62, 3.40) p = .378	0.89 (0.38, 2.04) p = . 670	0.34 (0.17, 0.87) p = .013

Open in a new tab

Figure 7 shows estimates of the cumulative transmission probability based on the multivariable and stratified models without interaction. The results of the two models are similar, but the stratified model showed lower probabilities of transmission from children and higher probabilities of transmission from adults. All four panels clearly show the estimated effect of antiviral prophylaxis. All curves show bigger jumps on the first four days after infection than on days 5 and 6, which is consistent with the results of Kenah (2013). Comparing the top and bottom rows shows that children are estimated to be less infectious than adults. Comparing the left and right columns shows that children are estimated to be more susceptible than adults. As noted above, these differences are not statistically significant.

Household transmission of 2009 pandemic influenza A(H1N1) in Los Angeles County. Each panel shows separate curves for susceptible contacts with (gray lines) and without (black lines) antiviral prophylaxis. The solid lines are based on the multivariable model, and the dotted lines are based on the model stratified by age_inf.

This data analysis has been intended primarily to illustrate the flexibility of the regression modeling framework for analyzing transmission data. There are several important limitations of the analysis itself. With only 29 within-household transmissions, the large-sample normal approximations may not hold and there is limited power to estimate the effects of age and antiviral prophylaxis. The age classification is crude, so it may not accurately capture the effects of age. The prophylaxis variable was missing for many pairs and was binary, allowing no consideration of the timing of prophylaxis relative to exposure. Analyses of the household transmission of influenza A(H3N2) found greater child-to-child than adult-to-adult transmission (Addy et al., 1991). In our analysis of influenza A(H1N1), children appeared less infectious and more susceptible than adults, but these differences were not statistically significant. If not due to random noise, such a result could reflect a difference between the H3N2 and H1N1 subtypes of influenza A or a bias caused the failure to account for infection from outside the household. In any case, this analysis shows that the model needs to be extended to model infection from outside the household and to handle missing data.

5 DISCUSSION

The semiparametric relative-risk regression model proposed here has several important advantages over the chain binomial model. It can be fit using standard statistical software with parameter interpretations that resemble the Cox model. Standard software can be used to convert the results into curves representing the cumulative probability of transmission in pairs of individuals with specific characteristics. It does not make any parametric assumptions about the baseline hazard of infectious contact, and it allows many of the same extensions as the Cox model, including stratification, interaction, and time-dependent covariates. This flexibility and ease of use will make it an important tool for infectious disease epidemiology. To realize this potential, there are several limitations that remain to be addressed.

We assumed that the set of imported infections is known. The chain binomial model handles unknown imported infections by including a per-time-unit probability of escaping infection from outside the household. In the semipara-metric regression model, this could be achieved by fitting two models in each step of the EM algorithm: a pairwise contact interval model within the household and an individual-level model in absolute time for infection from outside the household. At each step, the weights would be recalculated based on covariates, coefficient estimates, the baseline hazard of the contact interval distribution, and the baseline hazard of infection from outside the household.

We assumed that infection times, latent periods, and infectious periods were all observed. We can usually observe only the clinical course of the disease, so these times must be imputed. In Section 4, we had missing data on covariates. Simple missing data (such as antiviral prophylaxis) could be handled by extending the EM algorithm to calculate the expected log likelihood over the possible values of the missing data as well as who-infected-whom. More complex missing data (such as infection and removal times) could be handled using data augmentation in a profile sampler (Lee et al., 2005), getting a posterior distribution for the model coefficients while treating the baseline hazards as a nuisance parameter.

We assumed that all possible infectors of each person were observed. Unobserved infectors could occur because of incomplete contact tracing or asymptomatic infection. The possible bias caused by unobserved sources of infection needs to be studied, and methods for controlling it analytically or assessing its severity in a sensitivity analysis need to be developed.

We assumed a static contact network where the C_ij were binary and constant. In reality, people are exposed to close contacts at home, at work, at school, and at other locations in a dynamic process. The extension of these methods to dynamic contact networks is possible but nontrivial. We could allow C_ij(τ) to be a time-dependent process in the infectious age τ of i. The contact interval distribution would then be defined as the distribution of the contact interval that would occur if C_ij(τ) = 1 for all τ. For estimation, we would have to observe the process C_ij(τ) for each ij.

Some of our assumptions must be relaxed to capture the natural history of complex diseases. We assumed an SEIR framework best suited to acute, immunizing diseases that spread directly from person to person. Many foodborne and waterborne diseases, pneumococcal and meningococcal diseases, and other infectious diseases of major public health importance do not fit easily into this framework. To extend the proposed regression model to complex diseases, we could allow individuals to experience multiple events (e.g., first infection, second infection) or to experience different types of events (e.g., colonization, infection, relapse). We assumed that contact intervals were independent of infectious periods even though both are a ected by the same host-pathogen interaction. In some cases, there may be a covariate process X(τ) such that I_i(τ) and $N_{i j} (τ)$ are conditionally independent given X(τ⁻). Otherwise, infectious contact and the infectious period must be modeled as a multivariate survival process.

Several technical issues need further attention. The smoothing step is crucial to the fitting the regression model when who-infects-whom is not observed. Here, we used cubic smoothing splines because they were convenient and worked well. However, these do not guarantee that the smoothed hazard function is monotonically increasing and lack a convenient interpretation in terms of the likelihood. A penalized likelihood estimator that guarantees monotonicity, such as that of Anderson and Senthilselvan (1980), would be more consistent with the EM algorithm. Model diagnostics, goodness-of-fit tests, and small-sample methods for point and interval estimation need to be developed, and a more rigorous study of the model asymptotics needs to be done.

Despite these limitations, the semiparametric relative-risk regression model presented here is a powerful new framework for the analysis of infectious disease transmission data. Placing statistical methods for infectious disease epidemiology on the broad and deep theoretical foundation of survival analysis will help clarify study design and causal inference for communicable diseases and allow statistical methods to develop in concert with advances in molecular biology. Ultimately, these improvements may lead to more efficient vaccine trials and a better-informed public health response to future outbreaks and epidemics.

Supplementary Material

Supplementary Material - Appendices

NIHMS596023-supplement-Supplementary_Material_-_Appendices.pdf^{(216.8KB, pdf)}

Supplementary Material - Python Module

NIHMS596023-supplement-Supplementary_Material_-_Python_Module.zip^{(6.2KB, zip)}

Acknowledgments

This research was supported by National Institute of Allergy and Infectious Diseases (NIAID) grant K99/R00 AI095302. The content is solely the responsibility of the author and does not necessarily represent the o cial views of NIAID or the National Institutes of Health. The author thanks Tom Britton, Yang Yang, M. Elizabeth Halloran, and Ira M. Longini, Jr. for their comments and suggestions.

Footnotes

SUPPLEMENTARY MATERIALS

Supplementary material available online at the Journal of the American Statistical Association website includes:

Appendices for Section 2 Outline of sufficient conditions for consistency and asymptotic normality (Appendix A) and derivation of the asymptotic variance of baseline hazard estimates (Appendix B).

Simulation code for Section 3 Python module used for the simulations, including statistical analyses in R.

References

Aalen Odd O., Borgan Ørnulf, Gjessing Hakon. Statistics for Biology and Health. Springer-Verlag; New York: 2009. Survival and Event History Analysis: A Process Point of View. [Google Scholar]
Addy Cheryl L., Longini Ira M., Jr, Haber Michael. A generalized stochastic model for the analysis of infectious disease final size data. Biometrics. 1991;47:961–974. [PubMed] [Google Scholar]
Anderson JA, Senthilselvan A. Smooth estimates for the hazard function. Journal of the Royal Statistical Society, Series B. 1980;42:322–327. [Google Scholar]
Andersson Håkan, Britton Tom. Lecture Notes in Statistics. Springer; New York: 2000. Stochastic Epidemic Models and Their Statistical Analysis. [Google Scholar]
Becker Niels G. Monographs on Statistics and Applied Probability. Chapman & Hall/CRC; Boca Raton, FL: 1989. Analysis of Infectious Disease Data. [Google Scholar]
Breslow N. Cox DR, editor. Contribution to discussion of paper. Journal of the Royal Statistical Society B. 1972;34:216–217. [Google Scholar]
Cox David R. Regression models and life-tables. Journal of the Royal Statistical Society, Series B. 1972;34:187–220. [Google Scholar]
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from in complete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 1977;39:1–38. [Google Scholar]
Efron Bradley. The e ciency of Cox's likelihood function for censored data. Journal of the American Statistical Association. 1977;72:557–565. [Google Scholar]
Soren Johansen. An extension of Cox's regression model. International Statistical Review. 1983;51:165–174. [Google Scholar]
Kenah Eben. Contact intervals, survival analysis of epidemic data, and estimation of R0. Biostatistics. 2011;12:548–566. doi: 10.1093/biostatistics/kxq068. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kenah Eben. Nonparametric survival analysis of epidemic data. Journal of the Royal Statistical Society, Series B. 2013;75:277–303. doi: 10.1111/j.1467-9868.2012.01042.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kenah Eben, Lipsitch Marc, Robins James M. Generation interval contraction and epidemic data analysis. Mathematical Biosciences. 2008;213:71–79. doi: 10.1016/j.mbs.2008.02.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leng Lee Bee, Kosorok Michael R., Fine Jason P. The profile sampler. Journal of the American Statistical Association. 2005;100:960–969. [Google Scholar]
Louis Thomas A. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society, Series B. 1982;44:226–233. [Google Scholar]
Meng Xiao-Li, Rubin Donald. Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika. 1993;80:267–278. [Google Scholar]
Prentice Ross L., Self Steven G. Asymptotic distribution theory for Cox-type regression models with general relative risk form. Annals of Statistics. 1983;11:804–813. [Google Scholar]
Rampey Alvin H., Jr, Longini Ira M., Jr, Haber Michael, Monto Arnold S. A discrete-time model for the statistical analysis of infectious disease incidence data. Biometrics. 1992;48:117–128. [PubMed] [Google Scholar]
Rhodes Philip H., Elizabeth Halloran M, Longini Ira M., Jr. Counting process models for infectious disease data: Distinguishing exposure to infection from susceptibility. Journal of the Royal Statistical Society B. 1996;58:751–762. [Google Scholar]
Sugimoto Jonathan D., Yang Yang, Elizabeth Halloran M, Dean Brandon, Oiulfstad Brit, Ann Bagwell Dee, Mascola Laurene, Bancroft Elizabeth, Longini Ira M., Jr. Accounting for unobserved immunity and asymptomatic infection in the early household transmission of the pandemic influenza A (H1N1) 2009. Submitted to American Journal of Epidemiology. 2011 [Google Scholar]
Svensson Åke. A note on generation times in epidemic models. Mathematical Biosciences. 2007;208:300–311. doi: 10.1016/j.mbs.2006.10.010. [DOI] [PubMed] [Google Scholar]
Therneau Terry M., Grambsch Patricia M. Statistics for Biology and Health. Springer-Verlag; New York: 2000. Modeling Survival Data: Extending the Cox Model. [Google Scholar]
Wallinga Jacco, Teunis Peter. Di erent epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures. American Journal of Epidemiology. 2004;160:509–516. doi: 10.1093/aje/kwh255. [DOI] [PMC free article] [PubMed] [Google Scholar]
Watts Duncan J., Strogatz Steven H. Collective dynamics of ‘small-world’ networks. Nature. 1998;393:440–442. doi: 10.1038/30918. [DOI] [PubMed] [Google Scholar]
Forsberg White L, Pagano Marcello. A likelihood-based method for real-time estimate of the serial interval and reproductive number of an epidemic. Statistics in Medicine. 2008;27:2999–3016. doi: 10.1002/sim.3136. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang Yang, Sugimoto Jonathan, Elizabeth Halloran M, Basta Nicole E., Chao Dennis L., Matrajt Laura, Potter Gail, Kenah Eben, Longini Ira M., Jr. The transmissibility and control of pandemic influenza A(H1N1) virus. Science. 2009;326:729–733. doi: 10.1126/science.1177373. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material - Appendices

NIHMS596023-supplement-Supplementary_Material_-_Appendices.pdf^{(216.8KB, pdf)}

Supplementary Material - Python Module

NIHMS596023-supplement-Supplementary_Material_-_Python_Module.zip^{(6.2KB, zip)}

[R1] Aalen Odd O., Borgan Ørnulf, Gjessing Hakon. Statistics for Biology and Health. Springer-Verlag; New York: 2009. Survival and Event History Analysis: A Process Point of View. [Google Scholar]

[R2] Addy Cheryl L., Longini Ira M., Jr, Haber Michael. A generalized stochastic model for the analysis of infectious disease final size data. Biometrics. 1991;47:961–974. [PubMed] [Google Scholar]

[R3] Anderson JA, Senthilselvan A. Smooth estimates for the hazard function. Journal of the Royal Statistical Society, Series B. 1980;42:322–327. [Google Scholar]

[R4] Andersson Håkan, Britton Tom. Lecture Notes in Statistics. Springer; New York: 2000. Stochastic Epidemic Models and Their Statistical Analysis. [Google Scholar]

[R5] Becker Niels G. Monographs on Statistics and Applied Probability. Chapman & Hall/CRC; Boca Raton, FL: 1989. Analysis of Infectious Disease Data. [Google Scholar]

[R6] Breslow N. Cox DR, editor. Contribution to discussion of paper. Journal of the Royal Statistical Society B. 1972;34:216–217. [Google Scholar]

[R7] Cox David R. Regression models and life-tables. Journal of the Royal Statistical Society, Series B. 1972;34:187–220. [Google Scholar]

[R8] Dempster AP, Laird NM, Rubin DB. Maximum likelihood from in complete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 1977;39:1–38. [Google Scholar]

[R9] Efron Bradley. The e ciency of Cox's likelihood function for censored data. Journal of the American Statistical Association. 1977;72:557–565. [Google Scholar]

[R10] Soren Johansen. An extension of Cox's regression model. International Statistical Review. 1983;51:165–174. [Google Scholar]

[R11] Kenah Eben. Contact intervals, survival analysis of epidemic data, and estimation of R0. Biostatistics. 2011;12:548–566. doi: 10.1093/biostatistics/kxq068. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Kenah Eben. Nonparametric survival analysis of epidemic data. Journal of the Royal Statistical Society, Series B. 2013;75:277–303. doi: 10.1111/j.1467-9868.2012.01042.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Kenah Eben, Lipsitch Marc, Robins James M. Generation interval contraction and epidemic data analysis. Mathematical Biosciences. 2008;213:71–79. doi: 10.1016/j.mbs.2008.02.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Leng Lee Bee, Kosorok Michael R., Fine Jason P. The profile sampler. Journal of the American Statistical Association. 2005;100:960–969. [Google Scholar]

[R15] Louis Thomas A. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society, Series B. 1982;44:226–233. [Google Scholar]

[R16] Meng Xiao-Li, Rubin Donald. Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika. 1993;80:267–278. [Google Scholar]

[R17] Prentice Ross L., Self Steven G. Asymptotic distribution theory for Cox-type regression models with general relative risk form. Annals of Statistics. 1983;11:804–813. [Google Scholar]

[R18] Rampey Alvin H., Jr, Longini Ira M., Jr, Haber Michael, Monto Arnold S. A discrete-time model for the statistical analysis of infectious disease incidence data. Biometrics. 1992;48:117–128. [PubMed] [Google Scholar]

[R19] Rhodes Philip H., Elizabeth Halloran M, Longini Ira M., Jr. Counting process models for infectious disease data: Distinguishing exposure to infection from susceptibility. Journal of the Royal Statistical Society B. 1996;58:751–762. [Google Scholar]

[R20] Sugimoto Jonathan D., Yang Yang, Elizabeth Halloran M, Dean Brandon, Oiulfstad Brit, Ann Bagwell Dee, Mascola Laurene, Bancroft Elizabeth, Longini Ira M., Jr. Accounting for unobserved immunity and asymptomatic infection in the early household transmission of the pandemic influenza A (H1N1) 2009. Submitted to American Journal of Epidemiology. 2011 [Google Scholar]

[R21] Svensson Åke. A note on generation times in epidemic models. Mathematical Biosciences. 2007;208:300–311. doi: 10.1016/j.mbs.2006.10.010. [DOI] [PubMed] [Google Scholar]

[R22] Therneau Terry M., Grambsch Patricia M. Statistics for Biology and Health. Springer-Verlag; New York: 2000. Modeling Survival Data: Extending the Cox Model. [Google Scholar]

[R23] Wallinga Jacco, Teunis Peter. Di erent epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures. American Journal of Epidemiology. 2004;160:509–516. doi: 10.1093/aje/kwh255. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Watts Duncan J., Strogatz Steven H. Collective dynamics of ‘small-world’ networks. Nature. 1998;393:440–442. doi: 10.1038/30918. [DOI] [PubMed] [Google Scholar]

[R25] Forsberg White L, Pagano Marcello. A likelihood-based method for real-time estimate of the serial interval and reproductive number of an epidemic. Statistics in Medicine. 2008;27:2999–3016. doi: 10.1002/sim.3136. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Yang Yang, Sugimoto Jonathan, Elizabeth Halloran M, Basta Nicole E., Chao Dennis L., Matrajt Laura, Potter Gail, Kenah Eben, Longini Ira M., Jr. The transmissibility and control of pandemic influenza A(H1N1) virus. Science. 2009;326:729–733. doi: 10.1126/science.1177373. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Semiparametric Relative-risk Regression for Infectious Disease Transmission Data

Eben Kenah

Abstract

1 INTRODUCTION

1.1 Stochastic S(E)IR epidemic model

Figure 1.

1.2 Observation and censoring

Figure 2.

1.3 Transmission trees and infectious sets

2 METHODS

2.1 Who-infects-whom is observed

2.2 Partial likelihood score process

2.3 Observed and expected information

2.4 Large-sample estimation of β0 and Λ0(τ)

2.5 Who-infects-whom is not observed

2.6 EM algorithm

2.7 Large-sample estimation of β0

2.8 Large-sample estimation of Λ0(τ)

3 SIMULATIONS

3.1 Transmission model

3.2 Results

Figure 3.

Table 1.

Figure 4.

Figure 5.

Table 2.

Figure 6.

4 DATA ANALYSIS

4.1 Results

Table 3.

Figure 7.

5 DISCUSSION

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.4 Large-sample estimation of β₀ and Λ₀(τ)

2.7 Large-sample estimation of β₀

2.8 Large-sample estimation of Λ₀(τ)