Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Mar 1.
Published in final edited form as: J Am Stat Assoc. 2014 Mar 27;110(509):313–325. doi: 10.1080/01621459.2014.896807

Semiparametric Relative-risk Regression for Infectious Disease Transmission Data

Eben Kenah 1
PMCID: PMC4489164  NIHMSID: NIHMS596023  PMID: 26146425

Abstract

This paper introduces semiparametric relative-risk regression models for infectious disease data. The units of analysis in these models are pairs of individuals at risk of transmission. The hazard of infectious contact from i to j consists of a baseline hazard multiplied by a relative risk function that can be a function of infectiousness covariates for i, susceptibliity covariates for j, and pairwise covariates. When who-infects-whom is observed, we derive a profile likelihood maximized over all possible baseline hazard functions that is similar to the Cox partial likelihood. When who-infects-whom is not observed, we derive an EM algorithm to maximize the profile likelihood integrated over all possible combinations of who-infected-whom. This extends the most important class of regression models in survival analysis to infectious disease epidemiology. These methods can be implemented in standard statistical software, and they will be able to address important scientific questions about emerging infectious diseases with greater clarity, flexibility, and rigor than current statistical methods allow.

Keywords: Survival analysis, Epidemiology, EM algorithm, Chain-binomial model

1 INTRODUCTION

Infectious diseases are an important threat to human health and commerce, and understanding the transmission of disease is crucial to the design of public health interventions. The statistical analysis of infectious disease data is complicated by the fact that infections are inherently dependent, especially when they are transmitted directly from person to person (Becker, 1989; Andersson and Britton, 2000). Epidemiologists have dealt with this problem in three ways. The most common approach is to model susceptibility to disease using standard statistical methods such as logistic or Cox regression, ignoring disease transmission. A second approach is to use chain binomial models (Rampey et al., 1992), which estimate the probability of escaping infectious contact from infected members of groups such as households, classrooms, or hospital wards. The third approach is to model the spread of disease as a branching process where infectees are the o spring of their infectors (Wallinga and Teunis, 2004; White and Pagano, 2008). The time between the infections of an infector and an infectee is called a generation interval. In this approach, the generation intervals are assumed to be independent and identically distributed (iid).

To understand transmission, it is crucial to separate the e ects of covariates on infectiousness and susceptibility from their association with exposure to infected people (Rhodes et al., 1996). Regression models that ignore transmission cannot do this. When disease transmission is modeled as a branching process, uninfected people do not exist and cannot be exposed to infected people. The failure to account for uninfected person-time and competing risks of infection cause several problems with this approach (Svensson, 2007; Kenah et al., 2008). The assumption that generation intervals are iid is di cult to relax, making estimation of covariate e ects di cult (Kenah, 2013). Chain binomial models are a statistically sound response to the problem of dependence and can be used to estimate covariate e ects. However, their use is limited in two ways: First, they are not implemented in standard statistical software—a problem solved partially by the publicly-available package TranStat (www.epimodels.org/midas/transtat.do). Second, they use discrete time. Since infectious disease data are usually recorded by the day or week, this is not unnatural. However, continuous-time models corrected for ties may o er a more flexible modeling framework.

Kenah (2011) extended parametric methods from survival analysis to infectious disease data by modeling the contact interval. In the ordered pair ij, the contact interval τij is the time between the onset of infectiousness in i and the first infectious contact from i to j, where infectious contact is a contact sufficient to infect a susceptible individual. It is right-censored if the infectious period of i ends before i makes infectious contact with j or if j is infected by someone other than i. These methods solve the problem of dependence by treating ordered pairs of individuals, not the individuals themselves, as the units of analysis. Kenah (2013) showed that the contact interval distribution could be estimated nonparametrically by adapting the Nelson-Aalen estimator from standard survival analysis. These methods assume a homogeneous population where the contact interval distribution is the same for all pairs ij in which transmission from i to j is possible. They are unable to estimate covariate e ects on transmission, which is a primary goal of vaccine trials, outbreak investigations, and many other studies of infectious disease.

The goal of this paper is to extend the methods of Kenah (2013) to develop a relative-risk regression model similar to that of Cox (1972). This model will allow semiparametric estimation of the e ects of covariates on the hazard of infectious contact in pairs of individuals. For the ordered pair ij, the covariate vector can include infectiousness covariates for i, susceptibility covariates for j, and pairwise covariates. This semiparametric regression model will allow many of the most important scientific questions in infectious disease epidemiology to be addressed with greater clarity, flexibility, and rigor.

1.1 Stochastic S(E)IR epidemic model

Consider a closed population of n individuals assigned indices 1 . . . n. Each individual is in one of four states: susceptible (S), exposed (E), infectious (I), or removed (R). Person i moves from S to E at his or her infection time ti, with ti = ∞ if i is never infected. After infection i has a latent period of length εi, during which he or she is infected but not infectious. At time ti + εi, i moves from E to I, beginning an infectious period of length i. At time ti + εi + li, i moves from I to R. Once in R, i can no longer infect others or be infected. The states and notation are illustrated at the top of Figure 1. The latent period is a nonnegative random variable, the infectious period is a strictly positive random variable, and both have finite mean and variance.

Figure 1.

Figure 1

Notation for the stochastic SEIR model natural history (top) and infectious contact process (bottom). In the bottom diagram, the infectious contact interval τij is equal to the contact interval τij because τijli. Otherwise, we would have τij= and no infectious contact from i to j would occur.

An epidemic begins with one or more persons infected from outside the population, which we call imported infections. The methods in this paper require that the set of imported infections is known. For simplicity, we assume that epidemics begin with one or more imported infections at time 0 and there are no other imported infections.

After becoming infectious at time ti + εi, person i makes infectious contact with ji at time tij=ti+εi+τij, where the infectious contact interval τij is a strictly positive random variable with τij= if infectious contact never occurs. Since infectious contact must occur while i is infectious or never, τij(0,ιi] or τij=. We define infectious contact to be a contact sufficient infect a susceptible person, so tjtij for all ij. The infectious contact interval is illustrated at the bottom of Figure 1.

For each ordered pair ij, let Cij = 1 if infectious contact from i to j is possible and Cij = 0 otherwise. These Cij could be the entries in an adjacency matrix for a static contact network. We assume that the infectious contact interval τij is generated in the following way: A contact interval τij is drawn from a distribution with hazard function λij(τ). If τijli and Cij = 1, then τij=τij. Otherwise, τij=. In this paper, we assume the contact intervals in all ordered pairs ij are independent and have finite mean and variance.

1.2 Observation and censoring

Our population has size n, and we observe the times of all S → E (infection), E → I (onset of infectiousness), and I → R (removal) transitions in the population between time 0 and time T . For all ordered pairs ij such that i is infected, we observe Cij. We first consider the case where who-infects-whom is observed and then consider the more realistic case where it is not.

We assume that we can observe τij only if j is infected by i at time ti+εi+τij. Clearly, τij can be observed only if Cij = 1. We also have right-censoring of τij:

1. Since infectious contact can occur only while i is infectious, τij can be right-censored by the infectious period li of i. Let Ii(τ)=1τ(0,ιi] indicate whether i remains infectious at infectious age τ.

2. Since j is susceptible to infection by i only if he or she has not been infected by anyone else, τij can be right-censored by ti′jtiεi for i′i. Let Sij(τ)=1ti+εi+τtj indicate whether j remains susceptible at infectious age τ of i.

Let T denote the time at which observation ends. Then τij can be right-censored by the end of observation at infectious age Ttiεi of i. Let Yi(τ)=1ti+εi+τT indicate whether observation is ongoing when i reaches infectious age τ.

Since Ii(τ), Sij(τ), and Yij(τ) are left-continuous,

Yij(τ)=CijIi(τ)Sij(τ)Yi(τ) (1)

is a left-continuous process that indicates the risk of an observed infectious contact from i to j at infectious age τ of i. The assumptions made in the stochastic S(E)IR model above ensure that Ii(τ) and Sij(τ) independently censor τij. We also require that T is a stopping time with respect to the observed data such that, for all i, Yi(τ) independently censors τij for each j exposed to infectious contact from i. The possible censoring scenarios are illustrated in Figure 2.

Figure 2.

Figure 2

The three censoring processes for the contact interval τij. The onset of infectiousness in i occurs at time ti + εi, and the infection of j occurs at tj. At the top, τij is censored because Ii(tjtiεi)=0. In the middle, τij is observed if Sij(tjtiεi)=1 and censored otherwise. At the bottom, τij is censored because Yi(tjtiεi)=0.

1.3 Transmission trees and infectious sets

Following Wallinga and Teunis (2004), let vj denote the index of the person who infected person j, with vj = 0 for imported infections and vj = ∞ for persons not infected prior to the end of observation. The transmission tree is the directed network with an edge from vj to j for each j such that tjT . It can be represented by a vector v = (v1, . . . , vn). Let Vj={i:CijIi(tj)=1} denote the set of possible infectors of person j, which we call the infectious set of j. Let V denote the set of all v consistent with the observed data. A vV can be generated by choosing a vjVj for each non-imported infection j.

2 METHODS

Kenah (2013) described the nonparametric estimation of the contact interval distribution for a homogeneous population. Here, we consider a semiparametric relative-risk model like that of Prentice and Self (1983). Let

λij(τ)=r(β0TXij(τ))λ0(τ), (2)

where λ0(τ) is an unspecified baseline hazard function, r:R(0,) is a relative risk function, β0 is an unknown b × 1 coefficient vector, and Xij(τ) is a b × 1 predictable covariate process taking values in a set X. The covariates Xij(τ) can include individual-level covariates predicting the infectiousness of i or the susceptibility of j as well as pairwise covariates (e.g., membership in the same household) that predict the hazard of infectious contact from i to j.

We assume that r has continuous first and second derivatives, r(0) = 1, and ln r(βTX) is bounded on X. Letting r(x) = exp(x) gives us a loglinear relative risk regression model like that of Cox (1972), and letting r(x) = 1 + x gives us a linear relative risk regression model. To fit these semiparametric models, we adapt the nonparametric estimators from Kenah (2013) to account for the relative risk function.

2.1 Who-infects-whom is observed

Let Nij(τ) = 1τijτ indicate whether an observed infectious contact from i to j has occurred by infectious age τ in i, and let N(τ)=j=1nijNij(τ). Note that we must observe who-infected-whom in order to calculate N(τ).

Let Λ0(τ)=0τλ0(u)du. Given β, the Breslow estimator (Breslow, 1972) of Λ0(τ) is

Λ^(β,τ)=0τ1Y(β,u)dN(u), (3)

where

Y(β,u)=j=1nijr(βTXij(u))Yij(u). (4)

The Breslow estimator has two desirable properties. First, Λ^(β0,τ) is an unbiased estimator of Λ0(τ). Let

M(β,τ)=N(τ)0τY(β,u)λ0(u)du. (5)

Then for all τ such that Y (τ) > 0,

Λ^(β0,τ)Λ0(τ)=0τ1Y(u)>0Y(β0,u)dM(β0,u) (6)

is a mean-zero martingale when β = β0. Second, Λ^(β,τ) maximizes the log likelihood

(β,Λ)=j=1nijln(r(βTXvjj(τvjj))dΛ(τvjj))0Y(β,u)dΛ(u) (7)

over all step functions Λ(τ). Substituting Λ^(β,τ) into l(β, Λ) profile likelihood

(β,Λ^)=(j=1nlnr(βTXvjj(τvjj))Y(β,τvjj))τ, (8)

where T=max{τ:Y(τ)>0}. The first term is similar to the log partial likelihood from Cox (1972) and the second term does not depend on β. Dropping the second term, let

pl(β)=j=1nlnr(βTXvjj(τvjj))Y(β,τvjj) (9)

be the log partial likelihood for β. This derivation of the partial likelihood as a profile likelihood follows that of Johansen (1983). Let β^ denote the value of β that maximizes pl(β), and let Λ^0(τ)=Λ^(β^,τ) denote the corresponding Breslow estimate of the baseline cumulative hazard.

2.2 Partial likelihood score process

We can rewrite pl(β) as a sum of stochastic integrals:

pl(β)=j=1nij0lnr(βTXij(u))Y(β,u)dNij(u). (10)

The corresponding score process is

U(β,τ)=j=1nij0τβlnr(βTXij(u))E(β,u)dNij(u), (11)

where

E(β,u)=j=1nijr(βTXij(u))Yij(u)βlnr(βTXij(u))j=1nijr(βTXij(u))Yij(u). (12)

is the expected value of βlnr(βTXij(u)) over the risk set at u when each pair is weighted by its hazard of infectious contact at u. By the Doob-Meyer decomposition, there is a mean-zero martingale Mij(u) for each ij such that

dNij(u)=r(β0TXij(u))λ0(u)Yij(u)du+dMij(u). (13)

Expanding equation (11) using this decomposition and simplifying, we get

U(β0,τ)=j=1nij0τβlnr(β0TXij(u))Y(β0,u)dMij(u). (14)

Since it is a sum of integrals of predictable processes with respect to martingales, U(β0, τ) is a mean-zero martingale.

2.3 Observed and expected information

Let v2=vvT for a column vector v. Since the Nij(τ) do not jump simultaneously in continuous time, the predictable variation process of U(β0, τ) is

U(β0)(τ)=0τV(β0,u)Y(β0,u)λ0(u)du, (15)

where

V(β,u)=j=1nij(βlnr(βTXij(u))Y(β,u))2r(βTXij(u))Yij(u)Y(β,u) (16)

is the variance of βlnr(βTXij(u)) over the risk set at u when each pair ij is weighted by its hazard of infectious contact at u.

Let I(β)=2β2pl(β) be the observed information. Then

I(β)=j=1nij0(βlnr(βTXij(u)))2E(β,u)2dNij(u)j=1nij02β2r(βTXij(u))r(βTXij(u))2β2Y(β,u)Y(β,u)dNij(u). (17)

Expanding I(β0) via the Doob-Meyer decomposition (13) and simplifying, we get

I(β0)=0V(β0,u)Y(β0,u)λ0(u)du+j=1nij02β2lnr(β0TXij(u))Y(β0,u)dMij(u). (18)

The second term has expectation zero, so I(β0) is an unbiased estimate of Var(U(β0, ∞)).

Another estimate Var(U(β0, ∞)) is obtained by substituting the increments of the Breslow estimator (3) for λ0(u) du in equation (15). This gives us the estimated expected information

I(β)=0V(β,u)dN(u). (19)

Expanding I(β0) using the Doob-Meyer decomposition and simplifying, we get

I(β0)=0V(β0,u)Y(β0,u)λ0(u)du+0V(β0,u)dM(u), (20)

where M(τ)=j=1nijMij(τ). The second term has expectation zero, so I(β0) is also an unbiased estimate of the variance of U(β0, ∞). I(β0) may be a better estimator of Var(U(β0, ∞)) than I(β0) because it is guaranteed to be positive semidefinite (Prentice and Self, 1983) and it depends only on aggregates over risk sets (Aalen et al., 2009).

When r(x) = exp(x) as in the Cox model, I(β)=I(β) for all β. For general r(x), I(β0) and I(β0) are asymptotically equivalent under weak regularity conditions (see Appendix A online).

2.4 Large-sample estimation of β0 and Λ0(τ)

Appendix A, available online, outlines sufficient conditions for the asymptotic normality of U(β0, τ) and β^ as m → ∞, where m is the number of pairs ij at risk of transmission. Under these conditions, hypothesis tests and confidence intervals for β0 can be obtained using score, Wald, or likelihood ratio statistics. These conditions are very similar to those for the standard Cox model (Prentice and Self, 1983) except for the additional requirement that both the number of susceptibles and the number of pairs be large such that each susceptible is exposed to a number of infectors < < m. When a given susceptible j is infected, all pairs ij are censored. If there were many pairs but few susceptibles, each susceptible would be exposed to a very high hazard of infection and most pairs would be censored very quickly. To take an extreme case, imagine a single susceptible exposed to m infecteds. The number of pairs at risk of transmission is m but the susceptible will be infected almost immediately when m is large. After this, there are no more pairs at risk of transmission and we can learn nothing further about β0 or Λ0(τ).

Given β^ the Breslow estimator of Λ0(τ)isΛ^0(τ)=Λ^0(β^,τ). Its variance is consistently estimated by

σ^02(τ)=(βΛ^(β^,τ))TI(β^)1(βΛ^(β^,τ))+0τ1Y(β^,u)2dN(u), (21)

which is derived in Appendix B.1. I(β^) can be replaced by I(β^). Using the martingale central limit theorem and a log transformation, we get the approximate pointwise 1 − α confidence limits

Λ^0(τ)exp(±σ^0(τ)Λ^0(τ)Φ1(1α2)). (22)

Point and interval estimates for the baseline survival function can be obtained using the relationship S0(τ) = exp (− Λ0(τ)

2.5 Who-infects-whom is not observed

When we do not observe who-infected-whom, we do not know which contact intervals are observed and which are censored. It impossible to calculate the partial likelihood pl(β) or the Breslow estimate Λ^(β,τ). In this section, we show how an EM algorithm similar to that of Kenah (2013) can be used to obtain consistent and asymptotically normal estimates of β0 and Λ0(τ).

Given β, λ(τ), and the observed information, the probability that j was infected by i is

pij(β,λ)=r(βTXij(tjtiεi))λ(tjtiεi)1iVjkVjr(βTXkj(tjtkεk))λ(tjtkεk), (23)

and the infectors of different infected persons can be chosen independently (Kenah et al., 2008). The probability of a transmission network v = (v1, . . . , vn) given β, λ(τ), and the observed data is

Pr(vβ,λ,observed data)=j:0<vj<pvjj(β.λ). (24)

Note that the last two equations assume a continuous contact interval distribution, so simultaneous infectious contacts have probability zero.

Let plv(β) be the log partial likelihood that we would have calculated had we observed the transmission network v. Given a coefficient vector β* and a baseline hazard function λ*(τ), the expected log likelihood is

pl~β,λ(β)=vVplv(β)pr(vβ,λ,observed data)=j=1nij0Tlnr(βTXij(u))Y(β,u)dN~ij(uβ,λ), (25)

where Ñij(τ|β*, λ*) = pij(β*, λ*)1τtjtiεi. Now let N(τ|v) be the value of N(τ) that we would have calculated had we observed the transmission network v. The corresponding Breslow estimate is

Λ^v(β,τ)=0τ1Y(β,u)dN(uv). (26)

The marginal Breslow estimate given β* and λ* (τ) is

Λ~β,λ(β,τ)=vVΛ^v(β,τ)Pr(vβ,λ,observed data)=0τ1Y(β,u)dN~(uβ,λ), (27)

where N~(τβ,λ)=j=1nijN~ij(τβ,λ).

For the relative risk function r(x) = exp(x), the expected log partial likelihood pl~β,λ(β) is the log partial likelihood of a weighted Cox regression model (Therneau and Grambsch, 2000) with two copies of each pair ij: an uncensored copy with weight pij(β* , λ*) and a censored copy with weight 1 − pij(β* , λ*). The baseline hazard estimate from this model is the marginal Breslow estimate Λ~β,λ(β~,τ), where β~=argmaxβpl~β,λ(β).

2.6 EM algorithm

When who-infects-whom is not observed, the semiparametric regression model can be fit using the ECM algorithm of Meng and Rubin (1993), which is an extension of the EM algorithm of Dempster et al. (1977). In each iteration, we first estimate β0 using the expected log partial likelihood and then calculate the marginal Breslow estimator of Λ0(τ). We use these new estimates to re-weight the possible v. The entire process is described in Algorithm 1.

Algorithm 1 ECM algorithm for semiparametric estimation of β0 and Λ0(τ).

To show that this is an ECM algorithm, we must show that the CM1 and CM2 steps are conditional maximizations of the expected log likelihood. The CM1 step is a conditional maximization by definition, so it remains to show that the CM2 step is a conditional maximization. Given a coefficient vector β* and a hazard function λ*, the expected log likelihood is

~β,λ(β,Λ)=j=1nijpij(β,λ)ln(r(βTXij(tjtiεi))dΛ(tjtiεi))0Y(β,u)dΛ(u). (28)

Differentiating with respect to dΛ(tjtiεi) for each i and j shows that, for a fixed β, l̃β*,λ* (β, Λ) is maximized over all step functions Λ(τ) by setting

dΛ(tjtiεi)=pij(β,λ)Y(β,tjtiεi), (29)

exactly as in the marginal Breslow estimator Λ~β,λ(β,τ). Therefore, Algorithm 1 is an ECM algorithm. When it is known that β = 0, it reduces to the EM algorithm in Kenah (2013). Therefore, the convergence of both β(k) and Λ(k)(τ) should be monitored.

2.7 Large-sample estimation of β0

Let β~ denote the estimate of β0 to which the ECM algorithm converges, and let λ~(τ) denote the corresponding estimate of λ0(τ). Let Uv(τ, β) and Iv(β) denote the score and the observed information that we would have calculated had we observed the transmission network v. Using the methods of Louis (1982), the observed information is

I~(β~)=Eβ~,λ~[Iv(β~)]Eβ~,λ~[Uv(β~,)2], (30)

where Eβ,λ[] denotes an expectation taken under the assumption that the true coefficient vector is β and the true baseline hazard function is λ(τ). The first term in (30) is

j=1nij0τ2β2lnr(β~TXij(u))Y(β,u)dN~ij(u), (31)

where N~ij(u)=N~ij(uβ~,λ~). This is the observed information matrix from a weighted regression model where each ij has an uncensored copy with weight pij(β~,λ~) and a censored copy with weight 1 − 1pij(β~,λ~). To evaluate the second term in (30), let

U~.j(β,τ)=ij0τβlnr(βTXij(u))Y(β,u)dN~ij(u), (32)

be the expected score contribution from all pairs with j as a susceptible. Since j=1nU~.j(β~,)=0, each infected person j has only one infector in any v, and the infectors of different individuals can be chosen independently, we have

Eβ~,λ~[U(β~,)2]]=j=1nij0(βlnr(β~TXij(u))Y(β~,u))2dN~ij(u)j=1nU~.j(β~,)2 (33)

2.8 Large-sample estimation of Λ0(τ)

Let Λ~0(τ) be the marginal Breslow estimate obtained after convergence of the ECM algorithm. Appendix B.2, available online, derives the variance estimate

σ~02(τ)=(βΛ~β~,λ~(β~,τ))TI~(β~)1(βΛ~β~,λ~(β~,τ)) (34)
+20τ1Y(β~,u)2dN~(u)j=1n(0τ1Y(β~,u)dN~.j(u))2, (35)

where Ñ.j(u) = Σij Ñ.j(u). Using the martingale central limit theorem and a log transformation, we get the approximate pointswise 1 − α confidence limits

Λ~0(τ)exp(±σ~0(τ)Λ~0(τ)Φ1(1α2)). (36)

Point and interval estimates for the baseline survival function can be obtained using the relationship S0(τ) = exp (− Λ0(τ)

3 SIMULATIONS

The performance of the methods from Section 2 was tested with a series of 12000 network-based epidemic simulations. All epidemics took place on a Watts-Strogatz small-world network (Watts and Strogatz, 1998), which mimics the high clustering and low diameter of real human contact networks. Starting with a ring of 50000 nodes, each node was connected to its 10 nearest neighbors and each edge was rewired to a randomly chosen node with probability 0.1. A new contact network was built for each simulation.

All epidemic models were written in Python 2.7 (www.python.org) using the packages NetworkX 1.6 (networkx.lanl.gov), NumPy 1.6, and SciPy 0.9 (www.scipy.org). Statistical analysis was done in in R 2.15 (www.r-project.org) via the Rpy2 2.2 package (rpy.sourceforge.net). The code for the models is available as Online Supplementary Information.

3.1 Transmission model

The transmission model had a latent period of zero and an exponential infectious period with mean one. The baseline contact interval distribution was Weibull(α, γ), where α is the shape parameter and γ is the rate parameter. 6000 simulations had a Weibull(0.5, 0.2) distribution, which has Λ0(τ) = (0.2τ)0.5. The other 6000 had a Weibull(2, 0.6) distribution, which has Λ0(τ) = (0.6τ)2. These distributions gave a basic reproduction number (expected number of infectious contacts made by a typical infectious person) R0 ≈ 3 in a null model.

In the transmission model, each person i had an infectiousness covariate Xiinf and a susceptibility covariate Xisus. Each pair ij connected by an edge had a pairwise covariate Xijpair. All covariates were independent Bernoulli(.5) random variables. For a connected pair ij, the hazard of infectious contact from i to j at infectious age τ of i was

λij(τ)=exp(βinfXiinf+βsusXjsus+βpairXijpair)λ0(τ) (37)

For each parameter β, there were 4000 simulations where its true value was chosen from a uniform distribution on (−1, 1). Of these, 2000 simulations used the Weibull(0.5, 0.2) baseline hazard and 2000 used the Weibull(2, 0.6) baseline hazard. Of the 2000 simulations for each baseline hazard, 1000 had the other two β set to 0 and 1000 had the other two β set to 1.

Each simulated epidemic began with a single person infected at time 0. Data from the next 1000 infections was used to fit two regression models, one using information on who-infected-whom as in Section 2.1 and one using an EM algorithm as in Section 2.5. The EM algorithm used a minimum of 2 and a maximum of 25 iterations. At each iteration, a weighted Cox model was run using the last parameter estimates as the initial parameter estimates. Convergence was defined as a change less than 0.002 in the expected log likelihood (tighter convergence criteria yielded nearly identical parameter estimates). After convergence, a Cox model was run using the final weights and initial parameters βinf = βsus = βpair = 0.

After each simulation, we recorded the true value, estimate, and 95% confidence interval endpoints for each in the model and the baseline hazard at the 10th, 25th, 50th, 75th, and 90th percentiles of all censored and observed contact intervals. We also recorded the α and γ of the baseline hazard function and the number of EM iterations.

3.2 Results

Figure 3 shows good agreement between the estimated and true βinf, βsus, and βpair for both β^ (who-infects-whom observed) and β~ (who-infects-whom unobserved). Table 1 shows excellent 95% confidence interval coverage probabilities for all combinations of baseline hazards and parameters. The β~ estimates had slightly lower coverage probabilities than the β^ estimates. The lower right panel of Figure 3 shows that this was achieved with relatively few iterations. The median number of iterations was 6, and 98% of simulations required ≤ 10 iterations. Only 2 out of 12000 simulations failed to converge within 25 iterations.

Figure 3.

Figure 3

The top two panels and the bottom left panel show β^ (black circles) and β~ (gray circles) versus true β for βinf, βsus, and βpair. The bottom right panel shows a histogram of the number of EM iterations required for convergence.

Table 1.

Hazard ratio 95% confidence interval coverage probabilities in simulations. Each probability is based on the results of 1000 simulations.

Baseline hazard Parameter: βinf
βsuspair=0 βsuspair=1
β inf β inf β inf β inf
α = .5 .952 .937 .937 .945
α = 2 .955 .952 .957 .941
Baseline hazard Parameter: βsus
βinf = βpair = 0 βinf = βpair = 1
β sus β sus β sus β sus
α = .5 .951 .950 .939 .939
α=2 .952 .948 .945 .946
Baseline hazard Parameter: βpair
βinf = βsus = 0 βinf = βsus = 0
β pair β pair β pair β pair
α = .5 .942 .927 .955 .946
α=2 .942 .929 .951 .951

Figures 4 and 5 show good agreement between the estimated and true base-line hazard for both Λ^0(τ) (who-infects-whom observed) and Λ~0(τ) (who-infects-whom unobserved). The smoothed means show almost no bias in Λ^0(τ) or Λ~0(τ). Table 2 shows good 95% confidence interval coverage probabilites for both base-line hazards and all percentiles except for α = 2 at the 10th and 25th percentiles. When α = 2, the baseline hazard of infectious contact is λ0(τ) = 1.2τ. At low values of τ, the hazard of infectious contact is very small, so almost all of the contact intervals will be censored. Since the percentiles are calculated for all censored and observed contact intervals, there may be too few observed intervals at the 10th and 25th percentiles when α = 2 for the large-sample normal approximation to be valid.

Figure 4.

Figure 4

Λ^0(τ) and Λ~0(τ) versus true Λ0(τ) for the 6000 simulations with a Weibull(0.5, 0.2) baseline contact interval distribution. For each simulation and each estimator, a circle is shown for the 10th, 25th, 50th, 75th, and 90th percentiles of all possible contact intervals. The smoothed means were calculated using cubic smoothing splines.

Figure 5.

Figure 5

Λ^0(τ) and Λ~0(τ) versus true Λ0(τ) for the 6000 simulations with a Weibull(2, 0.6) baseline contact interval distribution. For each simulation and each estimator, a circle is shown for the 10th, 25th, 50th, 75th, and 90th percentiles of all possible contact intervals. The smoothed means were calculated using cubic smoothing splines.

Table 2.

95% confidence interval coverage probabilities in simulations. Each probability is based on the results of 6000 simulations.

Baseline hazard Quantile α = .5 α=2
Λ^0(τ) Λ~0(τ) Λ^0(τ) Λ~0(τ)
10% .949 .938 .957 .875
25% .949 .940 .951 .905
50% .950 .941 .955 .934
75% .949 .936 .949 .939
90% .949 .941 .953 .939

Figure 6 shows the widths of the confidence intervals when who-infects-whom is not observed in terms of the width of the confidence interval when who-infects-whom is observed for βinf, βsus, βpair, and Λ0(τ). For βinf and βpair, the precision gained by observing who-infects-whom is roughly equivalent to a 20-40% increase in sample size. The baseline hazard plays an important role in how much precision is gained, with a larger gain for α = 0.5 than for β = 2. There is no gain in precision for βsus because observing who-infects-whom does not add to our knowledge of who was infected. Seeing who-infects-whom only slightly improves the precision of baseline hazard estimates.

Figure 6.

Figure 6

The width of 95% confidence intervals when who-infects-whom is not observed divided by its width when who-infects-whom is observed for βinf, βsus, βpair, and Λ0(τ). The solid gray lines show smoothed means for α = 0.5 and dashed gray lines show smoothed means for α = 2. The smoothed means were calculated using cubic smoothing splines.

Observing who-infects-whom allows point estimates that are closer to the truth and interval estimates with better coverage probabilities. However, the EM algorithm can recover a great deal of information when who-infects-whom is not observed, making the iterative regression model of Section 2.5 a promising tool for infectious disease epidemiology.

4 DATA ANALYSIS

To show how the methods of Section 2 can be applied, we will look at the effect of antiviral prophylaxis and age on the transmission of pandemic influenza A(H1N1) in Los Angeles County in 2009. The Los Angeles County Department of Public Health (LACDPH) collected household surveillance data between April 22 and May 19 according to the following protocol (Sugimoto et al., 2011):

1. Nasopharyngeal swabs and aspirates were taken from individuals who reported to the LACDPH or other health care providers with acute febrile respiratory illness (AFRI), defined as a fever ≥ 100°F plus cough, core throat, or runny nose. These specimens were tested for influenza, and the age, gender, and symptom onset date of the AFRI patient were recorded.

2. Patients whose specimens tested positive for pandemic influenza A(H1N1) or for influenza A of undetermined subtype were enrolled as index cases. Each of them was given a structured phone interview to collect the following information about his or her household contacts: age, gender, type of contact (household, intimate, in-home daycare, non-home daycare), and high risk status (pregnant, child on long-term aspirin therapy, immuno-suppressed, or history of a chronic cardiac, pulmonary, renal, liver, or neurologic condition). The interviewer also recorded whether prophylactic antiviral medication was being taken by the household contacts. They were asked to report the symptom onset date of any AFRI episodes among their household contacts.

3. When necessary, a follow-up interview was given 14 days after the symptom onset date of the index case to assess whether any additional AFRI episodes had occurred in the household, including their illness onset date.

There were 58 households with a total of 299 members. There were 99 infections, of which 62 were index cases (4 of the 58 households had co-primary cases) and 27 were household contacts with an AFRI. For simplicity, we assume these were all influenza A(H1N1) cases and that all household members were susceptible to infection.

Our natural history assumptions were adapted from Yang et al. (2009) and are identical to those in Kenah (2013). In the primary analysis, we assumed an incubation period of 2 days, a latent period of 0 days, and an infectious period of 6 days. Under these assumptions, a person j with symptom onset at time tjsym was infected at time tj=tjsym2 and will stop being infectious at time tj+6=tjsym+4. Under these assumptions, person j can transmit infection on days tj + 1 to tj + 6. In a sensitivity analysis, we vary the latent period from 0 to 1 days, and the infectious period from 5 to 7 days.

We modeled influenza transmission within households, not between households or from outside the observed households. In each household, infected household members who had no possible infector within the household according to our natural history assumptions were assumed to be imported infections. We assumed that any infected household member could infect any susceptible household member. We used the regression model of Section 2.5 to estimate influenza transmission hazard ratios for the following covariates:

  • ageinf = 0 if the infectious person is < 18 years old and 1 otherwise,

  • agesus = 0 if the susceptible is < 18 years old and 1 otherwise,

  • prophsus = 0 if the susceptible is not on antiviral prophylaxis and 1 otherwise.

Since antiviral prophylaxis was initiated after the initial case in each household, it was considered only as a susceptibility covariate. All statistical analysis was done in R 2.15 (www.r-project.org).

4.1 Results

There were 114 people aged < 18 years and 185 aged ≥ 18 years, with no missing age data. There were 91 people taking antiviral prophylaxis and 152 not taking prophylaxis, with missing prophylaxis data for 56 people. When who-infects-whom is not observed, a complete-case analysis requires the removal of all rows corresponding to infectious-susceptible pairs ij where iVj and any member of Vj is missing data. Otherwise, the remaining members of Vj get too much credit for the infection of j.

In the main analysis, there were 70 people infected from outside the household (i.e., no possible infector in the household), 16 with 1 possible infector, 7 with 2 possible infectors, 4 with 4 possible infectors, and 2 with 8 possible infectors, giving us 116 × 27 × 44 × 82 = 2097152 possible transmission trees. The pairwise data contains 443 infectious-susceptible pairs with a total of 2455 pair-days at risk of infection. Of these, 16 × 1 + 7 × 2 + 4 × 4 + 2 × 8 = 62 rows represent possible infection events. All models used the Efron approximation (Efron, 1977) for the partial likelihood with tied failure times.

The top panel of Table 3 shows the results of seven models. All of the models including prophylaxis suggested that antiviral prophylaxis reduced the hazard of infectious contact by about 60%, with low p-values. Hazard ratio point estimates for the main effects of age in all models suggest that adults are more infectious and less susceptible than children. However, evidence for this result is very weak. Only one of the age effects was statistically significant in univariable models, and none were significant in any multivariable model. Multivariable and stratified models with interaction terms for age and antiviral prophylaxis suggest a stronger effect of antiviral prophylaxis on transmission to and from adults than on transmission to and from children. However, the evidence for this result is also weak; these coefficients had high p-values and wide confidence intervals. The bottom panel of Table 3 shows the results of a sensitivity analysis using the multivariable model without interaction. Varying the latent and infectious periods has little effect on the results of the model.

Table 3.

Hazard ratios with 95% con dence intervals and p-values for different models of the 2009 pandemic inuenza A(H1N1) household surveillance data from Los Angeles County. Likelihood ratio p-values comparing models with and without interaction terms are also given. The multivariable and stratied models without interaction were used as the final models.

Main effects Interactions
ageinf agesus prophysus Variables HR (p-value)
Regression model
Univariable 1.53 (0.66, 3.54) p = .321 0.41 (0.20, 0.85) p = .016 0.43 (0.18, 1.02) p = . 057
Multivariable 1.78 (0.69, 4.62) p = .234 0.69 (0.29, 1.64) p= .399 0.41 (0.17, 0.98) p = . 046
Multivariable + interaction 1.59 (0.32, 7.84) p = . 570 0.63 (0.14, 2.73) p= .532 0.04 (0.00, 9.62) p = . 253 ageinf:agesus ageinf :prophsus agesus :prophsus 0.66 (p = .71) 9.28 (p= .45) 2.72 (p = .36) LR p = .101
Stratified strata 0.69 (0.29, 1.64) p = . 401 0.41 (0.17, 0.99) p = . 047
Stratified + interaction strata 0.52 (0.29, 1.64) p = .219 0.23 (0.05, 1.16) p = . 075 ageinf :prophsus 2.37 (p = .38) LR p = .353
Sensitivity analysis (multivariable model without interaction)
Latent period 1 day 1.44 (0.64, 3.26) p = .378 0.83 (0.36, 1.93) p = . 670 0.35 (0.15, 0.80) p = .013
Infectious period 5 days 1.59 (0.60, 4.20) p = . 348 0.64 (0.27, 1.55) p = .322 0.45 (0.18, 1.07) p = .073
7 days 1.45 (0.62, 3.40) p = .378 0.89 (0.38, 2.04) p = . 670 0.34 (0.17, 0.87) p = .013

Figure 7 shows estimates of the cumulative transmission probability based on the multivariable and stratified models without interaction. The results of the two models are similar, but the stratified model showed lower probabilities of transmission from children and higher probabilities of transmission from adults. All four panels clearly show the estimated effect of antiviral prophylaxis. All curves show bigger jumps on the first four days after infection than on days 5 and 6, which is consistent with the results of Kenah (2013). Comparing the top and bottom rows shows that children are estimated to be less infectious than adults. Comparing the left and right columns shows that children are estimated to be more susceptible than adults. As noted above, these differences are not statistically significant.

Figure 7.

Figure 7

Household transmission of 2009 pandemic influenza A(H1N1) in Los Angeles County. Each panel shows separate curves for susceptible contacts with (gray lines) and without (black lines) antiviral prophylaxis. The solid lines are based on the multivariable model, and the dotted lines are based on the model stratified by ageinf.

This data analysis has been intended primarily to illustrate the flexibility of the regression modeling framework for analyzing transmission data. There are several important limitations of the analysis itself. With only 29 within-household transmissions, the large-sample normal approximations may not hold and there is limited power to estimate the effects of age and antiviral prophylaxis. The age classification is crude, so it may not accurately capture the effects of age. The prophylaxis variable was missing for many pairs and was binary, allowing no consideration of the timing of prophylaxis relative to exposure. Analyses of the household transmission of influenza A(H3N2) found greater child-to-child than adult-to-adult transmission (Addy et al., 1991). In our analysis of influenza A(H1N1), children appeared less infectious and more susceptible than adults, but these differences were not statistically significant. If not due to random noise, such a result could reflect a difference between the H3N2 and H1N1 subtypes of influenza A or a bias caused the failure to account for infection from outside the household. In any case, this analysis shows that the model needs to be extended to model infection from outside the household and to handle missing data.

5 DISCUSSION

The semiparametric relative-risk regression model proposed here has several important advantages over the chain binomial model. It can be fit using standard statistical software with parameter interpretations that resemble the Cox model. Standard software can be used to convert the results into curves representing the cumulative probability of transmission in pairs of individuals with specific characteristics. It does not make any parametric assumptions about the baseline hazard of infectious contact, and it allows many of the same extensions as the Cox model, including stratification, interaction, and time-dependent covariates. This flexibility and ease of use will make it an important tool for infectious disease epidemiology. To realize this potential, there are several limitations that remain to be addressed.

We assumed that the set of imported infections is known. The chain binomial model handles unknown imported infections by including a per-time-unit probability of escaping infection from outside the household. In the semipara-metric regression model, this could be achieved by fitting two models in each step of the EM algorithm: a pairwise contact interval model within the household and an individual-level model in absolute time for infection from outside the household. At each step, the weights would be recalculated based on covariates, coefficient estimates, the baseline hazard of the contact interval distribution, and the baseline hazard of infection from outside the household.

We assumed that infection times, latent periods, and infectious periods were all observed. We can usually observe only the clinical course of the disease, so these times must be imputed. In Section 4, we had missing data on covariates. Simple missing data (such as antiviral prophylaxis) could be handled by extending the EM algorithm to calculate the expected log likelihood over the possible values of the missing data as well as who-infected-whom. More complex missing data (such as infection and removal times) could be handled using data augmentation in a profile sampler (Lee et al., 2005), getting a posterior distribution for the model coefficients while treating the baseline hazards as a nuisance parameter.

We assumed that all possible infectors of each person were observed. Unobserved infectors could occur because of incomplete contact tracing or asymptomatic infection. The possible bias caused by unobserved sources of infection needs to be studied, and methods for controlling it analytically or assessing its severity in a sensitivity analysis need to be developed.

We assumed a static contact network where the Cij were binary and constant. In reality, people are exposed to close contacts at home, at work, at school, and at other locations in a dynamic process. The extension of these methods to dynamic contact networks is possible but nontrivial. We could allow Cij(τ) to be a time-dependent process in the infectious age τ of i. The contact interval distribution would then be defined as the distribution of the contact interval that would occur if Cij(τ) = 1 for all τ. For estimation, we would have to observe the process Cij(τ) for each ij.

Some of our assumptions must be relaxed to capture the natural history of complex diseases. We assumed an SEIR framework best suited to acute, immunizing diseases that spread directly from person to person. Many foodborne and waterborne diseases, pneumococcal and meningococcal diseases, and other infectious diseases of major public health importance do not fit easily into this framework. To extend the proposed regression model to complex diseases, we could allow individuals to experience multiple events (e.g., first infection, second infection) or to experience different types of events (e.g., colonization, infection, relapse). We assumed that contact intervals were independent of infectious periods even though both are a ected by the same host-pathogen interaction. In some cases, there may be a covariate process X(τ) such that Ii(τ) and Nij(τ) are conditionally independent given X(τ). Otherwise, infectious contact and the infectious period must be modeled as a multivariate survival process.

Several technical issues need further attention. The smoothing step is crucial to the fitting the regression model when who-infects-whom is not observed. Here, we used cubic smoothing splines because they were convenient and worked well. However, these do not guarantee that the smoothed hazard function is monotonically increasing and lack a convenient interpretation in terms of the likelihood. A penalized likelihood estimator that guarantees monotonicity, such as that of Anderson and Senthilselvan (1980), would be more consistent with the EM algorithm. Model diagnostics, goodness-of-fit tests, and small-sample methods for point and interval estimation need to be developed, and a more rigorous study of the model asymptotics needs to be done.

Despite these limitations, the semiparametric relative-risk regression model presented here is a powerful new framework for the analysis of infectious disease transmission data. Placing statistical methods for infectious disease epidemiology on the broad and deep theoretical foundation of survival analysis will help clarify study design and causal inference for communicable diseases and allow statistical methods to develop in concert with advances in molecular biology. Ultimately, these improvements may lead to more efficient vaccine trials and a better-informed public health response to future outbreaks and epidemics.

Supplementary Material

Supplementary Material - Appendices
Supplementary Material - Python Module

Acknowledgments

This research was supported by National Institute of Allergy and Infectious Diseases (NIAID) grant K99/R00 AI095302. The content is solely the responsibility of the author and does not necessarily represent the o cial views of NIAID or the National Institutes of Health. The author thanks Tom Britton, Yang Yang, M. Elizabeth Halloran, and Ira M. Longini, Jr. for their comments and suggestions.

Footnotes

SUPPLEMENTARY MATERIALS

Supplementary material available online at the Journal of the American Statistical Association website includes:

Appendices for Section 2 Outline of sufficient conditions for consistency and asymptotic normality (Appendix A) and derivation of the asymptotic variance of baseline hazard estimates (Appendix B).

Simulation code for Section 3 Python module used for the simulations, including statistical analyses in R.

References

  1. Aalen Odd O., Borgan Ørnulf, Gjessing Hakon. Statistics for Biology and Health. Springer-Verlag; New York: 2009. Survival and Event History Analysis: A Process Point of View. [Google Scholar]
  2. Addy Cheryl L., Longini Ira M., Jr, Haber Michael. A generalized stochastic model for the analysis of infectious disease final size data. Biometrics. 1991;47:961–974. [PubMed] [Google Scholar]
  3. Anderson JA, Senthilselvan A. Smooth estimates for the hazard function. Journal of the Royal Statistical Society, Series B. 1980;42:322–327. [Google Scholar]
  4. Andersson Håkan, Britton Tom. Lecture Notes in Statistics. Springer; New York: 2000. Stochastic Epidemic Models and Their Statistical Analysis. [Google Scholar]
  5. Becker Niels G. Monographs on Statistics and Applied Probability. Chapman & Hall/CRC; Boca Raton, FL: 1989. Analysis of Infectious Disease Data. [Google Scholar]
  6. Breslow N. Cox DR, editor. Contribution to discussion of paper. Journal of the Royal Statistical Society B. 1972;34:216–217. [Google Scholar]
  7. Cox David R. Regression models and life-tables. Journal of the Royal Statistical Society, Series B. 1972;34:187–220. [Google Scholar]
  8. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from in complete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 1977;39:1–38. [Google Scholar]
  9. Efron Bradley. The e ciency of Cox's likelihood function for censored data. Journal of the American Statistical Association. 1977;72:557–565. [Google Scholar]
  10. Soren Johansen. An extension of Cox's regression model. International Statistical Review. 1983;51:165–174. [Google Scholar]
  11. Kenah Eben. Contact intervals, survival analysis of epidemic data, and estimation of R0. Biostatistics. 2011;12:548–566. doi: 10.1093/biostatistics/kxq068. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Kenah Eben. Nonparametric survival analysis of epidemic data. Journal of the Royal Statistical Society, Series B. 2013;75:277–303. doi: 10.1111/j.1467-9868.2012.01042.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Kenah Eben, Lipsitch Marc, Robins James M. Generation interval contraction and epidemic data analysis. Mathematical Biosciences. 2008;213:71–79. doi: 10.1016/j.mbs.2008.02.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Leng Lee Bee, Kosorok Michael R., Fine Jason P. The profile sampler. Journal of the American Statistical Association. 2005;100:960–969. [Google Scholar]
  15. Louis Thomas A. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society, Series B. 1982;44:226–233. [Google Scholar]
  16. Meng Xiao-Li, Rubin Donald. Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika. 1993;80:267–278. [Google Scholar]
  17. Prentice Ross L., Self Steven G. Asymptotic distribution theory for Cox-type regression models with general relative risk form. Annals of Statistics. 1983;11:804–813. [Google Scholar]
  18. Rampey Alvin H., Jr, Longini Ira M., Jr, Haber Michael, Monto Arnold S. A discrete-time model for the statistical analysis of infectious disease incidence data. Biometrics. 1992;48:117–128. [PubMed] [Google Scholar]
  19. Rhodes Philip H., Elizabeth Halloran M, Longini Ira M., Jr. Counting process models for infectious disease data: Distinguishing exposure to infection from susceptibility. Journal of the Royal Statistical Society B. 1996;58:751–762. [Google Scholar]
  20. Sugimoto Jonathan D., Yang Yang, Elizabeth Halloran M, Dean Brandon, Oiulfstad Brit, Ann Bagwell Dee, Mascola Laurene, Bancroft Elizabeth, Longini Ira M., Jr. Accounting for unobserved immunity and asymptomatic infection in the early household transmission of the pandemic influenza A (H1N1) 2009. Submitted to American Journal of Epidemiology. 2011 [Google Scholar]
  21. Svensson Åke. A note on generation times in epidemic models. Mathematical Biosciences. 2007;208:300–311. doi: 10.1016/j.mbs.2006.10.010. [DOI] [PubMed] [Google Scholar]
  22. Therneau Terry M., Grambsch Patricia M. Statistics for Biology and Health. Springer-Verlag; New York: 2000. Modeling Survival Data: Extending the Cox Model. [Google Scholar]
  23. Wallinga Jacco, Teunis Peter. Di erent epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures. American Journal of Epidemiology. 2004;160:509–516. doi: 10.1093/aje/kwh255. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Watts Duncan J., Strogatz Steven H. Collective dynamics of ‘small-world’ networks. Nature. 1998;393:440–442. doi: 10.1038/30918. [DOI] [PubMed] [Google Scholar]
  25. Forsberg White L, Pagano Marcello. A likelihood-based method for real-time estimate of the serial interval and reproductive number of an epidemic. Statistics in Medicine. 2008;27:2999–3016. doi: 10.1002/sim.3136. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Yang Yang, Sugimoto Jonathan, Elizabeth Halloran M, Basta Nicole E., Chao Dennis L., Matrajt Laura, Potter Gail, Kenah Eben, Longini Ira M., Jr. The transmissibility and control of pandemic influenza A(H1N1) virus. Science. 2009;326:729–733. doi: 10.1126/science.1177373. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material - Appendices
Supplementary Material - Python Module

RESOURCES