Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2008 Aug 15.
Published in final edited form as: Comput Stat Data Anal. 2007 Aug 15;51(12):6582–6595. doi: 10.1016/j.csda.2007.03.007

A Data-Augmentation Method for Infectious Disease Incidence Data from Close Contact Groups

Yang Yang 1,*, Ira M Longini Jr 1,2, M Elizabeth Halloran 1,2
PMCID: PMC2131714  NIHMSID: NIHMS29356  PMID: 18704156

Abstract

A broad range of studies of preventive measures in infectious diseases gives rise to incidence data from close contact groups. Parameters of common interest in such studies include transmission probabilities and efficacies of preventive or therapeutic interventions. We estimate these parameters using discrete-time likelihood models. We augment the data with unobserved pairwise transmission outcomes and fit the model using the EM algorithm. A linear model derived from the likelihood based on the augmented data and fitted with the iteratively re-weighted least squares method is also discussed. Using simulations, we demonstrate the comparable accuracy and lower sensitivity to initial estimates of the proposed methods with data augmentation relative to the likelihood model based solely on the observed data. Two randomized household-based trials of zanamivir, an influenza antiviral agent, are analyzed using the proposed methods.

Keywords: Antiviral agent, Data augmentation, EM algorithm, Infectious disease, Intervention efficacy, Linear model, MLE

1 Introduction

Close contact groups, such as households, are the important places of transmission for many infectious diseases. Data collected from these contact groups provide a basis for evaluating person-to-person transmission risks and effectiveness of intervention methods such as antiviral treatments or vaccine (Halloran, Struchiner and Longini, 1997; Becker, Britton and O'Neill, 2003). Using different levels of information available in the data, various statistical methods have been developed for data analysis. If only the final infection status of participants are known, methods utilizing recursive final-size probabilities can be applied, including likelihood maximization (Longini and Koopman, 1982; Addy, Longini and Haber, 1991), Bayesian approaches (O'Neill and Roberts, 1999), generalized linear models (Magder and Brookmeyer, 1993), and estimating equations with martingale techniques (Becker and Hasofer, 1997). In many modern clinical trials, sequential laboratory tests and symptom diary of participants provide time-to-event data with individual-specific longitudinal exposure information. To take into account exposure and transmission dynamics at the individual level, Rampey et al. (1992) constructed discrete-time likelihoods based on assumptions about the natural history of the disease such as the distributions of the latent and infectious periods. Yang, Longini and Halloran (2006) extended this method to the more realistic case-ascertained design. Cauchemez et al. (2004) proposed a Bayesian model with the flexibility of estimating the natural history of the disease, but time-dependent covariates have not been accommodated.

The discrete-time likelihoods in Rampey et al. (1992) and Yang et al. (2006) are built solely upon the observed data, including symptom onset dates, laboratory test results and household structure (which individuals live in which households), and involve summing probability components over the latent period. Summations or integrals are commonly seen in likelihoods based solely on the observed data, and such complicated structure may present difficulties for standard analyses or prevent extension by other methods (O'Neill et al., 2000). More importantly, when data are sparse because of rare incidences and/or a multicovariate structure, iterative estimation procedures (e.g., the Newton-Raphson algorithm) using only the observed data may be sensitive to the initial estimates in locating the maximum likelihood estimates (MLEs). This fact can be seen in section 3 and 4 of this paper, and is also mentioned in Yang et al. (2006). Data augmentation is a popular technique to circumvent computational difficulties in classical likelihood methods because likelihood functions conditional on unobserved variables are often simpler (van Dyk and Meng, 2001; Paap, 2002). In a transmission model for infectious diseases, a basic element is the transmission probability given a contact between an infective person and a susceptible person. The contact may be defined in various ways, for example, one day of living in the same household. The outcome of each contact, infection or escape, is generally not observable since a person may make multiple contacts before infection. In this paper, we revise the discrete-time likelihood in Yang et al. (2006) by augmenting the observed symptom onset data with the unobserved transmission outcome for each contact. This likelihood based on the augmented data has a simpler form than the one based on only the observed data and can be maximized with the EM algorithm. To illustrate the potential use of the simple likelihood by a different method, we derive a linear model that can be fitted using the iteratively re-weighted least squares (IRLS) procedure. We show via simulation studies that both the maximum likelihood (ML) and the IRLS methods using the augmented data are less sensitive to initial estimates as compared to the ML method using only the observed data in Yang et al. (2006). We use the proposed approaches to estimate the prophylactic and treatment effectiveness of an influenza antiviral agent in two household trials.

2 Methods

Suppose that the disease under investigation is influenza and the data arise from a clinical trial in which household members are randomized to either an antiviral agent or control when an index case is identified by clinical symptoms. Let us assume the antiviral agent provides temporary protection for susceptible contacts and therapy for cases. In the discrete-time likelihood model setting, risks are evaluated for each susceptible participant in each time interval. Suppose that the time intervals are consecutive days, and define a contact as the exposure of a susceptible person to an infective person in the same household throughout a day. The pairwise transmission probability per contact between a susceptible person i with covariates xi and an infective person j with covariates xj in the same household is expressed as p(xi, xj). If xi and xj are scalars denoting treatment status of antiviral agent (1=yes, 0=no), then one can define efficacy measures AVES=1p(1,0)p(0,0), AVEI=1p(0,1)p(0,0) and AVET=1p(1,1)p(0,0), where in the epidemiological literature AVES measures the antiviral efficacy in reducing susceptibility, AVEI measures the efficacy in reducing infectiousness, and AVET is called the total effectiveness (Halloran et al., 1997). Let p = p(0, 0) be the baseline daily pairwise transmission probability without any treatment. For notational convenience, a reparameterization leads to p(xi,xj)=θxi(1xj)ϕ(1xi)xjηxixjp where θ = 1 − AVES, ϕ = 1 − AVEI and η = 1 − AVET. For simplicity, we assume multiplicativity of θ and ϕ such that η = θϕ, and thus p(xi,xj)=θxiϕxjp. In Yang et al. (2006), we explored the assumption of multiplicativity for the ML method using only the observed data.

As our interest centers around estimation of transmission probabilities and treatment efficacies, we assume that: 1. the latent period (time from infection to being infectious) coincides with the incubation period (time from infection to the onset of symptoms); and 2. durations of the latent and the infectious periods have known probability distributions. If the latent and the incubation periods do not coincide but are both known, the model can be adjusted for such situation.

2.1 The Maximum Likelihood Method Based on the Augmented Data

Suppose that the trial is conducted on a population of size N and is observed on a daily basis from day 1 to day T. Let us assume day 1 is the first calendar day of exposure for the whole study population. The observed data for each subject include household membership, the date of symptom onset, laboratory test result, randomized treatment and treatment period as well as other characteristics such as age and gender. On day t, the probability that an infective person j with treatment status rj(t) (0: untreated, 1: treated) infects a susceptible person i with treatment status ri(t) in the same household is expressed as

pji(t)=θri(t)ϕrj(t)pf(tt~j), (1)

where f(t|j) is the probability that person j stays infectious on day t given the day of symptom onset j and is derived from the known distribution of the infectious period. For simplicity in notation, we use i to denote the observed symptom onset time for each person, although i is right-censored for those who are free of symptoms up to day T. We allow a constant common infective source from outside of the household, by setting pci(t)=θri(t)b, where c refers to the common source, and b is the baseline probability of being infected by the common source per day. Let ψj = 1 if the infective source j is a person and 0 if j = c. A modification of (1) takes into account the common source as the following

pji(t)=θri(t)ϕrj(t)pψjb1ψjf(tt~j), (2)

where fc(t|c) = 1 and rc(t) = 0 for all t. A likelihood involving only the observed data, {i : 1 ≤ tT, 1 ≤ iN}, can be constructed from (2) and the known distribution of the latent period as in Yang et al. (2006).

Let Yji(t) be the transmission result (1:infection, 0:escape) between an infective source j and a susceptible person i on day t. Let lmax and lmin be the maximum and minimum duration of the latent period, so that ti = ilmax and ti=t~ilmin are the earliest and latest potential infection days for person i. Given the observed symptom onset day i, the sequence of Yji(t)'s for tti remains unknown. It should be noted that Yji(t) is a random variable only if Yji(τ) = 0 for all τ < t, and Yji(t) is independent of Yki(t) for the same day t. Define

Zji(t)=Yji(t)kDi,τ<t(1Yki(τ))

and

Zji(t)=(1Yji(t))kDi,τ<t(1Yki(τ)),

where Di is the collection of potential infective sources for person i, i.e., people living in the same household with person i plus the external common source. Zji(t) = 1 is the event that person i escapes infection from any source before day t but is infected by source j on day t, while Zji(t)=1 is the event that person i escapes infection from any source before day t and from source j on day t. Let maxjDiZji(t) indicate if Zji(t) = 1 for any j on day t. The likelihood of the augmented data is

Li(b,p,θ,ϕt~j,Zji(t),jDi,tT)=t=1T{g(t~it)maxjDiZji(t)jDi(pji(t))Zji(t)(1pji(t))Zji(t)}{t=1TjDi(pji(t))Zji(t)(1pji(t))Zji(t)}, (3)

where g(i|t) denotes the probability of illness onset on day i given infection on day t and is derived from the distribution of the latent period. According to our assumption, both f(t|j) and g(i|t) are known. This likelihood is a product of binomial probability components, much simpler than the one in Yang et al. (2006). To apply the EM algorithm, we need to determine the distributions of Zji(t) and Zji(t) conditioning on current estimates of b, p, θ and ϕ as well as j, jDi (Dempster, Laird and Rubin, 1977). Define Si(t) as the event that person i has symptom onset on day t, Ii(t) the event that person i is infected on day t and Iji(t) the event that person i is infected by j on day t. Then, the conditional distributions are given by (Appendix A)

Pr(Zji(t)=1b,p,θ,ϕ,t~i)={Pr(Iji(t))Pr(Si(t~i))×Pr(Si(t~i)Ii(t)),tit<ti0,otherwise} (4)

and

Pr(Zji(t)=1b,p,θ,ϕ,t~i)={Pr(Si(t~i)Ii(t))×{Pr(Ii(t))Pr(Iji(t))}Pr(Si(t~i))+Στ=t+1tiPr(Si(t~i)Ii(τ))×Pr(Ii(τ))Pr(Si(t~i)),tit<ti1,t<ti0,otherwise}. (5)

Given estimates (b^l1,p^l1,θ^l1,ϕ^l1) from the (l – 1)th iteration, in the lth iteration we have

Pr(Iji(t))={Q^i(t1)θ^l1ri(t)ϕ^l1rj(t)p^l1f(tt~j),jDiQ^i(t1)θ^l1ri(t)b^l1,j=c}Pr(Ii(t))=Q^i(t1){1(1θ^l1ri(t)b^l1)jDi(1θ^l1ri(t)ϕ^l1rj(t)p^l1f(tt~j))},Pr(Si(t~i))=Στ=titiPr(Si(t~i)Ii(τ))×Pr(Ii(τ)),Pr(Si(t~i)Ii(τ))=g(t~iτ),

where Q^i(t1) is the estimated cumulative escape probability based on (b^l1,p^l1,θ^l1,ϕ^l1) The likelihood history before day ti can be dropped from Pr(Ii,j(t)) and Pr(Ii(t)), since Q^i(ti1) is the common factor and will eventually be cancelled out in the calculations of (4) and (5). The implementation of the EM algorithm is straightforward. In the E-step, (4) and (5) are calculated and plugged into the logarithm of (3) to obtain

log(Li(b,p,θ,ϕt~j,Zji(t),jDi,tT))Σt=1TΣjDi{Pr(Zji(t)=1b,p,θ,ϕ,t~i)log(pji(t))+Pr(Zji(t)=1b,p,θ,ϕ,t~i)log(1pji(t))} (6)

which is maximized in the M-step.

Variances of the parameter estimates can be evaluated using Louis' method (Louis, 1982). Let Z be the collection of Zji(t), and the collection of i, for all i, j and t, so that is the observed data and Z is the partially latent data. Let λ = {b, p, θ, ϕ}. Louis' method states that

2log(L(λt~))λ2=EZt~,λ{2log(L(λt~,Z))λ2}+VARZt~,λ{log(L(λt~,Z))λ}.

The first component on the right side can be evaluated analytically based on (6), while the second component can be estimated via sampling from the distribution of Z conditioning on and λ^.

2.2 The Linear Model Based on the Augmented Data

A linear model is a natural consequence of modeling the daily pairwise transmissions. Taking the logarithm on both sides of (2),

log(pji(t))=log(b)+ψjlogpb+ri(t)log(θ)+rj(t)log(ϕ)+log(f(tt~j))=β0+β1ψj+β2ri(t)+β3rj(t)+log(f(tt~j)). (7)

The response of this model is Yji(t) since pji(t)=Pr(Yji(t)=1Yki(τ)=0,kDi,τ<t). From (6), it is clear that one should assign weights Pr(Zji(t)=1b,p,θ,ϕ,t~i) to the outcome Yji(t) = 1 and Pr(Zji(t)=1b,p,θ,ϕ,t~i) to the outcome Yji(t) = 0. As the weights need to be calculated from pre-estimated parameters, we use the iteratively re-weighted least squares (IRLS) method to fit the model.

To apply the IRLS method, suppose the conditional expected frequencies of Yji(t)'s have been summarized into H binomial proportions Ph, h = 1, …, H, with the H covariate patterns defined by ri(t), rj(t), ψj and f(tt~j). We fit model (7) by minimizing the objective function Σh=1Hwh{log(P~h)log(Ph)}2, the squared difference between the observed proportion h and the mean proportion Ph. Let nh be the number of observations in the hth pattern. The weight for the hth pattern wh=VAR1(log(P^h)) could be estimated from either h (data-based) or the fitted response h (model-based). Our simulations suggest that combinations such as the arithmetical mean 12{nh×P~h1P~h+nh×P^h1P^h} or the geometric mean nhP~hP^h(1P~h)(1P^h) provide estimates close to the MLEs. If h = 0, we replace h by h from the previous iteration. Let β^0,,β^3 be the WLS estimates of the coefficients in model (7), then the WLS estimates of the parameters at the lth iteration are

b^l=exp(β^0),p^l=exp(β^0+β^1),θ^l=exp(β^2),andϕ^l=exp(β^3).

We then update the parameters and re-fit the model until the estimates converge. We have generalized the linear model method to populations with heterogeneity in the transmission probabilities (Appendix B).

At each iteration, the variances of l, l, θ^l and ϕ^l estimated from the linear model have been averaged over the conditional distribution of Z. With the loss of randomness in Z, the final estimates will under-estimate the true variances. Since VAR(λ^)=E(VAR(λ^Z))+VAR(E(λ^Z)), similar to the Louis' method for the ML method, one can employ the following adjustment procedure to approximate VAR{λ^}:

  • Sample Z from Pr(Zt,λ^), where λ^ is the final parameter estimates.

  • Use the sampled Z as the weights to fit model (7) and obtain new point estimates of the parameters and their variances.

  • Repeat the previous steps for a sufficient number of times. The sample average of the newly-estimated variances approximates E(VAR(λ^Z)), and the sample variance of the newly-estimated parameters approximates VAR(E(λ^Z)).

3 Simulation Study

To compare the ML and IRLS methods using the augmented data with the ML method using only the observed data, we conducted simulations under two scenarios: with a large number of cases and with sporadic cases. A pseudo-community composed of households of size two or larger with 1000 people was generated according to the distributions of age and household sizes from the US Census 2000. The distribution of the simulated household sizes is {2 : 67%, 3 : 13%, 4 : 10%, 5 : 7%, 6 : 2%, 7 : 1%}. Simulated epidemics were stopped on day 100, the typical length of the influenza season for a community. The empirical latent and infectious period distributions, from which f(t|i) and g(i|t) were derived, were obtained from Elveback, Fox and Ackerman (1976) and given in Table 1. Our simulations were implemented with individual-level randomization of treatments, where individuals including index cases in the same household may receive different treatments. In the Newton-Raphson procedure for likelihood maximization, we apply the complementary log-log transformation for b and p and the log transformation for θ and ϕ to help improve convergence. One thousand stochastic replications were carried out for each scenario investigated.

Table 1.

Empirical cumulative distributions of the latent period and the infectious period for influenza (Elveback et al., 1976).

Latent Period Infectious Period
Duration
(days)
Cumulative
Probability
Duration
(days)
Cumulative
Probability
0 0 ≤ 2 0
1 0.2 3 0.3
2 0.8 4 0.7
3 1.0 5 0.9
6 1.0

We first set the values of the parameters to b = 0.005, p = 0.1, θ = 0.4, ϕ = 0.7. Under this setting, on average 69% of the households and 51% of the contacts were attacked in simulated epidemics, and 20% of the contacts were infected when receiving treatment. The three iterative procedures were initiated from the true values of the parameters and, with adequate numbers of events, converged most of the time. By convergence we mean that the estimates of all four parameters converge to reasonable values. Specifically, estimates of b and p in (10−10, 1) and estimates of θ and ϕ in (10−10, 10) are considered reasonable. Given convergence, the MLEs obtained from only the observed data are exactly the same as those obtained from the augmented data, and the estimates of the SDs are also similar. Therefore, we present only the MLEs obtained from the augmented data. Table 2 shows mean parameter estimates, Monte Carlo standard deviations (SD of point estimates), mean model-estimated SDs and coverage rates of 95% confidence intervals (CI) based on model-estimated SDs for the two approaches using the augmented data. The IRLS method yielded about the same estimates of the parameters and SDs as the MLEs. The small differences between the IRLS estimates and the MLEs for b, SD() and SD(θ^) decrease as the sample size increases (not shown).

Table 2.

Comparison between MLEs and IRLS estimates based on the augmented data. Results are based on 1000 simulations.

Mean of
Point Estimates
Monte Carlo SD
Mean of
SD Estimates
Coverage of
95% CI
Parameter MLE IRLS MLE IRLS MLE IRLS MLE IRLS
b 0.0051 0.0051 0.00028 0.00028 0.00028 0.00027 95.5 95.2
p 0.10 0.10 0.011 0.011 0.011 0.011 94.9 96.1
θ 0.40 0.40 0.067 0.067 0.067 0.069 95.4 95.7
ϕ 0.71 0.71 0.13 0.13 0.13 0.13 95.4 94.7

True parameters are set to b=0.005, p=0.1, θ=0.40, ϕ=0.70.

MLEs are the same for observed and augmented data.

To compare the sensitivity of the three methods to starting parameter values when data are sparse, we reduced the true values of b from 0.005 to 0.002 and p from 0.1 to 0.01 so as to reduce transmissions within households. Under this setting, the average attack rates decreased to 39% for households and to 12% for contacts, and only 10% of the contacts were infected when receiving treatment. We ran simulations under different starting values of b and p, as log(pji(t)) is generally more sensitive to the transmission probabilities than to the efficacies. Simulation results including convergence rates and parameter estimates are compared in Table 3. Clearly, the ML method using only the observed data is highly sensitive to initial values of b and p. The convergence rate of the ML method using only the observed data was comparable to the methods using the augmented data when the iteration started from the true parameters, but dropped dramatically when starting from larger values (b = 0.02, p = 0.1) or smaller values (b = 0.0002, p = 0.001) of the probability parameters. In contrast, the convergence rate was relatively stable for the approaches using the augmented data, regardless of the starting values. Parameter estimates and associated Monte Carlo standard deviations were similar across methods, except that the IRLS method appeared to overestimate θ to a larger extent compared to the ML methods. All methods overestimated ϕ as a consequence of sparse data. In addition, the ML methods overestimated, while the IRLS method underestimated, the standard deviation of ϕ. For example, when starting from true values of b and p, the mean standard errors are 1.10, 1.16 and 0.78 (not shown in Table 3) for the MLE based on the observed data, the MLE based on the augmented data and the IRLS estimate of ϕ respectively, in contrast to Monte Carlo standard deviations 0.95, 0.96 and 0.93.

Table 3.

Comparing sensitivity to initial estimates between the ML method using observed data and the approaches using the augmented data when data are sparse. Results are based on 1000 simulations.

Initial
Values
(b0, p0)
Method§ Conv.
Rate
(/1000)
Parameters§§
b p θ ϕ
(0.002, 0.01)
ML(Obs) 903 0.0020 (0.00016) 0.010 (0.0049) 0.42 (0.25) 0.98 (0.95)
ML(Aug) 889 0.0020 (0.00016) 0.010 (0.0048) 0.42 (0.24) 1.01 (0.96)
IRLS(Aug) 937 0.0020 (0.00016) 0.011 (0.0047) 0.48 (0.24) 1.07 (0.93)
(0.02, 0.1)
ML(Obs) 524 0.0020 (0.00016) 0.010 (0.0048) 0.41 (0.24) 1.19 (1.11)
ML(Aug) 878 0.0020 (0.00016) 0.010 (0.0049) 0.42 (0.24) 0.99 (1.00)
IRLS(Aug) 920 0.0020 (0.00016) 0.011 (0.0048) 0.48 (0.24) 1.07 (1.00)
(0.0002, 0.001)
ML(Obs) 92 0.0020 (0.00016) 0.010 (0.0054) 0.38 (0.23) 1.04 (0.79)
ML(Aug) 864 0.0020 (0.00015) 0.010 (0.0047) 0.44 (0.26) 1.03 (1.08)
IRLS(Aug) 928 0.0020 (0.00015) 0.011 (0.0047) 0.49 (0.24) 1.08 (0.90)

True parameters are set to b=0.002, p=0.01, θ=0.40, ϕ=0.70.

Initial values for θ and ϕ are set to the true values.

§

Obs: observed data, Aug: augmented data.

§§

Values in the parentheses are Monte Carlo standard deviations.

As seen in Table 3, sparse data generally lead to biased and unstable efficacy estimates for the parametric methods, particularly for the IRLS method. At the same time, sparse data also increase the chance of non-convergence for the standard likelihood maximization algorithms. Household-level randomization, in which individuals in the same household receive the same treatments, provides much less information for estimating θ and ϕ separately compared to individual-level randomization with the same population size. More discussion on trial design issues can be found in Donner (1998), Datta, Halloran and Longini (1999), Halloran et al. (2006) and Yang et al. (2006).

4 Data Analysis

Two randomized multi-center efficacy trials of zanamivir, an inhaled influenza antiviral agent, were conducted during October 1998 - April 1999 (Hayden et al., 2000) and June 2000 - April 2001 (Monto et al., 2002). In both trials, households were randomized to zanamivir or placebo but only eligible household members (aged 5+ years) were treated. In the later trial, index cases were not treated. Characteristics of the two trials are given in Table 4.

Table 4.

Two randomized multi-center trials of zanamivir, an influenza antiviral agent

Hayden et al., 2000 Monto et al., 2002
Time of trial Oct. 1998 - Apr. 1999 Jun. 2000 - Apr. 2001
Households 336 484
Population 1186 1770
Index case randomization Yes No
Duration of medication
   Index case 5 days N/A
   Contact 10 days 10 days
Follow up (symptom diary) 14 days 14 days
Infected/Symptomatic(index) 164/336 281/484
Infected/Exposed(contacts)
   Control 52/435 76/626
   Zanamivir 17/415 27/660

Numbers may slightly differ from references due to different criteria of data inclusion for analysis.

Laboratory-confirmed infections with clinical symptoms

The earlier trial adopted a typical household-level randomization, providing information about AVET = 1 − θϕ, if we assume multiplicativity between θ and ϕ, and the later trial contains information mainly about AVES. Neither trial alone provides any information about AVEI, and thus we combine the two trials to estimate AVES and AVEI simultaneously. While transmission probabilities and antiviral efficacies might differ from center to center, the limited sample size prohibits estimation of centerspecific parameters. As a result, we assume all the centers in both trials share the same parameters. The two reference papers used slightly different definition for clinical symptoms. We used the one in Monto et al.(2002) for both trials, i.e., presence of at least two of temperature≥ 37.8° C or feverishness (counted as one), cough, headache, sore throat and myalgia. As it is well known that influenza is more transmissible among children, we assume age-specific transmission probabilities in two age groups, children (< 18) and adults (≥ 18). Our primary endpoint is laboratory-confirmed influenza with clinical symptoms (clinical infection). Households in both trials were followed from the ascertainment time of index cases, for which selection bias was adjusted for based on Yang et al. (2006) and Appendix C. In such adjustment, index cases were excluded from analyses regardless of laboratory results, but their effects on the exposure level of the contacts were considered.

Results are given in Table 5. For this data set, both ML methods converge and thus give the same MLEs. Prophylaxis with zanamivir led to significantly preventive efficacy against clinical infection by AVE^S=0.75 (95% C.I.=(0.56, 0.86)). Hence, a susceptible person taking zanamivir has his chance of developing influenza illness reduced by 75% per daily exposure to an untreated symptomatic infected person. Zanamivir did not show significant efficacy in reducing the infectiousness of infected people with AVE^I=0.23 (95% C.I.=(−1.33, 0.75)). Assuming multiplicativity of θ and ϕ, the total efficacy AVET reached 0.81 (95% C.I.=(0.50, 0.93)). Based on final data of clinical influenza illness provided in Hayden et al. (2000) and Monto et al. (2002), similar AVET (0.80; 95% C.I.=(0.53, 0.91)) and AVES (0.84; 95%C.I.=(0.61, 0.90)) were reported by Halloran et al. (2006). They also reported AVES (0.75; 95% C.I.=(0.54, 0.86)), AVEI (0.19; 95% C.I.=(−1.60, 0.75)) and AVET (0.87; 95% C.I.=(0.63, 0.95)) based on secondary attack rates (SAR) during 2-7 days since the ascertainment of index cases. These results differ in their interpretation.

Table 5.

Estimates of efficacies and transmission probabilities by age (1-17 vs. 18+) for pooled zanamivir trials conducted in 1998-1999 and 2000-2001. Results are obtained by approaches using the augmented data.

IRLS
MLE
Parameter Point Estimate SD Point Estimate SD 95% CI
bc 0.0024 0.00052 0.0028 0.00063 (0.0017, 0.0042)
ba 0.00086 0.00030 0.0010 0.00039 (0.00045, 0.0021)
pcc 0.040 0.0074 0.040 0.0077 (0.027, 0.057)
pca 0.028 0.0045 0.029 0.0048 (0.021, 0.040)
pac 0.023 0.0071 0.020 0.0071 (0.009, 0.037)
paa 0.040 0.011 0.032 0.011 (0.016, 0.058)
AVES 0.68 0.086 0.75 0.072 (0.56, 0.86)
AVEI 0.24 0.38 0.23 0.44 (−1.33, 0.75)
AVET 0.81 0.094 (0.50, 0.93)

Subscript c denotes child (1-17), a denotes adult (18+), and ca denotes child-to-adult transmission.

The estimated probability of infection from the common source per daily exposure is 0.0028 for children and 0.0010 for adults. Within households, the daily pairwise transmission probability is also higher in children (p^cc=0.040) than in adults (p^aa=0.032). These estimates of transmission probabilities are comparable to those found in two trials of oseltamivir, another influenza antiviral agent, conducted about the same time in North America and Europe (Yang et al., 2006).

The IRLS estimates are fairly close to the MLEs except for paa and θ . In addition, the IRLS method might have under-estimated the SD for ϕ. The two trials combined together still do not provide sufficient information for estimating ϕ as suggested by the large SD for the MLE of ϕ. Starting estimates for all three methods were provided by a non-iteratively evaluated linear model (Appendix D). With a complementary log-log transformation for probability parameters and a log transformation for efficacy parameters, all three methods converge very well. Without such transformation, the Newton-Raphson procedure applied to the observed data converges if started from the IRLS estimates or the MLEs obtained via data augmentation but not from the noniteratively obtained estimates, which confirms the relative robustness of the methods using data augmentation to starting estimates.

5 Discussion

By augmenting the observed sequential symptom onsets in close contact groups with unobserved daily pairwise transmission outcomes, we identified a likelihood that has a simpler form than the one based solely on observed data and that can be maximized via the EM algorithm. Reilly and Lawlor (1999) used a similar approach to study hepatitis C infection in women with know exposure to anti-D immunoglobulin in sequential years before testing. However, the presence of multiple infective sources in the same time interval and the involvement of latent and infectious periods of influenza make our situation more complex. This simple form of the likelihood offers the flexibility of using other potential methods, for instance, the Fisher-scoring method instead of the Newton-Raphson algorithm for iterative maximization. As another example, we derived from this likelihood a linear model fitted with the IRLS method in combination with the EM-analogous algorithm. In a simulation study, the two approaches using the augmented data performed better than the ML method using the observed data in terms of robustness to initial estimates, especially for sparse data. The IRLS method is the most robust to initial estimates, and asymptotically provides estimates of the same quality as the MLEs. The IRLS estimates are likely biased and have larger variances when data are sparse, but can serve as good initial estimates for the ML methods.

We have assumed known distributions for the latent and infectious periods and the coincidence between the latent and the incubation periods, which may not be realistic for some infectious diseases. If these assumptions do not hold, estimates could be biased and misleading. Cauchemez et al. (2004) used a Bayesian hierarchical model to allow estimation of the latent and infectious periods, assuming that the latent and the incubation periods were equal, but such estimation requires a sufficient number of cases. In addition, our models are limited to symptomatic infections. However, asymptomatic influenza infections can provide further information about the efficacies and transmission probabilities from a virological point of view, although such ”silent” cases complicate the likelihood to a large extent. A future research topic of potential public health interest would be to extend our data augmentation scheme to a Bayesian framework that can estimate the natural history of the disease and take into account asymptomatic cases.

In the data analysis, index cases were excluded regardless of their laboratory test results. According to the rationale of adjustment for selection bias, i.e., conditioning on the symptom status (caused by true infection) of the index case on the ascertainment day, a test-negative index case should be viewed as a susceptible and followed the same way as for contacts. However, not all clinical trials required symptom diary for index cases after enrollment, e.g., in the 2000-2001 trial of zanamivir. Households with test-negative index cases are generally excluded from calculations of SARs; but in our case the inclusion of the contacts in these households can improve estimation of b and θ and of p to a lesser extent. This issue could be resolved by improving the follow-up of index cases.

In this paper we have assumed fixed antiviral effects and non-random susceptibility. If sufficient data are available, random effects on the transmission probabilities as well as the antiviral efficacies could be considered to address potential heterogeneity among centers, households, or individuals (Longini and Halloran, 1996; Halloran, Préziosi and Chu, 2003).

With the potential for pandemic influenza, a rising global concern, zanamivir is one of the major available influenza antivirals agents (Hayden, 2001). Our estimates can be used in modeling research to evaluate the effects of intervention options at different levels of contact groups (Longini et al., 2004; Longini et al., 2005; Germann et al., 2006). This research also emphasizes the need for proper study design for the parameters to be adequately estimated.

Acknowledgements

This work was partially supported by National Institute of Allergy and Infectious Diseases grant R01-AI32042. The data on the clinical trials of zanamivir were provided by GlaxoSmithKline Laboratories Inc.

Appendix A: Conditional Expected Frequency of Transmission Status

Define Iji(t) as the event that a susceptible person i escapes infection from infective source j on day t. Note that the following basic facts hold:

  • Ii(t)Iji(t)=Iji(t).

  • Pr(Iji(t)Ii(t)Si(t~i))=Pr(Iji(t)Ii(t)).

  • Iji(t)Ii(τ)=Ii(τ)forτ>t.

  • Pr(Iji(t)Ii(τ))=0forτ<t.

Then,

Pr(Zji(t)=1b,p,θ,ϕ,t~i)=Pr(Iji(t)Si(t~i))=Pr(Ii(t)Iji(t)Si(t~i))=Pr(Iji(t)Ii(t)Si(t~i))×Pr(Ii(t)Si(t~i))=Pr(Iji(t)Ii(t))×Pr(Si(t~i)Ii(t))×Pr(Ii(t))Pr(Si(t~i))=Pr(Iji(t)Ii(t))Pr(Ii(t))×Pr(Si(t~i)Ii(t))×Pr(Ii(t))Pr(Si(t~i))=Pr(Iji(t))Pr(Si(t~i))×Pr(Si(t~i)Ii(t)) (8)

and

Pr(Zji(t)=1b,p,θ,ϕ,t~i)=Pr(Iji(t)Si(t~i))=Στ=ttiPr(Iji(t)Ii(τ)Si(t~i))=Στ=ttiPr(Si(t~i)Iji(t)Ii(τ))×Pr(Iji(t)Ii(τ))Pr(Si(t~i))=Pr(Si(t~i)Iji(t)Ii(t))×Pr(Iji(t)Ii(t))Pr(Si(t~i))+Στ=t+1tiPr(Si(t~i)Ii(τ))×Pr(Ii(τ))Pr(Si(t~i))=Pr(Si(t~i)Ii(t))×{Pr(Ii(t))Pr(Iji(t))}Pr(Si(t~i))+Στ=t+1tiPr(Si(t~i)Ii(τ))×Pr(Ii(τ))Pr(Si(t~i)). (9)

Appendix B: Generalization of the Linear Model to Heterogeneous Populations

For a heterogeneous population composed of k risk categories of people (e.g., age groups), let pvu be the pairwise transmission probability per unprotected contact between a susceptible individual in category u and an infective person in category v. Further, let bu be the probability of infection from the common source for category u. Assume that the AVES and the AVEI are the same for all categories for notational simplicity. The models can be easily generalized to situations with heterogeneous efficacies as well. There are k parameters for common source transmission probabilities and k2 parameters for household transmission probabilities.

Let group k be the reference stratum. The model in matrix form derived from (7) would be

log(pji(t))=β(b)τIi+Jiτβ(p)Ii+β(θ)ri(t)+β(ϕ)rj(t)+log(fj(tt~j)), (10)

where β(θ) = log(θ), β(ϕ) = log(ϕ), and

Ii=(I{i1},,I{ik1},1)τ,Ji=(ψjI{j1},,ψjI{jk1},1)τ,β(b)=(β1(b),,βk(b))τ=(log(b1bk),,log(bk1bk),log(bk))τ,β(p)={βvu(p)}k×k={log(p11pkkp1kpk1)log(p1(k1)pkkp1kpk(k1))log(p1kpkk)log(p(k1)1pkkp(k1)kpk1)log(p(k1)(k1)pkkp(k1)kpk(k1))log(p(k1)kpkk)log(pk1bkpkkb1)log(pk(k1)bkpkkbk1)log(pkkbk)}.

Appendix C: Adjustment for Selection Bias in Case-ascertained Follow-up Design

In a prospective follow-up design, exposure to risks of infection starts on day 1. However, in real clinical trials, households are generally enrolled when one or more index cases are identified by symptom onsets, to which we refer as a case-ascertained design. To reduce bias caused by such selective enrollment, Yang et al. (2006) suggest that the individual likelihood contributions be conditioned on observed symptom status up to the symptom onset day of the index case. The consequences of such adjustment are the following:

  • Index cases do not contribute to the likelihood.

  • The likelihood calculation for person i starts from the day tdi+1, where di denotes the index case in the household of person i.

  • The individual log-likelihood is subtracted by log(Ai) where
    Ai=Σt=tdi+1tdi{(τ=tdi+1t1ei(τ))(1ei(t))Pr(t~>t~dit)}+t=tdi+1tdiei(t). (11)

For the ML method using the augmented data, the same adjustment can be applied. For the linear model method, such a conditional adjustment is difficult. However, since minimizing the weighted least squares is analogous to maximizing the log-likelihood, it is natural to use the same adjusting term to penalize the objective function

Σh=1Hωh{log(P~h)+log(Ph)}2+Σilog(Ai(β)),

where Ai is re-expressed as functions of β = (β0, … ,β3). Denote the covariate matrix by X, the diagonal weight matrix by W and the observed response vector by log(P~), then at the lth iteration,

β^l=(XWl1X)1{XWl1log(P~)12Σidlog(Ai(β^l1))dβ^l1}.

Appendix D: Non-iteratively Fitted Linear Model for Initial Estimates

The ML and IRLS methods require initial estimates to start the iteration. Iteration could be avoided if we model Ii(t) instead of Iji(t), i.e., infection status of person i on day t instead of pairwise transmission, and assume equal Pr(Ii(t)) for all titti.

Let Ni(t) be the number of treated infective individuals and Mi(t) be the number of untreated infective individuals that a susceptible person i is exposed to within the household on day t. Given Ni(t) and Mi(t), the probability that person i is infected on day t is given by

pi(t)=1(1b)1ri(t)(1θb)ri(t)×(1p)Mi(t)(1ri(t))(1θp)Mi(t)ri(t)(1ϕp)Ni(t)(1ri(t))(1θϕp)Ni(t)ri(t).

A reparameterization leads to

log(1pi(t))=β0+β1ri(t)+β2Mi(t)+β3ri(t)Mi(t)+β4Ni(t)+β5ri(t)Ni(t). (12)

where

β0=log(1b),β1=log(1θb1b),β2=log(1p),β3=log(1θp1p),β4=log(1ϕp),andβ5=log(1θϕp1ϕp).

Let Yi(t) indicate the infection status (1:infection, 0:escape) for person i on day t. Similar to Section 2.1, define

Zi(t)=Yi(t)τ<t(1Yi(τ))

and

Zi(t)=τt(1Yi(τ)).

Zi(t) = 1 is the event that person i escapes infection from any source until day t, while Zi(t)=1 is the event that person i escapes infection from any source up to day t. Assume that Pr(Ii(t)) is equal for all t[ti,ti]. Then the conditional probabilities

Pr(Zji(t)=1b,p,θ,ϕ,t~i)=Pr(Si(t~i)Ii(t)),Pr(Zji(t)=1b,p,θ,ϕ,t~i)=Στ=t+1tiPr(Si(t~i)Ii(τ)),

do not involve unknown parameters, and can be used as the weights for fitting (12). While Ni(t) and Mi(t) are generally unknown, they can be obtained by randomly sampling the duration of infectious period for each infective individual according to the known empirical distribution f. Alternatively, all possible combinations of Ni(t) and Mi(t) can contribute to model (12) with the weights multiplied by the joint probability Pr (Ni(t), Mi(t)) derived from f.

Model (12) gives rise to multiple estimators for the efficacy parameters because of the increase in parameter dimension:

θ^1=1exp(β^0+β^1)1exp(β^0),θ^2=1exp(β^2+β^3)1exp(β^2)andθ^3=1exp(β^4+β^5)1exp(β^4)

for θ and

ϕ^1=1exp(β^4)1exp(β^2)andϕ^2=1exp(β^4+β^5)1exp(β^2+β^3)

for ϕ. The average of the multiple estimates weighted by reciprocal standard errors can serve as the initial estimate, e.g., θ^=Σi=13ωiθ^i, where ωi=1s.e.(θ^i)Σj=131s.e.(θ^j).

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Addy CL, Longini IM, Haber MJ. A generalized stochastic model for the analysis of infectious disease final size data. Biometrics. 1991;47:961–974. [PubMed] [Google Scholar]
  2. Becker NG. Analysis of Infectious Disease Data. Chapman and Hall; New York, NY: 1989. [Google Scholar]
  3. Becker NG, Hasofer AM. Estimation in Epidemics with Incomplete Observations. Journal of the Royal Statistical Society, Series B. 1997;59:415–429. [Google Scholar]
  4. Becker NG, Britton T, O'Neill PD. Estimating vaccine effects on transmission of infection from household outbreak data. Biometrics. 2003;59:467–475. doi: 10.1111/1541-0420.00056. [DOI] [PubMed] [Google Scholar]
  5. Cauchemez S, Carrat F, Viboud C, Valleron AJ, Boëlle PY. A Bayesian MCMC approach to study transmission of influenza: application to household longitudinal data. Statist. Med. 2004;23:3469–3487. doi: 10.1002/sim.1912. [DOI] [PubMed] [Google Scholar]
  6. Datta S, Halloran ME, Longini IM. Efficiency of estimating vaccine efficacy for susceptibility and infectiousness: randomization by individual versus household. Biometrics. 1999;55:792–798. doi: 10.1111/j.0006-341x.1999.00792.x. [DOI] [PubMed] [Google Scholar]
  7. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 1977;39:1–38. [Google Scholar]
  8. Donner A. Some aspects of the design and analysis of cluster randomized trials. Statistics in Medicine. 1998;47:95–113. [Google Scholar]
  9. Elveback LR, Fox JP, Ackerman E. An influenza simulation model for immunization studies. American Journal of Epidemiology. 1976;103:152–165. doi: 10.1093/oxfordjournals.aje.a112213. [DOI] [PubMed] [Google Scholar]
  10. Germann TC, Kadau K, Longini IM, Macken CA. Mitigation strategies for pandemic influenza in the United States. Proceedings of the National Academy of Science of the U. S. A. 2006;103:5935–5940. doi: 10.1073/pnas.0601266103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Halloran ME, Struchiner CJ, Longini IM. Study designs for different efficacy and effectiveness aspects of vaccination. American Journal of Epidemiology. 1997;146:789–803. doi: 10.1093/oxfordjournals.aje.a009196. [DOI] [PubMed] [Google Scholar]
  12. Halloran ME, Préziosi M-P, Chu H. Estimating vaccine efficacy from secondary attack rates. Journal of the American Statistical Association. 2003;98:38–46. [Google Scholar]
  13. Halloran ME, Hayden FG, Yang Y, Longini IM, Monto AS. Antiviral effects on influenza viral transmission and pathogenicity: observations from household-based trials. American Journal of Epidemiology. 2006;165:212–222. doi: 10.1093/aje/kwj362. [DOI] [PubMed] [Google Scholar]
  14. Hayden FG, Gubareva LV, Monto AS, Klein TC, Elliott MJ, Hammond JM, Sharp SJ, Ossi MJ, Zanamivir Family Study Group Inhaled zanamivir for the prevention of influenza in families. New England Journal of Medicine. 2000;343:1282–1289. doi: 10.1056/NEJM200011023431801. [DOI] [PubMed] [Google Scholar]
  15. Hayden FG. Perspectives on antiviral use during pandemic influenza. Philosophical transactions of the Royal Society of London, Series B, Biological sciences. 2001;356:1877–1884. doi: 10.1098/rstb.2001.1007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Longini IM, Koopman JS. Household and Community Transmission Parameters from Final Distributions of Infections in Households. Biometrics. 1982;38:115–126. [PubMed] [Google Scholar]
  17. Longini IM, Halloran ME. A frailty mixture model for estimating vaccine efficacy. Journal of the Royal Statistical Society, Series C. 1996;45:165–173. [Google Scholar]
  18. Longini IM, Halloran ME, Nizam A, Yang Y. Containing pandemic influenza with antiviral agents. American Journal of Epidemiology. 2004;159:623–633. doi: 10.1093/aje/kwh092. [DOI] [PubMed] [Google Scholar]
  19. Longini IM, Nizam A, Xu S, Ungchusak K, Hanshaoworakul W, Cummings DAT, Halloran ME. Containing pandemic influenza at the source. Science. 2005;309:1083–1087. doi: 10.1126/science.1115717. [DOI] [PubMed] [Google Scholar]
  20. Louis TA. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society, Series B. 1982;44:226–233. [Google Scholar]
  21. Magder L, Brookmeyer R. Analysis of infectious disease data from partner studies with unknown source of infection. Biometrics. 1993;49:1110–1116. [PubMed] [Google Scholar]
  22. Monto AS, Pichichero ME, Blanckenberg SJ, Ruuskanen O, Cooper C, Fleming DM, Kerr C. Zanamivir prophylaxis: an effective strategy for the prevention of influenza types A and B within households. Journal Infectious Diseases. 2002;186:1582–1588. doi: 10.1086/345722. [DOI] [PubMed] [Google Scholar]
  23. O'Neill P, Roberts GO. Bayesian inference for partially observed stochastic epidemics. Journal of the Royal Statistical Society, Series A. 1999;162:121–129. [Google Scholar]
  24. O'Neill P, Balding DJ, Becker NG, Eerola M, Mollison D. Analyses of infectious disease data from household outbreaks by Markov chain Monte Carlo methods. Journal of the Royal Statistical Society, Series C. 2000;49:517–542. [Google Scholar]
  25. Paap R. What are the advantages of MCMC based inference in latent variable models? Statistica Neerlandica. 2002;56:2–22. [Google Scholar]
  26. Rampey AH, Longini IM, Haber MJ, Monto AS. A discrete-time model for the statistical analysis of infectious disease incidence data. Biometrics. 1992;48:117–128. [PubMed] [Google Scholar]
  27. Reilly M, Lawlor E. A likelihood-based method for identifying contaminated lots of blood product. International Journal of Epidemiology. 1999;28:787–792. doi: 10.1093/ije/28.4.787. [DOI] [PubMed] [Google Scholar]
  28. van Dyk DA, Meng X. The art of data augmentation. Journal of Computational and Graphical Statistics. 2001;10:1–50. [Google Scholar]
  29. Welliver R, Monto AS, Carewicz O, Schattemanet E, Hassman M, Hedrick J, Jackson HC, Huson L, Ward P, Oxford JS. Effectiveness of oseltamivir in preventing influenza in household contacts: a randomized controlled trial. Journal of the American Medical Associtation. 2001;285:748–754. doi: 10.1001/jama.285.6.748. [DOI] [PubMed] [Google Scholar]
  30. Yang Y, Longini IM, Halloran ME. Design and evaluation of prophylactic interventions using infectious disease incidence data from close contact groups. Journal of the Royal Statistical Society, Series C. 2006;55:317–330. doi: 10.1111/j.1467-9876.2006.00539.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES