A Data-Augmentation Method for Infectious Disease Incidence Data from Close Contact Groups

Yang Yang; Ira M Longini, Jr; M Elizabeth Halloran

doi:10.1016/j.csda.2007.03.007

. Author manuscript; available in PMC: 2008 Aug 15.

Published in final edited form as: Comput Stat Data Anal. 2007 Aug 15;51(12):6582–6595. doi: 10.1016/j.csda.2007.03.007

A Data-Augmentation Method for Infectious Disease Incidence Data from Close Contact Groups

Yang Yang ^1,^*, Ira M Longini Jr ^1,², M Elizabeth Halloran ^1,²

PMCID: PMC2131714 NIHMSID: NIHMS29356 PMID: 18704156

Abstract

A broad range of studies of preventive measures in infectious diseases gives rise to incidence data from close contact groups. Parameters of common interest in such studies include transmission probabilities and efficacies of preventive or therapeutic interventions. We estimate these parameters using discrete-time likelihood models. We augment the data with unobserved pairwise transmission outcomes and fit the model using the EM algorithm. A linear model derived from the likelihood based on the augmented data and fitted with the iteratively re-weighted least squares method is also discussed. Using simulations, we demonstrate the comparable accuracy and lower sensitivity to initial estimates of the proposed methods with data augmentation relative to the likelihood model based solely on the observed data. Two randomized household-based trials of zanamivir, an influenza antiviral agent, are analyzed using the proposed methods.

Keywords: Antiviral agent, Data augmentation, EM algorithm, Infectious disease, Intervention efficacy, Linear model, MLE

1 Introduction

Close contact groups, such as households, are the important places of transmission for many infectious diseases. Data collected from these contact groups provide a basis for evaluating person-to-person transmission risks and effectiveness of intervention methods such as antiviral treatments or vaccine (Halloran, Struchiner and Longini, 1997; Becker, Britton and O'Neill, 2003). Using different levels of information available in the data, various statistical methods have been developed for data analysis. If only the final infection status of participants are known, methods utilizing recursive final-size probabilities can be applied, including likelihood maximization (Longini and Koopman, 1982; Addy, Longini and Haber, 1991), Bayesian approaches (O'Neill and Roberts, 1999), generalized linear models (Magder and Brookmeyer, 1993), and estimating equations with martingale techniques (Becker and Hasofer, 1997). In many modern clinical trials, sequential laboratory tests and symptom diary of participants provide time-to-event data with individual-specific longitudinal exposure information. To take into account exposure and transmission dynamics at the individual level, Rampey et al. (1992) constructed discrete-time likelihoods based on assumptions about the natural history of the disease such as the distributions of the latent and infectious periods. Yang, Longini and Halloran (2006) extended this method to the more realistic case-ascertained design. Cauchemez et al. (2004) proposed a Bayesian model with the flexibility of estimating the natural history of the disease, but time-dependent covariates have not been accommodated.

The discrete-time likelihoods in Rampey et al. (1992) and Yang et al. (2006) are built solely upon the observed data, including symptom onset dates, laboratory test results and household structure (which individuals live in which households), and involve summing probability components over the latent period. Summations or integrals are commonly seen in likelihoods based solely on the observed data, and such complicated structure may present difficulties for standard analyses or prevent extension by other methods (O'Neill et al., 2000). More importantly, when data are sparse because of rare incidences and/or a multicovariate structure, iterative estimation procedures (e.g., the Newton-Raphson algorithm) using only the observed data may be sensitive to the initial estimates in locating the maximum likelihood estimates (MLEs). This fact can be seen in section 3 and 4 of this paper, and is also mentioned in Yang et al. (2006). Data augmentation is a popular technique to circumvent computational difficulties in classical likelihood methods because likelihood functions conditional on unobserved variables are often simpler (van Dyk and Meng, 2001; Paap, 2002). In a transmission model for infectious diseases, a basic element is the transmission probability given a contact between an infective person and a susceptible person. The contact may be defined in various ways, for example, one day of living in the same household. The outcome of each contact, infection or escape, is generally not observable since a person may make multiple contacts before infection. In this paper, we revise the discrete-time likelihood in Yang et al. (2006) by augmenting the observed symptom onset data with the unobserved transmission outcome for each contact. This likelihood based on the augmented data has a simpler form than the one based on only the observed data and can be maximized with the EM algorithm. To illustrate the potential use of the simple likelihood by a different method, we derive a linear model that can be fitted using the iteratively re-weighted least squares (IRLS) procedure. We show via simulation studies that both the maximum likelihood (ML) and the IRLS methods using the augmented data are less sensitive to initial estimates as compared to the ML method using only the observed data in Yang et al. (2006). We use the proposed approaches to estimate the prophylactic and treatment effectiveness of an influenza antiviral agent in two household trials.

2 Methods

Suppose that the disease under investigation is influenza and the data arise from a clinical trial in which household members are randomized to either an antiviral agent or control when an index case is identified by clinical symptoms. Let us assume the antiviral agent provides temporary protection for susceptible contacts and therapy for cases. In the discrete-time likelihood model setting, risks are evaluated for each susceptible participant in each time interval. Suppose that the time intervals are consecutive days, and define a contact as the exposure of a susceptible person to an infective person in the same household throughout a day. The pairwise transmission probability per contact between a susceptible person i with covariates x_i and an infective person j with covariates x_j in the same household is expressed as p(x_i, x_j). If x_i and x_j are scalars denoting treatment status of antiviral agent (1=yes, 0=no), then one can define efficacy measures ${AVE}_{S} = 1 - \frac{p (1, 0)}{p (0, 0)}$ , ${AVE}_{I} = 1 - \frac{p (0, 1)}{p (0, 0)}$ and ${AVE}_{T} = 1 - \frac{p (1, 1)}{p (0, 0)}$ , where in the epidemiological literature AVE_S measures the antiviral efficacy in reducing susceptibility, AVE_I measures the efficacy in reducing infectiousness, and AVE_T is called the total effectiveness (Halloran et al., 1997). Let p = p(0, 0) be the baseline daily pairwise transmission probability without any treatment. For notational convenience, a reparameterization leads to $p (x_{i}, x_{j}) = θ^{x_{i} (1 - x_{j})} ϕ^{(1 - x_{i}) x_{j}} η^{x_{i} x_{j}} p$ where θ = 1 − AVE_S, ϕ = 1 − AVE_I and η = 1 − AVE_T. For simplicity, we assume multiplicativity of θ and ϕ such that η = θϕ, and thus $p (x_{i}, x_{j}) = θ^{x_{i}} ϕ^{x_{j}} p$ . In Yang et al. (2006), we explored the assumption of multiplicativity for the ML method using only the observed data.

As our interest centers around estimation of transmission probabilities and treatment efficacies, we assume that: 1. the latent period (time from infection to being infectious) coincides with the incubation period (time from infection to the onset of symptoms); and 2. durations of the latent and the infectious periods have known probability distributions. If the latent and the incubation periods do not coincide but are both known, the model can be adjusted for such situation.

2.1 The Maximum Likelihood Method Based on the Augmented Data

Suppose that the trial is conducted on a population of size N and is observed on a daily basis from day 1 to day T. Let us assume day 1 is the first calendar day of exposure for the whole study population. The observed data for each subject include household membership, the date of symptom onset, laboratory test result, randomized treatment and treatment period as well as other characteristics such as age and gender. On day t, the probability that an infective person j with treatment status r_j(t) (0: untreated, 1: treated) infects a susceptible person i with treatment status r_i(t) in the same household is expressed as

p_{ji} (t) = θ^{r_{i} (t)} ϕ^{r_{j} (t)} pf (t ∣ {\tilde{t}}_{j}),

(1)

where f(t|t̃_j) is the probability that person j stays infectious on day t given the day of symptom onset t̃_j and is derived from the known distribution of the infectious period. For simplicity in notation, we use t̃_i to denote the observed symptom onset time for each person, although t̃_i is right-censored for those who are free of symptoms up to day T. We allow a constant common infective source from outside of the household, by setting $p_{ci} (t) = θ^{r_{i} (t)} b$ , where c refers to the common source, and b is the baseline probability of being infected by the common source per day. Let ψ_j = 1 if the infective source j is a person and 0 if j = c. A modification of (1) takes into account the common source as the following

p_{ji} (t) = θ^{r_{i} (t)} ϕ^{r_{j} (t)} p^{ψ_{j}} b^{1 - ψ_{j}} f (t ∣ {\tilde{t}}_{j}),

(2)

where f_c(t|t̃_c) = 1 and r_c(t) = 0 for all t. A likelihood involving only the observed data, {t̃_i : 1 ≤ t ≤ T, 1 ≤ i ≤ N}, can be constructed from (2) and the known distribution of the latent period as in Yang et al. (2006).

Let Y_ji(t) be the transmission result (1:infection, 0:escape) between an infective source j and a susceptible person i on day t. Let l_max and l_min be the maximum and minimum duration of the latent period, so that t_i = t̃_i − l_max and $\overset{‒}{t_{i}} = {\tilde{t}}_{i} - l_{\min}$ are the earliest and latest potential infection days for person i. Given the observed symptom onset day t̃_i, the sequence of Y_ji(t)'s for t ≥ t_i remains unknown. It should be noted that Y_ji(t) is a random variable only if Y_ji(τ) = 0 for all τ < t, and Y_ji(t) is independent of Y_ki(t) for the same day t. Define

Z_{ji} (t) = Y_{ji} (t) \prod_{k \in D_{i}, τ < t} (1 - Y_{ki} (τ))

and

{\bar{Z}}_{ji} (t) = (1 - Y_{ji} (t)) \prod_{k \in D_{i}, τ < t} (1 - Y_{ki} (τ)),

where D_i is the collection of potential infective sources for person i, i.e., people living in the same household with person i plus the external common source. Z_ji(t) = 1 is the event that person i escapes infection from any source before day t but is infected by source j on day t, while ${\bar{Z}}_{ji} (t) = 1$ is the event that person i escapes infection from any source before day t and from source j on day t. Let $\max_{j \in D_{i}} Z_{ji} (t)$ indicate if Z_ji(t) = 1 for any j on day t. The likelihood of the augmented data is

\begin{matrix} L_{i} (b, p, θ, ϕ ∣ {\tilde{t}}_{j}, Z_{ji} (t), j \in D_{i}, t \leq T) \\ = & \prod_{t = 1}^{T} {g {({\tilde{t}}_{i} ∣ t)}^{\max_{j \in D_{i}} Z_{ji} (t)} \prod_{j \in D_{i}} {(p_{ji} (t))}^{Z_{ji} (t)} {(1 - p_{ji} (t))}^{{\bar{Z}}_{ji} (t)}} \\ \propto & \prod_{t = 1}^{T} \prod_{j \in D_{i}} {(p_{ji} (t))}^{Z_{ji} (t)} {(1 - p_{ji} (t))}^{{\bar{Z}}_{ji} (t)}}, \end{matrix}

(3)

where g(t̃_i|t) denotes the probability of illness onset on day t̃_i given infection on day t and is derived from the distribution of the latent period. According to our assumption, both f(t|t̃_j) and g(t̃_i|t) are known. This likelihood is a product of binomial probability components, much simpler than the one in Yang et al. (2006). To apply the EM algorithm, we need to determine the distributions of Z_ji(t) and ${\bar{Z}}_{ji} (t)$ conditioning on current estimates of b, p, θ and ϕ as well as t̃_j, j ∈ D_i (Dempster, Laird and Rubin, 1977). Define S_i(t) as the event that person i has symptom onset on day t, I_i(t) the event that person i is infected on day t and I_ji(t) the event that person i is infected by j on day t. Then, the conditional distributions are given by (Appendix A)

\Pr (Z_{ji} (t) = 1 ∣ b, p, θ, ϕ, {\tilde{t}}_{i}) = {\begin{matrix} \frac{\Pr (I_{ji} (t))}{\Pr (S_{i} ({\tilde{t}}_{i}))} \times \Pr (S_{i} ({\tilde{t}}_{i}) ∣ I_{i} (t)), & \underset{‒}{t_{i}} \leq t < \overset{‒}{t_{i}} \\ 0, otherwise \end{matrix}

(4)

and

\Pr ({\bar{Z}}_{ji} (t) = 1 ∣ b, p, θ, ϕ, {\tilde{t}}_{i}) = {\begin{matrix} \frac{\Pr (S_{i} ({\tilde{t}}_{i}) ∣ I_{i} (t)) \times {\Pr (I_{i} (t)) - \Pr (I_{ji} (t))}}{\Pr (S_{i} ({\tilde{t}}_{i}))} + Σ_{τ = t + 1}^{\overset{‒}{t_{i}}} \frac{\Pr (S_{i} ({\tilde{t}}_{i}) ∣ I_{i} (τ)) \times \Pr (I_{i} (τ))}{\Pr (S_{i} ({\tilde{t}}_{i}))}, & \underset{‒}{t_{i}} \leq t < \overset{‒}{t_{i}} \\ 1, t < \underset{‒}{t_{i}} \\ 0, otherwise \end{matrix} .

(5)

Given estimates $({\hat{b}}_{l - 1}, {\hat{p}}_{l - 1}, {\hat{θ}}_{l - 1}, {\hat{ϕ}}_{l - 1})$ from the (l – 1)^th iteration, in the l^th iteration we have

\begin{matrix} \Pr (I_{ji} (t)) & = {\begin{matrix} {\hat{Q}}_{i} (t - 1) {\hat{θ}}_{l - 1}^{r_{i} (t)} {\hat{ϕ}}_{l - 1}^{r_{j} (t)} {\hat{p}}_{l - 1} f (t ∣ {\tilde{t}}_{j}), & j \in D_{i} \\ {\hat{Q}}_{i} (t - 1) {\hat{θ}}_{l - 1}^{r_{i} (t)} {\hat{b}}_{l - 1}, & j = c \end{matrix} \\ \Pr (I_{i} (t)) & = {\hat{Q}}_{i} (t - 1) {1 - (1 - {\hat{θ}}_{l - 1}^{r_{i} (t)} {\hat{b}}_{l - 1}) \prod_{j \in D_{i}} (1 - {\hat{θ}}_{l - 1}^{r_{i} (t)} {\hat{ϕ}}_{l - 1}^{r_{j} (t)} {\hat{p}}_{l - 1} f (t ∣ {\tilde{t}}_{j}))}, \\ \Pr (S_{i} ({\tilde{t}}_{i})) & = Σ_{τ = \underset{‒}{t_{i}}}^{\overset{‒}{t_{i}}} \Pr (S_{i} ({\tilde{t}}_{i}) ∣ I_{i} (τ)) \times \Pr (I_{i} (τ)), \\ \Pr (S_{i} ({\tilde{t}}_{i}) ∣ I_{i} (τ)) & = g ({\tilde{t}}_{i} ∣ τ), \end{matrix}

where ${\hat{Q}}_{i} (t - 1)$ is the estimated cumulative escape probability based on $({\hat{b}}_{l - 1}, {\hat{p}}_{l - 1}, {\hat{θ}}_{l - 1}, {\hat{ϕ}}_{l - 1})$ The likelihood history before day $\underset{‒}{t_{i}}$ can be dropped from Pr(I_i,j(t)) and Pr(I_i(t)), since ${\hat{Q}}_{i} (\underset{‒}{t_{i}} - 1)$ is the common factor and will eventually be cancelled out in the calculations of (4) and (5). The implementation of the EM algorithm is straightforward. In the E-step, (4) and (5) are calculated and plugged into the logarithm of (3) to obtain

\begin{matrix} \log (L_{i} (b, p, θ, ϕ ∣ {\tilde{t}}_{j}, Z_{ji} (t), j \in D_{i}, t \leq T)) \\ \propto Σ_{t = 1}^{T} \underset{j \in D_{i}}{Σ} {\Pr (Z_{ji} (t) = 1 ∣ b, p, θ, ϕ, {\tilde{t}}_{i}) \log (p_{ji} (t)) + \Pr ({\bar{Z}}_{ji} (t) = 1 ∣ b, p, θ, ϕ, {\tilde{t}}_{i}) \log (1 - p_{ji} (t))} \end{matrix}

(6)

which is maximized in the M-step.

Variances of the parameter estimates can be evaluated using Louis' method (Louis, 1982). Let Z be the collection of Z_ji(t), and t̃ the collection of t̃_i, for all i, j and t, so that t̃ is the observed data and Z is the partially latent data. Let λ = {b, p, θ, ϕ}. Louis' method states that

\frac{\partial^{2} \log (L (λ ∣ \tilde{t}))}{\partial λ^{2}} = E_{Z ∣ \tilde{t}, λ} {- \frac{\partial^{2} \log (L (λ ∣ \tilde{t}, Z))}{\partial λ^{2}}} + {VAR}_{Z ∣ \tilde{t}, λ} {- \frac{\partial \log (L (λ ∣ \tilde{t}, Z))}{\partial λ}} .

The first component on the right side can be evaluated analytically based on (6), while the second component can be estimated via sampling from the distribution of Z conditioning on t̃ and $\hat{λ}$ .

2.2 The Linear Model Based on the Augmented Data

A linear model is a natural consequence of modeling the daily pairwise transmissions. Taking the logarithm on both sides of (2),

\begin{matrix} \log (p_{ji} (t)) & = \log (b) + ψ_{j} \log \frac{p}{b} + r_{i} (t) \log (θ) + r_{j} (t) \log (ϕ) + \log (f (t ∣ {\tilde{t}}_{j})) \\ = β_{0} + β_{1} ψ_{j} + β_{2} r_{i} (t) + β_{3} r_{j} (t) + \log (f (t ∣ {\tilde{t}}_{j})) . \end{matrix}

(7)

The response of this model is Y_ji(t) since $p_{ji} (t) = \Pr (Y_{ji} (t) = 1 ∣ Y_{ki} (τ) = 0, k \in D_{i}, τ < t)$ . From (6), it is clear that one should assign weights $\Pr (Z_{ji} (t) = 1 ∣ b, p, θ, ϕ, {\tilde{t}}_{i})$ to the outcome Y_ji(t) = 1 and $\Pr ({\bar{Z}}_{ji} (t) = 1 ∣ b, p, θ, ϕ, {\tilde{t}}_{i})$ to the outcome Y_ji(t) = 0. As the weights need to be calculated from pre-estimated parameters, we use the iteratively re-weighted least squares (IRLS) method to fit the model.

To apply the IRLS method, suppose the conditional expected frequencies of Y_ji(t)'s have been summarized into H binomial proportions P_h, h = 1, …, H, with the H covariate patterns defined by r_i(t), r_j(t), ψ_j and $f (t ∣ {\tilde{t}}_{j})$ . We fit model (7) by minimizing the objective function $Σ_{h = 1}^{H} w_{h} {\log ({\tilde{P}}_{h}) - \log (P_{h})}^{2}$ , the squared difference between the observed proportion P̃_h and the mean proportion P_h. Let n_h be the number of observations in the h^th pattern. The weight for the h^th pattern $w_{h} = {VAR}^{- 1} (\log ({\hat{P}}_{h}))$ could be estimated from either P̃_h (data-based) or the fitted response P̂_h (model-based). Our simulations suggest that combinations such as the arithmetical mean $\frac{1}{2} {\frac{n_{h} \times {\tilde{P}}_{h}}{1 - {\tilde{P}}_{h}} + \frac{n_{h} \times {\hat{P}}_{h}}{1 - {\hat{P}}_{h}}}$ or the geometric mean $n_{h} \sqrt{\frac{{\tilde{P}}_{h} {\hat{P}}_{h}}{(1 - {\tilde{P}}_{h}) (1 - {\hat{P}}_{h})}}$ provide estimates close to the MLEs. If P̃_h = 0, we replace P̃_h by P̂_h from the previous iteration. Let ${\hat{β}}_{0}, \dots, {\hat{β}}_{3}$ be the WLS estimates of the coefficients in model (7), then the WLS estimates of the parameters at the l^th iteration are

{\hat{b}}_{l} = \exp ({\hat{β}}_{0}), {\hat{p}}_{l} = \exp ({\hat{β}}_{0} + {\hat{β}}_{1}), {\hat{θ}}_{l} = \exp ({\hat{β}}_{2}), and {\hat{ϕ}}_{l} = \exp ({\hat{β}}_{3}) .

We then update the parameters and re-fit the model until the estimates converge. We have generalized the linear model method to populations with heterogeneity in the transmission probabilities (Appendix B).

At each iteration, the variances of b̂_l, p̂_l, ${\hat{θ}}_{l}$ and ${\hat{ϕ}}_{l}$ estimated from the linear model have been averaged over the conditional distribution of Z. With the loss of randomness in Z, the final estimates will under-estimate the true variances. Since $VAR (\hat{λ}) = E (VAR (\hat{λ} ∣ Z)) + VAR (E (\hat{λ} ∣ Z))$ , similar to the Louis' method for the ML method, one can employ the following adjustment procedure to approximate $VAR {\hat{λ}}$ :

Sample Z from $\Pr (Z ∣ t, \hat{λ})$ , where $\hat{λ}$ is the final parameter estimates.
Use the sampled Z as the weights to fit model (7) and obtain new point estimates of the parameters and their variances.
Repeat the previous steps for a sufficient number of times. The sample average of the newly-estimated variances approximates $E (VAR (\hat{λ} ∣ Z))$ , and the sample variance of the newly-estimated parameters approximates $VAR (E (\hat{λ} ∣ Z))$ .

3 Simulation Study

To compare the ML and IRLS methods using the augmented data with the ML method using only the observed data, we conducted simulations under two scenarios: with a large number of cases and with sporadic cases. A pseudo-community composed of households of size two or larger with 1000 people was generated according to the distributions of age and household sizes from the US Census 2000. The distribution of the simulated household sizes is {2 : 67%, 3 : 13%, 4 : 10%, 5 : 7%, 6 : 2%, 7 : 1%}. Simulated epidemics were stopped on day 100, the typical length of the influenza season for a community. The empirical latent and infectious period distributions, from which f(t|t̃_i) and g(t̃_i|t) were derived, were obtained from Elveback, Fox and Ackerman (1976) and given in Table 1. Our simulations were implemented with individual-level randomization of treatments, where individuals including index cases in the same household may receive different treatments. In the Newton-Raphson procedure for likelihood maximization, we apply the complementary log-log transformation for b and p and the log transformation for θ and ϕ to help improve convergence. One thousand stochastic replications were carried out for each scenario investigated.

Table 1.

Empirical cumulative distributions of the latent period and the infectious period for influenza (Elveback et al., 1976).

Latent Period		Infectious Period
Duration (days)	Cumulative Probability	Duration (days)	Cumulative Probability
0	0	≤ 2	0
1	0.2	3	0.3
2	0.8	4	0.7
3	1.0	5	0.9
		6	1.0

Open in a new tab

We first set the values of the parameters to b = 0.005, p = 0.1, θ = 0.4, ϕ = 0.7. Under this setting, on average 69% of the households and 51% of the contacts were attacked in simulated epidemics, and 20% of the contacts were infected when receiving treatment. The three iterative procedures were initiated from the true values of the parameters and, with adequate numbers of events, converged most of the time. By convergence we mean that the estimates of all four parameters converge to reasonable values. Specifically, estimates of b and p in (10⁻¹⁰, 1) and estimates of θ and ϕ in (10⁻¹⁰, 10) are considered reasonable. Given convergence, the MLEs obtained from only the observed data are exactly the same as those obtained from the augmented data, and the estimates of the SDs are also similar. Therefore, we present only the MLEs obtained from the augmented data. Table 2 shows mean parameter estimates, Monte Carlo standard deviations (SD of point estimates), mean model-estimated SDs and coverage rates of 95% confidence intervals (CI) based on model-estimated SDs for the two approaches using the augmented data. The IRLS method yielded about the same estimates of the parameters and SDs as the MLEs. The small differences between the IRLS estimates and the MLEs for b, SD(b̂) and $SD (\hat{θ})$ decrease as the sample size increases (not shown).

Table 2.

Comparison between MLEs and IRLS estimates based on the augmented data. Results are based on 1000 simulations.^†^‡

	Mean of Point Estimates		Monte Carlo SD		Mean of SD Estimates		Coverage of 95% CI
Parameter	MLE	IRLS	MLE	IRLS	MLE	IRLS	MLE	IRLS
b	0.0051	0.0051	0.00028	0.00028	0.00028	0.00027	95.5	95.2
p	0.10	0.10	0.011	0.011	0.011	0.011	94.9	96.1
θ	0.40	0.40	0.067	0.067	0.067	0.069	95.4	95.7
ϕ	0.71	0.71	0.13	0.13	0.13	0.13	95.4	94.7

Open in a new tab

^†

True parameters are set to b=0.005, p=0.1, θ=0.40, ϕ=0.70.

^‡

MLEs are the same for observed and augmented data.

To compare the sensitivity of the three methods to starting parameter values when data are sparse, we reduced the true values of b from 0.005 to 0.002 and p from 0.1 to 0.01 so as to reduce transmissions within households. Under this setting, the average attack rates decreased to 39% for households and to 12% for contacts, and only 10% of the contacts were infected when receiving treatment. We ran simulations under different starting values of b and p, as log(p_ji(t)) is generally more sensitive to the transmission probabilities than to the efficacies. Simulation results including convergence rates and parameter estimates are compared in Table 3. Clearly, the ML method using only the observed data is highly sensitive to initial values of b and p. The convergence rate of the ML method using only the observed data was comparable to the methods using the augmented data when the iteration started from the true parameters, but dropped dramatically when starting from larger values (b = 0.02, p = 0.1) or smaller values (b = 0.0002, p = 0.001) of the probability parameters. In contrast, the convergence rate was relatively stable for the approaches using the augmented data, regardless of the starting values. Parameter estimates and associated Monte Carlo standard deviations were similar across methods, except that the IRLS method appeared to overestimate θ to a larger extent compared to the ML methods. All methods overestimated ϕ as a consequence of sparse data. In addition, the ML methods overestimated, while the IRLS method underestimated, the standard deviation of ϕ. For example, when starting from true values of b and p, the mean standard errors are 1.10, 1.16 and 0.78 (not shown in Table 3) for the MLE based on the observed data, the MLE based on the augmented data and the IRLS estimate of ϕ respectively, in contrast to Monte Carlo standard deviations 0.95, 0.96 and 0.93.

Table 3.

Comparing sensitivity to initial estimates between the ML method using observed data and the approaches using the augmented data when data are sparse. Results are based on 1000 simulations.^†

Initial Values (b₀, p₀)^‡	Method^§	Conv. Rate (/1000)	Parameters^§§
Initial Values (b₀, p₀)^‡	Method^§	Conv. Rate (/1000)	b	p	θ	ϕ
(0.002, 0.01)
	ML(Obs)	903	0.0020 (0.00016)	0.010 (0.0049)	0.42 (0.25)	0.98 (0.95)
	ML(Aug)	889	0.0020 (0.00016)	0.010 (0.0048)	0.42 (0.24)	1.01 (0.96)
	IRLS(Aug)	937	0.0020 (0.00016)	0.011 (0.0047)	0.48 (0.24)	1.07 (0.93)
(0.02, 0.1)
	ML(Obs)	524	0.0020 (0.00016)	0.010 (0.0048)	0.41 (0.24)	1.19 (1.11)
	ML(Aug)	878	0.0020 (0.00016)	0.010 (0.0049)	0.42 (0.24)	0.99 (1.00)
	IRLS(Aug)	920	0.0020 (0.00016)	0.011 (0.0048)	0.48 (0.24)	1.07 (1.00)
(0.0002, 0.001)
	ML(Obs)	92	0.0020 (0.00016)	0.010 (0.0054)	0.38 (0.23)	1.04 (0.79)
	ML(Aug)	864	0.0020 (0.00015)	0.010 (0.0047)	0.44 (0.26)	1.03 (1.08)
	IRLS(Aug)	928	0.0020 (0.00015)	0.011 (0.0047)	0.49 (0.24)	1.08 (0.90)

Open in a new tab

^†

True parameters are set to b=0.002, p=0.01, θ=0.40, ϕ=0.70.

^‡

Initial values for θ and ϕ are set to the true values.

^§

Obs: observed data, Aug: augmented data.

^§§

Values in the parentheses are Monte Carlo standard deviations.

As seen in Table 3, sparse data generally lead to biased and unstable efficacy estimates for the parametric methods, particularly for the IRLS method. At the same time, sparse data also increase the chance of non-convergence for the standard likelihood maximization algorithms. Household-level randomization, in which individuals in the same household receive the same treatments, provides much less information for estimating θ and ϕ separately compared to individual-level randomization with the same population size. More discussion on trial design issues can be found in Donner (1998), Datta, Halloran and Longini (1999), Halloran et al. (2006) and Yang et al. (2006).

4 Data Analysis

Two randomized multi-center efficacy trials of zanamivir, an inhaled influenza antiviral agent, were conducted during October 1998 - April 1999 (Hayden et al., 2000) and June 2000 - April 2001 (Monto et al., 2002). In both trials, households were randomized to zanamivir or placebo but only eligible household members (aged 5+ years) were treated. In the later trial, index cases were not treated. Characteristics of the two trials are given in Table 4.

Table 4.

Two randomized multi-center trials of zanamivir, an influenza antiviral agent

	Hayden et al., 2000	Monto et al., 2002
Time of trial	Oct. 1998 - Apr. 1999	Jun. 2000 - Apr. 2001
Households	336	484
Population	1186	1770
Index case randomization	Yes	No
Duration of medication
Index case	5 days	N/A
Contact	10 days	10 days
Follow up (symptom diary)	14 days	14 days
Infected^†/Symptomatic(index)	164/336	281/484
Infected^†/Exposed(contacts)
Control	52/435	76/626
Zanamivir	17/415	27/660

Open in a new tab

Numbers may slightly differ from references due to different criteria of data inclusion for analysis.

^†

Laboratory-confirmed infections with clinical symptoms

The earlier trial adopted a typical household-level randomization, providing information about AVE_T = 1 − θϕ, if we assume multiplicativity between θ and ϕ, and the later trial contains information mainly about AVE_S. Neither trial alone provides any information about AVE_I, and thus we combine the two trials to estimate AVE_S and AVE_I simultaneously. While transmission probabilities and antiviral efficacies might differ from center to center, the limited sample size prohibits estimation of centerspecific parameters. As a result, we assume all the centers in both trials share the same parameters. The two reference papers used slightly different definition for clinical symptoms. We used the one in Monto et al.(2002) for both trials, i.e., presence of at least two of temperature≥ 37.8° C or feverishness (counted as one), cough, headache, sore throat and myalgia. As it is well known that influenza is more transmissible among children, we assume age-specific transmission probabilities in two age groups, children (< 18) and adults (≥ 18). Our primary endpoint is laboratory-confirmed influenza with clinical symptoms (clinical infection). Households in both trials were followed from the ascertainment time of index cases, for which selection bias was adjusted for based on Yang et al. (2006) and Appendix C. In such adjustment, index cases were excluded from analyses regardless of laboratory results, but their effects on the exposure level of the contacts were considered.

Results are given in Table 5. For this data set, both ML methods converge and thus give the same MLEs. Prophylaxis with zanamivir led to significantly preventive efficacy against clinical infection by ${\hat{AVE}}_{S} = 0.75$ (95% C.I.=(0.56, 0.86)). Hence, a susceptible person taking zanamivir has his chance of developing influenza illness reduced by 75% per daily exposure to an untreated symptomatic infected person. Zanamivir did not show significant efficacy in reducing the infectiousness of infected people with ${\hat{AVE}}_{I} = 0.23$ (95% C.I.=(−1.33, 0.75)). Assuming multiplicativity of θ and ϕ, the total efficacy AVE_T reached 0.81 (95% C.I.=(0.50, 0.93)). Based on final data of clinical influenza illness provided in Hayden et al. (2000) and Monto et al. (2002), similar AVE_T (0.80; 95% C.I.=(0.53, 0.91)) and AVE_S (0.84; 95%C.I.=(0.61, 0.90)) were reported by Halloran et al. (2006). They also reported AVE_S (0.75; 95% C.I.=(0.54, 0.86)), AVE_I (0.19; 95% C.I.=(−1.60, 0.75)) and AVE_T (0.87; 95% C.I.=(0.63, 0.95)) based on secondary attack rates (SAR) during 2-7 days since the ascertainment of index cases. These results differ in their interpretation.

Table 5.

Estimates of efficacies and transmission probabilities by age (1-17 vs. 18+) for pooled zanamivir trials conducted in 1998-1999 and 2000-2001. Results are obtained by approaches using the augmented data.

	IRLS		MLE
Parameter	Point Estimate	SD	Point Estimate	SD	95% CI
b_c^†	0.0024	0.00052	0.0028	0.00063	(0.0017, 0.0042)
b_a	0.00086	0.00030	0.0010	0.00039	(0.00045, 0.0021)
p_cc^†	0.040	0.0074	0.040	0.0077	(0.027, 0.057)
p_ca	0.028	0.0045	0.029	0.0048	(0.021, 0.040)
p_ac	0.023	0.0071	0.020	0.0071	(0.009, 0.037)
p_aa	0.040	0.011	0.032	0.011	(0.016, 0.058)
AVE_S	0.68	0.086	0.75	0.072	(0.56, 0.86)
AVE_I	0.24	0.38	0.23	0.44	(−1.33, 0.75)
AVE_T			0.81	0.094	(0.50, 0.93)

Open in a new tab

^†

Subscript c denotes child (1-17), a denotes adult (18+), and ca denotes child-to-adult transmission.

The estimated probability of infection from the common source per daily exposure is 0.0028 for children and 0.0010 for adults. Within households, the daily pairwise transmission probability is also higher in children $({\hat{p}}_{cc} = 0.040)$ than in adults $({\hat{p}}_{aa} = 0.032)$ . These estimates of transmission probabilities are comparable to those found in two trials of oseltamivir, another influenza antiviral agent, conducted about the same time in North America and Europe (Yang et al., 2006).

The IRLS estimates are fairly close to the MLEs except for p_aa and θ . In addition, the IRLS method might have under-estimated the SD for ϕ. The two trials combined together still do not provide sufficient information for estimating ϕ as suggested by the large SD for the MLE of ϕ. Starting estimates for all three methods were provided by a non-iteratively evaluated linear model (Appendix D). With a complementary log-log transformation for probability parameters and a log transformation for efficacy parameters, all three methods converge very well. Without such transformation, the Newton-Raphson procedure applied to the observed data converges if started from the IRLS estimates or the MLEs obtained via data augmentation but not from the noniteratively obtained estimates, which confirms the relative robustness of the methods using data augmentation to starting estimates.

5 Discussion

By augmenting the observed sequential symptom onsets in close contact groups with unobserved daily pairwise transmission outcomes, we identified a likelihood that has a simpler form than the one based solely on observed data and that can be maximized via the EM algorithm. Reilly and Lawlor (1999) used a similar approach to study hepatitis C infection in women with know exposure to anti-D immunoglobulin in sequential years before testing. However, the presence of multiple infective sources in the same time interval and the involvement of latent and infectious periods of influenza make our situation more complex. This simple form of the likelihood offers the flexibility of using other potential methods, for instance, the Fisher-scoring method instead of the Newton-Raphson algorithm for iterative maximization. As another example, we derived from this likelihood a linear model fitted with the IRLS method in combination with the EM-analogous algorithm. In a simulation study, the two approaches using the augmented data performed better than the ML method using the observed data in terms of robustness to initial estimates, especially for sparse data. The IRLS method is the most robust to initial estimates, and asymptotically provides estimates of the same quality as the MLEs. The IRLS estimates are likely biased and have larger variances when data are sparse, but can serve as good initial estimates for the ML methods.

We have assumed known distributions for the latent and infectious periods and the coincidence between the latent and the incubation periods, which may not be realistic for some infectious diseases. If these assumptions do not hold, estimates could be biased and misleading. Cauchemez et al. (2004) used a Bayesian hierarchical model to allow estimation of the latent and infectious periods, assuming that the latent and the incubation periods were equal, but such estimation requires a sufficient number of cases. In addition, our models are limited to symptomatic infections. However, asymptomatic influenza infections can provide further information about the efficacies and transmission probabilities from a virological point of view, although such ”silent” cases complicate the likelihood to a large extent. A future research topic of potential public health interest would be to extend our data augmentation scheme to a Bayesian framework that can estimate the natural history of the disease and take into account asymptomatic cases.

In the data analysis, index cases were excluded regardless of their laboratory test results. According to the rationale of adjustment for selection bias, i.e., conditioning on the symptom status (caused by true infection) of the index case on the ascertainment day, a test-negative index case should be viewed as a susceptible and followed the same way as for contacts. However, not all clinical trials required symptom diary for index cases after enrollment, e.g., in the 2000-2001 trial of zanamivir. Households with test-negative index cases are generally excluded from calculations of SARs; but in our case the inclusion of the contacts in these households can improve estimation of b and θ and of p to a lesser extent. This issue could be resolved by improving the follow-up of index cases.

In this paper we have assumed fixed antiviral effects and non-random susceptibility. If sufficient data are available, random effects on the transmission probabilities as well as the antiviral efficacies could be considered to address potential heterogeneity among centers, households, or individuals (Longini and Halloran, 1996; Halloran, Préziosi and Chu, 2003).

With the potential for pandemic influenza, a rising global concern, zanamivir is one of the major available influenza antivirals agents (Hayden, 2001). Our estimates can be used in modeling research to evaluate the effects of intervention options at different levels of contact groups (Longini et al., 2004; Longini et al., 2005; Germann et al., 2006). This research also emphasizes the need for proper study design for the parameters to be adequately estimated.

Acknowledgements

This work was partially supported by National Institute of Allergy and Infectious Diseases grant R01-AI32042. The data on the clinical trials of zanamivir were provided by GlaxoSmithKline Laboratories Inc.

Appendix A: Conditional Expected Frequency of Transmission Status

Define ${\bar{I}}_{ji} (t)$ as the event that a susceptible person i escapes infection from infective source j on day t. Note that the following basic facts hold:

$I_{i} (t) \cap I_{ji} (t) = I_{ji} (t)$ .
$\Pr (I_{ji} (t) ∣ I_{i} (t) \cap S_{i} ({\tilde{t}}_{i})) = \Pr (I_{ji} (t) ∣ I_{i} (t))$ .
${\bar{I}}_{ji} (t) \cap I_{i} (τ) = I_{i} (τ) for τ > t$ .
$\Pr ({\bar{I}}_{ji} (t) \cap I_{i} (τ)) = 0 for τ < t$ .

Then,

\begin{matrix} \Pr (Z_{ji} (t) = 1 ∣ b, p, θ, ϕ, {\tilde{t}}_{i}) & = \Pr (I_{ji} (t) ∣ S_{i} ({\tilde{t}}_{i})) = \Pr (I_{i} (t) \cap I_{ji} (t) ∣ S_{i} ({\tilde{t}}_{i})) \\ = \Pr (I_{ji} (t) ∣ I_{i} (t) \cap S_{i} ({\tilde{t}}_{i})) \times \Pr (I_{i} (t) ∣ S_{i} ({\tilde{t}}_{i})) \\ = \Pr (I_{ji} (t) ∣ I_{i} (t)) \times \frac{\Pr (S_{i} ({\tilde{t}}_{i}) ∣ I_{i} (t)) \times \Pr (I_{i} (t))}{\Pr (S_{i} ({\tilde{t}}_{i}))} \\ = \frac{\Pr (I_{ji} (t) \cap I_{i} (t))}{\Pr (I_{i} (t))} \times \frac{\Pr (S_{i} ({\tilde{t}}_{i}) ∣ I_{i} (t)) \times \Pr (I_{i} (t))}{\Pr (S_{i} ({\tilde{t}}_{i}))} \\ = \frac{\Pr (I_{ji} (t))}{\Pr (S_{i} ({\tilde{t}}_{i}))} \times \Pr (S_{i} ({\tilde{t}}_{i}) ∣ I_{i} (t)) \end{matrix}

(8)

and

\begin{matrix} \Pr ({\bar{Z}}_{ji} (t) = 1 ∣ b, p, θ, ϕ, {\tilde{t}}_{i}) = & \Pr ({\bar{I}}_{ji} (t) ∣ S_{i} ({\tilde{t}}_{i})) = Σ_{τ = t}^{\overset{‒}{t_{i}}} \Pr ({\bar{I}}_{ji} (t) \cap I_{i} (τ) ∣ S_{i} ({\tilde{t}}_{i})) \\ = & Σ_{τ = t}^{\overset{‒}{t_{i}}} \frac{\Pr (S_{i} ({\tilde{t}}_{i}) ∣ {\bar{I}}_{ji} (t) \cap I_{i} (τ)) \times \Pr ({\bar{I}}_{j i} (t) \cap I_{i} (τ))}{\Pr (S_{i} ({\tilde{t}}_{i}))} \\ = & \frac{\Pr (S_{i} ({\tilde{t}}_{i}) ∣ {\bar{I}}_{ji} (t) \cap I_{i} (t)) \times \Pr ({\bar{I}}_{ji} (t) \cap I_{i} (t))}{\Pr (S_{i} ({\tilde{t}}_{i}))} \\ + Σ_{τ = t + 1}^{\overset{‒}{t_{i}}} \frac{\Pr (S_{i} ({\tilde{t}}_{i}) ∣ I_{i} (τ)) \times \Pr (I_{i} (τ))}{\Pr (S_{i} ({\tilde{t}}_{i}))} \\ = & \frac{\Pr (S_{i} ({\tilde{t}}_{i}) ∣ I_{i} (t)) \times {\Pr (I_{i} (t)) - \Pr (I_{ji} (t))}}{\Pr (S_{i} ({\tilde{t}}_{i}))} \\ + Σ_{τ = t + 1}^{\overset{‒}{t_{i}}} \frac{\Pr (S_{i} ({\tilde{t}}_{i}) ∣ I_{i} (τ)) \times \Pr (I_{i} (τ))}{\Pr (S_{i} ({\tilde{t}}_{i}))} . \end{matrix}

(9)

Appendix B: Generalization of the Linear Model to Heterogeneous Populations

For a heterogeneous population composed of k risk categories of people (e.g., age groups), let p_vu be the pairwise transmission probability per unprotected contact between a susceptible individual in category u and an infective person in category v. Further, let b_u be the probability of infection from the common source for category u. Assume that the AVE_S and the AVE_I are the same for all categories for notational simplicity. The models can be easily generalized to situations with heterogeneous efficacies as well. There are k parameters for common source transmission probabilities and k² parameters for household transmission probabilities.

Let group k be the reference stratum. The model in matrix form derived from (7) would be

\log (p_{ji} (t)) = β^{{(b)}^{τ}} I_{i} + {J_{i}}^{τ} β^{(p)} I_{i} + β^{(θ)} r_{i} (t) + β^{(ϕ)} r_{j} (t) + \log (f_{j} (t ∣ {\tilde{t}}_{j})),

(10)

where β^(θ) = log(θ), β^(ϕ) = log(ϕ), and

\begin{matrix} I_{i} & = {(I_{{i \in 1}}, \dots, I_{{i \in k - 1}}, 1)}^{τ}, \\ J_{i} & = {(ψ_{j} I_{{j \in 1}}, \dots, ψ_{j} I_{{j \in k - 1}}, 1)}^{τ}, \\ β^{(b)} & = {(β_{1}^{(b)}, \dots, β_{k}^{(b)})}^{τ} = {(\log (\frac{b_{1}}{b_{k}}), \dots, \log (\frac{b_{k - 1}}{b_{k}}), \log (b_{k}))}^{τ}, \\ β^{(p)} & = {β_{vu}^{(p)}}_{k \times k} = {\begin{matrix} \log (\frac{p_{11} p_{kk}}{p_{1 k} p_{k 1}}) & \dots & \log (\frac{p_{1 (k - 1)} p_{kk}}{p_{1 k} p_{k (k - 1)}}) & \log (\frac{p_{1 k}}{p_{kk}}) \\ ⋮ & ⋱ & ⋮ & ⋮ \\ \log (\frac{p_{(k - 1) 1} p_{kk}}{p_{(k - 1) k} p_{k 1}}) & \dots & \log (\frac{p_{(k - 1) (k - 1)} p_{kk}}{p_{(k - 1) k} p_{k (k - 1)}}) & \log (\frac{p_{(k - 1) k}}{p_{kk}}) \\ \log (\frac{p_{k 1} b_{k}}{p_{kk} b_{1}}) & \dots & \log (\frac{p_{k (k - 1)} b_{k}}{p_{kk} b_{k - 1}}) & \log (\frac{p_{kk}}{b_{k}}) \end{matrix}} . \end{matrix}

Appendix C: Adjustment for Selection Bias in Case-ascertained Follow-up Design

In a prospective follow-up design, exposure to risks of infection starts on day 1. However, in real clinical trials, households are generally enrolled when one or more index cases are identified by symptom onsets, to which we refer as a case-ascertained design. To reduce bias caused by such selective enrollment, Yang et al. (2006) suggest that the individual likelihood contributions be conditioned on observed symptom status up to the symptom onset day of the index case. The consequences of such adjustment are the following:

Index cases do not contribute to the likelihood.
The likelihood calculation for person i starts from the day $\underset{‒}{t_{d_{i}}} + 1$ , where d_i denotes the index case in the household of person i.
The individual log-likelihood is subtracted by log(A_i) where
$A_{i} = Σ_{t = \underset{‒}{t_{d_{i}}} + 1}^{\overset{‒}{t_{d_{i}}}} {(\prod_{τ = \underset{‒}{t_{d_{i}}} + 1}^{t - 1} e_{i} (τ)) (1 - e_{i} (t)) \Pr (\tilde{t} > {\tilde{t}}_{d_{i}} ∣ t)} + \prod_{t = \underset{‒}{t_{d_{i}}} + 1}^{\overset{‒}{t_{d_{i}}}} e_{i} (t) .$ (11)

For the ML method using the augmented data, the same adjustment can be applied. For the linear model method, such a conditional adjustment is difficult. However, since minimizing the weighted least squares is analogous to maximizing the log-likelihood, it is natural to use the same adjusting term to penalize the objective function

Σ_{h = 1}^{H} ω_{h} {\log ({\tilde{P}}_{h}) + \log (P_{h})}^{2} + \underset{i}{Σ} \log (A_{i} (β)),

where A_i is re-expressed as functions of β = (β₀, … ,β₃). Denote the covariate matrix by X, the diagonal weight matrix by W and the observed response vector by $\log (\tilde{P})$ , then at the l^th iteration,

{\hat{β}}_{l} = {(X^{'} W_{l - 1} X)}^{- 1} {X^{'} W_{l - 1} \log (\tilde{P}) - \frac{1}{2} \underset{i}{Σ} \frac{d \log (A_{i} ({\hat{β}}_{l - 1}))}{d {\hat{β}}_{l - 1}}} .

Appendix D: Non-iteratively Fitted Linear Model for Initial Estimates

The ML and IRLS methods require initial estimates to start the iteration. Iteration could be avoided if we model I_i(t) instead of I_ji(t), i.e., infection status of person i on day t instead of pairwise transmission, and assume equal Pr(I_i(t)) for all $\underset{‒}{t_{i}} \leq t \leq \overset{‒}{t_{i}}$ .

Let N_i(t) be the number of treated infective individuals and M_i(t) be the number of untreated infective individuals that a susceptible person i is exposed to within the household on day t. Given N_i(t) and M_i(t), the probability that person i is infected on day t is given by

p_{i} (t) = 1 - {(1 - b)}^{1 - r_{i} (t)} {(1 - θ b)}^{r_{i} (t)} \times {(1 - p)}^{M_{i} (t) (1 - r_{i} (t))} {(1 - θ p)}^{M_{i} (t) r_{i} (t)} {(1 - ϕ p)}^{N_{i} (t) (1 - r_{i} (t))} {(1 - θ ϕ p)}^{N_{i} (t) r_{i} (t)} .

A reparameterization leads to

\log (1 - p_{i} (t)) = β_{0} + β_{1} r_{i} (t) + β_{2} M_{i} (t) + β_{3} r_{i} (t) M_{i} (t) + β_{4} N_{i} (t) + β_{5} r_{i} (t) N_{i} (t) .

(12)

where

\begin{matrix} β_{0} = \log (1 - b), β_{1} = \log (\frac{1 - θ b}{1 - b}), β_{2} = \log (1 - p), β_{3} = \log (\frac{1 - θ p}{1 - p}), \\ β_{4} = \log (1 - ϕ p), and β_{5} = \log (\frac{1 - θ ϕ p}{1 - ϕ p}) . \end{matrix}

Let Y_i(t) indicate the infection status (1:infection, 0:escape) for person i on day t. Similar to Section 2.1, define

Z_{i} (t) = Y_{i} (t) \prod_{τ < t} (1 - Y_{i} (τ))

and

{\bar{Z}}_{i} (t) = \prod_{τ \leq t} (1 - Y_{i} (τ)) .

Z_i(t) = 1 is the event that person i escapes infection from any source until day t, while ${\bar{Z}}_{i} (t) = 1$ is the event that person i escapes infection from any source up to day t. Assume that Pr(I_i(t)) is equal for all $t \in [\underset{‒}{t_{i}}, \overset{‒}{t_{i}}]$ . Then the conditional probabilities

\begin{matrix} \Pr (Z_{ji} (t) = 1 ∣ b, p, θ, ϕ, {\tilde{t}}_{i}) & = \Pr (S_{i} ({\tilde{t}}_{i}) ∣ I_{i} (t)), \\ \Pr ({\bar{Z}}_{ji} (t) = 1 ∣ b, p, θ, ϕ, {\tilde{t}}_{i}) & = Σ_{τ = t + 1}^{\overset{‒}{t_{i}}} \Pr (S_{i} ({\tilde{t}}_{i}) ∣ I_{i} (τ)), \end{matrix}

do not involve unknown parameters, and can be used as the weights for fitting (12). While N_i(t) and M_i(t) are generally unknown, they can be obtained by randomly sampling the duration of infectious period for each infective individual according to the known empirical distribution f. Alternatively, all possible combinations of N_i(t) and M_i(t) can contribute to model (12) with the weights multiplied by the joint probability Pr (N_i(t), M_i(t)) derived from f.

Model (12) gives rise to multiple estimators for the efficacy parameters because of the increase in parameter dimension:

{\hat{θ}}_{1} = \frac{1 - \exp ({\hat{β}}_{0} + {\hat{β}}_{1})}{1 - \exp ({\hat{β}}_{0})}, {\hat{θ}}_{2} = \frac{1 - \exp ({\hat{β}}_{2} + {\hat{β}}_{3})}{1 - \exp ({\hat{β}}_{2})} and {\hat{θ}}_{3} = \frac{1 - \exp ({\hat{β}}_{4} + {\hat{β}}_{5})}{1 - \exp ({\hat{β}}_{4})}

for θ and

{\hat{ϕ}}_{1} = \frac{1 - \exp ({\hat{β}}_{4})}{1 - \exp ({\hat{β}}_{2})} and {\hat{ϕ}}_{2} = \frac{1 - \exp ({\hat{β}}_{4} + {\hat{β}}_{5})}{1 - \exp ({\hat{β}}_{2} + {\hat{β}}_{3})}

for ϕ. The average of the multiple estimates weighted by reciprocal standard errors can serve as the initial estimate, e.g., $\hat{θ} = Σ_{i = 1}^{3} ω_{i} {\hat{θ}}_{i}$ , where $ω_{i} = \frac{\frac{1}{s . e . ({\hat{θ}}_{i})}}{Σ_{j = 1}^{3} \frac{1}{s . e . ({\hat{θ}}_{j})}}$ .

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

Addy CL, Longini IM, Haber MJ. A generalized stochastic model for the analysis of infectious disease final size data. Biometrics. 1991;47:961–974. [PubMed] [Google Scholar]
Becker NG. Analysis of Infectious Disease Data. Chapman and Hall; New York, NY: 1989. [Google Scholar]
Becker NG, Hasofer AM. Estimation in Epidemics with Incomplete Observations. Journal of the Royal Statistical Society, Series B. 1997;59:415–429. [Google Scholar]
Becker NG, Britton T, O'Neill PD. Estimating vaccine effects on transmission of infection from household outbreak data. Biometrics. 2003;59:467–475. doi: 10.1111/1541-0420.00056. [DOI] [PubMed] [Google Scholar]
Cauchemez S, Carrat F, Viboud C, Valleron AJ, Boëlle PY. A Bayesian MCMC approach to study transmission of influenza: application to household longitudinal data. Statist. Med. 2004;23:3469–3487. doi: 10.1002/sim.1912. [DOI] [PubMed] [Google Scholar]
Datta S, Halloran ME, Longini IM. Efficiency of estimating vaccine efficacy for susceptibility and infectiousness: randomization by individual versus household. Biometrics. 1999;55:792–798. doi: 10.1111/j.0006-341x.1999.00792.x. [DOI] [PubMed] [Google Scholar]
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 1977;39:1–38. [Google Scholar]
Donner A. Some aspects of the design and analysis of cluster randomized trials. Statistics in Medicine. 1998;47:95–113. [Google Scholar]
Elveback LR, Fox JP, Ackerman E. An influenza simulation model for immunization studies. American Journal of Epidemiology. 1976;103:152–165. doi: 10.1093/oxfordjournals.aje.a112213. [DOI] [PubMed] [Google Scholar]
Germann TC, Kadau K, Longini IM, Macken CA. Mitigation strategies for pandemic influenza in the United States. Proceedings of the National Academy of Science of the U. S. A. 2006;103:5935–5940. doi: 10.1073/pnas.0601266103. [DOI] [PMC free article] [PubMed] [Google Scholar]
Halloran ME, Struchiner CJ, Longini IM. Study designs for different efficacy and effectiveness aspects of vaccination. American Journal of Epidemiology. 1997;146:789–803. doi: 10.1093/oxfordjournals.aje.a009196. [DOI] [PubMed] [Google Scholar]
Halloran ME, Préziosi M-P, Chu H. Estimating vaccine efficacy from secondary attack rates. Journal of the American Statistical Association. 2003;98:38–46. [Google Scholar]
Halloran ME, Hayden FG, Yang Y, Longini IM, Monto AS. Antiviral effects on influenza viral transmission and pathogenicity: observations from household-based trials. American Journal of Epidemiology. 2006;165:212–222. doi: 10.1093/aje/kwj362. [DOI] [PubMed] [Google Scholar]
Hayden FG, Gubareva LV, Monto AS, Klein TC, Elliott MJ, Hammond JM, Sharp SJ, Ossi MJ, Zanamivir Family Study Group Inhaled zanamivir for the prevention of influenza in families. New England Journal of Medicine. 2000;343:1282–1289. doi: 10.1056/NEJM200011023431801. [DOI] [PubMed] [Google Scholar]
Hayden FG. Perspectives on antiviral use during pandemic influenza. Philosophical transactions of the Royal Society of London, Series B, Biological sciences. 2001;356:1877–1884. doi: 10.1098/rstb.2001.1007. [DOI] [PMC free article] [PubMed] [Google Scholar]
Longini IM, Koopman JS. Household and Community Transmission Parameters from Final Distributions of Infections in Households. Biometrics. 1982;38:115–126. [PubMed] [Google Scholar]
Longini IM, Halloran ME. A frailty mixture model for estimating vaccine efficacy. Journal of the Royal Statistical Society, Series C. 1996;45:165–173. [Google Scholar]
Longini IM, Halloran ME, Nizam A, Yang Y. Containing pandemic influenza with antiviral agents. American Journal of Epidemiology. 2004;159:623–633. doi: 10.1093/aje/kwh092. [DOI] [PubMed] [Google Scholar]
Longini IM, Nizam A, Xu S, Ungchusak K, Hanshaoworakul W, Cummings DAT, Halloran ME. Containing pandemic influenza at the source. Science. 2005;309:1083–1087. doi: 10.1126/science.1115717. [DOI] [PubMed] [Google Scholar]
Louis TA. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society, Series B. 1982;44:226–233. [Google Scholar]
Magder L, Brookmeyer R. Analysis of infectious disease data from partner studies with unknown source of infection. Biometrics. 1993;49:1110–1116. [PubMed] [Google Scholar]
Monto AS, Pichichero ME, Blanckenberg SJ, Ruuskanen O, Cooper C, Fleming DM, Kerr C. Zanamivir prophylaxis: an effective strategy for the prevention of influenza types A and B within households. Journal Infectious Diseases. 2002;186:1582–1588. doi: 10.1086/345722. [DOI] [PubMed] [Google Scholar]
O'Neill P, Roberts GO. Bayesian inference for partially observed stochastic epidemics. Journal of the Royal Statistical Society, Series A. 1999;162:121–129. [Google Scholar]
O'Neill P, Balding DJ, Becker NG, Eerola M, Mollison D. Analyses of infectious disease data from household outbreaks by Markov chain Monte Carlo methods. Journal of the Royal Statistical Society, Series C. 2000;49:517–542. [Google Scholar]
Paap R. What are the advantages of MCMC based inference in latent variable models? Statistica Neerlandica. 2002;56:2–22. [Google Scholar]
Rampey AH, Longini IM, Haber MJ, Monto AS. A discrete-time model for the statistical analysis of infectious disease incidence data. Biometrics. 1992;48:117–128. [PubMed] [Google Scholar]
Reilly M, Lawlor E. A likelihood-based method for identifying contaminated lots of blood product. International Journal of Epidemiology. 1999;28:787–792. doi: 10.1093/ije/28.4.787. [DOI] [PubMed] [Google Scholar]
van Dyk DA, Meng X. The art of data augmentation. Journal of Computational and Graphical Statistics. 2001;10:1–50. [Google Scholar]
Welliver R, Monto AS, Carewicz O, Schattemanet E, Hassman M, Hedrick J, Jackson HC, Huson L, Ward P, Oxford JS. Effectiveness of oseltamivir in preventing influenza in household contacts: a randomized controlled trial. Journal of the American Medical Associtation. 2001;285:748–754. doi: 10.1001/jama.285.6.748. [DOI] [PubMed] [Google Scholar]
Yang Y, Longini IM, Halloran ME. Design and evaluation of prophylactic interventions using infectious disease incidence data from close contact groups. Journal of the Royal Statistical Society, Series C. 2006;55:317–330. doi: 10.1111/j.1467-9876.2006.00539.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Addy CL, Longini IM, Haber MJ. A generalized stochastic model for the analysis of infectious disease final size data. Biometrics. 1991;47:961–974. [PubMed] [Google Scholar]

[R2] Becker NG. Analysis of Infectious Disease Data. Chapman and Hall; New York, NY: 1989. [Google Scholar]

[R3] Becker NG, Hasofer AM. Estimation in Epidemics with Incomplete Observations. Journal of the Royal Statistical Society, Series B. 1997;59:415–429. [Google Scholar]

[R4] Becker NG, Britton T, O'Neill PD. Estimating vaccine effects on transmission of infection from household outbreak data. Biometrics. 2003;59:467–475. doi: 10.1111/1541-0420.00056. [DOI] [PubMed] [Google Scholar]

[R5] Cauchemez S, Carrat F, Viboud C, Valleron AJ, Boëlle PY. A Bayesian MCMC approach to study transmission of influenza: application to household longitudinal data. Statist. Med. 2004;23:3469–3487. doi: 10.1002/sim.1912. [DOI] [PubMed] [Google Scholar]

[R6] Datta S, Halloran ME, Longini IM. Efficiency of estimating vaccine efficacy for susceptibility and infectiousness: randomization by individual versus household. Biometrics. 1999;55:792–798. doi: 10.1111/j.0006-341x.1999.00792.x. [DOI] [PubMed] [Google Scholar]

[R7] Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 1977;39:1–38. [Google Scholar]

[R8] Donner A. Some aspects of the design and analysis of cluster randomized trials. Statistics in Medicine. 1998;47:95–113. [Google Scholar]

[R9] Elveback LR, Fox JP, Ackerman E. An influenza simulation model for immunization studies. American Journal of Epidemiology. 1976;103:152–165. doi: 10.1093/oxfordjournals.aje.a112213. [DOI] [PubMed] [Google Scholar]

[R10] Germann TC, Kadau K, Longini IM, Macken CA. Mitigation strategies for pandemic influenza in the United States. Proceedings of the National Academy of Science of the U. S. A. 2006;103:5935–5940. doi: 10.1073/pnas.0601266103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Halloran ME, Struchiner CJ, Longini IM. Study designs for different efficacy and effectiveness aspects of vaccination. American Journal of Epidemiology. 1997;146:789–803. doi: 10.1093/oxfordjournals.aje.a009196. [DOI] [PubMed] [Google Scholar]

[R12] Halloran ME, Préziosi M-P, Chu H. Estimating vaccine efficacy from secondary attack rates. Journal of the American Statistical Association. 2003;98:38–46. [Google Scholar]

[R13] Halloran ME, Hayden FG, Yang Y, Longini IM, Monto AS. Antiviral effects on influenza viral transmission and pathogenicity: observations from household-based trials. American Journal of Epidemiology. 2006;165:212–222. doi: 10.1093/aje/kwj362. [DOI] [PubMed] [Google Scholar]

[R14] Hayden FG, Gubareva LV, Monto AS, Klein TC, Elliott MJ, Hammond JM, Sharp SJ, Ossi MJ, Zanamivir Family Study Group Inhaled zanamivir for the prevention of influenza in families. New England Journal of Medicine. 2000;343:1282–1289. doi: 10.1056/NEJM200011023431801. [DOI] [PubMed] [Google Scholar]

[R15] Hayden FG. Perspectives on antiviral use during pandemic influenza. Philosophical transactions of the Royal Society of London, Series B, Biological sciences. 2001;356:1877–1884. doi: 10.1098/rstb.2001.1007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Longini IM, Koopman JS. Household and Community Transmission Parameters from Final Distributions of Infections in Households. Biometrics. 1982;38:115–126. [PubMed] [Google Scholar]

[R17] Longini IM, Halloran ME. A frailty mixture model for estimating vaccine efficacy. Journal of the Royal Statistical Society, Series C. 1996;45:165–173. [Google Scholar]

[R18] Longini IM, Halloran ME, Nizam A, Yang Y. Containing pandemic influenza with antiviral agents. American Journal of Epidemiology. 2004;159:623–633. doi: 10.1093/aje/kwh092. [DOI] [PubMed] [Google Scholar]

[R19] Longini IM, Nizam A, Xu S, Ungchusak K, Hanshaoworakul W, Cummings DAT, Halloran ME. Containing pandemic influenza at the source. Science. 2005;309:1083–1087. doi: 10.1126/science.1115717. [DOI] [PubMed] [Google Scholar]

[R20] Louis TA. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society, Series B. 1982;44:226–233. [Google Scholar]

[R21] Magder L, Brookmeyer R. Analysis of infectious disease data from partner studies with unknown source of infection. Biometrics. 1993;49:1110–1116. [PubMed] [Google Scholar]

[R22] Monto AS, Pichichero ME, Blanckenberg SJ, Ruuskanen O, Cooper C, Fleming DM, Kerr C. Zanamivir prophylaxis: an effective strategy for the prevention of influenza types A and B within households. Journal Infectious Diseases. 2002;186:1582–1588. doi: 10.1086/345722. [DOI] [PubMed] [Google Scholar]

[R23] O'Neill P, Roberts GO. Bayesian inference for partially observed stochastic epidemics. Journal of the Royal Statistical Society, Series A. 1999;162:121–129. [Google Scholar]

[R24] O'Neill P, Balding DJ, Becker NG, Eerola M, Mollison D. Analyses of infectious disease data from household outbreaks by Markov chain Monte Carlo methods. Journal of the Royal Statistical Society, Series C. 2000;49:517–542. [Google Scholar]

[R25] Paap R. What are the advantages of MCMC based inference in latent variable models? Statistica Neerlandica. 2002;56:2–22. [Google Scholar]

[R26] Rampey AH, Longini IM, Haber MJ, Monto AS. A discrete-time model for the statistical analysis of infectious disease incidence data. Biometrics. 1992;48:117–128. [PubMed] [Google Scholar]

[R27] Reilly M, Lawlor E. A likelihood-based method for identifying contaminated lots of blood product. International Journal of Epidemiology. 1999;28:787–792. doi: 10.1093/ije/28.4.787. [DOI] [PubMed] [Google Scholar]

[R28] van Dyk DA, Meng X. The art of data augmentation. Journal of Computational and Graphical Statistics. 2001;10:1–50. [Google Scholar]

[R29] Welliver R, Monto AS, Carewicz O, Schattemanet E, Hassman M, Hedrick J, Jackson HC, Huson L, Ward P, Oxford JS. Effectiveness of oseltamivir in preventing influenza in household contacts: a randomized controlled trial. Journal of the American Medical Associtation. 2001;285:748–754. doi: 10.1001/jama.285.6.748. [DOI] [PubMed] [Google Scholar]

[R30] Yang Y, Longini IM, Halloran ME. Design and evaluation of prophylactic interventions using infectious disease incidence data from close contact groups. Journal of the Royal Statistical Society, Series C. 2006;55:317–330. doi: 10.1111/j.1467-9876.2006.00539.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A Data-Augmentation Method for Infectious Disease Incidence Data from Close Contact Groups

Yang Yang

Ira M Longini Jr

M Elizabeth Halloran

Abstract

1 Introduction

2 Methods

2.1 The Maximum Likelihood Method Based on the Augmented Data

2.2 The Linear Model Based on the Augmented Data

3 Simulation Study

Table 1.

Table 2.

Table 3.

4 Data Analysis

Table 4.

Table 5.

5 Discussion

Acknowledgements

Appendix A: Conditional Expected Frequency of Transmission Status

Appendix B: Generalization of the Linear Model to Heterogeneous Populations

Appendix C: Adjustment for Selection Bias in Case-ascertained Follow-up Design

Appendix D: Non-iteratively Fitted Linear Model for Initial Estimates

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A Data-Augmentation Method for Infectious Disease Incidence Data from Close Contact Groups

Yang Yang

Ira M Longini Jr

M Elizabeth Halloran

Abstract

1 Introduction

2 Methods

2.1 The Maximum Likelihood Method Based on the Augmented Data

2.2 The Linear Model Based on the Augmented Data

3 Simulation Study

Table 1.

Table 2.

Table 3.

4 Data Analysis

Table 4.

Table 5.

5 Discussion

Acknowledgements

Appendix A: Conditional Expected Frequency of Transmission Status

Appendix B: Generalization of the Linear Model to Heterogeneous Populations

Appendix C: Adjustment for Selection Bias in Case-ascertained Follow-up Design

Appendix D: Non-iteratively Fitted Linear Model for Initial Estimates

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases