Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2021 Jan 20;17(1):e1008601. doi: 10.1371/journal.pcbi.1008601

Estimating and interpreting secondary attack risk: Binomial considered biased

Yushuf Sharker 1, Eben Kenah 2,*
Editor: Cecile Viboud3
PMCID: PMC7850487  PMID: 33471806

Abstract

The household secondary attack risk (SAR), often called the secondary attack rate or secondary infection risk, is the probability of infectious contact from an infectious household member A to a given household member B, where we define infectious contact to be a contact sufficient to infect B if he or she is susceptible. Estimation of the SAR is an important part of understanding and controlling the transmission of infectious diseases. In practice, it is most often estimated using binomial models such as logistic regression, which implicitly attribute all secondary infections in a household to the primary case. In the simplest case, the number of secondary infections in a household with m susceptibles and a single primary case is modeled as a binomial(m, p) random variable where p is the SAR. Although it has long been understood that transmission within households is not binomial, it is thought that multiple generations of transmission can be neglected safely when p is small. We use probability generating functions and simulations to show that this is a mistake. The proportion of susceptible household members infected can be substantially larger than the SAR even when p is small. As a result, binomial estimates of the SAR are biased upward and their confidence intervals have poor coverage probabilities even if adjusted for clustering. Accurate point and interval estimates of the SAR can be obtained using longitudinal chain binomial models or pairwise survival analysis, which account for multiple generations of transmission within households, the ongoing risk of infection from outside the household, and incomplete follow-up. We illustrate the practical implications of these results in an analysis of household surveillance data collected by the Los Angeles County Department of Public Health during the 2009 influenza A (H1N1) pandemic.

Author summary

The household secondary attack risk (SAR), often called the secondary attack rate or secondary infection risk, is the probability of infectious contact from an infectious household member A to a given household member B, where we define infectious contact to be a contact sufficient to infect B if he or she is susceptible. The most common statistical models used to estimate the SAR are binomial models such as logistic regression, which implicitly assume that all secondary infections in a household are infected by the primary case. Here, we use analytical calculations and simulations to show that estimation of the SAR must account for multiple generations of transmission within households. As an example, we show that binomial models and statistical models that account for multiple generations of within-household transmission reach different conclusions about the household SAR for 2009 influenza A (H1N1) in Los Angeles County, with the latter models fitting the data better. In an epidemic, accurate estimation of the SAR allows rigorous evaluation of the effectiveness of public health interventions such as social distancing, prophylaxis or treatment, and vaccination.

Introduction

In infectious disease epidemiology, the household secondary attack risk (SAR) is the probability of infectious contact from an infected household member A to a susceptible household member B during A’s infectious period, where we define infectious contact as a contact sufficient to infect B if he or she is susceptible. It is often called the secondary attack rate, but we prefer to call it a risk because it is a probability [1]. SARs can also be defined in other groups of close contacts, such as schools or hospital wards [2].

The SAR is used to assess the transmissibility of disease and to evaluate control measures [37]. The idea was originally developed by Charles V. Chapin in 1903 to study the transmission of diphtheria and scarlet fever, and it was extended to influenza, tuberculosis, and other infectious diseases by Wade Hampton Frost [810]. Household surveillance data from emerging infections is often used to estimate the SAR, including 1957 and 1968 pandemic influenza [1113], meningococcal disease [2], pertussis [6], SARS coronavirus [14], seasonal influenza [1518], rotavirus [19], 2009 pandemic influenza A (H1N1) [2026], MERS coronavirus [27, 28], Ebola virus disease [2931], norovirus [32, 33], hand-foot-and-mouth disease [34], cryptosporidium [35], measles [36], and COVID-19 [37, 38].

It has been understood that within-household transmission is not binomial since the work of En’ko in 1899 [39], Reed and Frost in 1928 [40], and Greenwood in 1931 [41]. The process is binomial only if the primary case (the first infected household member [42]) is the only possible source of infection for susceptible household members throughout his or her infectious period. However, binomial models continue to be used for the estimation of the SAR because it is thought that multiple generations of transmission within households can be neglected safely when the SAR is small. In its simplest form, this assumes that the number of secondary infections in a household with m susceptible individuals and a single primary case is a binomial(m, p) random variable, where p is the household SAR. A given transmission path of length k from a primary case A to a given susceptible B has probability pk, which decays exponentially as k increases. Up to and including the COVID-19 pandemic, the vast majority of studies of household transmission use a binomial model (often a logistic regression model) to estimate the household SAR [2, 6, 7, 9, 1113, 19, 2228, 3032, 3538]. A smaller number of studies have used explicit statistical models of transmission [1518, 20, 21, 29, 33, 43]. Here, we hope to establish that the latter approach should become universal.

Although the probability of each given transmission path of length k from A to B decays as pk, the risk of infection through k generations of transmission also depends on the number of possible paths of length k. A path of length k ≥ 1 from A to B can be specified by choosing k − 1 individuals from the m − 1 susceptible household members other than B. Each ordering of these k − 1 individuals produces a unique transmission path. For 1 ≤ km, the total number of paths from A to B of length k equals the number of permutations of k − 1 objects chosen from m − 1 objects:

P(m-1,k-1)=(m-1)!(m-k)!. (1)

Table 1 shows that the number of paths of length k can grow quickly with household size. Each path can carry infection from A to B, so the total risk of transmission from A to B along any path of length k can be much greater than pk. A binomial model attributes this additional risk of infection to direct transmission from the primary case, so the estimated SAR is too high.

Table 1. Number of paths from the primary case to a given susceptible.

Susceptibles Path length (k)
(m) 1 2 3 4
2 1 1 0 0
4 1 3 6 6
9 1 8 56 336

The binomial variance assumes that infections in different household members are independent. Because each new infection in a household increases the risk of infection in the remaining susceptibles, infections within a household are positively correlated. This correlation makes the true variance in the number of infections larger than the binomial variance. To address this issue, cluster-adjusted variances [6, 21, 23, 24, 31] and random effects [32] have been used to account for correlation among household members. Because of the bias in the point estimate of the SAR, this adjustment for clustering does not produce confidence intervals that have the expected coverage probabilities.

When the latent period (between infection and the onset of infectiousness) and incubation period (between infection and the onset of symptoms) are longer than the infectious period, generations of infection can be separated in time. This was seen seen most famously by Peter Panum in a measles epidemic on the Faroe Islands in 1846 [44]. With such separation, a binomial model could be used to estimate the risk of infection within a follow-up interval designed to capture only the first generation of transmission. However, such separation of generations is unusual. For example, the incubation period of influenza is roughly 1–2 days and the duration of viral shedding is 4–6 days [45, 46]. For influenza and most other infectious diseases, the follow-up times of households cannot be adjusted to capture exactly one generation of transmission.

In its original usage, the SAR was defined as the probability that a susceptible in a household with a primary case is infected by within-household transmission, whether or not there were multiple generations of transmission within the household [8, 9]. Here, we will call this the household final attack risk (FAR). With complete follow-up of all households, a cluster-adjusted binomial model could produce an unbiased estimate of the FAR. However, the estimated FAR will be biased upward if there are co-primary cases or if household members are at risk of infection from outside the household during the follow-up period [3, 9, 47]. Such conditions are common in practice.

In its modern interpretation, the household SAR is an extremely useful measure of the transmissibility of infection. However, this interpretation requires us to abandon the use of binomial models for estimation. Here, we use probability generating functions and simulations to show that (1) a binomial model produces biased estimates of the household SAR even when the probability of transmission is small and (2) cluster adjustment of the variances does not produce interval estimates with the expected coverage probabilities. To estimate the household SAR, explicit statistical models of disease transmission such as longitudinal chain binomial models [40, 48] or pairwise survival analysis [4952] should always be used. We illustrate the practical implications of these results using household surveillance data collected by the Los Angeles County Department of Public Health during the 2009 influenza A (H1N1) pandemic. In these data, binomial models produce SAR estimates that are too high to be interpreted as probabilities of transmission.

Methods

For simplicity, our analytical calculations and simulations assume a uniform SAR within households (i.e., no variation in infectiousness or susceptibility) and no risk of infection from outside the household except for the primary case. These assumptions are not realistic, but binomial models break down even under these ideal conditions. We use probability generating functions (PGFs) to calculate the true outbreak size distributions at different combinations of the number of susceptibles (m) and the SAR (p), and we verify these calculations in simulations of household outbreaks.

Household outbreak size distributions

Assume that each infectious member of a household makes infectious contact with each other member of the household with probability p during his or her infectious period. Let pmi be the probability that i out of m susceptibles are infected by within-household transmission in a household with a single primary case. Then

gm(x)=i=0mpmixi (2)

is the probability generating function (PGF) for the outbreak size distribution in a household with m susceptibles and one primary case. Because a household with zero susceptibles has zero secondary infections with probability one, g0(x) = 1.

The PGF for the outbreak size distribution in a household with m + 1 susceptibles can be derived from the PGFs for smaller households. Imagine a household with m susceptibles of whom i were infected. Now imagine that the household had one more susceptible. There are two possible outcomes:

  1. With probability (1 − p)i+1, the additional susceptible escapes infection from all i + 1 infected household members. The total number of infections in the household is i.

  2. With probability 1 − (1 − p)i+1, the additional susceptible gets infected. He or she acts like a primary case in a household containing the mi susceptibles who escaped infection. There are i + 1 infections, and the number of infections among the remaining susceptibles has the PGF gmi(x).

Combining these results, we conclude that

gm+1(x)=i=0mpmi[(1-p)i+1xi+(1-(1-p)i+1)xi+1gm-i(x)] (3)

The first few iterations yield

g0(x)=1, (4)
g1(x)=(1-p)+px, (5)
g2(x)=(1-p)2+2p(1-p)2x+(3p2-2p3)x2 (6)

which can be checked by hand. We calculated these polynomials using Python code in S2 File. As shown in Eq (2), the coefficient on xi in the PGF gm(x) is the probability that i of m susceptibles are infected in a household outbreak started by a single primary case. Using these probabilities, we can calculate the mean and variance of the number of infections among the m susceptibles.

Household outbreak simulations

We simulated household outbreaks using Erdős-Rényi random graphs [53, 54], where each pair of nodes is connected independently with probability p. In our graphs, each node represents a household member and p is the SAR. One node is fixed as the primary case, and all household members connected to the primary case by a series of edges are infected. This captures the final outcome of one realization of a within-household epidemic with SAR p. This simple model allows but does not require multiple generations of infection within households, allowing us to evaluate the performance of binomial estimates based on the assumption of a single generation of transmission.

We performed 40,000 simulations for each combination of household size and SAR. In each simulation, there were 200 independent households of the same size. We used logistic regression to calculate the proportion of susceptible household members who were infected with a naive 95% confidence interval. We then calculated a cluster-adjusted confidence interval using generalized estimating equations (GEE) with a robust variance estimate. The variance inflation factor (VIF) was calculated as the ratio of the robust variance to the naive variance. All confidence intervals were calculated on the logit scale as β^±1.96σ^ where β^=logit(p^) is the estimated log odds of infection and σ^ is the naive or robust standard error estimate. Finally, we transformed the confidence intervals to the probability scale and estimated the coverage probabilities for the true household SAR and the true household FAR.

Source code

Simulations were implemented in Python 3 [55], and statistical analysis was performed in R [56]. The R code is available in S1 File, and the Python code is available in S2 File. All software used is free and open-source, and further details are given in the Supporting Information.

Household data analysis

To give a practical example of the consequences of using a binomial model to estimate the household SAR, we use influenza A (H1N1) household surveillance data collected by the Los Angeles County Department of Public Health (LACDPH) between April 22 and May 19, 2009. The data was collected using the following protocol [49]:

  1. Nasopharyngeal swabs and aspirates were taken from individuals who reported to the LACDPH or other health care providers with acute febrile respiratory illness (AFRI), defined as a fever ≥100°F plus cough, core throat, or runny nose. These specimens were tested for influenza, and the age, gender, and symptom onset date of the AFRI patient were recorded.

  2. Patients whose specimens tested positive for pandemic influenza A (H1N1) or for influenza A of undetermined subtype were enrolled as primary cases. Each of them was given a structured phone interview to collect information about his or her household contacts. They were asked to report the symptom onset date of any AFRI episodes among their household contacts.

  3. When necessary, a follow-up interview was given 14 days after the symptom onset date of the primary case to assess whether any additional AFRI episodes had occurred in the household, including their illness onset date.

For simplicity, we assume all AFRI episodes among household members were caused by influenza A (H1N1) and that all household members except the primary case were susceptible to infection. All analyses use natural history assumptions adapted from Ref [20] and consistent with Ref [46]. Identical assumptions were used in Refs [50, 51]. In the primary analysis, we assumed an incubation period of 2 days, a latent period of zero days, and an infectious period of 6 days. We also consider 4-day and 8-day infectious periods.

We estimated the household SAR for 2009 pandemic influenza A (H1N1) using binomial models, a longitudinal chain binomial model [40, 48], and parametric pairwise regression models [52, 57]. In each household, we censored observations at the end of the infectious period of the primary case. Thus, the models are fit only to infections that could have been caused by primary cases, giving the binomial models the best possible chance of accurately estimating the household SAR. For each assumed infectious period, all statistical models were fit to exactly the same data. For simplicity, we did not include any covariates in these analyses. Final size chain binomial models were not used because they require complete observation of each within-household epidemic, so they cannot be fit to data censored at the end of the infectious period of the primary case in each household.

Binomial models

Two binomial models were fit to the LACDPH households data. The first model was an intercept-only logistic regression model—a binomial generalized linear model (GLM) with logit link. The second model was an intercept-only binomial GEE model [58]. For both models, we calculated Wald confidence intervals using naive and cluster-adjusted variances [59].

Longitudinal chain binomial model

The chain binomial model assumes that a given infectious person A makes infectious contact with a given susceptible household member B with an unknown probability p on each day that A is infectious. On day t, an individual B who is exposed to k infectious household members will escape infection with probability qk and be infected with probability 1 − qk, where q = 1 − p. The likelihood contribution from observation of individual B is the product of these likelihood contributions over all days where B was at risk of infection. The overall likelihood is the product of the likelihood contributions of all susceptibles who were at risk of infection for at least one day.

The household SAR is 1 − qι where ι is the infectious period. Because p ∈ (0, 1), our likelihood was defined in terms of logit(p)=ln(pq). To get a point estimate of the SAR, the unknown true q is replaced by a point estimate q^=1-p^. Standard maximum likelihood estimation was used to get point and interval estimates on the logit scale, which were transformed back to the probability scale. For simplicity, we have assumed that the probability of escaping infection from an infectious household member does not depend on how long he or she has been infectious or on any covariates. More sophisticated longitudinal chain binomial models can allow the escape probability to vary with the time since infection or with covariates [40, 48].

Pairwise survival analysis

Pairwise survival analysis estimates failure times in ordered pairs consisting of an infectious individual and a susceptible household member [57]. The pair AB is at risk of transmission starting with the onset of infectiousness in A, and failure occurs if A infects B. This failure time, called a contact interval is right-censored if B is infected by someone other than A or if observation of the pair stops. To account for uncertainty about who-infected-whom, the overall likelihood is the sum of the likelihoods for all possible combinations of who-infected-whom consistent with the data [49]. The survival function S(τ, θ), where θ is a parameter vector, is the probability that the contact interval is greater than τ. If θ0 is the true value of θ and the infectious period is ι, then the household SAR is 1 − S(ι, θ0). To get a point estimate of the SAR, the unknown true parameter θ0 is replaced by the maximum likelihood estimate θ^.

We used intercept-only exponential, Weibull, and log-logistic regression models [52]. For the exponential distribution, S(τ, λ) = exp(−λτ) where λ is the rate parameter. For the Weibull distribution, S(τ, λ, γ) = exp[−(λτ)γ] where λ is the rate and γ is the shape parameter. For the log-logistic distribution, S(τ, λ, γ) = [1 + (λτ)γ]−1 for rate λ and shape γ. For all three distributions, λ > 0 and γ > 0 so we defined our likelihoods in terms of their natural logarithms lnλ and lnγ. Standard maximum likelihood estimation was used to get point estimates and a covariance matrix for the rate and shape parameters. To get a 95% confidence interval for the SAR, we sampled lnλ and lnγ from their approximate multivariate normal distribution, calculated the household SAR for each sample, and took the 2.5% and 97.5% quantiles of the calculated SARs as confidence limits.

Goodness of fit

To see how well the SAR estimates fit the data, we simulated outbreaks in the Los Angeles households using SAR point estimates from the binomial model, the chain binomial model, and pairwise survival models. In each simulation, we calculated the total number of infections among susceptible household members. For each SAR estimate, we performed 4,000 simulations. We then compared the simulated household epidemics to the observed final size of the outbreak started by the primary cases (i.e., the total number of cases who can be linked to a primary case through one or more generations of transmission). For all infectious periods shorter than 12 days, there are a few observed cases that occur after the end of the initial within-household outbreak. Given each assumed infectious period, these late cases are excluded because they can only be explained by later introductions of infection into the household or by transmission paths that include undetected cases.

Source code

Statistical analyses were done with R [56], and the simulations were implemented in Python 3 [55]. The R code is available in S3 File, the Python code is available in S4 File, and the household data are available in S5 File. All software used is free and open-source, and further details are given in the Supporting Information.

Results

Household outbreak simulations

Fig 1 shows the household FAR calculated using PGFs (lines) and from simulations (symbols) as a function of the true SAR and the number of susceptibles. There is excellent agreement between the analytical calculations and the simulations. Both show that the household FAR is larger than the household SAR when there is more than one susceptible. At a fixed SAR, the difference between the SAR and the FAR increases with household size. Thus, a binomial model will produce a point estimate of the SAR that is biased upward whenever there is more than one susceptible household member.

Fig 1. The household FAR as a function of the SAR for households with different numbers of susceptibles m.

Fig 1

Lines show analytical calculations using probability generating functions, and simulations show estimates from 40,000 simulated household outbreaks. Each simulated household outbreak had a single primary case, so the total household size was m + 1.

Fig 2 shows the VIF calculated using PGFs (lines) and from simulations (symbols) as a function of the true SAR and the number of susceptibles. Again, there is excellent agreement between the analytical calculations and the simulations. The variance of the number of infections within households is substantially larger than the binomial variance, and this difference increases with increasing household size. Thus, confidence intervals based on a binomial estimate will have coverage probabilities that are too low even if the estimated SAR is correct.

Fig 2. The VIF as a function of the SAR for households with m susceptibles.

Fig 2

Lines show analytical calculations, and symbols show estimates from 40,000 simulated household outbreaks. Each simulated household outbreak started with a single primary case, so the total household size was m + 1. For numerical stability, symbols are shown only for simulations with an observed FAR <0.99.

Fig 3 shows the household SAR coverage probabilities for unadjusted and cluster-adjusted binomial 95% confidence intervals. Even for small households, the coverage probabilities are below 95% and decrease rapidly as the true SAR increases. Cluster adjustment increases the coverage probabilities only slightly. With or without adjustment for clustering by household, a binomial model does not produce reliable point or interval estimates of the household SAR.

Fig 3. Coverage probabilities of binomial 95% confidence intervals for the household SAR with different numbers of susceptibles (m).

Fig 3

Gray lines are coverage probabilities for unadjusted confidence intervals, and black lines are coverage probabilities for cluster-adjusted confidence intervals. Each symbol represents 1,000 simulations with 100 households each.

Fig 4 shows coverage probabilities of unadjusted and cluster-adjusted 95% confidence intervals for the household FAR. Coverage of the FAR is much higher than coverage of the SAR. However, the coverage probabilities for unadjusted confidence intervals are always below 95%, and they decrease with increasing household size or increasing SAR. Adjustment for clustering by household corrects this problem, producing coverage probabilities close to 95% for all household sizes. Under these ideal conditions, a binomial model can produce reliable point and interval estimates of the household FAR when the variance is adjusted for clustering within households. This does not imply that FAR can be defined clearly or estimated accurately under more realistic conditions, and it does not imply that the FAR is an acceptable substitute for the SAR in practice.

Fig 4. Coverage probabilities of binomial 95% confidence intervals for the household FAR with different numbers of susceptibles (m).

Fig 4

Gray lines are coverage probabilities for unadjusted confidence intervals, and black lines are coverage probabilities for cluster-adjusted confidence intervals. Each symbol represents 1,000 simulations with 100 households each.

Household data analysis

In the LACDPH pandemic influenza A (H1N1) data, there were 58 households with a total of 299 members. There were 99 infections, of which 62 were classified as primary cases because 4 of 58 households had two co-primary cases. There were 37 household contacts who were infected while under observation. The median household size was 5 with a range from 2 to 20. Both in this example and more generally, co-primary cases and varying household sizes are practical problems for estimation of the household SAR.

There are three types of cases relevant to our analyses: Possible second generation cases are susceptible household members who are infected during the infectious period of the primary case, so it is possible that they were infected by the primary case. Final size cases are susceptible household members who could have been infected through a transmission path starting from a primary case. Late cases are susceptible household members who were infected after the end of the infectious period of the last final size case in the household. Given the assumed infectious period, these cases can only be explained by a new introduction of infection to the household or by transmission paths that include undetected cases. In volunteer challenge studies, approximately 71% of influenza A (H1N1) infections result in symptoms and 37% result in fever ≥100°F [46]. In the analyses comparing binomial models to chain binomial and pairwise survival models, we use only the possible second-generation cases to give the binomial models the best possible chance of producing a good estimate of the household SAR.

Table 2 shows the numbers of possible second generation cases, final size cases, and late cases for each assumed infectious period from 3 days (probably too short) to 12 days (almost surely too long). Assuming an infectious period of 6 days as in our primary analysis, there are 24 possible second generation cases, 26 final size cases, and 11 late cases. We also show analyses with 4-day and 8-day infectious periods.

Table 2. The number of possible second generation cases, final size cases, and late cases for each assumed infectious period.

There are always 37 total final size and late cases.

Infectious period (days) Possible second generation cases Final size cases Late cases
3 13 16 21
4 17 22 15
5 20 25 12
6 24 26 11
7 28 32 5
8 28 32 5
9 28 32 5
10 32 36 1
11 33 36 1
12 34 37 0

Table 3 shows point estimates and 95% confidence intervals for the household SAR. The point estimates for all binomial models are identical. As expected, binomial models produce much higher estimates than the chain binomial or pairwise regression models. Adjustment for clustering produced wider confidence intervals, with very similar results in the binomial GLM and GEE models. The chain binomial and exponential pairwise regression models produced nearly identical point and interval estimates of the household SAR. For each assumed infectious period, the Weibull and log-logistic pairwise regression models produced slightly higher SAR estimates and wider confidence intervals than the exponential model. To compare goodness-of-fit among the parametric pairwise survival models, we used the Akaike Information Criterion (AIC). For the 4-day infectious period, the Weibull and log-logistic models had lower AICs than the exponential model. For the 6-day and 8-day infectious periods, the exponential model had the lowest AIC. The chain binomial and pairwise regression estimates are consistent with each other, but neither is consistent with the binomial estimates.

Table 3. Estimates of the household SAR with 95% confidence limits and Akaike information criterion (AIC) for pairwise regression models.

Model Estimated SAR AIC
6-day infectious period
Binomial: GLM (naive) 0.101 (0.067, 0.144)
GLM (adjusted) 0.101 (0.052, 0.189)
GEE (naive) 0.101 (0.069, 0.147)
GEE (robust) 0.101 (0.052, 0.188)
Longitudinal chain binomial 0.076 (0.051, 0.109)
Pairwise regression: exponential 0.075 (0.051, 0.110) 235.96
Weibull 0.079 (0.056, 0.138) 236.38
log-logistic 0.079 (0.055, 0.133) 236.26
4-day infectious period
Binomial: GLM (naive) 0.072 (0.043, 0.109)
GLM (adjusted) 0.072 (0.033, 0.150)
GEE (naive) 0.072 (0.045, 0.112)
GEE (robust) 0.072 (0.033, 0.149)
Longitudinal chain binomial 0.059 (0.035, 0.090)
Pairwise regression: exponential 0.058 (0.036, 0.092) 166.54
Weibull 0.063 (0.041, 0.135) 162.67
log-logistic 0.063 (0.042, 0.133) 162.65
8-day infectious period
Binomial: GLM (naive) 0.118 (0.081, 0.163)
GLM (adjusted) 0.118 (0.063, 0.211)
GEE (naive) 0.118 (0.083, 0.166)
GEE (robust) 0.118 (0.063, 0.210)
Longitudinal chain binomial 0.085 (0.058, 0.118)
Pairwise regression: exponential 0.084 (0.059, 0.120) 281.88
Weibull 0.085 (0.062, 0.145) 283.77
log-logistic 0.085 (0.062, 0.139) 283.49

Fig 5 shows histograms of the simulated outbreak sizes in the LA households based on the four different SAR estimates that assume a 6-day infectious period. The binomial estimates predict outbreaks much larger than observed, but the chain binomial and pairwise estimates predict outbreak size distributions centered near observed final sizes. Figs 6 and 7 show a similar pattern for estimates that assume 4-day and 8-day infectious periods, respectively. For the binomial estimates, the predicted outbreak sizes increase rapidly as the assumed infectious period gets longer. For the chain binomial and pairwise regression estimates, the predicted outbreak sizes increase much more slowly. To the extent that a true household SAR exists, it is almost certainly below the binomial estimates and closer to the chain binomial and pairwise regression estimates.

Fig 5. Histograms of simulated final outbreak sizes in the LA households based on household SAR estimates assuming a 6-day infectious period.

Fig 5

Vertical black lines indicate the observed final size of 26 cases.

Fig 6. Histogram of simulated final outbreak sizes in the LA households based on SAR estimates assuming a 4-day infectious period.

Fig 6

Vertical black lines indicate the observed final size of 22 cases.

Fig 7. Histogram of simulated final outbreak sizes in the LA households based on SAR estimates assuming an 8-day infectious period.

Fig 7

Vertical black lines indicate the observed final size of 32 cases.

An important advantage of the longitudinal chain binomial and pairwise regression models is that they can estimate the SAR using the entire period of household observation. Table 4 shows point and interval estimates of the SAR based on the full data set collected by the LACDPH. As before, the chain binomial and pairwise exponential models produce nearly identical point and interval estimates. Using the full data set, the pairwise Weibull and log-logistic models produce point estimates closer to those of the one-parameter models than in Table 3, but their confidence intervals remain slightly wider. For the 6-day and 8-day assumed infectious periods, all four models produce lower point estimates of the SAR when using the full data set than when using only the possible second generation data. For the 4-day assumed infectious period, the point estimates from the full data are near the higher point estimates from the possible second-generation data. Fig 8 shows the distribution of outbreak sizes under the pairwise exponential estimate of the SAR assuming 6-, 4-, and 8-day infectious periods. The light gray histograms in the background show the distributions based on the point estimates from Table 3, which used the possible second generation data. In all three cases, there is a small but clear improvement in the predictive fit of the model when the full data set is used. Similar results were seen for the longitudinal chain binomial and pairwise Weibull and log-logistic regression models (see figures produced by S3 File).

Table 4. Full-data estimates of the household SAR with 95% confidence limits and Akaike information criterion (AIC) for pairwise regression models.

Model Estimated SAR AIC
6-day infectious period
Longitudinal chain binomial 0.069 (0.048, 0.096)
Pairwise regression: exponential 0.068 (0.048, 0.097) 288.33
Weibull 0.069 (0.050, 0.111) 289.74
log-logistic 0.069 (0.050, 0.111) 289.54
4-day infectious period
Longitudinal chain binomial 0.063 (0.043, 0.089)
Pairwise regression: exponential 0.062 (0.043, 0.088) 256.67
Weibull 0.063 (0.045, 0.108) 254.30
log-logistic 0.063 (0.044, 0.105) 254.24
8-day infectious period
Longitudinal chain binomial 0.080 (0.056, 0.108)
Pairwise regression: exponential 0.079 (0.056, 0.111) 339.98
Weibull 0.079 (0.058, 0.122) 341.97
log-logistic 0.079 (0.058, 0.119) 341.66

Fig 8. Histograms of simulated outbreak sizes based on pairwise exponential SAR estimates using the full data (dark gray) superimposed on the corresponding histograms from Figs 57 based on estimates using second generation data (light gray).

Fig 8

For each assumed infectious period, a vertical black line shows the observed final outbreak size.

Discussion

Studies of disease transmission in households and other clearly-defined groups at risk of infection will always be one of the most effective means of obtaining critical information about routes of transmission, predictors of infectiousness and susceptibility, and the natural history of epidemic diseases [3, 8, 60]. Every author of the studies cited above has made an important contribution to infectious disease epidemiology and to public health. However, these studies should no longer be analyzed using binomial models. Even when the SAR is small, it is important to account for multiple generations of transmission. Unless these generations are clearly separated in time, a binomial estimate of the SAR will be biased upward and have a confidence interval with low coverage probability whether or not it is adjusted for clustering.

A binomial model can estimate the household FAR accurately if cluster-adjusted confidence intervals are used. However, the FAR was clearly defined in our simulations only because we made the following assumptions: (1) each household had at most one primary case, (2) all households were the same size, (3) susceptibles were not at risk of infection from outside the household after the occurrence of a primary case. In practice, these assumptions are extremely unlikely to hold. The LACDPH data had households with multiple primary cases, household sizes that varied from 2 to 20, an ongoing risk of infection from outside the household. Unlike the household FAR, the household SAR can be clearly defined under more realistic conditions.

Our simulation study of the SAR and FAR also assumed that individuals were identical in terms of infectiousness and susceptibility, which is extremely unlikely to hold in practice. In the LACDPH data, individuals varied in age, antiviral prophylaxis use, and other possible predictors of infectiousness and susceptibility. Some household studies have seen evidence of lower transmission intensity between individuals in larger households [21]. The chain binomial model and pairwise survival models allow the probability or hazard of transmission to depend on individual-level, pairwise, and household-level covariates [51, 52]. These covariate effects are estimated simultaneously, which is critical to preventing bias for contagious outcomes [61]. Accurate estimates of these effects can provide critical insight into the effectiveness of public health interventions such as handwashing, social distancing, antiviral prophylaxis or treatment, and vaccination.

The discrete-time chain binomial model [48] and pairwise survival models [4951] require more detailed follow-up of each household than binomial models, but they can account accurately for delayed entry, loss to follow-up, and the risk of infection from outside the household. Most household studies already collect the data needed to use chain binomial and pairwise survival models for analysis. An additional advantage of these models is that data augmentation and Markov chain Monte Carlo (MCMC) can be used to fit them if there are undetected infections or if infection times cannot be determined precisely [62].

Whereas binomial models can be fit using almost any standard statistical package, the lack of available software has been a major obstacle to the adoption of statistical models of infectious disease transmission in household studies. Chain binomial models are available in the free and open source software package TranStat (www.cidid.org/transtat), which incorporates several advanced methods [63, 64] and has been used in analyses of influenza [20], Zika virus [65], and Ebola virus [29]. Pairwise survival models are available in the free and open source transtat package for R, which was used to analyze the LA household data above. This package includes parametric models and semiparametric models [4952, 57].

In the COVID-19 pandemic, there have been too few studies of SARS-CoV-2 transmission in households or other clearly-defined populations at risk of infection, leaving many questions about the intensity of transmission and the predictors of infectiousness and susceptibility unanswered [60]. This has forced public health decisions that affect millions of lives to be made under far greater uncertainty than there could or should have been. Household studies can provide critical scientific insights to guide public health interventions and policies. The results above show that replacing binomial models with chain binomial or pairwise survival models will help these studies contribute more effectively to the prevention and control of epidemics.

Supporting information

S1 File. R [56] (https://www.r-project.org/) code used to analyze the household outbreak simulations in section.
Produces Figs 14. Requires the following packages: Directions and package versions used for publication are in comments.

(R)

S2 File. Python 3 [55] (https://www.python.org) functions called by S1 File.
Requires the following packages: Directions and package versions used for publication are in comments.

(PY)

S3 File. R [56] code used to analyze LACDPH household surveillance data in section.
Produces Tables 2 to 4 and Figs 58. In addition to the packages listed in S1 File, the following packages are required: Directions and package versions used for publication are in comments.

(R)

S4 File. Python 3 [55] functions called by S3 File.

Requires the NetworkX and pandas packages listed under S2 File: Directions and package versions used for publication are in comments.

(PY)

S5 File. De-identified LACDPH household surveillance data in CSV format used by S3 File.

(CSV)

Acknowledgments

The subtitle of the paper was inspired by Edsger Dijkstra’s letter “Go To Statement Considered Harmful” (Communications of the ACM 11:147–148, 1968). Brit Oiulfstad, Dee Ann Bagwell, Brandon Dean, Laurene Mascola, and Elizabeth Bancroft of the Los Angeles County Department of Public Health (LACDPH) generously provided the household influenza surveillance data.

Disclaimer: The contribution of YS was completed prior to his Food and Drug Administration (FDA) employment. The content is solely the responsibility of the authors and does not represent the official views or policies of LACDPH, NIAID, NIGMS, NIH, or FDA.

Data Availability

All relevant data are within the manuscript and its Supporting information files.

Funding Statement

EK and YS were supported by National Institute of Allergy and Infectious Diseases (NIAID) grants R01 AI116770 and R03 AI124017. EK was also supported by National Institute of General Medical Sciences (NIGMS) grant U54 GM111274, and YS was also supported by National Institutes of Health (NIH) grant DP2 HD09179. NIAID: https://www.niaid.nih.gov/ NIGMS: https://www.nigms.nih.gov/ NIH: https://www.nih.gov/ The funders played no role in the study design, data collection and analysis, decision to publish, or the preparation of the manuscript.

References

  • 1. Morgenstern H, Kleinbaum DG, Kupper LL. Measures of disease incidence used in epidemiologic research. International Journal of Epidemiology. 1980;9(1):97–104. 10.1093/ije/9.1.97 [DOI] [PubMed] [Google Scholar]
  • 2. De Wals P, Hertoghe L, Borlée-Grimée I, De Maeyer-Cleempoel S, Reginster-Haneuse G, Dachy A, et al. Meningococcal disease in Belgium. Secondary attack rate among household, day-care nursery and pre-elementary school contacts. Journal of Infection. 1981;3:53–61. 10.1016/S0163-4453(81)80009-6 [DOI] [PubMed] [Google Scholar]
  • 3. Fox JP. Family-based epidemiologic studies. American Journal of Epidemiology. 1974;99(3):165–79. 10.1093/oxfordjournals.aje.a121600 [DOI] [PubMed] [Google Scholar]
  • 4. Elveback LR, Fox JP, Ackerman E, Langworthy A, Boyd M, Gatewood L. An influmza simulation model for immunization studies. American Journal of Epidemiology. 1976;103(2):152–165. 10.1093/oxfordjournals.aje.a112213 [DOI] [PubMed] [Google Scholar]
  • 5. Monto AS. Studies of the community and family: acute respiratory illness and infection. Epidemiologic Reviews. 1994;16(2):351 10.1093/oxfordjournals.epirev.a036158 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Halloran ME, Préziosi MP, Chu H. Estimating vaccine efficacy from secondary attack rates. Journal of the American Statistical Association. 2003;98(461):38–46. 10.1198/016214503388619076 [DOI] [Google Scholar]
  • 7. Terry J, Flatley C, van den Berg D, Morgan G, Trent M, Turahui J, et al. A field study of household attack rates and the effectiveness of macrolide antibiotics in reducing household transmission of pertussis. Communicable Diseases Intelligence Quarterly Report. 2014;39(1):E27–33. [PubMed] [Google Scholar]
  • 8. Frost WH. The familial aggregation of infectious diseases. American Journal of Public Health and the Nations Health. 1938;28(1):7–13. 10.2105/ajph.28.1.7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Wilson EB, Bennett C, Allen M, Worcester J. Measles and scarlet fever in Providence, RI, 1929-1934 with respect to age and size of family. Proceedings of the American Philosophical Society. 1939; p. 357–476. [Google Scholar]
  • 10. Terris M. Charles V. Chapin (1856-1941),“Dean of City Health Officers”. Journal of Public Health Policy. 1999;20(2):215–220. 10.2307/3343212 [DOI] [PubMed] [Google Scholar]
  • 11. Jordan WS Jr, Denny FW Jr, Badger GF, Dingle JH, Oseasohn R, Stevens D, et al. A study of illness in a group of Cleveland families. XVII. The occurrence of Asian influenza. American Journal of Hygiene. 1958;68(2):190–212. [DOI] [PubMed] [Google Scholar]
  • 12. Chin TD, Foley JF, Doto IL, Gravelle CR, Weston J. Morbidity and mortality characteristics of Asian strain influenza. Public Health Reports. 1960;75(2):149 10.2307/4590751 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Davis LE, Caldwell GG, Lynch RE, Bailey RE, Chin TD. Hong Kong influenza: the epidemiologic features of a high school family study analyzed and compared with a similar study during the 1957 Asian influenza epidemic. American Journal of Epidemiology. 1970;92(4):240–247. 10.1093/oxfordjournals.aje.a121203 [DOI] [PubMed] [Google Scholar]
  • 14. Goh DLM, Lee BW, Chia KS, Heng BH, Chen M, Ma S, et al. Secondary household transmission of SARS, Singapore. Emerging infectious diseases. 2004;10(2):232 10.3201/eid1002.030676 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Cauchemez S, Carrat F, Viboud C, Valleron A, Boelle P. A Bayesian MCMC approach to study transmission of influenza: application to household longitudinal data. Statistics in Medicine. 2004;23(22):3469–3487. 10.1002/sim.1912 [DOI] [PubMed] [Google Scholar]
  • 16. Tsang TK, Cauchemez S, Perera RA, Freeman G, Fang VJ, Ip DK, et al. Association between antibody titers and protection against influenza virus infection within households. The Journal of Infectious Diseases. 2014;210(5):684–692. 10.1093/infdis/jiu186 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Tsang TK, Cowling BJ, Fang VJ, Chan KH, Ip DK, Leung GM, et al. Influenza A virus shedding and infectivity in households. The Journal of Infectious Diseases. 2015;212(9):1420–1428. 10.1093/infdis/jiv225 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Petrie JG, Eisenberg MC, Ng S, Malosh RE, Lee KH, Ohmit SE, et al. Application of an individual-based transmission hazard model for estimation of influenza vaccine effectiveness in a household cohort. American Journal of Epidemiology. 2017;186(12):1380–1388. 10.1093/aje/kwx217 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Banerjee I, Primrose Gladstone B, Iturriza-Gomara M, Gray JJ, Brown DW, Kang G. Evidence of intrafamilial transmission of rotavirus in a birth cohort in South India. Journal of Medical Virology. 2008;80(10):1858–1863. 10.1002/jmv.21263 [DOI] [PubMed] [Google Scholar]
  • 20. Yang Y, Sugimoto JD, Halloran ME, Basta NE, Chao DL, Matrajt L, et al. The transmissibility and control of pandemic influenza A (H1N1) virus. Science. 2009;326(5953):729–733. 10.1126/science.1177373 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Cauchemez S, Donnelly CA, Reed C, Ghani AC, Fraser C, Kent CK, et al. Household transmission of 2009 pandemic influenza A (H1N1) virus in the United States. New England Journal of Medicine. 2009;361(27):2619–2627. 10.1056/NEJMoa0905498 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Morgan OW, Parks S, Shim T, Blevins PA, Lucas PM, Sanchez R, et al. Household transmission of pandemic (H1N1) 2009, San Antonio, Texas, USA, April-May 2009. Emerg Infect Dis. 2010;16(4):631–637. 10.3201/eid1604.091658 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. France AM, Jackson M, Schrag S, Lynch M, Zimmerman C, Biggerstaff M, et al. Household transmission of 2009 influenza A (H1N1) virus after a school-based outbreak in New York City, April-May 2009. Journal of Infectious Diseases. 2010;201(7):984–992. 10.1086/651145 [DOI] [PubMed] [Google Scholar]
  • 24. Carcione D, Giele C, Goggin L, Kwan KS, Smith D, Dowse G, et al. Secondary attack rate of pandemic influenza A (H1N1) 2009 in Western Australian households, 29 May-7 August 2009. Euro Surveill. 2011;16(3):19765 [PubMed] [Google Scholar]
  • 25. Savage R, Whelan M, Johnson I, Rea E, LaFreniere M, Rosella LC, et al. Assessing secondary attack rates among household contacts at the beginning of the influenza A (H1N1) pandemic in Ontario, Canada, April-June 2009: A prospective, observational study. BMC Public Health. 2011;11(1):234 10.1186/1471-2458-11-234 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Ng S, Saborio S, Kuan G, Gresh L, Sanchez N, Ojeda S, et al. Association between Haemagglutination inhibiting antibodies and protection against clade 6B viruses in 2013 and 2015. Vaccine. 2017;35(45):6202–6207. 10.1016/j.vaccine.2017.09.036 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Drosten C, Meyer B, Müller MA, Corman VM, Al-Masri M, Hossain R, et al. Transmission of MERS-coronavirus in household contacts. New England Journal of Medicine. 2014;371(9):828–835. 10.1056/NEJMoa1405858 [DOI] [PubMed] [Google Scholar]
  • 28. Arwady MA, Alraddadi B, Basler C, Azhar EI, Abuelzein E, Sindy AI, et al. Middle East respiratory syndrome coronavirus transmission in extended family, Saudi Arabia, 2014. Emerging infectious diseases. 2016;22(8):1395 10.3201/eid2208.152015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Fang LQ, Yang Y, Jiang JF, Yao HW, Kargbo D, Li XL, et al. Transmission dynamics of Ebola virus disease and intervention effectiveness in Sierra Leone. Proceedings of the National Academy of Sciences. 2016;113(16):4488–4493. 10.1073/pnas.1518587113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Glynn JR, Bower H, Johnson S, Turay C, Sesay D, Mansaray SH, et al. Variability in intrahousehold transmission of Ebola virus, and estimation of the household secondary attack rate. The Journal of Infectious Diseases. 2018;217(2):232–237. 10.1093/infdis/jix579 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Reichler MR, Bangura J, Bruden D, Keimbe C, Duffy N, Thomas H, et al. Household transmission of Ebola virus: risks and preventive factors, Freetown, Sierra Leone, 2015. The Journal of infectious diseases. 2018;218(5):757–767. 10.1093/infdis/jiy204 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Marsh Z, Grytdal S, Beggs J, Leshem E, Gastanaduy P, Rha B, et al. The unwelcome houseguest: secondary household transmission of norovirus. Epidemiology & Infection. 2018;146(2):159–167. 10.1017/S0950268817002783 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Tsang TK, Chen TM, Longini IM Jr, Halloran ME, Wu Y, Yang Y. Transmissibility of norovirus in urban versus rural households in a large community outbreak in China. Epidemiology. 2018;29(5):675 10.1097/EDE.0000000000000855 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Hoang CQ, Nguyen TTT, Ho NX, Nguyen HD, Nguyen AB, Nguyen THT, et al. Transmission and serotype features of hand foot mouth disease in household contacts in Dong Thap, Vietnam. BMC Infectious Diseases. 2019;19(1):933 10.1186/s12879-019-4583-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Korpe PS, Gilchrist C, Burkey C, Taniuchi M, Ahmed E, Madan V, et al. Case-control study of cryptosporidium transmission in Bangladeshi households. Clinical Infectious Diseases. 2019;68(7):1073–1079. 10.1093/cid/ciy593 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Banerjee E, Griffith J, Kenyon C, Christianson B, Strain A, Martin K, et al. Containing a measles outbreak in Minnesota, 2017: methods and challenges. Perspectives in Public Health. 2020;140(3):162–171. 10.1177/1757913919871072 [DOI] [PubMed] [Google Scholar]
  • 37. Bi Q, Wu Y, Mei S, Ye C, Zou X, Zhang Z, et al. Epidemiology and transmission of COVID-19 in 391 cases and 1286 of their close contacts in Shenzhen, China: a retrospective cohort study. The Lancet Infectious Diseases. 2020;20(8):911–919. 10.1016/S1473-3099(20)30287-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Li W, Zhang B, Lu J, Liu S, Chang Z, Cao P, et al. The characteristics of household transmission of COVID-19. Clinical Infectious Diseases. 2020; 10.1093/cid/ciaa450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Dietz K. The first epidemic model: a historical note on PD En’ko. Australian Journal of Statistics. 1988;30(1):56–65. 10.1111/j.1467-842X.1988.tb00464.x [DOI] [Google Scholar]
  • 40. Becker NG. Analysis of Infectious Disease Data. vol. 33 CRC Press; 1989. [Google Scholar]
  • 41. Greenwood M. On the statistical measure of infectiousness. Epidemiology & Infection. 1931;31(3):336–351. 10.1017/s002217240001086x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Giesecke J. Primary and index cases. The Lancet. 2014;384(9959):2024 10.1016/S0140-6736(14)62331-X [DOI] [PubMed] [Google Scholar]
  • 43. Lau MS, Cowling BJ, Cook AR, Riley S. Inferring influenza dynamics and control in households. Proceedings of the National Academy of Sciences. 2015;112(29):9094–9099. 10.1073/pnas.1423339112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Panum PL, Petersen JJ. Observations Made During the Epidemic of Measles on the Faroe Islands in the Year 1846. Delta Omega Society; New York; 1940. [Google Scholar]
  • 45. Longini IM, Nizam A, Xu S, Ungchusak K, Hanshaoworakul W, Cummings DA, et al. Containing pandemic influenza at the source. Science. 2005;309(5737):1083–1087. 10.1126/science.1115717 [DOI] [PubMed] [Google Scholar]
  • 46. Carrat F, Vergu E, Ferguson NM, Lemaitre M, Cauchemez S, Leach S, et al. Time lines of infection and disease in human influenza: a review of volunteer challenge studies. American Journal of Epidemiology. 2008;167(7):775–785. 10.1093/aje/kwm375 [DOI] [PubMed] [Google Scholar]
  • 47. Kemper JT. Error sources in the evaluation of secondary attack rates. American Journal of Epidemiology. 1980;112(4):457–464. 10.1093/oxfordjournals.aje.a113013 [DOI] [PubMed] [Google Scholar]
  • 48. Rampey AH Jr, Longini IM Jr, Haber M, Monto AS. A discrete-time model for the statistical analysis of infectious disease incidence data. Biometrics. 1992;48:117–128. 10.2307/2532743 [DOI] [PubMed] [Google Scholar]
  • 49. Kenah E. Contact intervals, survival analysis of epidemic data, and estimation of R0. Biostatistics. 2011;12(3):548–566. 10.1093/biostatistics/kxq068 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Kenah E. Non-parametric survival analysis of infectious disease data. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2013;75(2):277–303. 10.1111/j.1467-9868.2012.01042.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Kenah E. Semiparametric relative-risk regression for infectious disease transmission data. Journal of the American Statistical Association. 2015;110(509):313–325. 10.1080/01621459.2014.896807 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Sharker Y, Kenah E. Pairwise accelerated failure time models for infectious disease transmission with external sources of infection. arXiv preprint arXiv:190104916. 2019;.
  • 53. Bollobás B. Random Graphs. Springer; 1998. [Google Scholar]
  • 54. Gilbert EN. Random graphs. The Annals of Mathematical Statistics. 1959;30(4):1141–1144. 10.1214/aoms/1177706098 [DOI] [Google Scholar]
  • 55.Van Rossum G, Drake FL. Python 3 Reference Manual. Scotts Valley, CA: CreateSpace; 2009.
  • 56.R Core Team. R: A Language and Environment for Statistical Computing; 2020. Available from: https://www.R-project.org/.
  • 57. Kenah E. Pairwise survival analysis of infectious disease transmission data In: Held L, Hens N, O’Neill PD, Wallinga J, editors. Handbook of Infectious Disease Data Analysis. CRC Press; 2019. p. 221–244. [Google Scholar]
  • 58. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986; p. 13–22. 10.1093/biomet/73.1.13 [DOI] [Google Scholar]
  • 59. Cameron AC, Gelbach JB, Miller DL. Robust inference with multiway clustering. Journal of Business & Economic Statistics. 2011;29(2):238–249. 10.1198/jbes.2010.07136 [DOI] [Google Scholar]
  • 60. Lipsitch M, Swerdlow DL, Finelli L. Defining the epidemiology of Covid-19—–studies needed. New England Journal of Medicine. 2020;382(13):1194–1196. 10.1056/NEJMp2002125 [DOI] [PubMed] [Google Scholar]
  • 61. Morozova O, Cohen T, Crawford FW. Risk ratios for contagious outcomes. Journal of The Royal Society Interface. 2018;15(138):20170696 10.1098/rsif.2017.0696 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. O’Neill PD, Roberts GO. Bayesian inference for partially observed stochastic epidemics. Journal of the Royal Statistical Society: Series A. 1999;162(1):121–129. 10.1111/1467-985X.00125 [DOI] [Google Scholar]
  • 63. Yang Y, Longini Ira M Jr, Halloran ME. A resampling-based test to detect person-to-person transmission of infectious diseases. Annals of Applied Statistics. 2007;1:211–228. 10.1214/07-AOAS105 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Yang Y, Longini Ira M Jr, Halloran ME, Obenchain V. A hybrid EM and Monte Carlo EM algorithm and its application to analysis of transmission of infectious diseases. Biometrics. 2012;68:1238–1249. 10.1111/j.1541-0420.2012.01757.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Rojas DP, Dean NE, Yang Y, Kenah E, Quintero J, Tomasi S, et al. The epidemiology and transmissibility of Zika virus in Girardot and San Andres island, Colombia, September 2015 to January 2016. Eurosurveillance. 2016;21(28):30283 10.2807/1560-7917.ES.2016.21.28.30283 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.to R by Thomas Lumley VJCP, src/dgedi f BRF, src/dgefa f are for LINPACK authored by Cleve Moler Note that maintainers are not available to give advice on using a package they did not author. gee: Generalized Estimation Equation Solver; 2019. Available from: https://CRAN.R-project.org/package=gee.
  • 67.Ushey K, Allaire J, Tang Y. reticulate: Interface to’Python’; 2020. Available from: https://CRAN.R-project.org/package=reticulate.
  • 68.Therneau TM. A Package for Survival Analysis in R; 2020. Available from: https://CRAN.R-project.org/package=survival.
  • 69.Kenah E, Yang Y. transtat: Statistical Methods for Infectious Disease Transmission; 2020. Available from: https://github.com/ekenah/transtat.
  • 70.Hagberg A, Swart P, S Chult D. Exploring network structure, dynamics, and function using NetworkX. Los Alamos National Lab.(LANL), Los Alamos, NM (United States); 2008.
  • 71. Oliphant TE. A guide to NumPy. vol. 1 Trelgol Publishing; USA; 2006. [Google Scholar]
  • 72.Wes McKinney. Data Structures for Statistical Computing in Python. In: Stéfan van der Walt, Jarrod Millman, editors. Proceedings of the 9th Python in Science Conference; 2010. p. 56–61.
  • 73. Venables WN, Ripley BD. Modern Applied Statistics with S. 4th ed New York: Springer; 2002. Available from: http://www.stats.ox.ac.uk/pub/MASS4. [Google Scholar]
  • 74. Zeileis A. Object-Oriented Computation of Sandwich Estimators. Journal of Statistical Software. 2006;16(9):1–16. 10.18637/jss.v016.i09 [DOI] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008601.r001

Decision Letter 0

Cecile Viboud, Virginia E Pitzer

8 Sep 2020

Dear Dr. Kenah,

Thank you very much for submitting your manuscript "Estimating and interpreting secondary attack risk: Binomial considered harmful" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. 

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Cecile Viboud

Associate Editor

PLOS Computational Biology

Virginia Pitzer

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Please see attachment for detailed comments

Reviewer #2: This manuscript highlights the risk of using binomial models to estimate the secondary attack risk in clusters and highlights the importance of accounting for multiple generations of cases. This is a well-know problem for mathematical epidemiologists that, unfortunately, it is much less clear to more traditional epidemiologists. I believe this manuscript deals with a very important issue and provides a clear and accessible way to understand it for traditional epidemiologists. The theoretical framework looks solid to me, while the analysis of the LACDPH household data has a key flaw that should be addressed (see below).

Detailed comments

When analyzing the LACDPH household data, the authors are assuming that all influenza infections will result in an acute febrile respiratory illness. There are several studies showing that the probability of developing fever after influenza infection is <30% - I would like to point the authors to Carrat et al, Am J Epidemiol, 2008 and references therein. Therefore, it is very likely that the LACDPH missed more than half of the influenza infections. I agree with the authors’ choice to keep the transmission model as simple as possible and not include other factors such as age-specific susceptibility to infection, age-specific infectiousness, etc. However, the probability of developing fever is so important to proper interpreting the LACDPH household data that that cannot be neglected in the simulation analysis. As such all claims that some cases cannot be explained unless the re-importation of the infection to the household should be removed (e.g., line 279, 346). Also the definition of “late case” (line 276) should be revisited accordingly.

As for most infectious diseases, the duration of the infectious period of influenza is unknown and we have only rather indirect estimates of it. Therefore, I agree with the authors’ idea of exploring a wide range of values. However, I see two issues here. First, the list of explored values is way too large. Second, the distribution is clearly not uniform. We have a wide range of studies showing that the mean generation time of influenza is about 2-4 days. An infectious period of 12 days would be possible only if the transmission probability after the very first few days is extremely low. This should be clearly discussed and I suggest to use a more realistic (shorter) value for the infectious period in the baseline analysis and to decrease also the value of the upper bound as 12 days appear to be highly unrealistic.

I think it would be very interesting to look at the FAR by household size in the LACDPH and provide a comparison with the obtained modeling results. In fact, I fear that the model may fail the comparison with the data in this respect. If so, this should be clearly stated and acknowledged as a study limitation possibly linked to the many additional factors that are not included in the simple models used here.

Line 61-62. As stated before, the duration of the infectious period of influenza is unknown. The same applies also to the latent period. What we do know are the length of the generation time (mean roughly in the range 2-4 days) and of the incubation period (mean roughly in the range 1-2 days). I strongly recommend rephrasing this sentence in terms of incubation period and generation time that would better support the (correct) authors reasoning here.

Line 71-72, “so the binomial […] of the SAR”. I recommend dropping this part of the sentence.

Line 331. I suggest dropping the reproduction number from the list given here.

There are a few very minor English mistakes here and there (e.g., lines 47, 55, 57).

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Marco Ajelli

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-materials-and-methods

Attachment

Submitted filename: comments.pdf

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008601.r003

Decision Letter 1

Cecile Viboud, Virginia E Pitzer

2 Dec 2020

Dear Dr. Kenah,

We are pleased to inform you that your manuscript 'Estimating and interpreting secondary attack risk: Binomial considered biased' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Cecile Viboud

Associate Editor

PLOS Computational Biology

Virginia Pitzer

Deputy Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008601.r004

Acceptance letter

Cecile Viboud, Virginia E Pitzer

15 Jan 2021

PCOMPBIOL-D-20-01278R1

Estimating and interpreting secondary attack risk: Binomial considered biased

Dear Dr Kenah,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Jutka Oroszlan

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File. R [56] (https://www.r-project.org/) code used to analyze the household outbreak simulations in section.
    Produces Figs 14. Requires the following packages: Directions and package versions used for publication are in comments.

    (R)

    S2 File. Python 3 [55] (https://www.python.org) functions called by S1 File.
    Requires the following packages: Directions and package versions used for publication are in comments.

    (PY)

    S3 File. R [56] code used to analyze LACDPH household surveillance data in section.
    Produces Tables 2 to 4 and Figs 58. In addition to the packages listed in S1 File, the following packages are required: Directions and package versions used for publication are in comments.

    (R)

    S4 File. Python 3 [55] functions called by S3 File.

    Requires the NetworkX and pandas packages listed under S2 File: Directions and package versions used for publication are in comments.

    (PY)

    S5 File. De-identified LACDPH household surveillance data in CSV format used by S3 File.

    (CSV)

    Attachment

    Submitted filename: comments.pdf

    Attachment

    Submitted filename: Response.pdf

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting information files.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES