Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2021 Jul 6;17(7):e1009182. doi: 10.1371/journal.pcbi.1009182

Sample size calculation for phylogenetic case linkage

Shirlee Wohl 1, John R Giles 1, Justin Lessler 1,*
Editor: Virginia E Pitzer2
PMCID: PMC8284614  PMID: 34228722

Abstract

Sample size calculations are an essential component of the design and evaluation of scientific studies. However, there is a lack of clear guidance for determining the sample size needed for phylogenetic studies, which are becoming an essential part of studying pathogen transmission. We introduce a statistical framework for determining the number of true infector-infectee transmission pairs identified by a phylogenetic study, given the size and population coverage of that study. We then show how characteristics of the criteria used to determine linkage and aspects of the study design can influence our ability to correctly identify transmission links, in sometimes counterintuitive ways. We test the overall approach using outbreak simulations and provide guidance for calculating the sensitivity and specificity of the linkage criteria, the key inputs to our approach. The framework is freely available as the R package phylosamp, and is broadly applicable to designing and evaluating a wide array of pathogen phylogenetic studies.

Author summary

Sequencing the genetic material of viral and bacterial pathogens has become an important part of tracking and combating human infectious diseases. Specifically, comparing the pathogen DNA or RNA sequences collected from infected individuals can allow researchers and public health experts to determine who infected whom, or detect when a pathogen entered a specific country or geographic area. However, it is often impossible to collect samples from every single infected person, and these missing sequences can pose problems for this type of analysis, especially if there is some bias behind which samples were selected for sequencing. We have developed a mathematical framework that allows users to determine the probability their conclusions about pathogen transmission are correct given the number and proportion of samples from a pathogen outbreak they have sequenced. This framework is freely available, easy to use, and broadly generalizable to any pathogen, and we hope that it can be used to inform the design and sampling strategies behind future sequencing-based studies.

Introduction

As the cost of pathogen sequencing has declined, the number and size of studies based on pathogen sequence analysis has increased dramatically [1]. Traditionally, researchers have sequenced convenience samples collected as part of routine clinical or public health activities (e.g., diagnostic specimens collected as part of an outbreak response), or as part of studies where specimens are collected for other purposes. However, the analysis of pathogen genomic sequences is increasingly becoming a primary goal of both research studies and public health surveillance efforts [25]. This shift has been driven by the apparent utility of pathogen sequence data for understanding aspects of pathogen spread ranging from the frequency and source of introductions into a region [610], to identifying endogenous spread of emerging diseases [11,12], to understanding the role of “hotspots” in maintaining broader community epidemics [13], to understanding transmission patterns at an individual or “microscale” level [3,14].

Despite these many examples, there is a lack of clear and accessible guidance for appropriately designing and sizing studies aimed at understanding pathogen transmission, or for evaluating the design and conclusions of past studies. Without such guidance, it is difficult for researchers to design studies in a way that maximizes the chances of success, and difficult for reviewers to appropriately evaluate papers and grant applications centered around molecular or phylogenetic outcomes [15,16]. In particular, undersampling or biased sampling can lead to poorly supported inferences about patterns of disease spread [17,18]. While there are examples of researchers conducting careful a priori analyses of sampling strategies [1921], these have largely relied on sophisticated techniques that are not broadly generalizable. Hence, there is a need for broadly accepted and accessible guidance for the selection of specimens for sequencing and phylogenetic analyses.

As noted above, pathogen sequences have been used to understand multiple aspects of infectious disease transmission at scales ranging from the global (e.g., movement of pathogens between countries) to the individual (e.g., reconstruction of individual transmission chains). Arguably, all such analyses can be reduced to the basic question of whether pairs of infected units (individuals, locations, etc.) are related or connected within a particular number of generations of transmission. Therefore, developing tools for assessing the number of sequences needed to confidently identify linked individuals (infections separated by no more than a specific number of generations of transmission) is a good place to start building a theory for power calculations for phylogenetic inference that can later be applied to questions at vastly different spatial or temporal scales. In this paper, we present a framework for making critical decisions about study design when the goal is to identify infector-infectee pairs, and we illustrate this approach with simulation studies.

Methods

General principles

In this paper we will focus on studies that aim to identify infector-infectee pairs from phylogenetic analysis of pathogen sequence data collected from infected individuals. We assume the study aims to achieve some level of certainty that identified infector-infectee pairs are correct, and may also require identification of some minimum number of pairs. Below we lay out a precise terminology (Table 1) and general principles.

Table 1. Parameters used in calculations and simulations.

Parameter Description
M Number of infections sampled
N Total number of (relevant) infected individuals in an outbreak
ρ Proportion of outbreak infections sampled (M/N)
η Sensitivity of the linkage criteria
χ Specificity of the linkage criteria
ϕ Probability that an identified link represents a true transmission event (1-False Discovery Rate)
R Reproductive number of a pathogen
Rpop Average reproductive number of a pathogen in a finite population (always <1)
μ Substitution rate of the pathogen (in substitutions observed per genome per transmission event)

To start, we define the term linkage criteria to represent all the criteria used to infer whether a set of infected individuals are linked to one another by direct transmission. The linkage criteria can be derived from a combination of genetic distance between pathogens isolated from different individuals, tree structure (e.g., clade support), and epidemiologic information (e.g., relative dates of symptom onset). We refer to infections inferred to be connected by transmission using this criteria as linked pairs. Some of these linked pairs will represent actual transmission events (true transmission pairs) and some will be false positives. We want to determine the sample size (M) and proportion of the population (ρ) required to recover a predetermined number of linked pairs, while keeping the false discovery rate (the proportion of these linked pairs that are false positives) below a predetermined threshold. When applied to a study where design was dictated by other factors (e.g., specimen availability), the same methods can be used to determine the false discovery rate, which will inform the confidence we have in any conclusions about disease transmission in that study.

To capture true transmission pairs, the infector and their partner infectee must both be in the sample. Therefore, correctly identifying direct transmission links (and, conversely, calculating the false discovery rate) depends on the sampling fraction (ρ), which is equal to the sample size (M) divided by the total number of infected individuals in the relevant population (N). Identification of these links will further depend on the sensitivity (η) and specificity (χ) of the criteria used to define linkage. We define sensitivity as the probability that the linkage criteria will identify a true transmission pair as a linked pair given that both the infector and infectee are in the sample. Similarly, the specificity is the probability that two infections not linked by transmission are not linked by the linkage criteria.

Here we show that, if we have reasonable estimates of the sampling fraction, sensitivity, and specificity, we can, for a sample of size M, estimate the false discovery rate. The relationship between these parameters can then be used to design studies with a sample size and sampling fraction that minimizes the false discovery rate and therefore maximizes our ability to draw inferences from identified infections.

Calculating sample size and false discovery rate

Multiple links and multiple true transmissions

In most transmission scenarios, we will be interested in linking an infected individual to both their infector and anyone they infect. Therefore, we must account for the fact that each infection in an outbreak may be linked by transmission to multiple other infections, only some of which may have been sampled. If the goal is to identify all true transmission pairs in the sample, the linkage criteria used must similarly allow for each infection to be linked to multiple other infections. Given this, we can calculate the probability of correctly identifying a true transmission pair, ϕ (equal to one minus the false discovery rate), as a function of just the sensitivity and specificity of the linkage criteria, the proportion sampled, and the sample size. Conceptually, this probability of correctly identifying a transmission pair is equal to the number of true positives (correctly identified true transmission pairs) divided by the total number of positives (linked pairs, regardless of true transmission status):

ϕ=TruePositivesTruePositives+FalsePositives

Because we allow each infection to have multiple transmission partners, this probability will also depend on the average number of transmission links per infection, which is determined by the epidemiological parameter R, the expected number of other individuals each infected individual infects in a fully susceptible population. However, sampling infections over a finite period of time produces a bounded sampling frame, in which the average number of infectees per infector, denoted Rpop, may differ from R. This is because terminal nodes in the transmission network within this finite sampling frame are presumed to have no known child infections, and therefore an R value of zero. These nodes (which may or may not have child infections outside the sampling frame) contribute an R value of 0, decreasing the average number of infectees per infector. In fact, Rpop must be less than 1, see ‘Estimating the average reproductive number’ below. Because each infection is linked to, on average, Rpop infectees as well as its infector, each infection has Rpop+1 true transmission partners. If we assume that the distribution of the number of transmission partners per infection is Poisson distributed, we get the following equation for the true discovery rate, ϕ (see S1 Text for full derivation):

ϕ=ηρ(Rpop+1)ηρ(Rpop+1)+(1χ)(Mρ(Rpop+1)1) (1)

Under the same assumptions, we show that the total number of sampled true pairs, E[numberoftruepairs], can be calculated as:

E[numberoftruepairs]=Mρ(Rpop+1)η2

Through algebraic rearrangement of these equations we can determine the expected number of pairs observed in this sample, E[numberofpairsobserved]:

E[numberofpairsobserved]=M2[ηρ(Rpop+1)+(1χ)(Mρ(Rpop+1)1)]

These equations can be used to determine the false discovery rate (1−ϕ) and the expected number of linked pairs given a particular criteria, sample size, and sampling proportion. Additionally, we can use these equations to observe how the expected number of links and the true discovery rate vary with the proportion sampled and the sample size (Fig 1A). For a given sensitivity and specificity of the linkage criteria, we observe that the false discovery rate increases with sample size if the proportion sampled remains constant, suggesting that studies aimed at correctly identifying the highest proportion of transmission links should prioritize sampling proportion over an arbitrary number of samples. Additionally, the relationship between false discovery rate and sampling proportion is dependent on the sample size needed to obtain that sampling proportion such that the impact of sampling proportion increases with sample size. We also observe the effects of changing sensitivity and specificity on the false discovery rate and find that the specificity of the linkage criteria is of key importance when attempting to minimize the false discovery rate of transmission pairs (Fig 1B).

Fig 1. Sample size and false discovery rate given multiple linkage and multiple transmissions.

Fig 1

(A) Effect of sample size (red lines) or proportion sampled (blue lines) on the expected number of linked pairs (upper plots) or the false discovery rate of linked pairs (lower plots). The specificity and sensitivity are held constant. (B) Effect of varying the sensitivity and specificity of the linkage criteria on the false discovery rate (FDR). White dots: theoretical sensitivity and specificity values at different genetic distance thresholds (1–10 substitutions between infections; leftmost white dot represents a threshold of 1 substitution) for a hypothetical pathogen with substitution rate = 1 substitution/genome/transmission and R = 2 (see ‘Determining sensitivity and specificity’ below for details). In both panels, Rpop = 1.

Single link and single true transmission

We can also derive the relationship between the sample size and false discovery rate for the special case where each infection is the transmission pair of exactly one other sample, relevant when we are only interested in identifying the correct infector of a given infection. In this case, the linkage criteria will similarly identify exactly one probable link for each infection [15]. These assumptions about transmission simplify the relationship between sample size and false discovery rate. Here, we calculate the false discovery rate for transmission pairs under these assumptions (see S1 Text for full derivation).

The probability of correctly identifying a true transmission pair (ϕ) under the assumptions of single transmission and single linkage is:

ϕ=ηρηρ+(1χM2)(1η)ρ+(1χM1)(1ρ) (2)

Under the same assumptions, we can also calculate the expected total number of true transmission pairs that will be identified in our sample, E[numberoftruepairs], as:

E[numberoftruepairs]=M2ηρ

Through algebraic rearrangement of these equations, we can determine the expected number of linked pairs (identified with the linkage criteria) observed in this sample (E[numberofpairsobserved]):

E[numberofpairsobserved]=M2[ηρ+(1χM2)(1η)ρ+(1χM1)(1ρ)]

As in the multiple links and multiple transmissions case, we observe that the false discovery rate increases with the sample size, but decreases with the proportion sampled. We also again see the important effect of the specificity of the linkage criteria on the false discovery rate (S1 Fig). The relationships between these parameters and our ability to correctly identify transmission links are clearly robust to transmission model specification.

Estimating the average reproductive number

In the previous section, we distinguished R, the basic reproductive number of a pathogen, from Rpop, the average reproductive number in a bounded sampling frame. This is an important distinction because we can show that the average reproductive number (Rpop) is at most one. This is because any sampling frame contains a finite number of infected individuals, and individuals on terminal nodes of the captured transmission chain have not, by definition, infected any other individuals within the sampling frame (though they may have passed the infection to others outside the finite sample). Averaging the R value from these terminal nodes (which is zero, because they are terminal nodes) with the R value from all other nodes is what allows the Rpop average to drop below one, even when the true value of R is significantly greater than one. In other words, there will always be more infections (at minimum, all infectees in a transmission chain plus a single index case) than infection events (see S2A Fig). Hence, Rpop, which is equal to the number of actual transmission events divided by the number of infections, will be at most one.

In epidemic situations where there is a single introduction, Rpop will be close to one, as the number of infections will exceed the number of infection events by precisely one. In situations where there are multiple introductions (e.g., transmission chains that are persistently seeded from sources outside the sampling frame) then Rpop may be substantially less than one (S2B Fig). Specifically:

casesintroductionscases

The examples shown in this paper focus on epidemics seeded by a single introduction, where Rpop is approximately equal to one.

Determining sensitivity and specificity

In the framework presented here, the sensitivity and specificity of the linkage criteria are needed to estimate the false discovery rate from sample size and vice versa. This criteria can be based on a number of phylogenetic and epidemiological metrics, and may depend on the data available for a particular study. In this section, we outline two methods for approximating the sensitivity and specificity of a simple genomic metric: genetic distance.

Both methods involve determining these parameters from the discrete distributions of genetic distances between linked and unlinked infections, but they differ in how these distributions are obtained. Given the distributions, we can consider a number of different genetic distance thresholds (e.g., 1 or 2 mutations observed between sequences) that could be used as the criteria for differentiating between linked and unliked pairs, and we can calculate the sensitivity and specificity at each. The optimal threshold and its associated sensitivity and specificity can be selected in a variety of ways [2225] based on the specific study goals.

Below, we describe two ways to obtain the genetic distance distributions of linked and unlinked infection pairs for a hypothetical pathogen with R = 2 and a substitution rate (μ) of 1 substitution per genome per generation. We use the substitution rate rather than the pathogen mutation rate because our method concerns mutations observed between pathogen transmission events. We then use these genetic distance distributions to determine sensitivity and specificity, and ultimately to calculate the false discovery rate given a specific sample size and proportion. Here and henceforth, “generation” refers to a generation of transmission (not viral replication time).

Empirical method

One way to estimate the relevant genetic distance distributions is to use existing data. Specifically, we need a subsample of infections for which sequencing data is available and we have a high degree of confidence—based on epidemiological data—of the true transmission relationships between included infections. For example, infected individuals who share a household versus community members with no known relationship. We can compute the genetic distance between every pair of pathogen sequences from this subsample and use the results to approximate the underlying genetic distance distributions between linked and unlinked infections in the population.

We illustrate this method on a simulated outbreak of approximately 1500 infections (data available at https://github.com/HopkinsIDD/phylosamplesize), created using the outbreaker R package [26,27] (see ‘Outbreak simulations’ below). To create our known subsample, we selected a small number of infections from early in the outbreak and extracted their true transmission links and simulated genomes. We then calculated the genetic distance matrix of sequences in this subsample and determined the genetic distance distributions for linked and unlinked infection pairs (Fig 2A). Next, we estimated the sensitivity and specificity at every mutation threshold (0 mutations, 1 mutation, etc.) and used the point closest to the (0,1) corner to determine the optimal threshold for differentiating between linked and unliked infections. In this case, the optimal threshold was 3 mutations, which had a sensitivity of 0.95 and a specificity of 0.88.

Fig 2. Determining the sensitivity and specificity of a genetic distance threshold.

Fig 2

(A) Empirical distribution of genetic distances for linked (purple) and unlinked (yellow) infections for 50 infections selected from early in a simulated outbreak (μ = 1 substitution/genome/generation, R = 2). Inset: receiver operating characteristic (ROC) for all possible genetic distance thresholds. Optimal threshold shown as green dot (ROC) and dashed vertical line (distribution). (B) Estimated distribution of genetic distances for linked and unlinked infections generated by the substitution rate method. Parameters and plots are as in (A).

Substitution rate method

Observed pathogen substitution rates can also be used to estimate the genetic distance distributions, especially when a subsample of infections with known transmission histories is not available. If we assume that the number of mutations observed between two linked infections is Poisson distributed around the substitution rate and that we know the distribution of the number of generations between infections in the population, the probability of observing a specific genetic distance (d) between the sequences from any two infected individuals linked by transmission is:

1i=1glinkg(i)i=1glinkg(i)f(d;iμ) (3)

where g(i) is the probability of observing i generations between infections, glink is the maximum number of generations between infections considered linked, f(d;iμ) is the probability of observing d mutations between two infections separated by i generations, and μ is the substitution rate per genome per generation (see S2 Text).

Similarly, the probability of observing a genetic distance d between two infections not linked by transmission is:

1i=glink+1gmaxg(i)i=glink+1gmaxg(i)f(d;iμ) (4)

Where gmax is the maximum number of generations considered.

Since we assume that the number of substitutions between two linked infections is Poisson distributed, f(d;iμ) is simply the probability density function of a Poisson distribution with mean i×μ. Determining the distribution of generations between infections, however, is a non-trivial task [2830] and depends on several factors, including the shape of the epidemic and the period of time from which infections are sampled (S3 Fig). In the examples included herein, we use simulations to empirically approximate this distribution (see S2 Text), but it is likely that adequate approximations can be obtained by other means—or that more sophisticated approaches can be employed to directly estimate the necessary genetic distance distributions [31].

Given the approximate generation distribution between infections, we calculated the genetic distance distributions for linked and unlinked infections for the pathogen described above. The optimal genetic distance threshold for distinguishing between linked and unlinked infections was 4 mutations (sensitivity = 0.98, specificity = 0.99) (Fig 2B). The empirical and substitution rate methods result in a similar, but not identical, optimal threshold for the pathogen in this example, likely due to sparse sampling in the empirical case.

Regardless of which method we choose, we can use the sensitivity and specificity values to calculate the probability of correctly identifying a true transmission pair (ϕ) for this pathogen. We use Eq 1, allowing for each infection to have multiple transmission partners. We will also assume that we are able to sample 50% of the cases in this hypothetical outbreak of 1500 infections:

ϕ=ηρ(Rpop+1)ηρ(Rpop+1)+(1χ)(Mρ(Rpop+1)1)=0.98*0.5*(1+1)0.98*0.5*(1+1)+(10.99)(7500.5*(1+1)1)=0.116

We note that, despite a reproductive number (R) of 2, a single introduction into this outbreak means we should use Rpop = 1. Given our assumptions, we find that under 12% of our inferred linked infections—using a genetic distance threshold of 4 mutations—are likely to reflect true transmission relationships. A better specificity value is needed to achieve more confidence in direct transmission links, which can occur for pathogens that incur a significant number of mutations between infections considered linked [32]. For pathogens that do not meet these criteria (as in the example here), it may not be possible to use genetic distance alone to distinguish between linked and unlinked infections (S4 Fig).

Outbreak simulations

We used outbreak simulations to validate our approach. We simulated outbreaks using the ‘simOutbreak’ function implemented in the outbreaker R package [26]. For all simulations we assumed a large number of susceptible individuals in the population (n.hosts = 100,000), a genome length of 1,000 nucleotides, and no importation events (single source outbreak). We also assumed every infected individual transmitted their infection exactly one time step after infection, and ran the simulation for the number of generations needed to achieve a final outbreak size of approximately 1,000 infections (ln(1000)/ln(R)). We discarded simulations with an outbreak size of less than 100 or more than 2000 infected individuals; these discarded simulations did not count towards the total number of simulations for a given set of parameters. After simulating the source population, we randomly selected a predetermined proportion of infections from that population.

For each sampling proportion, we simulated outbreaks over a variety of substitution rates and reproductive numbers. We allowed the substitution rate to vary between 0.0001–4 mutations per genome per generation, and allowed the reproductive number to vary between 1.3–18. We chose these ranges to encompass substitution rates [33,34] and reproductive numbers [35] observed in actual human pathogens, and set the transition rate to be equal to the transversion rate for the purposes for this simulation. We note that, while pathogens can have reproductive numbers below 1.3, this was the minimum value that produced enough outbreaks with greater than 100 individuals in a reasonable amount of time. We divided each parameter range into 100 discrete values and ran simulations with all combinations of substitution rate and reproductive number, for a total of 10,000 simulations for each sampling proportion. We required simulated outbreaks to contain at least 100 and no more than 2000 infections for analysis. Validation plots were made in R using ggplot2 [36], and smoothed conditional means were calculated with the geom_smooth function from this package.

Implementation

Functions for calculating the false discovery rate for a specific sample size or proportion are implemented in the R package phylosamp, freely available at: https://github.com/HopkinsIDD/phylosamp. This package also includes functions for calculating the necessary sample size based on a desired false discovery rate (inverse of Eqs 1 and 2), and functions to estimate the number of transmission pairs that will be observed given a sample size and a set of assumptions (e.g., multiple links and multiple transmissions, single link and single transmission, etc.). We also provide generation distributions for values of R between 1.3–18, derived from the simulations described in S2 Text.

Applications to existing datasets

We used the phylosamp package to apply our method to an existing mumps virus dataset. We converted the reported substitution rate of 4.76×10−4 substitutions/site/year [37] to 0.36 substitutions/genome/generation as follows:

4.76×104substitutionssite·year×15384sitesgenome×1year365days×18daysgeneration=0.36subs/genome/generation

We used a sampling proportion of 0.93, which is the fraction of samples from patients affiliated with Harvard University (71) that resulted in complete genomes. We also noted that the original mumps manuscript reports multiple lineages circulating within Harvard University, which would reduce the average reproductive number (Rpop) used to calculate the true discovery rate. However, decreasing this value again only decreases confidence in identified links, so we used Rpop = 1 to again calculate the upper bound of this estimate.

When applying the methods to a hypothetical SARS-CoV-2 outbreak, we converted a substitution rate of 24.896 substitutions/genome/year [3840] to 0.34 substitutions/genome/generation using a generation time of 5 days [41]. The samplesize function in the phylosamp package gave the following error message when used with the optimum sensitivity and specificity (along with an outbreak size of 120 and true discovery rate of 0.9), indicating no amount of sampling would lead to high confidence in identified links: “Input values do no produce a viable solution.”

Results

Method performance with known sensitivity and specificity

We used simulated outbreaks to validate the relationship between sample size and false discovery rate using genetic distance as our linkage criteria. We subsampled each outbreak and, using the known transmission relationships and genetic distances between simulated infections, calculated the false discovery rate at each possible genetic distance threshold in the subsample (“simulated FDR”). For each simulation (before subsampling), we also calculated the actual specificity and sensitivity at every relevant genetic distance threshold. We used these values and the observed Rpop (roughly equal to one in most simulations) to then calculate the theoretical false discovery rate at a particular sampling proportion using Eq 1. We find that the theoretical false discovery rate is consistent with the simulated value for a wide array of pathogen substitution rates and reproductive numbers (Fig 3).

Fig 3. Predicted versus observed false discovery rate in outbreak simulations.

Fig 3

Theoretical versus simulated false discovery rate (FDR) for each genetic distance threshold in 10,000 simulations of varying substitution rate and reproductive number (approximately 260,000 points per plot, see Tables 2 and 3). Outbreak sizes range from 100–2000, as described in Methods. White line: smoothed conditional mean; grey dashed line: y = x line. Increasing values of the sample size (M) are plotted in darker color; because the maximum outbreak size is fixed at 2000, the maximum sample size differs for each sampling proportion. Increasing both the sample size and proportion reduces bias and error, see Tables 2 and 3.

Overall, the bias of our estimate of the false discovery rate approached zero for all sampling proportions (Table 2). The average error was less than 0.04 in each case (i.e., false discovery rate estimate is off by no more than 4%), decreasing significantly with increased sample size or proportion sampled (Tables 3 and S1). We note that special care should be taken with low sample sizes and low theoretical false discovery rates, as error rates can be particularly high in this range. Additionally, while our method is an unbiased estimator and overall correct in expectation, it is always possible for performance in a particular set of individuals sampled from a population to deviate substantially from expectation. As an example, in a small fraction of simulations, there were by chance no true transmission links (or, in some cases, no false positives) in our subsample. This fixes the simulated false discovery rate at 1 (or 0, when there are no false positives), which may not be representative of the overall relationship between sample size and false discovery rate and highlights how the specific infections sampled can affect results, particularly when sample sizes are low.

Table 2. Bias of calculated false discovery rate for simulations with fixed sampling proportion.

⍴ = 0.10 ⍴ = 0.25 ⍴ = 0.50 ⍴ = 0.75 All ⍴ values N
FDR = 0.00–0.25 -0.0006 0.0045 0.0001 0.0036 0.0022 17,900
FDR = 0.25–0.50 0.0044 0.0045 0.0009 0.0032 0.0032 31,633
FDR = 0.50–0.75 0.0064 0.0039 0.0006 0.001 0.0029 51,069
FDR = 0.75–1.00 0.0001 0.0001 <0.0001 <0.0001 0.0001 965,125
All FDR Values 0.0005 0.0005 0.0001 0.0002 0.0003 1,065,727
N 261,360 267,239 268,900 268,228 1,065,727

Table 3. Error of calculated false discovery rate for simulations with fixed sampling proportion.

⍴ = 0.10 ⍴ = 0.25 ⍴ = 0.50 ⍴ = 0.75 All ⍴ values N
FDR = 0.00–0.25 0.2135 0.1359 0.0799 0.0401 0.098 17,900
FDR = 0.25–0.50 0.2751 0.1583 0.079 0.0416 0.1275 31,633
FDR = 0.50–0.75 0.2057 0.0979 0.0478 0.0259 0.092 51,069
FDR = 0.75–1.00 0.0155 0.0069 0.0035 0.002 0.007 965,125
All FDR Values 0.032 0.0181 0.0097 0.0052 0.0161 1,065,727
N 261,360 267,239 268,900 268,228 1,065,727

To better understand why the error rate of our estimator increases as the false discovery rate decreases, we stratified the simulation data by the sensitivity and specificity given a particular genetic distance threshold. We found that the error is highest when sensitivity is low and specificity is high (S5A and S5B Fig), which occurs when a high genetic distance threshold is used. This combination often produces low false discovery rates, but is highly dependent on sampling (namely, if any true positives or false positives are sampled). This leads to highly variable simulated false discovery rates and consequently higher error rates. Unsurprisingly, this analysis also highlights that a discrete threshold like genetic distance produces a limited number of possible sensitivity and specificity combinations (S5C and S5D Fig). Therefore, obtaining reasonable estimates for these values in tandem is of key importance when using our method to estimate the false discovery rate of a phylogenetic study.

Method performance with estimated sensitivity and specificity

We repeated the false discovery rate comparison described above, but instead of using the actual sensitivity and specificity observed in each simulation, we calculated these parameters from the substitution rate used to generate that simulated outbreak (Fig 4). To reduce reliance on simulation data to calculate necessary parameters, we used Rpop = 1 rather than the empirical value.

Fig 4. Validation of substitution rate method to calculate sensitivity and specificity.

Fig 4

Theoretical versus simulated false discovery rate (FDR) for each genetic distance threshold in 10,000 simulations of varying substitution rate and reproductive number (approximately 260,000 points per plot, see Tables 2 and 3). Outbreak sizes range from 100–2000, as described in Methods. White line: smoothed conditional mean; grey dashed line: y = x line. Increasing values of the sample size (M) are plotted in darker color; increasing both the sample size and proportion reduces bias and error, see S2 and S3 Tables.

Under this more realistic set of assumptions, we observe a slight bias, though overall values remain less than one percent (S2 and S3 Tables). However, while mean bias is very low on average, it is greater when the theoretical false discovery rate is low, reaching an average of nearly 8% off the simulated value for predicted false discovery rates less than 25%. Average error rates were similarly slightly increased, but remained less than 4% overall. Despite these trends, the vast majority of false discovery rate estimates (as well as sensitivity and specificity estimates) fall very close to their true values (Fig 5). This observation holds true when only examining the optimal genetic distance threshold (using the closest to the (0,1) corner method, as described in Methods) (S6 Fig) rather than estimated values at all thresholds shown in Figs 4 and 5.

Fig 5. Histogram of raw parameter error using substitution rate method.

Fig 5

Theoretical minus simulated parameter values for each genetic distance threshold in 10,000 simulations of varying substitution rate and reproductive number for a given sampling proportion (see Fig 4). Top row: theoretical minus simulated false discovery rate; middle row: theoretical minus simulated sensitivity; bottom row: theoretical minus simulated specificity. Colors correspond to sampling proportion as in Fig 4.

Given that correct sensitivity and specificity values are an important component of calculating the theoretical false discovery rate, we looked at the specific estimates for these parameters generated by our substitution rate method. When considering only direct transmissions as linked (as we do throughout these simulations), Eq 3 simplifies to simply a Poisson distribution around the substitution rate, resulting in highly accurate and precise sensitivity estimates (Figs 5 and S7). However, we find that our estimates for specificity have a positive bias regardless of sample size or proportion (Figs 5 and S8 and S9). We hypothesized that inaccuracies in the estimated specificity cause the bias observed in the false discovery rate estimate and were due to the distribution of generations between infections used in our calculation; as discussed in Methods, this is a non-trivial distribution that we estimated by averaging over many simulations (see S2 Text for details).

To test this hypothesis, we used the actual distribution of generations between infections from each simulation in our calculation of specificity (sensitivity estimates are unaffected by this distribution when considering only direct transmissions, as described above). We find that this does in fact reduce bias in our specificity estimates (Fig 6) and leads to largely unbiased (<2%) estimates of the false discovery rate, even at low theoretical false discovery rate values (S10 Fig and S4 Table).

Fig 6. Effect of the generation distribution on specificity of the linkage criteria.

Fig 6

Theoretical versus simulated specificity for each genetic distance threshold in 10,000 simulations of varying substitution rate and reproductive number (proportion sampled = 0.75). White line: smoothed conditional mean; grey dashed line: y = x line. Increasing values of the sample size (M) are plotted in darker color. (A) Theoretical sensitivity and specificity calculated using average distribution of generations between infections from simulations (see S2 Text). (B) Theoretical sensitivity and specificity calculated using the actual distribution of generations between infections from that simulated outbreak.

Application of the sampling framework

Illustrative retrospective example

To illustrate our sample size calculation framework, we used a publicly available dataset from an outbreak caused by a well characterized pathogen (mumps virus) that had been subject to both genomic and epidemiological analysis [37]. We first used the substitution rate method described above to calculate the sensitivity and specificity of genetic distance as a linkage criteria using the substitution rate reported in the study (molecular clock rate = 4.76×10−4 substitutions per site per year). We converted this substitution rate to 0.36 substitutions per genome per generation using the mean generation interval estimated in the study (18 days), which falls within previous estimates of this parameter [4244]. We used the effective reproductive number reported for Harvard University (1.70) to estimate the generation time distribution using our phylosamp package, as shown in the R code below:

library(phylosamp)

data("gen_dist_sim")

mgd <- as.numeric(gen_dist_sim[gen_dist_sim$R = = 1.70, -(1:2)])

get_optim_roc(sens_spec_roc(cutoff = 1:20,mut_rate = 0.36,mean_gens_pdf = mgd))

This method results in an optimal sensitivity of 0.95 and specificity of 0.95 using a cutoff of two mutations.

We then used these parameter values to calculate the true discovery rate of our linkage criteria, i.e., the proportion of identified links (whole mumps genomes differing by <2 mutations) that represent actual transmission pairs. We focused on the part of the mumps outbreak within Harvard university, for which 66 whole genomes sequences were generated from 71 unique patient samples. While the true number of cases at Harvard was likely significantly higher, this provides a maximum sampling proportion of 93% of infections. Using the phylosamp package, we calculated the true discovery rate as follows:

truediscoveryrate(eta = optim$sensitivity,chi = 1-optim$specificity,rho = 0.93,M = 66,R = 1)

Using our method, we calculated a true discovery rate of 0.35. This low value suggests that genetic distance alone would not be sufficient to identify specific transmission links within the Harvard community during this mumps outbreak. This is in line with the findings of the original paper, which demonstrates the need for both genomic and epidemiological data to understand transmission, and emphasizes the frequent need for such epidemiological data to achieve the required specificity for high confidence estimation of transmissiosn links.

Illustrative prospective example

To demonstrate how our method could be used to estimate the sample size needed to identify transmission links with 90% confidence (i.e, a true discovery rate of 0.9), we applied our method to a hypothetical COVID-19 outbreak in an unvaccinated community with 120 infections. We calculated the sensitivity and specificity of genetic distance using a substitution rate of 0.34 mutations per genome per generation [3841] and an R value of 3, consistent with many efforts [45,46]:

mgd <- as.numeric(gen_dist_sim[gen_dist_sim$R = = 3, -(1:2)])

get_optim_roc(sens_spec_roc(cutoff = 1:20,mut_rate = 0.34,mean_gens_pdf = mgd))

This method results in an optimal sensitivity of 0.95 and a specificity of 0.84 using a cutoff of two mutations. Using these parameters, we found that not even perfect sampling could lead to a true discovery rate of at least 0.9:

samplesize(eta = optim$sensitivity,chi = 1-optim$specificity,N = 120,R = 1,phi = 0.9)

This suggests that genetic distance alone is not sufficient to differentiate linked and unliked SARS-CoV-2 infections at high confidence. However, if we could identify additional phylogenetic or epidemiological criteria that would increase the specificity to 0.999 (keeping the sensitivity at 0.95), a sample size of 11 would achieve our desired confidence in direct transmission links. Additionally, it may be more fruitful to focus on cases linked within several generations of transmission, during which additional mutations would have time to accumulate.

Discussion

We have developed a mathematical framework for making informed sampling decisions in pathogen genome sequencing studies. Specifically, this framework allows for easy calculation of the relationship between the number or proportion of infections sampled during an outbreak and the ability of some phylogenetic or epidemiological criteria to correctly identify infections within this sample that are linked by direct transmission. Understanding this relationship is crucial to making correct inferences about pathogen transmission patterns, especially as genomic studies are becoming more feasible and widely used to answer both scientific and public health questions.

This framework is broadly applicable to a variety of phylogenetic or epidemiological approaches, as long as the sensitivity and specificity of the criteria can be approximated. With a basic understanding of the pathogen and the criteria being used, researchers can more effectively design studies that correctly identify transmission pairs with a known level of confidence. Additionally, this generalizable method (available as a free software, the R package phylosamp) provides a metric by which reviewers of these studies can evaluate their conclusions. We apply our method to simulated outbreaks using genetic distance as the linkage criteria and find that we can effectively estimate the false discovery rate for a variety of pathogen substitution rates, reproductive numbers, and relevant genetic distance thresholds. It is important to note, however, that for a given sensitivity and specificity, there may not always be a study design that achieves the desired false discovery rate.

Performance of the method presented depends on our ability to estimate the sensitivity and specificity of a particular linkage criteria. While we present two methods for doing this—empirically and theoretically using the substitution rate of the pathogen—implementing either in practice is not without challenges, and improved estimation of these values may be a fruitful area for future research. For instance, the substitution rate based approach also depends on the distribution of the number of generations of transmission between infections in the underlying population. Although distributions derived from simulations (provided as part of the phylosamp package) provide a reasonable proxy, estimates of sensitivity and specificity are much improved when using the exact generation distribution, which currently can only be determined from complete knowledge of all transmission events. Further research into all the factors affecting this distribution will be necessary to improve its estimation. Likewise, there are challenges to the empirical approach, particularly for novel pathogens.

Better performance can likely be obtained by not restricting ourselves to genetic distance alone when determining a linkage criteria. Genetic distance is easy to determine from sequence data, but this simple metric does not take into account ancestral relationships or uncertainty around these relationships, and is limited to discrete mutational changes. Applying more complex phylogenetic criteria may allow us to learn more about transmission relationships, though there is a limit to the extent to which genetic data can be used to distinguish infections in fast-spreading (or slow-mutating) pathogen outbreaks. There are several examples of outbreaks in which multiple infected individuals have the same consensus viral genome [32]. In this case, incorporating epidemiological data (e.g., location, time of symptom onset) may be important in determining which infections are unlikely to be linked. This incorporation of additional data may complicate calculation of the sensitivity and specificity, so developing the methodology around calculating these parameters will be important to further development of our method. This will likely build on a larger effort to better integrate epidemiological and genomic data into pathogen transmission studies [26,4749].

The application of our methodology to a previous mumps outbreak and a hypothetical COVID-19 outbreak highlights the need to move beyond genetic distance as a linkage criteria; for pathogens with a substitution rate similar to that of mumps virus, genetic distance is not enough to differentiate between linked and unlinked cases even in densely sampled outbreaks. In trying to apply this method to other outbreaks, it also became clear that well-characterized substitution rates and reproductive numbers are essential for calculating sensitivity and specificity using our method, and that these parameters are less clearly defined for pathogens with long and variable generation times, such as bacterial infections. Variable periods of replication within a host makes it difficult to characterize a per-generation substitution rate that is broadly applicable over the entire outbreak and can be used to estimate sensitivity and specificity. In these cases, more nuanced criteria such as phylogenetic relatedness will likely be more informative than the number of mutations between sequenced infections; while we provide instructions for using genetic distance as a linkage criteria in order to give a concrete example of calculating sensitivity and specificity, the primary focus of this manuscript is to demonstrate how they can be used to calculate or evaluate sample sizes.

While in this manuscript we have focused on direct transmission pairs, our framework is designed to be extensible to alternative definitions of linkage; for example, infections connected within a specified number of transmission events. Expanding the definition of linkage to include such indirect transmissions has a number of useful applications in outbreak research, such as identifying and connecting transmission clusters. This method could also be extended to more complex direct transmission relationships, for example when within-host evolution results in the existence of viral quasispecies within infected individuals, each of which has some potential of being transmitted. In all of these scenarios, it is equally important to understand the sample size needed to make the desired inferences.

We hope that this work represents a step towards developing a larger theory of study design for making inferences from pathogen sequence data, but recognize it is only a step. The focus of this paper is sample size and the impact of undersampling, but spatial and/or temporal biases are also important for determining which infections are sampled [5052]. For example, understanding routes of direct transmission may require dense sampling of a small group of highly-connected individuals, while understanding general transmission trends over the course of a geographically-dispersed outbreak may require us to sample broadly over space and time. Additionally, it will be important to take into account the contact network underlying pathogen transmission, since some individuals may be more likely to transmit their infection to others. Finally, the goal of linking infections is seldom the linkages themselves, but the larger inferences about risk and transmission derived from those linkages. Adapting the techniques here to more directly link sample size calculations to these outcomes is an important next step.

Supporting information

S1 Fig. Sample size and false discovery rate given single linkage and single transmission.

(A) Effect of sample size (red lines) or proportion sampled (blue lines) on the expected number of linked pairs (upper plots) or the false discovery rate of linked pairs (lower plots). The specificity and sensitivity are held constant. (B) Effect of varying the sensitivity and specificity of the linkage criteria on the false discovery rate (FDR).

(TIF)

S2 Fig. Estimating the average reproductive number in a population.

Two hypothetical outbreaks with a pathogen reproductive number (R) equal to 2 and a total of 15 infections. Black circles represent infections; blue circles represent infections who have not yet infected others, or whose descendents are outside the sampling frame. (A) Outbreak caused by a single introduction, meaning there were 14 transmission events and 15 total infections. In other words, Rpop=1415=0.933. (B) Outbreak caused by two separate introductions, meaning there were only 13 infection events in the sampling frame, resulting in Rpop=1315=0.867.

(TIF)

S3 Fig. Effects of R and G on the distribution of generations between cases.

Distribution of the number of generations between infections averaged over 1000 simulated outbreaks with reproduction number R and number of generations of transmission G. Distributions are shown for three values of R (rows). Left column: distribution of generations between infections after 3 generations of transmission; middle column: distribution after ln(1000)/ln(R) generations of transmission (see Methods); right column: distribution after ln(1000)/ln(R)+2 generations of transmission.

(TIF)

S4 Fig. Genetic distance distributions for different types of pathogens.

(A) Distribution of genetic distances for linked (purple) and unlinked (yellow) infections for a hypothetical pathogen with substitution rate = 1 substitution/genome/generation and R = 1.5. Inset: receiver operating characteristic (ROC) curve for all possible genetic distance cutoff values. Optimal threshold shown as green dot (ROC) and dashed vertical line (distribution). (B) Distribution of genetic distances for linked and unlinked cases for a hypothetical pathogen with substitution rate = 0.2 mutations/genome/generation and R = 3. Inset: ROC curve for all possible genetic distance cutoff values for this pathogen. The optimal threshold is shown as in (A).

(TIF)

S5 Fig. Error of false discovery rate calculation by sensitivity and specificity.

(A) Average false discovery from 10,000 simulated outbreaks (proportion sampled = 0.75) binned by sensitivity and specificity (bin size = 0.02). Grey = no genetic distance thresholds in simulation produced this combination of sensitivity and specificity. (B) Zoom view of (A), with specificity ranging from 0.9–1 (bin size = 0.002). (C) Number of data points with sensitivity and specificity in the desired bins (i.e., number of data points used to calculate average error in panel (A). (D) Zoom view of (C), with specificity ranging from 0.9–1.

(TIF)

S6 Fig. Histogram of raw parameter error using substitution rate method (optimal threshold only).

Theoretical minus simulated parameter values for the optimal genetic distance threshold (determined by selecting the threshold for which the point at (1-specificity, sensitivity) is closest to the (0,1) corner) in 10,000 simulations of varying substitution rate and reproductive number for a given sampling proportion. Top row: theoretical minus simulated false discovery rate; middle row: theoretical minus simulated sensitivity; bottom row: theoretical minus simulated specificity. Colors correspond to sampling proportion as in Fig 4.

(TIF)

S7 Fig. Predicted versus observed sensitivity using substitution rate method.

Theoretical versus simulated sensitivity for each genetic distance threshold in 10,000 simulations of varying substitution rate and reproductive number. White line: smoothed conditional mean; grey dashed line: y = x line. Increasing values of the sample size (M) are plotted in darker color.

(TIF)

S8 Fig. Predicted versus observed specificity using substitution rate method.

Theoretical versus simulated specificity for each genetic distance threshold in 10,000 simulations of varying substitution rate and reproductive number. Outbreak sizes range from 100–2000, as described in Methods. White line: smoothed conditional mean; grey dashed line: y = x line. Increasing values of the sample size (M) are plotted in darker color.

(TIF)

S9 Fig. Histogram of raw specificity error using substitution rate method by sample size and proportion.

Theoretical minus simulated specificity for each genetic distance threshold in 10,000 simulations of varying substitution rate and reproductive number for a given sampling proportion. Each column represents 10,000 simulations with a specific sampling proportion (colors as in Fig 4) and sample size within each proportion (determined by the final outbreak size) goes from low (top row) to high (bottom row).

(TIF)

S10 Fig. Predicted versus observed false discovery rate using actual generation distribution.

Theoretical versus simulated false discovery rate (FDR) for each genetic distance threshold in 10,000 simulations of varying substitution rate and reproductive number. Theoretical FDR is calculated using the actual distribution of generations between infections from the corresponding simulated outbreak. White line: smoothed conditional mean; grey dashed line: y = x line. Increasing values of the sample size (M) are plotted in darker color.

(TIF)

S1 Table. Error of false discovery rate calculation by sample size.

(PDF)

S2 Table. Bias and error of false discovery rate calculation using substitution rate method.

(PDF)

S3 Table. Error and of false discovery rate calculation using substitution rate method by sample size.

(PDF)

S4 Table. Bias and error of false discovery rate using actual generation distribution.

(PDF)

S1 Text. Deriving probably of transmission given linkage.

(PDF)

S2 Text. Determining sensitivity and specificity of genetic distance as a linkage criteria.

(PDF)

Acknowledgments

We thank Stuart Ray for his insightful comments on the manuscript.

Data Availability

All code and simulation data are available at: https://github.com/HopkinsIDD/phylosamplesize.

Funding Statement

Funding was provided by Bill and Melinda Gates Foundation OPP1195157 (S.W. and J.L.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Neher RA, Bedford T. Real-Time Analysis and Visualization of Pathogen Sequence Data. J Clin Microbiol. 2018;56. doi: 10.1128/JCM.00480-18 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Quick J, Loman NJ, Duraffour S, Simpson JT, Severi E, Cowley L, et al. Real-time, portable genome sequencing for Ebola surveillance. Nature. 2016;530: 228–232. doi: 10.1038/nature16996 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Gardy JL, Johnston JC, Ho Sui SJ, Cook VJ, Shah L, Brodkin E, et al. Whole-genome sequencing and social-network analysis of a tuberculosis outbreak. N Engl J Med. 2011;364: 730–739. doi: 10.1056/NEJMoa1003176 [DOI] [PubMed] [Google Scholar]
  • 4.Jackson BR, Tarr C, Strain E, Jackson KA, Conrad A, Carleton H, et al. Implementation of Nationwide Real-time Whole-genome Sequencing to Enhance Listeriosis Outbreak Detection and Investigation. Clin Infect Dis. 2016;63: 380–386. doi: 10.1093/cid/ciw242 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Snider CJ, Diop OM, Burns CC, Tangermann RH, Wassilak SGF. Surveillance Systems to Track Progress Toward Polio Eradication—Worldwide, 2014–2015. MMWR Morb Mortal Wkly Rep. 2016;65: 346–351. doi: 10.15585/mmwr.mm6513a3 [DOI] [PubMed] [Google Scholar]
  • 6.Lei F, Shi W. Prospective of Genomics in Revealing Transmission, Reassortment and Evolution of Wildlife-Borne Avian Influenza A (H5N1) Viruses. Curr Genomics. 2011;12: 466–474. doi: 10.2174/138920211797904052 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Nelson MI, Simonsen L, Viboud C, Miller MA, Holmes EC. Phylogenetic analysis reveals the global migration of seasonal influenza A viruses. PLoS Pathog. 2007;3: 1220–1228. doi: 10.1371/journal.ppat.0030131 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Gonzalez-Reiche AS, Hernandez MM, Sullivan MJ, Ciferri B, Alshammary H, Obla A, et al. Introductions and early spread of SARS-CoV-2 in the New York City area. Science. 2020. doi: 10.1126/science.abc1917 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Thézé J, Li T, du Plessis L, Bouquet J, Kraemer MUG, Somasekar S, et al. Genomic Epidemiology Reconstructs the Introduction and Spread of Zika Virus in Central America and Mexico. Cell Host Microbe. 2018;23: 855–864.e7. doi: 10.1016/j.chom.2018.04.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Weill F-X, Domman D, Njamkepo E, Almesbahi AA, Naji M, Nasher SS, et al. Genomic insights into the 2016–2017 cholera epidemic in Yemen. Nature. 2019;565: 230–233. doi: 10.1038/s41586-018-0818-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Carroll MW, Matthews DA, Hiscox JA, Elmore MJ, Pollakis G, Rambaut A, et al. Temporal and spatial analysis of the 2014–2015 Ebola virus outbreak in West Africa. Nature. 2015;524: 97–101. doi: 10.1038/nature14594 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Park DJ, Dudas G, Wohl S, Goba A, Whitmer SLM, Andersen KG, et al. Ebola Virus Epidemiology, Transmission, and Evolution during Seven Months in Sierra Leone. Cell. 2015;161: 1516–1526. doi: 10.1016/j.cell.2015.06.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Ratmann O, Kagaayi J, Hall M, Golubchick T, Kigozi G, Xi X, et al. Quantifying HIV transmission flow between high-prevalence hotspots and surrounding communities: a population-based study in Rakai, Uganda. Lancet HIV. 2020;7: e173–e183. doi: 10.1016/S2352-3018(19)30378-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Salje H, Lessler J, Endy TP, Curriero FC, Gibbons RV, Nisalak A, et al. Revealing the microscale spatial signature of dengue transmission and immunity in an urban population. Proc Natl Acad Sci U S A. 2012;109: 9535–9538. doi: 10.1073/pnas.1120621109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Volz EM, Frost SDW. Inferring the source of transmission with phylogenetic data. PLoS Comput Biol. 2013;9: e1003397. doi: 10.1371/journal.pcbi.1003397 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Frost SDW, Pybus OG, Gog JR, Viboud C, Bonhoeffer S, Bedford T. Eight challenges in phylodynamic inference. Epidemics. 2015;10: 88–92. doi: 10.1016/j.epidem.2014.09.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Grabowski MK, Lessler J. Phylogenetic insights into age-disparate partnerships and HIV. The lancet. HIV. 2017. pp. e8–e9. doi: 10.1016/S2352-3018(16)30184-9 [DOI] [PubMed] [Google Scholar]
  • 18.Mavian C, Marini S, Manes C, Capua I, Prosperi M, Salemi M. Regaining perspective on SARS-CoV-2 molecular tracing and its implications. medRxiv. 2020; 2020.03.16.20034470. [Google Scholar]
  • 19.Farhat MR, Shapiro BJ, Sheppard SK, Colijn C, Murray M. A phylogeny-based sampling strategy and power calculator informs genome-wide associations study design for microbial pathogens. Genome Med. 2014;6: 101. doi: 10.1186/s13073-014-0101-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Kelly BJ, Gross R, Bittinger K, Sherrill-Mix S, Lewis JD, Collman RG, et al. Power and sample-size estimation for microbiome studies using pairwise distances and PERMANOVA. Bioinformatics. 2015;31: 2461–2468. doi: 10.1093/bioinformatics/btv183 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Network HPT, Others. HPTN 071: population effects of antiretroviral therapy to reduce HIV transmission (PopART): a cluster-randomized trial of the impact of a combination prevention package on population-level HIV incidence in Zambia and South Africa. 2013. [Google Scholar]
  • 22.Youden WJ. Index for rating diagnostic tests. Cancer. 1950;3: 32–35. doi: [DOI] [PubMed] [Google Scholar]
  • 23.Perkins NJ, Schisterman EF. The inconsistency of “optimal” cutpoints obtained using two criteria based on the receiver operating characteristic curve. Am J Epidemiol. 2006;163: 670–675. doi: 10.1093/aje/kwj063 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Liu X. Classification accuracy and cut point selection. Stat Med. 2012;31: 2676–2686. doi: 10.1002/sim.4509 [DOI] [PubMed] [Google Scholar]
  • 25.Zou KH, Yu C-R, Liu K, Carlsson MO, Cabrera J. Optimal thresholds by maximizing or minimizing various metrics via ROC-type analysis. Acad Radiol. 2013;20: 807–815. doi: 10.1016/j.acra.2013.02.004 [DOI] [PubMed] [Google Scholar]
  • 26.Jombart T, Cori A, Didelot X, Cauchemez S, Fraser C, Ferguson N. Bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data. PLoS Comput Biol. 2014;10: e1003457. doi: 10.1371/journal.pcbi.1003457 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Team RC, Others. R: A language and environment for statistical computing. 2013. Available: http://finzi.psych.upenn.edu/R/library/dplR/doc/intro-dplR.pdf
  • 28.Dobrow RP. On the distribution of distances in recursive trees. J Appl Probab. 1996;33: 749–757. [Google Scholar]
  • 29.Mahmoud HM, Neininger R. Distribution of distances in random binary search trees. Ann Appl Probab. 2003;13: 253–276. [Google Scholar]
  • 30.Salje H, Cummings DAT, Lessler J. Estimating infectious disease transmission distances using the overall distribution of cases. Epidemics. 2016;17: 10–18. doi: 10.1016/j.epidem.2016.10.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Worby CJ, Chang H-H, Hanage WP, Lipsitch M. The distribution of pairwise genetic distances: a tool for investigating disease transmission. Genetics. 2014;198: 1395–1404. doi: 10.1534/genetics.114.171538 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Campbell F, Strang C, Ferguson N, Cori A, Jombart T. When are pathogen genome sequences informative of transmission events? PLoS Pathog. 2018;14: e1006885. doi: 10.1371/journal.ppat.1006885 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Jenkins GM, Rambaut A, Pybus OG, Holmes EC. Rates of molecular evolution in RNA viruses: a quantitative phylogenetic analysis. J Mol Evol. 2002;54: 156–165. doi: 10.1007/s00239-001-0064-3 [DOI] [PubMed] [Google Scholar]
  • 34.Duchêne S, Holt KE, Weill F-X, Le Hello S, Hawkey J, Edwards DJ, et al. Genome-scale rates of evolutionary change in bacteria. Microb Genom. 2016;2: e000094. doi: 10.1099/mgen.0.000094 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.van den Driessche P. Reproduction numbers of infectious disease models. Infect Dis Model. 2017;2: 288–303. doi: 10.1016/j.idm.2017.06.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Wickham H. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag; New York; 2016. Available: https://ggplot2.tidyverse.org [Google Scholar]
  • 37.Wohl S, Metsky HC, Schaffner SF, Piantadosi A, Burns M, Lewnard JA, et al. Combining genomics and epidemiology to track mumps virus transmission in the United States. PLoS Biol. 2020;18: e3000611. doi: 10.1371/journal.pbio.3000611 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Genomic epidemiology of novel coronavirus—Global subsampling. [cited 20 Mar 2021]. Available: https://nextstrain.org/ncov/global?l=clock
  • 39.Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, et al. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics. 2018;34: 4121–4123. doi: 10.1093/bioinformatics/bty407 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Sagulenko P, Puller V, Neher RA. TreeTime: Maximum-likelihood phylodynamic analysis. Virus Evol. 2018;4: vex042. doi: 10.1093/ve/vex042 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Ferretti L, Wymant C, Kendall M, Zhao L, Nurtay A, Abeler-Dörner L, et al. Quantifying SARS-CoV-2 transmission suggests epidemic control with digital contact tracing. bioRxiv. medRxiv; 2020. doi: 10.1126/science.abb6936 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Vink MA, Bootsma MCJ, Wallinga J. Serial intervals of respiratory infectious diseases: a systematic review and analysis. Am J Epidemiol. 2014;180: 865–875. doi: 10.1093/aje/kwu209 [DOI] [PubMed] [Google Scholar]
  • 43.Anderson RM, May RM. Infectious Diseases of Humans: Dynamics and Control. OUP Oxford; 1992. [Google Scholar]
  • 44.Vynnycky E, White R. An Introduction to Infectious Disease Modelling. OUP Oxford; 2010. doi: 10.1093/aje/kwp394 [DOI] [Google Scholar]
  • 45.Billah MA, Miah MM, Khan MN. Reproductive number of coronavirus: A systematic review and meta-analysis based on global level evidence. PLoS One. 2020;15: e0242128. doi: 10.1371/journal.pone.0242128 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Katul GG, Mrad A, Bonetti S, Manoli G, Parolari AJ. Global convergence of COVID-19 basic reproduction number and estimation from early-time SIR dynamics. PLoS One. 2020;15: e0239800. doi: 10.1371/journal.pone.0239800 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Klinkenberg D, Backer JA, Didelot X, Colijn C, Wallinga J. Simultaneous inference of phylogenetic and transmission trees in infectious disease outbreaks. PLoS Comput Biol. 2017;13: e1005495. doi: 10.1371/journal.pcbi.1005495 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Ypma RJF, Bataille AMA, Stegeman A, Koch G, Wallinga J, van Ballegooijen WM. Unravelling transmission trees of infectious diseases by combining genetic and epidemiological data. Proc Biol Sci. 2012;279: 444–450. doi: 10.1098/rspb.2011.0913 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Morelli MJ, Thébaud G, Chadœuf J, King DP, Haydon DT, Soubeyrand S. A Bayesian inference framework to reconstruct transmission trees using epidemiological and genetic data. PLoS Comput Biol. 2012;8: e1002768. doi: 10.1371/journal.pcbi.1002768 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Stack JC, Welch JD, Ferrari MJ, Shapiro BU, Grenfell BT. Protocols for sampling viral sequences to study epidemic dynamics. J R Soc Interface. 2010;7: 1119–1127. doi: 10.1098/rsif.2009.0530 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.de Silva E, Ferguson NM, Fraser C. Inferring pandemic growth rates from sequence data. J R Soc Interface. 2012;9: 1797–1808. doi: 10.1098/rsif.2011.0850 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Hall MD, Woolhouse MEJ, Rambaut A. The effects of sampling strategy on the quality of reconstruction of viral population dynamics using Bayesian skyline family coalescent methods: A simulation study. Virus Evol. 2016;2: vew003. doi: 10.1093/ve/vew003 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009182.r002

Decision Letter 0

Virginia E Pitzer

20 Jan 2021

Dear Dr. Lessler,

Thank you very much for submitting your manuscript "Sample Size Calculation for Phylogenetic Case Linkage" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

As you'll see, the reviews are mixed. While Reviewer 2 has only minor suggestions for improvements in the clarity of the text, Reviewers 1 and 3 have more substantive comments. In particular, both would like to see the methods illustrated using an openly available real data set, and I strongly encourage the authors to do so. Reviewer 3 (who previously reviewed the manuscript at eLife) still has some substantive methodological concerns, although I think that these can be addressed.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Virginia E. Pitzer, Sc.D.

Deputy Editor-in-Chief

PLOS Computational Biology

Virginia Pitzer

Deputy Editor-in-Chief

PLOS Computational Biology

***********************

As you'll see, the reviews are mixed. While Reviewer 2 has only minor suggestions for improvements in the clarity of the text, Reviewers 1 and 3 have more substantive comments. In particular, both would like to see the methods illustrated using an openly available real data set, and I strongly encourage the authors to do so. Reviewer 3 (who previously reviewed the manuscript at eLife) still has some substantive methodological concerns, although I think that these can be addressed.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Wohl et al present a statistical framework for calculating sample sizes for robust determinations of the infector-infectee pairs within transmission chains of pathogen genomic epidemiology studies. Their framework also provides methods for calculating FDR and the expected number of true transmission pairs from the specificity and sensitivity of the linkage criteria (genetic distances), sample size, the proportion of samples sequenced, and the effective reproductive number of the pathogen analysed. The authors demonstrate the utility of this framework with simulation data and developed the R package “phylosamp” to provide an implementation of their framework.

This manuscript addresses a neglected problem in many genetic epidemiology studies regarding the level of sequencing required to be carried out in order for robust conclusions to be made when reconstructing transmission chains of pathogen outbreaks using WGS data. The work is novel as there are a lack of current formal agreed upon standards for carrying out this aspect study design, and is both relevant and timely given the increasing widespread adoption of genetic epidemiology techniques for understanding pathogen transmission dynamics. Further, the manuscript is well written, the underlying methodology well described, and the use cases of the software and limitations are appropriately discussed.

Please find my comments below, divided into different sections for (a) the manuscript describing the framework and (b) the R package phylosamp. I hope these are useful to the authors.

A. Manuscript comments:

(1) The reliance of phylosamp at present on genetic distances alone as the linkage criteria presents a key limitation in calculating appropriate sample sizes and other parameters for a study concerning slowly evolving pathogens where there is limited genetic variation accumulating between transmission pairs/generations which prohibits their detection from WGS alone. I recognise that the focus of this manuscript is a first step towards more comprehensive approaches, and that these concerns are discussed in both the manuscript, and in previous supplied reviews from a submission to eLife, but also believe that this limits the utility of the software for many genetic epidemiology studies.

(2) While the simulation data provide a useful and convincing illustration of the framework, it would be excellent to also see an example application of phylosamp to an existing published pathogen dataset to further demonstrate its utility. Again, I recognise that this has been discussed in previous reviews from a submission to eLife, but the inclusion of such data would present a substantial improvement to the work and encourage further adoption of the framework.

(3) The definition of Rpop provided from line 100, where it is first introduced requires rephrasing for clarity. While this is better described later in the manuscript from line 149, the earlier text could be clarified to avoid the reader having to scroll back and forth throughout the paper. I recognise that this text has already been refined based on the reviewer comments from the previous submission to eLife, however, it could benefit from further refinement for improved clarity and flow.

(4) Figure 1B: Does each white dot indicate the sensitivity and 1-specificty for a SNP/genetic distance increased in increments of 1? i.e. 0, 1, 2, 3, 4, … SNPs? If so, it would be helpful to indicate the values of these increments either by annotation of the figure itself or expansion of the figure legend to improve clarity.

(5) Line 242: There don’t appear to be any citations for the range of effective reproductive numbers of human pathogens explored in simulation studies.

(6) Figure S5: It appears that either the figure panels or the legend descriptions might be inverted for A and B, as well as C and D.

(7) The authors have put substantial effort into making their work openly available by submitting a preprint on medrxiv and providing all code and data files required to reproduce their analyses and manuscript figures via github (available at: https://github.com/HopkinsIDD/phylosamplesize). I was able to reproduce all figures and analyses until line #113 of figures.Rmd at which point I was unable to proceed further.

i.e.

# first time only: calculate tfdr from simulations and save to file

calc.tfdr(simdata="data/simdata_var_N10000",rho_values=c(0.1,0.25,0.5,0.75),max_sim_size=2000,

sens_spec_method="sim",mgd=mgd,outdir="data/full_data_sim.Rdata")

I think this might be due to the files being specified by the prefix “simdata_var_N10000” where it might need to be instead specified as “simdata_var_gen_N10000”, but the authors may need to look into this further.

B. Phylosamp R package and documentation comments:

Code from the R package was clearly structured and generally well commented. The package is freely available and easily installed via the devtools library. I was able to reproduce the results from the vignette code easily and without issue, and found the explanations very clear and informative. I have provided some comments on the R package and documentation below that I hope are useful to the authors, but do not regard any of these to be critical changes, nor do I require that these suggested changes be made for the publication of this manuscript.

- In the vignettes it may be worth providing a simple reiteration of what each argument provided to the function is in the vignette (e.g. for eta, chi, rho)

- There appears to be a typo at the top of the ‘Illustrated examples’ vignette page, I think “this vignette…” should perhaps be “In this vignette…”.

- When using the help operator in R, I found the package to be well documented for all functions, but at times it was a little unclear to me which defaults were used when these are not supplied explicitly by the user i.e. the assumption argument. I think based on the manuscript and example function provided via the help operation in R this is mtml for ‘multiple-transmission multiple-linkage’, but perhaps this could be further clarified in the package documentation

Reviewer #2: Wohl et al. present a method for understanding how sampling, both in terms of overall depth and in terms of proportion, influences how accurately we can identify true infector-infectee pairs (linked cases) from a phylogeny of pathogen genomes. This theoretical area of genomic epidemiology is sorely underdeveloped, especially when compared to the rigorous theoretical framework for sampling design available for traditional epidemiological studies. This work is the first real step I’ve seen to develop sample size calculations for genomic epidemiological studies. The manuscript is clearly written, and I am satisfied by how the authors have addressed previous reviewer comments. While this work should be accepted, I do have some minor comments that should be addressed to avoid reader confusion and position this paper in the appropriate context. These comments do not require further analytic work; they are only textual changes.

1. In the Introduction the authors draw on many examples of how pathogen genomic information can be used to investigate public health questions (lines 34-37) at multiple scales (lines 47-49), and declare that all of those questions can be boiled down to a question of asking whether pairs of infections are related. I disagree with this, especially within the context of sampling. Sampling considerations within phylogeographic studies, which seek to infer patterns of spatial linkage, center on the assumption that sampling must be sufficiently broad and random to have fully sampled all circulating genetic lineages, generally at an intensity that is proportional to a lineage’s prevalence. For those questions I don’t see how it’s important that linked pairs are captured, and thus I don’t see how this method would help me to design better phylogeographic studies. I would recommend that the authors pivot their introduction to orient this work towards phylogenetic studies of “Who Infected Whom” or phylogenetic birth-death processes, where this method seems most useful.

2. In the section “Determining sensitivity and specificity” the discussion of “mutation rate” is confusing. Given that the generation time is the serial interval between infections, the rate at which changes in the genome would accrue AND be observed at the consensus level should be referred to as the pathogen “substitution rate” rather than the “mutation rate”. I realize that may sound pedantic, but this actually caused some confusion for me given that the selected example rate of 1 mutation/genome/generation is actually a reasonable expectation of the biological mutation rate per pathogen replication cycle.

3. I presume that the high substitution rate was selected such that differences in the distributions of expected mutations between linked and unlinked cases (Fig 2B) would appear more distinct. Using genetic distance as the sole basis for distinguishing linked and unlinked cases gets significantly murkier for “natural” substitution rates, as the authors have shown nicely in Fig S4, mentioned on lines 229-230, and discussed in the Discussion. I appreciate those efforts, and I want to stress that I do not feel that this rate selection is disingenuous in any way. However, in the Discussion the authors’ solution to this issue is to incorporate epidemiological data (such as location data, symptom onset date, contact history etc) to improve resolution of linked versus unlinked cases. Again, I don’t deny that multiple data sources would improve these designations, but it is unclear to me then how one would then calculate sensitivity and specificity. Given that this method relies upon knowing those values, this solution actually seems quite challenging to implement and at least mentioning that in the Discussion is important.

4. I find the R_pop quantity to be highly unintuitive. While we generally discuss R_eff as changing over an outbreak given depletion of susceptibles, I’ve never seen a formulation where the average R is calculated across the population with terminal samples presumed to be 0 because their child infections are not sampled. I will say that Figure S2 helped to clarify this concept greatly, and I’m thankful for that addition. However, I still find the in-text explanation (lines 145-157) very confusing. I think the key to making this clearer is to explicitly say that, within the bounded sampling frame, any terminal nodes (leaves) in the tree/transmission network are presumed to have no known child infections, and thus contribute an R value of 0, which is what allows R_pop to drop below one even for diseases where R_eff is easily greater than one.

Reviewer #3: In this work the authors seek to provide guidance to understand how sampling impacts the discovery of transmission events using genomic data. The question is interesting and important but the exploration here is limited to the simplest transmission scenario, with a single introduction, uniform random sampling, a known sensitivity and specificity of the genetic linkage system used (or this can be estimated but again it requires some strong assumptions) and Poisson distributed secondary infections. There is no application to real data, either for a sequenced (or partially sequenced) outbreak with analysis of the study design, or for the exploration of the linkage criteria.

The "single linkage" assumption seems hard to justify and the authors' give a derivation of the main result in S1 Text part D, so it's not clear why this assumption merits so much discussion earlier.

On page 16 of SI Text, k_i is the number of i's true transmission links that are in the sample. So k_i has to add to something less than M, the number of samples. This means that K (sum_i k_i) is not a sum of *independent* Poisson distributed random variables with rate parameter lambda - they are dependent because their sum is constrained. This impacts the expected number of pairs. It would be approximately correct if the sampling fraction is very small, because the sum of k_i would not approach M so the constraint would have minimal impact. But particularly in this paper, something whose bias gets more severe in a way that depends on the sampling fraction is not good. Also the distribution of the number of pairs is important (not just the expectation) .

On the same page I don't get the E(number of true pairs) / Pr(pair is true) - could this be a typo?

- Chi, not X, should be in Table 1

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: None

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009182.r004

Decision Letter 1

Virginia E Pitzer

20 May 2021

Dear Dr. Lessler,

Thank you very much for submitting your manuscript "Sample Size Calculation for Phylogenetic Case Linkage" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please address the very minor points raised by the reviewer. Also, note that some of the variables did not render correctly in the pdf of the main text (at least not on my computer). Please check the final submission and ensure that it looks correct. Once these minor points have been addressed, we should be able to accept the manuscript without further review.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Virginia E. Pitzer, Sc.D.

Deputy Editor-in-Chief

PLOS Computational Biology

Virginia Pitzer

Deputy Editor-in-Chief

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Please address the very minor points raised by the reviewer. Also, note that some of the variables did not render correctly in the pdf of the main text (at least not on my computer). Please check the final submission and ensure that it looks correct. Once these minor points have been addressed, we should be able to accept the manuscript without further review.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Many thanks to the authors for considering the points outlined in my previous review. I am satisfied that the authors have adequately addressed all points raised and include only minor typographical feedback below.

Line 136 (marked up version): It may be worth changing mutation to substitution here "rate = 1 mutation/genome/transmission"

Line 281 (marked up version): It might be worth changing the section heading to reflect that it contains multiple examples i.e. "Application to existing datasets"

Lines 386 and 413 (marked up version): The same subheading is used twice for each of the examples, it may be worth making them more specific to the example detailed in each section.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009182.r006

Decision Letter 2

Virginia E Pitzer

14 Jun 2021

Dear Dr. Lessler,

We are pleased to inform you that your manuscript 'Sample Size Calculation for Phylogenetic Case Linkage' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Virginia E. Pitzer, Sc.D.

Deputy Editor-in-Chief

PLOS Computational Biology

Virginia Pitzer

Deputy Editor-in-Chief

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009182.r007

Acceptance letter

Virginia E Pitzer

30 Jun 2021

PCOMPBIOL-D-20-02147R2

Sample Size Calculation for Phylogenetic Case Linkage

Dear Dr Lessler,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Katalin Szabo

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Sample size and false discovery rate given single linkage and single transmission.

    (A) Effect of sample size (red lines) or proportion sampled (blue lines) on the expected number of linked pairs (upper plots) or the false discovery rate of linked pairs (lower plots). The specificity and sensitivity are held constant. (B) Effect of varying the sensitivity and specificity of the linkage criteria on the false discovery rate (FDR).

    (TIF)

    S2 Fig. Estimating the average reproductive number in a population.

    Two hypothetical outbreaks with a pathogen reproductive number (R) equal to 2 and a total of 15 infections. Black circles represent infections; blue circles represent infections who have not yet infected others, or whose descendents are outside the sampling frame. (A) Outbreak caused by a single introduction, meaning there were 14 transmission events and 15 total infections. In other words, Rpop=1415=0.933. (B) Outbreak caused by two separate introductions, meaning there were only 13 infection events in the sampling frame, resulting in Rpop=1315=0.867.

    (TIF)

    S3 Fig. Effects of R and G on the distribution of generations between cases.

    Distribution of the number of generations between infections averaged over 1000 simulated outbreaks with reproduction number R and number of generations of transmission G. Distributions are shown for three values of R (rows). Left column: distribution of generations between infections after 3 generations of transmission; middle column: distribution after ln(1000)/ln(R) generations of transmission (see Methods); right column: distribution after ln(1000)/ln(R)+2 generations of transmission.

    (TIF)

    S4 Fig. Genetic distance distributions for different types of pathogens.

    (A) Distribution of genetic distances for linked (purple) and unlinked (yellow) infections for a hypothetical pathogen with substitution rate = 1 substitution/genome/generation and R = 1.5. Inset: receiver operating characteristic (ROC) curve for all possible genetic distance cutoff values. Optimal threshold shown as green dot (ROC) and dashed vertical line (distribution). (B) Distribution of genetic distances for linked and unlinked cases for a hypothetical pathogen with substitution rate = 0.2 mutations/genome/generation and R = 3. Inset: ROC curve for all possible genetic distance cutoff values for this pathogen. The optimal threshold is shown as in (A).

    (TIF)

    S5 Fig. Error of false discovery rate calculation by sensitivity and specificity.

    (A) Average false discovery from 10,000 simulated outbreaks (proportion sampled = 0.75) binned by sensitivity and specificity (bin size = 0.02). Grey = no genetic distance thresholds in simulation produced this combination of sensitivity and specificity. (B) Zoom view of (A), with specificity ranging from 0.9–1 (bin size = 0.002). (C) Number of data points with sensitivity and specificity in the desired bins (i.e., number of data points used to calculate average error in panel (A). (D) Zoom view of (C), with specificity ranging from 0.9–1.

    (TIF)

    S6 Fig. Histogram of raw parameter error using substitution rate method (optimal threshold only).

    Theoretical minus simulated parameter values for the optimal genetic distance threshold (determined by selecting the threshold for which the point at (1-specificity, sensitivity) is closest to the (0,1) corner) in 10,000 simulations of varying substitution rate and reproductive number for a given sampling proportion. Top row: theoretical minus simulated false discovery rate; middle row: theoretical minus simulated sensitivity; bottom row: theoretical minus simulated specificity. Colors correspond to sampling proportion as in Fig 4.

    (TIF)

    S7 Fig. Predicted versus observed sensitivity using substitution rate method.

    Theoretical versus simulated sensitivity for each genetic distance threshold in 10,000 simulations of varying substitution rate and reproductive number. White line: smoothed conditional mean; grey dashed line: y = x line. Increasing values of the sample size (M) are plotted in darker color.

    (TIF)

    S8 Fig. Predicted versus observed specificity using substitution rate method.

    Theoretical versus simulated specificity for each genetic distance threshold in 10,000 simulations of varying substitution rate and reproductive number. Outbreak sizes range from 100–2000, as described in Methods. White line: smoothed conditional mean; grey dashed line: y = x line. Increasing values of the sample size (M) are plotted in darker color.

    (TIF)

    S9 Fig. Histogram of raw specificity error using substitution rate method by sample size and proportion.

    Theoretical minus simulated specificity for each genetic distance threshold in 10,000 simulations of varying substitution rate and reproductive number for a given sampling proportion. Each column represents 10,000 simulations with a specific sampling proportion (colors as in Fig 4) and sample size within each proportion (determined by the final outbreak size) goes from low (top row) to high (bottom row).

    (TIF)

    S10 Fig. Predicted versus observed false discovery rate using actual generation distribution.

    Theoretical versus simulated false discovery rate (FDR) for each genetic distance threshold in 10,000 simulations of varying substitution rate and reproductive number. Theoretical FDR is calculated using the actual distribution of generations between infections from the corresponding simulated outbreak. White line: smoothed conditional mean; grey dashed line: y = x line. Increasing values of the sample size (M) are plotted in darker color.

    (TIF)

    S1 Table. Error of false discovery rate calculation by sample size.

    (PDF)

    S2 Table. Bias and error of false discovery rate calculation using substitution rate method.

    (PDF)

    S3 Table. Error and of false discovery rate calculation using substitution rate method by sample size.

    (PDF)

    S4 Table. Bias and error of false discovery rate using actual generation distribution.

    (PDF)

    S1 Text. Deriving probably of transmission given linkage.

    (PDF)

    S2 Text. Determining sensitivity and specificity of genetic distance as a linkage criteria.

    (PDF)

    Attachment

    Submitted filename: phylosamp_elife_reviewerresponse.pdf

    Attachment

    Submitted filename: phylosamp_ploscompbio_reviewerresponse.pdf

    Attachment

    Submitted filename: phylosamp_ploscompbio_reviewerresponse2.docx

    Data Availability Statement

    All code and simulation data are available at: https://github.com/HopkinsIDD/phylosamplesize.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES