Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Jun 1.
Published in final edited form as: Theor Popul Biol. 2019 Mar 12;127:7–15. doi: 10.1016/j.tpb.2019.03.001

An Accurate Approximation for the Expected Site Frequency Spectrum in a Galton-Watson Process under an Infinite Sites Mutation Model

John L Spouge 1,*
PMCID: PMC6640632  NIHMSID: NIHMS1523779  PMID: 30876864

Abstract

If viruses or other pathogens infect a single host, the outcome of infection often hinges on the fate of the initial invaders. The initial basic reproduction number R0, the expected number of cells infected by a single infected cell, helps determine whether the initial viruses can establish a successful beachhead. To determine R0, the Kingman coalescent or continuous-time birth-and-death process can be used to infer the rate of exponential growth in an historical population. Given M sequences sampled in the present, the two models can make the inference from the site frequency spectrum (SFS), the count of mutations that appear in exactly k sequences (k = 1,2,…,M). In the case of viruses, however, if R0 is large and an infected cell bursts while propagating virus, the two models are suspect, because they are Markovian with only binary branching. Accordingly, this article develops an approximation for the SFS of a discrete-time branching process with synchronous generations (i.e., a Galton-Watson process). When evaluated in simulations with an asynchronous, non-Markovian model (a Bellman-Harris process) with parameters intended to mimic the bursting viral reproduction of HIV, the approximation proved superior to approximations derived from the Kingman coalescent or continuous-time birth-and-death process. This article demonstrates that in analogy to methods in human genetics, the SFS of viral sequences sampled well after latent infection can remain informative about the initial R0. Thus, it suggests the utility of analyzing the SFS of sequences derived from patient and animal trials of viral therapies, because in some cases, the initial R0 may be able to indicate subtle therapeutic progress, even in the absence of statistically significant differences in the infection of treatment and control groups.

Keywords: site frequency spectrum, Galton-Watson process, viral reproduction number

1. Introduction

The theory in this article is strongly motivated by the practical observation that in infection, the success of the invasive population often hinges on the fate of the first arrivals. In the viral infection of a single host, the invasive population often descends from a small set of founder viruses. In some cases, it even descends from a single founder, e.g., about 80% of human immunodeficiency virus (HIV) infections have a single founder (Keele et al. 2008, Haaland et al. 2009, Love et al. 2016). The initial basic reproduction number R0, in viral infection the expected number of cells that a single infected cell infects in the next generation (Giorgi et al. 2010), contributes fundamentally to the chances of an invasive population establishing a successful beachhead. On one hand, if R0 < 1, the invasive population falls below replacement and dies. On the other hand, if R0 is slightly greater than 1, the invaders have a small but positive chance of survival, and if R0, is large, the invaders are likely to flourish. Thus, whether preventing infection or mitigating its impact by reducing an initial viral load, the initial R0, the basic reproduction number at the start of infection, could in principle provide a measure for setting therapeutic goals and benchmarking therapeutic progress.

Presently, if a researcher wishes to assess therapeutic progress by demonstrating a decrease in host infectability, only one type of datum has direct pertinence, i.e., the binary datum corresponding to whether or not infection occurs (Pegu et al. 2013, Strbo et al. 2013, Gordon et al. 2016). Animal trials testing viral therapies have therefore developed ingenious designs (Regoes et al. 2005, Nolen et al. 2015), primarily to squeeze binary data for the statistical power required to detect subtle changes in infectability. An estimate of R0 would provide an additional measure of therapeutic success, beyond the binary datum of infection or viral extinction. A quantitative analysis estimating the initial R0 would therefore help bypass a stubborn bottleneck in the systematic development of viral therapies.

Unfortunately, the initial R0 is usually not directly accessible, because the initial viral population is typically latent, i.e., below the threshold of detectability. Although the literature for HIV gives the estimates R0 ≈ 6 (Stafford et al. 2000) or R0 ≈ 8 (Ribeiro et al. 2010), e.g., these estimates pertain to viremia (i.e., after HIV becomes detectable in blood) (Fiebig et al. 2003), which starts on average about 10 days after the founders infect (Kahn and Walker 1998). The lower limit of HIV detection in blood is about 20 viruses/ml (Kosaka et al. 2017), so a viremia implies that the total blood volume (5L) contains at least about 105 viruses. When HIV is detected in blood, therefore, the viral invasion has long since secured its beachhead. Indeed, if current estimates (R0 ≈ 6 or R0 ≈ 8) of the basic reproduction number in viremia were pertinent during the initial infection, a single infected cell would probably ensure successful infection. Note, however, that the invasive routes with the highest estimates of per-act HIV transmission risk are (per 10,000 exposures) blood transfusion (9250), mother-to-child transmission (2260), and receptive anal intercourse (138). Other routes have an HIV transmission risk less than 63 per 10,000 exposures (see Table 1 in (Patel et al. 2014)). Because HIV transmission is so uncertain, a typical HIV founder virus probably faces a much more hostile initial environment than the current direct estimates of R0 in viremia suggest.

Although detection thresholds prevent direct measurement of the initial R0, genetic researchers have shown that the nucleic acids of present-day humans retain footprints from the population history (e.g., (Durrett and Limic 2001, Durrett 2008)). Similarly, viral sequences sampled during early viremia may be informative about the initial R0. Figure 1 illustrates the concept, and the remainder of this article refines the qualitative insight there. In particular, Figure 1 suggests that mutations appearing in two or more sampled sequences contain information about the initial R0 of an expanding population. Although the viral applications motivate the analysis, the theory presented here has broad applicability to inferring the early demographics of populations with very few founders.

Figure 1: Two viral ancestries, illustrating the effect of the initial R0 on samples.

Figure 1:

Figure 1 illustrates two hypothetical viral ancestries. The circles represent viruses. In the two ancestries, time runs downward, and for simplicity each viral population at the bottom has a single founder at the top (multiple founders introduce technical complications, but do not change the concept illustrated). The present population at the bottom contains some large number Z of individuals, from which M sequences are sampled. The ancestry on the left has a large initial R0, so the founder has Z1 daughters, where for illustration, we assume Z1 is much larger than M (M − 1) / 2. Thus, any two of the M sampled sequences are likely to descend from different daughters (many standard references on the “Birthday problem” implicitly prove this statement). If so, the sampled sequences cannot share any mutations away from the founder sequence. In contrast, the ancestry on the right has a small initial R0, so the founder has only Z1 = 2 daughters. If either daughter’s sequence has a mutation away from the founder sequence, about half the M sampled sequences share the mutation. Thus, mutations appearing in more than one of the M sampled sequences are informative about the initial R0. Sampled mutations are conveniently partitioned into a “site frequency spectrum”, i.e., the numbers ηm of the sites where a mutation appears in exactly m of the M sampled sequences (m = 1,2,…,M).

Thus, given M aligned viral sequences sampled simultaneously from a single infected host, the aim of this article is to reconstruct as a function of R0 the site frequency spectrum (SFS) described in Figure 1. Before proceeding to the mathematical abstractions of Section 2, we first establish the parameter ranges relevant to an important application. In studies sampling HIV gp120 sequences from patients, M typically ranged from 16 to 30 per patient (Lee et al. 2009). The gp120 gene is about 2550 nt long, and (with crossovers neglected) each HIV replication averages ε ≈ 2.16 × 10−5 point mutations/base/replication (Giorgi et al. 2010). On average, therefore, each RNA replication entails μ ≈ 0.0551 mutations in gp120. (The three significant figures represent an unrealistic precision but retain consistency with the literature.)

Studies in human genetics suggest some methods for estimating R0 from the SFS in gp120 sequence data, but the methods are not directly suited to viral infection in a single host. To elaborate, consider the following idealized model of an initial HIV infection, focusing first on a typical initial infected cell in a host. HIV lyses the cell, releasing about 103 – 104 viral particles (Chen et al. 2007, De Boer et al. 2010). As noted above, the initial R0 is likely smaller than R0 = 8, so a typical viral particle has a miniscule chance of infecting. Given an initial environment, if viral particles infect independently, the count of infected daughter cells approximately follows an Poisson distribution whose mean equals the unknown initial R0 (see, e.g., (Arratia et al. 1989a, Arratia et al. 1989b)). Thus, the viruses from the initial infected cell infect Z1, daughter cells, where Z1 is a Poisson random variate with mean R0. Estimates of the replication time for HIV range from 1.76 days to 4.2 days (Love et al. 2016), 2 days being a reasonable approximation (Markowitz et al. 2003). We therefore model the time-intervals between the lysis of a mother cell and her infected daughter cells as independent random variates with a gamma distribution, approximating the mean by 2 days and the standard deviation by 0.24 = 2 – 1.76 days (1.76 days is the minimum estimate of the HIV replication time, so the resulting gamma distribution is likely less tightly concentrated around its mean than the true random HIV replication time). The gamma distribution approximating the random HIV replication time therefore has shape and rate parameters (n, λ) = (69.4, 34.7) (i.e., its mean −1 = 2 and variance −2 ≈ 0.242). The reproductive cycle begins anew with the lysis of each infected daughter. This idealized but relatively realistic model of HIV reproduction is a Bellman-Harris process (Bellman and Harris 1948); call it the “Gamma model”, to emphasize that its parameters and distribution are chosen to model HIV (see Figure 2).

Figure 2: A diagram schematically illustrating the Gamma and Delta models.

Figure 2:

Figure 2 illustrates two hypothetical models of viral ancestries. As in Figure 1, black circles represent the lysis of infected cells, and each ancestry starts from a single founder at the top, with time running downward. The numbers in the circles count the generations away from the founder, with the single founder in generation 0. In the Gamma model on the left, the lysis of each infected cell gives birth to a random number of viral daughters whose progeny lyse cells at random times chosen independently from a gamma distribution. The graph on the extreme left illustrates the relevant gamma distribution. The Delta model on the right is like the Gamma model, except that each cell lysis gives birth to viral daughters who lyse their cells simultaneously, in synchronous generations at time-intervals equal to the mean generation time in the Gamma model.

The theory in Section 2 exploits the Delta model illustrated in Figure 2 (a Galton-Watson branching process, in discrete-time) to provide analytic approximations to the SFS of the Gamma model. To make the Delta model comparable to the Gamma model, the count of daughters in the Delta model is also a Poisson variate with mean R0, and the deterministic replication time in the Delta model is 2 days, the mean replication time in the Gamma model. As the models’ names suggest, the synchronous generations in the Delta model therefore substitute a Dirac delta distribution (i.e., a distribution having a single atom of probability 1, a distribution as tightly concentrated around its mean as possible) for the gamma distribution in the Gamma model. For comparison with simulations of the Gamma model, a birth-and-death process with constant birth and death intensities and the Kingman coalescent model with an exponentially expanding population also provide analytic expressions for the SFS. All approximations in this article are uncontrolled, so (often without comment) we rely on the simulations of the Gamma model with parameters relevant to HIV gp120 to assess the accuracy of SFS approximations. Other parameters regimes, which require separate assessment, are beyond the immediate practical purview of this article.

The organization of this article is as follows. Section 2 (Theory) approximates the mean SFS in the Delta model as a function of the basic reproduction number R0 and compares it to the SFS derived from the Kingman coalescent for an expanding population or from continuous-time birth-and-death processes. Section 3 (Methods) describes simulation of the Gamma and Delta models. It also discusses the dependence of the SFS variance on the mutation rate. Section 4 (Results) examines the accuracy of the various approximations when the simulations use the parameters relevant to HIV gp120. Section 5 is our Discussion, which examines the few (superable) difficulties impeding direct data analysis with the theory presented here.

2. Theory

An Approximation for the Site Frequency Spectrum of a Galton-Watson Process

With the Delta model of Figure 2 in mind, consider a discrete-time branching process starting with one individual in the generation 0. Let G~g denote generation g of the branching process (g = 0,1,…), let Z~g count the individuals in G~g (e.g., Z~0=1), let I~g,iG~g denote individual i in generation g (i = 1,2,…, Z~g; g = 0,1,…), and let I~g,i have X~g,i daughters. Thus, Z~g+1=i=1Z~gX~g,i. The {X~g,i} are mutually independent, identically distributed non-negative integer random variates, i.e., the process is a Galton-Watson process.

To streamline the subscripted notation R0 for mathematical manipulation, define r=EZ~1=R0, e.g., EZ~g=rg. The case r > 1 (a supercritical Galton-Watson process) is of greatest interest in the viral application, because (as the Introduction indicates) the sampling in viremia entails a large viral population. Let P~(s)=EsZ~1 denote the probability generating function (pgf) of the number of daughters in G~1, so r=EZ~1=P~(1). The Galton-Watson process goes extinct if Z~g=Z~g+1==0 for some g. The extinction probability 0 ≤ q < 1 satisfies q=P~(q) (i.e., an individual’s lineage goes extinct if and only if all its daughters’ lineages go extinct).

The Delta model is a Galton-Watson process where the {X~g,i} share a Poisson distribution with mean R0, so the relevant pgf is

P~Δ(s)=P~(s)=EsZ~1=k=0P(Z~1=k)sk=k=0errkk!sk=erk=0(rs)kk!=er(s1). (1)

Perhaps predictably, the approximations below depend only on r=P~(1), and no other feature of P~(s).

Fix some large generation G (where G → ∞ later), and let I~g,i, have Z~g,i;G descendants in G~G (g = 0,1,…, G; i=1,2,,Z~g). For example, Z~G,i;G=1 for every i=1,2,,Z~G (because by convention, every individual in G~G is her own descendant), and i=1Z~gZ~g,i;G=Z~G for every g = 0,1,…,G (because every individual in G~G has exactly one ancestor in G~g). Sample M individuals with replacement from G~G, and let S~g,i;G count the descendants of I~g,i in the sample from generation G. (Sampling with replacement is mathematically more convenient than sampling without replacement, but as with other uncontrolled approximations here, simulation results are required to assess its accuracy.) Each sampled individual has exactly one ancestor in G~g(g=0,1,,G), so the sample from G~G implicitly samples ancestors from G~g, with i=1Z~gS~g,i;G=M for every g = 0,1,…,G.

If the Galton-Watson process survives long enough to be sampled at some sufficiently large G (in the Introduction, e.g., the viral population has become detectable in blood), all sampled ancestors in early generations had lineages that for any practical purpose have survived forever. For brevity, call ancestors whose lineages survive forever “immortal”; all others, “doomed”. As another uncontrolled approximation, delete all doomed individuals from the Galton-Watson process. The individuals remaining, if any, form another Galton-Watson process, the so-called skeleton process of immortal individuals.

To rephrase, if an individual is doomed, for G large enough samples from G~G never contain any of her descendants. To focus on relevant individuals, therefore, we replace the original Galton-Watson process with the skeleton process and drop the over-tildes from the associated quantities Ig,i, Zg, Zg,i;G, Sg,i;G, etc. To avoid the triviality of an empty skeleton process, all probabilities and expectations below implicitly condition on survival of the original process. The Appendix shows that the skeleton process has the same basic reproduction number r=P~(1)=P(1) as the original process. Because the approximations below depend only on the basic reproduction number r, dropping over-tildes and replacing the original process with the skeleton process does not impact the analysis.

To introduce genealogical conventions into the skeleton process, for linguistic convenience let Ig,i be both her own ancestor and her own descendant. In contrast, let the strict partial order Ig,i<Ig,i denote that Ig,i is an ancestor of Ig,i and g < g′. If Ig,i<Ig,i, and g′ = g + 1, therefore, Ig,i is the mother of Ig,i and Ig,i is the daughter of Ig,i.

To introduce mutations into the genealogy of the skeleton process, note that a daughter may differ in nucleic acid sequence from her mother. To compare the sequence differences, conceptually we align the relevant sequences from every individual Ig,i (g = 0,1,…; i = 1,2,…,Zg). Let the Poisson random variate ξg,i (e.g., (Giorgi et al. 2010)) count the novel point mutations in Ig,i, namely, the alignment columns where Ig,i has a different letter from her mother (e.g., ξg,i = 0 if Ig,i and her mother have the same nucleic acid sequence). Assume an infinite sites model (Kimura 1969), which restricts the mutations allowed: if Ig,i and her mother differ in alignment column j (i.e., Ig,i bears a novel mutation in column j), no other mother-daughter pair differs in column j. Thus, each column contains at most two different letters, and all individuals bearing a mutation (relative to I0,1, the founder) descend from a single ancestor Ig,i. Assume that Eξg,i=μ, i.e., the expectation of the count of novel mutations in a daughter sequence is some fixed constant μ.

Define an indicator random variate for k-fold sampling of the ancestor Ig,i (g = 0,1,…,G; i = 1, 2,…, Zg): Ig,i;k = 1 if Sg,i;G = k; and Ig,i;k = 0 otherwise. With the theoretical preliminaries in hand, consider now the random variate

ηm=g=1Gi=1Zgξg,iIg,i;m (2)

(m = 1,…,M), which counts novel mutations over all individuals Ig,i, but only those novel mutations displayed by exactly m of the M sampled individuals. Under the infinite sites model, each novel mutation occurs in a different column, so from the perspective of the sampled sequences, Eq (2) counts all columns containing m letters mutated away from the founder, i.e., ηm counts the distinct mutations that appear in exactly m sampled sequences. Thus, η = (η1, η2,…, ηM) constitutes the SFS in Figure 1.

The approximation in Eq (7) below for the expected SFS Eη=(Eη1,Eη2,,EηM) is our central result, derived as follows. As in the so-called Delta method (Ver Hoef 2012), expand

f(X)=f(EX)+f(EX)(XEX)+12f(EX)(XEX)2+ (3)

into a Taylor series around EX. Truncate the resulting series before 12f(EX)σ2(X) to yield a linear function, and take expectations to yield the uncontrolled approximation Ef(X)f(EX). As a preliminary,

E[Zg,i;GZG]=E[1Zgi=1ZgZg,i;GZG]=E[1Zg]1EZg=rg, (4)

where the first equality follows from the exchangeability of {Ig,i:i=1,2,,Zg}; the second, from the fact that every individual in the G -th generation has exactly one ancestor in the g -th generation (gG); and the approximation, from a linear truncation applied to f (X) = 1/ X. A second application of a linear truncation to f (X) = Xm (1−X)Mm with X = Zg,i;G / ZG shows that

P(Sg,i;G=m)=E[(Mm)(Zg,i;GZG)m(1Zg,i;GZG)Mm](Mm)(rg)m(1rg)Mm. (5)

Assume that the implicit sampling {Sg,i;G} in Gg depends only weakly on the size Zg of Gg.

Eηm=E[g=1Gi=1Zgξg,iIg,i;m]=μg=1GE[i=1ZgE[Ig,i;mZg]]μg=1GE[i=1ZgP(Sg,i;G=m)]=μg=1GrgP(Sg,1;G=m), (6)

where the second equality uses the mean Eξg,i=μ, the independence properties of the {ξg,i} (another model assumption), and a property of the expectations conditioned on Zg; the third (approximate) equality depends on the above assumption of weak dependence; and the final equality uses both the Malthusian population growth EZg=rg of the skeleton process and the exchangeability of {Ig,i:i=1,2,,Zg} (so P(Sg,i;G=m)=P(Sg,1;G=m)).

Together, therefore, Eqs (2)–(6) approximate Eηm as a function of r : for m ≥ 1,

Eηmμ(Mm)g=1G(rg)m1(1rg)Mm=EΔηm, (7)

where the equality on the right of Eq (7) defines the (approximate) Delta SFS EΔηm.

For m ≥ 2, convergence of the geometric series g=1(rg)m1 shows that the sum EΔηm in Eq (7) converges as G → ∞. For m = 1, EΔηm has a very different behavior:

μMGEΔη1=μMg=1G[1(1rg)M1]. (8)

Now,

1(1rg)M1=(M1)1rg1xM2dx(M1)1rg1dx=(M1)rg, (9)

so the sum on the right of Eq (8) increases with G and is bounded above by the finite sum g=1(M1)rg of a geometric series. Thus, the difference on the left of Eq (8) increases to a finite limit as G → ∞. Thus, limGEΔη1(μMG)=1.

For m ≥ 2, the second factor in the middle expression of Eq (7) expands to yield an incidental alternative expression,

EΔηm=μ(Mm)g=1Gj=0Mm(Mmj)(rg)m1(rg)j=μ(Mm)j=0Mm(Mmj)(1)jg=1Gr(m1+j)g=μ(Mm)j=0Mm(Mmj)(1)j1r(m1+j)Grm1+j1, (10)

whereas

EΔη1=μM[Gj=1M1(M1j)(1)j11rjGrj1], (11)

For computing, because the final sum in Eq (10) alternates in sign, it is less reliable numerically than Eq (7). For m = M, however, Eq (10) provides a convenient check on programming errors:

EΔηM=μ1r(M1)GrM11. (12)

To compare the above approximations to other models of interest, consider a continuous-time birth-and-death branching process with mutation rate μ, birth rate b, and death rate d, yielding a basic reproduction number r = b / d. Let N1 be the total population at the time t1 of sampling M individuals. For m ≥ 2 (our main interest), if N1 is large, the approximate expected SFS is

EηmμM1r11m(m1)=ECηm (13)

independent of t1 and the exact value of N1, where Eηm=Ssample(m,μ,t1M) in the authors’ notation of Eqs (27) and (28) of (Ohtsuki and Innan 2017). According to those authors, Kingman’s coalescent for an exponentially growing population (e.g., (Griffiths and Tavare 1998)) yields the same approximation. (See also Theorem 2 of (Durrett 2013) about the Moran model.) Accordingly, the right side of Eq (13) defines ECηm, the (approximate) coalescent SFS.

To relate Eq (13) to Eq (7), let G → ∞. For m ≥ 2, the Appendix shows that

limr1EΔηmECηm=limr1μ(Mm)g=1(rg)m1(1rg)MmμM1r11m(m1)=1, (14)

i.e., the ratio of the Delta and coalescent SFSs approaches 1 as r decreases to 1.

In many models of exponentially expanding populations (e.g., the birth-and-death process (Champagnat and Henry 2016)), the moments of η2,…, ηM converge to fixed, finite values at infinite times. Here in a Galton-Watson process, the SFS approximations EΔηm(m2) display a similar convergence as G → ∞. Accordingly, Section 4 presents numerical results only for the limit G → ∞ in Eq (7).

3. Methods

A technique for reducing variance (Hammersley and Handscomb 1964) also reduces the programming effort in simulating the SFS for the Gamma and Delta models, as follows. For either model, let Am count the non-founding ancestors with m descendants in the sample, and call A = (A1, A2, …, AM) the ancestral sample frequency spectrum. (Note that the related term “allele frequency spectrum” is sometimes used synonymously with SFS, but it might be preferable to reserve it (e.g., (Griffiths and Pakes 1988)) for the analogous variate under the infinite alleles model (Kimura and Crow 1964).) For any model with synchronous generations (e.g., the Delta model),

Am=g=1Gi=1ZgIg,i;m, (15)

so

m=1MmAm=m=1Mmg=1Gi=1ZgIg,i;m=g=1Gi=1Zgm=1MmIg,i;m=g=1GM=MG, (16)

because m=1MmIg,i;m counts the descendants of Ig,i in the sample. Some equivalent of the following result is doubtless stated elsewhere.

Theorem 1: Consider an infinite sites model where the novel mutation counts in every daughter of every mother are independent Poisson variates with fixed mean μ (e.g., the Gamma or Delta model). Given A, the coordinates of η are independent Poisson variates, with ηm having mean μAm.

Proof: The coordinates of η are determined by the novel mutation counts in disjoint ancestral sets of sizes A. Thus, independence of mutation implies each ηm is the sum of Am Poisson variates of mean μ, with all Poisson variates independent. The sum of Am Poisson variates of mean μ is a Poisson variate ηm of mean μAm, where the variates {ηm} are independent. □

Theorem 1 implies that under the Gamma or Delta model, Eηm=μEAm. In addition, the law of total variance yields

σ2(ηm)=E[σ2(ηmA)]+σ2(E[ηmA])=E[μAm]+σ2(μAm)=μEAm+μ2σ2(Am), (17)

where the second equality follows from Theorem 1 and the fact that the variance of a Poisson variate equals its mean. Under the Gamma or Delta model, therefore, the simulation of ancestries alone suffices to estimate the first two moments of ηm. On the right side of Eq (17), the first term is sometimes called the mutational variance; the second, the evolutionary variance. Besides variance reduction, the simulation of A clarifies the relative contributions of the mutational and evolutionary variance in Eq (17) by displaying their dependence on μ.

For different r > 1, simulation yielded 1000 realizations of the Gamma and Delta models. As a check on using the skeleton process in Section 2, each realization started with one founder and propagated the population, simply restarting it with another founder if the population became extinct. The realization continued until the population reached a threshold of 6000 live individuals. Here, 6000 is an arbitrary large number, chosen on the (possibly irrelevant) basis that a neutral model in coalescent theory estimated the effective size of the viral population in HIV patients between 2000 and 6000 (Seo et al. 2002). As a check on using sampling with replacement in Section 2, each realization sampled the 6000 live individuals uniformly without replacement.

4. Results

Typically, the sample means and sample standard deviations of the SFS simulated from the Delta and Gamma models were visually indistinguishable, and the differences between the models’ results were always subtle. Using the Gamma model, Figure 3 exemplifies some other numerical trends. Typically for fixed m, the sample mean SFS E^ηm simulated from the Gamma model increased with the sample number M, while the sample standard deviation decreased (see plots pertinent to η2 in Figure 3a, b, c, and d for M = 8,32,128,512), whereas for fixed M, E^ηm decreased with the number m of mutations in an alignment column, while the sample standard deviation increased (see plots pertinent to η2, η3, and η4 in Figure 3d, e, and f for M = 512). Unlike Eq (13) for the coalescent SFS ECηm, Eq (7) for the Delta SFS EΔηm generally provided accurate approximations to the sample mean E^ηm of the Gamma SFS in all simulations, except possibly where it crossed over from E^ηm to ECηm as r ↓ 1, e.g., at r = 1.1 for large M (e.g., Figure 3c, d, e, and f, where M = 512 and k = 2,3,4); or for small r for M = 128 and k = 2 (in Figure 3c).

Figure 3. Selected plots of the SFS ηm vs. the basic reproduction number r = R0.

Figure 3.

In Figure 3, all axes are logarithmic. All X-axes share the same scale (likewise for all Y-axes). All results displayed use the HIV value μ = 0.0551. In each subfigure, Eq (13) for the coalescent SFS ECηm yields the black curve joining triangular points. Because the black curves are therefore all translates of f(r) = 1/(1−r−1), the translated shape provides a ready reference for comparing subfigures. In each subfigure, Eq (7) for the Delta SFS EΔηm yields the red curve joining square points. Generally, each red curve obscures a gold curve joining circular points. The gold curve corresponds to the sample mean E^ηm estimating Eηm from simulations of the Gamma model, with the error bars giving the sample standard deviation. Figure 3a, b, c, and d show plots pertinent to η2 (i.e., ηm for m = 2) for different sample numbers M = 8,32,128,512, whereas Figure 3d, e, and f show plots pertinent to η2, η3, and η4 for M = 512. For M = 512 in Figure 3d, e, and f, close inspection of the leftmost point (at r = 1.1) displays EΔηm crossing over from E^ηm to ECηm as r decreases to 1, as in Eq (14). Figure 3c also shows EΔη2 crossing over from E^η2 to ECη2, albeit more subtly, for M = 128.

Figure 3 is representative of simulation results, with two exceptions. First, the inequality EΔηmECηm failed occasionally at large values of m, with some neighboring values of r yielding comparable but probabilistically independent numerical exceptions. Second, for fixed r, E^ηm, EΔηm and ECηm all typically decreased with increasing m (as in Figure 3d, e, and f). A notable exception occurred for E^ηm and small r (e.g., r = 1.1), however, because the founder I0,1 of the branching process then gives rise to a long, unbranched initial lineage. The lineage eventually leads to the most recent common ancestor of the sample, so its mutations occur in all samples, making E^ηM noticeably larger than E^ηM1. As approximations, neither EΔηM nor ECηm appear to account adequately for mutations in a long, unbranched initial lineage.

Finally, the magnitude of the ratio μ2σ2(Am)(μEAm) of the evolutionary variance to mutational variance in σ2(ηm) from Eq (17) was surprisingly robust over different sample sizes M and m = 1,2,…,M. It decreased as a function of r and went from about 0.5 at r = 1.1 to about 0.1 at r = 1.5. Thus, in the present context of μ = 0.0551 (HIV gp120), even at r = 1.1, if the evolutionary variance were neglected and the standard deviation σ(ηm) replaced by a Poisson approximation EηmμEAm determined numerically from Eq (7), the resulting underestimate is at worst 111+0.518%.

Figure 4 provides a different view of many phenomena in Figure 3. First, for M ≥ 64 and small values of m, the top dotted curves for EΔηm (particularly the black curve for r = 1.1) display the crossover from E^ηm to ECηm, by overestimating E^ηm. Second, in contrast to the corresponding dotted curves, the top two solid curves (for r = 1.1 and r ≈ 1.46) in Figure 4a for M = 8 show an increase from E^ηM1 to E^ηM. As in Figure 3, the increase reflects a long, unbranched initial lineage from founder I0,1. Finally, for M ≥ 64, the two bottom dotted curves (for r ≈ 6.12 and r ≈ 8.14) in particular display some numerical instability at very small values of ηm. Premature numerical truncation of Eq (7) as G → ∞ does not appear to be the cause, so the instability may be inherent in the delta approximation, possibly due to the exponentiation of large values of r.

Figure 4. The expected SFS E^ηm and EΔηm, vs. m for different sample sizes M.

Figure 4.

In Figure 4, the Y-axes are logarithmic with the same range, and the X-axes all have categories m = 2,3,…,8. All results displayed use the HIV value μ = 0.0551 . Each subfigure displays simulation results for a different sample size M. The solid curves display simulation sample means E^ηm (m = 2,3,…,8) corresponding to different reproduction numbers r = R0 in a geometric series with ratio 1.331 = 1.13. From top to bottom, the curves correspond to r = 1.1 (black), r ≈ 1.46 (purple), r ≈ 1.95 (dark blue), r ≈ 2.59 (light blue), r ≈ 3.45 (green), r ≈ 4.59 (orange), r ≈ 6.12 (light red), and r ≈ 8.14 (dark red). The dotted curves correspond to Eq (7) for the Delta SFS EΔηm (implicitly connecting points for m = 2,3,…,8).

5. Discussion

The success of an invasive population often hinges on the fate of the first arrivals. Sometimes (e.g., in some viral infections), initial invaders may be very few and reproduce in near-synchronous bursts, blunting the accuracy of some population models in approximating the SFS (the Introduction and Figure 1 define the SFS verbally; Eq (2), mathematically). The Results section shows that in at least one important case, the Gamma model of HIV gp120 (a Bellman-Harris process), the simpler Delta model (a Galton-Watson process) can often yield analytic approximations indistinguishable by eye from the expected SFS. In particular, Figure 3 (c.f. particularly, m = 2 for M = 128, or m = 2,3,4 for M = 512) demonstrates that the SFS can carry information about an initial viral basic reproduction number r = R0. The estimated R0 can serve a quantitative measure of therapeutic progress in human and animal trials of viral therapies, particularly if the trials sample more viral sequences than they do at present.

Several technical difficulties present themselves, however. HIV researchers already recognize that multiple founders impede sequence analysis (Love et al. 2016). In addition, the invading viral population may pass through many environments, causing R0 to vary. If the technical difficulties prove superable, however, the present theory suggests a novel practical use for extant sequence data already sampled from patient and animal trials: the SFS can benchmark subtle therapeutic progress by estimating the initial R0, even when infection occurred and the trial had insufficient statistical power to infer therapeutic efficacy from infection data alone.

Many assumptions here are undemanding. For example, the difference between sampling with and without replacement can be neglected for light sampling (Freedman 1977). In applications to HIV, each infected cell in the sampled generation produces to 103 – 104 viral particles, and gp120 RNA averages μ ≈ 0.0551 mutations per replication, so many gp120 samples have identical sequences anyway.

Similarly, the assumption that viral sequences are sampled simultaneously is excessively stringent. For practical purposes, the most recent common ancestor of most sampled pairs often occurs early in the population’s expansion, an assumption holding for many expanding populations, regardless of whether sampling is simultaneous.

The Delta model therefore provides robust, accurate approximations to the expected SFS for the Gamma model of HIV reproduction (see Figure 3 and Figure 4). In that context, moreover, the approximations are superior to approximations based on continuous-time birth-and-death process or the Kingman coalescent for an exponentially expanding population. (Some other results on the SFS (Champagnat et al. 2012, Champagnat and Henry 2016), though not directly relevant to the Gamma model, are worth noting here.) The superiority should be unsurprising. The Gamma model mimics tight bursts of HIV replication every 2 days, much like a Delta model with synchronous generations every 2 days. In contrast, the continuous-time birth-and-death and the coalescent processes are Markovian, so by their nature, they do not mimic coordinated cell lysis as viruses burst forth from an infected cell. In addition, a small population such as a small initial viral population is likely to degrade approximations using the coalescent (Stadler et al. 2015): “…in most cases, the coalescent approximation works very well down to small population sizes (a few hundred individuals)” (Eriksson et al. 2010).

Figure 3 shows also shows that as R0 approaches 1 (e.g., R0 = 1.1), Eq (7) for the SFS EΔηm from the Delta model no longer closely approximates the SFS Eηm for the Gamma model, but instead crosses over from Eηm to Eq (13) for the SFS ECηm derived from a coalescent or continuous-time birth-and-death process. The Appendix gives a mathematical proof of the crossover, which has foundations in viewing the continuous-time birth-and-death process as the appropriate limit of Galton-Watson processes.

For HIV gp120, μ ≈ 0.0551, so μ2 is much less than μ. In Eq (17), one might suspect that consequently, the mutational variance dominates the evolutionary variance. The last paragraph of Section 4 (Results) bears out the suspicion. In fact, one incisive model of gp120 phylogeny in HIV is deterministic and explicitly neglects the evolutionary variance (Lee et al. 2009). The linear truncation in Section 2 (Theory) justifies the deterministic model (as well as making sense of it for non-integer r). The present paper adds a minor caveat to the deterministic model, however, particularly in its application to a study involving whole viral genomic sequence instead of a single protein. To explain, the ratio of the lengths of the HIV genome and of gp120 is 9200 / 2550 ≈ 3.6. The calculation at the end of Section 4 (Results) indicates that at r = 1.1, the neglect of evolutionary variance underestimates σ(ηm) by about 111+3.6(0.5)40%, enough to start impacting error estimates, and therefore scientific conclusions. The largest viruses have genome length around 1Mb, where indiscriminate neglect of evolutionary variance may lead to error.

To summarize, this article has presented an approximation for the expected site frequency spectrum in a Galton-Watson process with mutation. In many parameter regimes, the approximation is superior to approximations from a continuous-time birth-and-death process or a coalescent process. Although the (superable) practical problems described above prevent immediate application of the theory presented here, the present article indicates the possibility of using sequence data collected after a virus has become detectable in blood to infer the initial reproduction number R0, with the aim of examining the efficacy of therapies for preventing or mitigating initial viral infection.

Acknowledgements:

I thank Vruj Patel, along with Drs. Junyong Park, DoHwan Park, Mileidy Gonzalez and Anthony DeVico, for useful conversations. This research was supported by the Intramural Research Program of the NIH, National Library of Medicine.

Appendix

The Skeleton Process

The Harris-Sevastyanov transformation (Harris 1948, Harris 1963) gives the pgf P(s)=EsZ1 of the skeleton process:

P(s)=P~((1q)s+q)q1q. (18)

To keep the article self-contained, we note that Eq (18) has the following heuristic justification (see, e.g., (Schuh 1982) or Part D.12 Decomposition of the Supercritical Branching Process, p.47 et seq. in (Athreya and Ney 2004)). The pgf P~(q+(1q)s)=E[(q+(1q)s)Z~1] deletes the founder’s doomed daughters independently with the correct probability q. The founder is either doomed (with probability q) or immortal (with probability 1 − q). If she is immortal, she is part of the skeleton process, which has pgf P(s). Thus,

P~((1q)s+q)=q+(1q)P(s), (19)

justifying Eq (18). Eq (19) shows that P(1)=P~(1)=r, i.e., the original and skeleton processes have the same basic reproduction number.

Note that the Harris-Sevastyanov transformation reduces the variance of the offspring distribution, probably contributing to the accuracy of deterministic approximations in this article:

σ2(Z1)=P(1)+rr2=P~(1)(1q)+rr2=σ2(Z~1)qP~(1)σ2(Z~1). (20)

Proof of the Limit in Eq (14) for 2 ≤ kM

Consider the function f(x) = xk−2(1−x)Mk and partition [0,1] into subintervals with the points xg(r)=rg(g=1,2,3,). Let xg(0)=1. Almost immediately, the theory of approximating integrals by Riemann sums shows that

01f(x)dx=limr1g=1f(xg(r))(xg1(r)xg(r))=limr1g=1f(rg)rg(r1). (21)

Substitution of f(x) into Eq (21) yields

limr1g=1(rg)k1(1rg)Mk(r1)=01xk2(1x)Mkdx=(k2)!(Mk)!(M1)!, (22)

after using standard results for the Beta integral in the middle expression. Thus,

limr1μ(Mk)g=1(rg)k1(1rg)MkμM1r11k(k1)=limr1(Mk)g=1(rg)k1(1rg)Mk(r1)Mr11k(k1)=limr1(Mk)(k2)!(Mk)!(M1)!Mr11k(k1)=1. (23)

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Arratia R, Goldstein L and Gordon L (1989a). Poisson approximation and the Chen-Stein method. Statistical Science 5: 403–434. [Google Scholar]
  2. Arratia R, Goldstein L and Gordon L (1989b). Two moments suffice for Poisson approximations, the Chen-Stein method. Annals of Probability 17: 9–25. [Google Scholar]
  3. Athreya KB and Ney PE (2004). Branching Processes. Mineola, New York, Dover. [Google Scholar]
  4. Bellman R and Harris TE (1948). On the Theory of Age-Dependent Stochastic Branching Processes. Proceedings of the National Academy of Sciences of the United States of America 34: 601–604. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Champagnat N and Henry B (2016). Moments of the frequency spectrum of a splitting tree with neutral Poissonian mutations. Electronic Journal of Probability 21. [Google Scholar]
  6. Champagnat N, Lambert A and Richardson M (2012). Birth and Death Processes with Neutral Mutations. International Journal of Stochastic Analysis. [Google Scholar]
  7. Chen HY, Di Mascio M, Perelson AS, et al. (2007). Determination of vims burst size in vivo using a single-cycle SIV in rhesus macaques. Proceedings of the National Academy of Sciences of the United States of America 104: 19079–19084. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. De Boer RJ, Ribeiro RM and Perelson AS (2010). Current Estimates for HIV-1 Production Imply Rapid Viral Clearance in Lymphoid Tissues. Plos Computational Biology 6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Durrett R (2008). Probability Models for DNA Sequence Evolution. New York, Springer Science + Business Media, LLC. [Google Scholar]
  10. Durrett R (2013). Population genetics of neutral mutations in exponentially growing cancer cell populations. Annals of Applied Probability 23: 230–250. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Durrett R and Limic V (2001). On the quantity and quality of single nucleotide polymorphisms in the human genome. Stochastic Processes and Their Applications 93: 1–24. [Google Scholar]
  12. Eriksson A, Mehlig B, Rafajlovic M, et al. (2010). The Total Branch Length of Sample Genealogies in Populations of Variable Size. Genetics 186: 601–611. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Fiebig EW, Wright DJ, Rawal BD, et al. (2003). Dynamics of HIV viremia and antibody seroconversion in plasma donors: implications for diagnosis and staging of primary HIV infection. Aids 17: 1871–1879. [DOI] [PubMed] [Google Scholar]
  14. Freedman D (1977). Remark on difference between sampling with and without replacement. Journal of the American Statistical Association 72: 681–681. [Google Scholar]
  15. Giorgi EE, Funkhouser B, Athreya G, et al. (2010). Estimating time since infection in early homogeneous HIV-1 samples using a poisson model. BMC Bioinformatics 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Gordon SN, Liyanage NPM, Doster MN, et al. (2016). Boosting of ALVAC-SIV Vaccine-Primed Macaques with the CD4-SIVgp120 Fusion Protein Elicits Antibodies to V2 Associated with a Decreased Risk of SIVmac251 Acquisition. Journal of Immunology 197: 2726–2737. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Griffiths RC and Pakes AG (1988). An infinite-alleles version of the simple branching-process. Advances in Applied Probability 20: 489–524. [Google Scholar]
  18. Griffiths RC and Tavare S (1998). The age of a mutation in a general coalescent tree. Stochastic Models: 273–295. [Google Scholar]
  19. Haaland RE, Hawkins PA, Salazar-Gonzalez J, et al. (2009). Inflammatory Genital Infections Mitigate a Severe Genetic Bottleneck in Heterosexual Transmission of Subtype A and C HIV-1. Plos Pathogens 5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Hammersley JM and Handscomb DC (1964). Monte Carlo Methods. London, Chapman and Hall. [Google Scholar]
  21. Harris TE (1948). Branching Processes. Annals of Mathematical Statistics 19: 474–494. [Google Scholar]
  22. Harris TE (1963). The Theory of Branching Processes. Berlin, Springer-Verlag. [Google Scholar]
  23. Kahn JO and Walker BD (1998). Acute human immunodeficiency virus type 1 infection. New England Journal of Medicine 339: 33–39. [DOI] [PubMed] [Google Scholar]
  24. Keele BF, Giorgi EE, Salazar-Gonzalez JF, et al. (2008). Identification and characterisation of transmitted and early founder virus envelopes in primary HIV-1 infection. Proceedings of the National Academy of Sciences of the United States of America 105: 7552–7557. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Kimura M (1969). Number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics 61: 893–903. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Kimura M and Crow JF (1964). Number of alleles that can be maintained in finite population. Genetics 49: 725–738. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kosaka PM, Pini V, Calleja M, et al. (2017). Ultrasensitive detection of HIV-1 p24 antigen by a hybrid nanomechanical-optoplasmonic platform with potential for detecting HIV-1 at first week after infection. Plos One 12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Lee HY, Giorgi EE, Keele BF, et al. (2009). Modeling sequence evolution in acute HIV-1 infection. Journal of Theoretical Biology 261: 341–360. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Love TMT, Park SY, Giorgi EE, et al. (2016). SPMM: estimating infection duration of multivariant HIV-1 infections. Bioinformatics 32: 1308–1315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Markowitz M, Louie M, Hurley A, et al. (2003). A novel antiviral intervention results in more accurate assessment of human immunodeficiency virus type 1 replication dynamics and T-Cell decay in vivo. Journal of Virology 77: 5037–5038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Nolen TL, Hudgens MG, Senb PK, et al. (2015). Analysis of repeated low-dose challenge studies. Statistics in Medicine 34: 1981–1992. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Ohtsuki H and Innan H (2017). Forward and backward evolutionary processes and allele frequency spectrum in a cancer cell population. Theoretical Population Biology 117: 43–50. [DOI] [PubMed] [Google Scholar]
  33. Patel P, Borkowf CB, Brooks JT, et al. (2014). Estimating per-act HIV transmission risk: a systematic review. Aids 28: 1509–1519. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Pegu P, Vaccari M, Gordon S, et al. (2013). Antibodies with High Avidity to the gp120 Envelope Protein in Protection from Simian Immunodeficiency Virus SIVmac251 Acquisition in an Immunization Regimen That Mimics the RV-144 Thai Trial. Journal of Virology 87: 1708–1719. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Regoes RR, Longini IM, Feinberg MB, et al. (2005). Preclinical assessment of HIV vaccines and microbicides by repeated low-dose virus challenges. Plos Medicine 2: 798–807. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Ribeiro RM, Qin L, Chavez LL, et al. (2010). Estimation of the Initial Viral Growth Rate and Basic Reproductive Number during Acute HIV-1 Infection. Journal of Virology 84: 6096–6102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Schuh HJ (1982). A note on the Harris-Sevastyanov transformation for supercritical branching-processes. Journal of the Australian Mathematical Society Series a-Pure Mathematics and Statistics 32: 215–222. [Google Scholar]
  38. Seo TK, Thorne JL, Hasegawa M, et al. (2002). Estimation of effective population size of HIV-1 within a host: A pseudomaximum-likelihood approach. Genetics 160: 1283–1293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Stadler T, Vaughan TG, Gavryushkin A, et al. (2015). How well can the exponential-growth coalescent approximate constant-rate birth-death population dynamics? Proceedings of the Royal Society B-Biological Sciences 282. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Stafford MA, Corey L, Cao YZ, et al. (2000). Modeling plasma virus concentration during primary HIV infection. Journal of Theoretical Biology 203: 285–301. [DOI] [PubMed] [Google Scholar]
  41. Strbo N, Vaccari M, Pahwa S, et al. (2013). Cutting Edge: Novel Vaccination Modality Provides Significant Protection against Mucosal Infection by Highly Pathogenic Simian Immunodeficiency Virus. Journal of Immunology 190: 2495–2499. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Ver Hoef JM (2012). Who Invented the Delta Method? The American Statistician 66: 124–127. [Google Scholar]

RESOURCES