Abstract
In Bayesian phylogenetics, the coalescent process provides an informative framework for inferring changes in the effective size of a population from a phylogeny (or tree) of sequences sampled from that population. Popular coalescent inference approaches such as the Bayesian Skyline Plot, Skyride, and Skygrid all model these population size changes with a discontinuous, piecewise-constant function but then apply a smoothing prior to ensure that their posterior population size estimates transition gradually with time. These prior distributions implicitly encode extra population size information that is not available from the observed coalescent data or tree. Here, we present a novel statistic, , to quantify and disaggregate the relative contributions of the coalescent data and prior assumptions to the resulting posterior estimate precision. Our statistic also measures the additional mutual information introduced by such priors. Using we show that, because it is surprisingly easy to overparametrize piecewise-constant population models, common smoothing priors can lead to overconfident and potentially misleading inference, even under robust experimental designs. We propose as a useful tool for detecting when effective population size estimates are overly reliant on prior assumptions and for improving quantification of the uncertainty in those estimates.[Coalescent processes; effective population size; information theory; phylodynamics; prior assumptions; skyline plots.]
The coalescent process models how changes in the effective size of a target population influence the phylogenetic patterns of sequences sampled from that population. First derived in (Kingman, 1982) under the assumption of a constant sized population, the coalescent process has since been extended to account for temporal variation in the population size (Griffiths and Tavare 1994), structured demographics (Beerli and Felsenstein 1999), and multilocus sampling (Li and Durbin 2011). Inference under these models aims to statistically recover the unknown effective population size (or demographic) history from the reconstructed phylogeny (or tree) and has provided insights into infectious disease epidemiology, population genetics, and molecular ecology (Pybus et al. 2003; Wakeley 2008; Shapiro et al. 2004). Here, we focus on coalescent processes that describe the genealogies of serially sampled individuals from populations with deterministically varying size. These are widely applied to study the phylodynamics of infectious diseases (Griffiths and Tavare 1994; Rodrigo and Felsenstein 1999).
Early approaches to inferring effective population size from coalescent phylogenies used pre-defined parametric models (e.g., exponential or logistic growth functions) to represent temporal demographic changes (Kuhner et al. 1998; Pybus et al. 2003). While these formulations required only a few variables and provided interpretable estimates, selecting the most appropriate parametric description could be challenging and risk underfitting complex trends (Minin et al. 2008). This motivated the introduction of the classic skyline plot (Pybus et al. 2000), which, by proposing an independent, piecewise-constant demographic change at every coalescent event (i.e., at the branching times in the phylogeny), maximized flexibility and removed parametric restrictions. However, this flexibility came at the cost of increased estimation noise and potential overfitting of changes in effective population size (Ho and Shapiro 2011).
Efforts to redress these issues within a piecewise-constant framework subsequently spawned a family of skyline plot-based methods (Ho and Shapiro 2011). Among these, the most popular and commonly used are the Bayesian Skyline Plot (BSP) (Drummond et al. 2005), the Skyride (Minin et al. 2008), and the Skygrid (Gill et al. 2013) approaches. All three attempted to regulate the sharp fluctuations of the inferred piecewise-constant demographic function by enforcing a priori assumptions about the smoothness (i.e., the level of autocorrelation among piecewise-constant segments) of real population dynamics. This was seen as a biologically sensible compromise between noise regulation and model flexibility (Parag and Donnelly 2020; Strimmer and Pybus 2001).
The BSP limited overfitting by i) predefining fewer piecewise demographic changes than coalescent events and ii) smoothing noise by asserting a priori that the population size after a change-point was exponentially distributed around the population size before it. This method was questioned by (Minin et al., 2008) for making strong smoothing and change-point assumptions and stimulated the development of the Skyride, which embeds the flexible classic skyline plot within a tunable Gaussian smoothing field. The Skygrid, which extends the Skyride to multiple loci and allows arbitrary change-points (the BSP and Skyride change-times coincide with coalescent events), also uses this prior. The Skyride and Skygrid methods aimed to better trade off prior influence with noise reduction, and while somewhat effective, are still imperfect because they can fail to recover genuinely abrupt demographic changes such as bottlenecks (Faulkner et al. 2019).
As a result, studies continue to explore and address the nontrivial problem of optimizing this tradeoff, either by searching for less-restrictive and more adaptive priors (Faulkner et al. 2019) or by deriving new data-driven skyline change-point grouping strategies (Parag and Donnelly 2020). The evolution of coalescent model inference thus reflects a desire to understand and fine-tune how prior assumptions and observed phylogenetic data interact to yield reliable posterior population size estimates. Surprisingly, and in contrast to this desire, no study has yet tried to directly and rigorously measure the relative influence of the priors and data on these estimates.
Here, we develop and present a novel information theoretic statistic, , to formally disaggregate and quantify the contributions of both priors and data on the uncertainty around the posterior demographic estimates of popular skyline-based coalescent methods. Using we show how widely used smoothing priors can result in overconfident population size inferences (i.e., estimates with unjustifiably small credible intervals) and provide practical guidelines against such circumstances. We illustrate the utility of this approach on well-characterized data sets describing the population size of HCV in Egypt (Pybus et al. 2003) and ancient Beringian steppe bison (Shapiro et al. 2004).
To our knowledge, , which in theory can be adapted to any prior-data comparison problem, is new not only to the field of phylogenetics but also across statistics and data science. While inference that is strongly driven by prior assumptions can be beneficial, for example when a prior encodes expert knowledge or salient dynamics, having a measure of the relative information introduced by data and prior distributions can improve the reproducibility and interpretability of analyses. Our statistic will help to detect when prior assumptions are inadvertently and overly influencing demographic estimates and will hopefully serve as a diagnostic tool that future methods can employ to optimize and validate their prior-data tradeoffs.
Materials and Methods
Coalescent Inference
We provide an overview of the coalescent process and statistical inference under skyline plot-based demographic models. The coalescent is a stochastic process that describes the ancestral genealogy of sampled individuals or lineages from a target population (Kingman 1982). Under the coalescent, a tree or phylogeny of relationships among these individuals is reconstructed backwards in time with coalescent events defined as the points where pairs of lineages merge (i.e., coalesce) into their ancestral lineage. This tree, , is rooted at time into the past, which is the time to the most recent common ancestor (TMRCA) of the sample. The tips of correspond to sampled individuals.
The rate at which coalescent events occur (i.e., the rate of branching in ) is determined by and hence informative about the effective size of the target population. We assume that a total of samples are taken from the target population at distinct sampling times, which are independent of and uninformative about population size changes (Drummond et al. 2005). We do not specify the sample generating process as it does not affect our analysis by this independence assumption (Parag and Pybus 2019). We let be the time of the th coalescent event in with and ( samples can coalesce times before reaching the TMRCA).
We use to count the number of lineages in at time into the past; then decrements by 1 at every and increases at sampling times. Here, is the present. The effective population size or demographic function at is so that the coalescent rate underlying is (Kingman 1982). While can be described using appropriate parametric formulations (Parag and Pybus 2017), it is more common to represent by some tractable -dimensional piecewise-constant approximation (Ho and Shapiro 2011). Thus, we can write , with as the number of piecewise-constant segments. Here, is the constant population size of the segment which is delimited by times , with and and is an indicator function. The rate of producing new coalescent events is then . Kingman's coalescent model is obtained by setting (constant population of ).
When reconstructing the population size history of infectious diseases, it is often of interest to infer from (Ho and Shapiro 2011), which forms our coalescent data generating process. If denotes the vector of demographic parameters to be estimated then the coalescent data log-likelihood can be obtained from (Parag and Pybus, 2019) and (Snyder and Miller, 1991) as
(1) |
with and as constants that depend on the times and lineage counts of the coalescent events that fall within the segment duration , and . Equation 1 is equivalent to the standard serially sampled skyline log-likelihood in (Drummond et al., 2005), except that we do not restrict to change only at coalescent event times.
In Bayesian phylogenetic inference, skyline-based methods such as the BSP, Skyride and Skygrid combine this likelihood with a prior distribution , which encodes a priori beliefs about the demographic function. This yields a population size posterior, from Bayes law, which depends on both the prior and coalescent data-likelihood as:
(2) |
Here, we assume that the phylogeny, , is known without error. In some instances, only sampled sequence data, , are available and a distribution over must be reconstructed from under a model of molecular evolution with parameters . Equation 2 becomes embedded in the more complex expression , which then involves inferring both the tree and population size (Drummond et al. 2002).
While we do not consider this extension here we note that results presented here are still applicable and relevant. This follows because the output of the more complex Bayesian analysis above (i.e., when sequence data are used directly) is a posterior distribution over tree space. We can sample from this posterior and treat each sampled tree effectively as a fixed tree. Consequently, we expect any summary statistic that we derive here, under the assumption of a fixed-tree will be usable in studies that incorporate genealogical uncertainty by computing the distribution of that statistic over this covering set of sampled posterior trees.
Information and Estimation Theory
We review and extend some concepts from information and estimation theory, applying them to skyline-based coalescent inference. We consider a general parametrization of the effective population size , where for all and (.) is a differentiable function. Popular skyline-based methods usually choose the identity function (e.g., BSP) or the natural logarithm (e.g., the Skyride and Skygrid) for . Equations 1 and 2 are then reformulated with as the coalescent data log-likelihood and as the demographic prior. The Bayesian posterior, combines this likelihood and prior and hence is influenced by both the coalescent data and prior beliefs. We can formalize these influences using information theory.
The expected Fisher information, , is a matrix with th element (Lehmann and Casella 1998). The expectation is taken over the coalescent tree branches and . As observed in (Parag and Pybus, 2019), quantifies how precisely we can estimate the demographic parameters, , from the coalescent data, . Precision is defined as the inverse of variance (Lehmann and Casella 1998). The BSP, Skyride, and Skygrid parametrizations all yield and , with I as a identity matrix (Parag and Pybus 2019). These matrices provide several useful insights that we will exploit in later sections. First, is orthogonal (diagonal), meaning that the coalescent process over the segment can be treated as deriving from an independent Kingman coalescent with constant population size (Parag and Pybus 2017). Second, the number of coalescent events in that segment, , controls the Fisher information available about . Last, working under removes any dependence of this Fisher information component on the unknown parameter (Parag and Pybus 2019).
The prior distribution, , that is placed on the demographic parameters can alter and impact both estimate bias and precision. We can gauge prior-induced bias by comparing the maximum likelihood estimate (MLE), with the maximum a posteriori estimate (MAP), (van Trees 1968). The difference measures this bias. We can account for prior-induced precision by computing Fisher-type matrices for the prior and posterior as and (Tichavsky et al. 1998; Huang and Zhang 2018). Combining these gives
(3) |
Equation 3 describes how the posterior Fisher information matrix, , relates to the standard Fisher information and the prior second derivative . We make the common regularity assumptions (see Huang and Zhang 2018 for details) that ensure is positive definite and that all Fisher matrices exist. These assumptions are valid for exponential families such as the piecewise-constant coalescent (Lehmann and Casella 1998; Parag and Pybus 2019). Equation 3 will prove fundamental to resolving the relative impact of the prior and data on the best precision achievable using the posterior . We also define expectations on these matrices with respect to the prior as , and , with , for example. These matrices are now constants instead of functions of . Equation 3 also holds for these constant matrices (Tichavsky et al. 1998).
These Fisher information matrices set theoretical upper bounds on the precision attainable by all possible statistical inference methods. For any unbiased estimate of , , the Cramer–Rao bound (CRB) states that with indicating transpose. If we relax the unbiased estimation requirement and include prior (distribution) information then the Bayesian or posterior Cramer–Rao lower bound (BCRB) controls the best estimate precision (van Trees 1968). If is any estimator of then the BCRB states that . This bound is not dependent on due to the extra expectation over the prior (Tichavsky et al. 1998).
The CRB describes how precisely we can estimate demographic parameters using just the coalescent data and is achieved (asymptotically) with equality for skyline (piecewise-constant) coalescent models (Parag and Pybus 2019). The BCRB, instead, defines the precision limit for the combined contributions of the data and the prior. The CRB is a frequentist bound that assumes a true fixed , while the BCRB is a Bayesian bound that treats as a random parameter. The expectation over the prior connects the two formalisms (Ben-Haim and Eldar 2009). Given their importance in delimiting precision, the and Fisher matrices will be central to our analysis, which focuses on resolving and quantifying the individual contributions of the data versus prior assumptions.
Results
The Coalescent Information Ratio,
We propose and derive the coalescent information ratio, , as a statistic for evaluating the relative contributions of the prior and coalescent data to the posterior estimates obtained as solutions to Bayesian skyline inference problems (see Materials and Methods section). Consider such a problem in which the -tip phylogeny is used to estimate the -element demographic parameter vector . Let be the MLE of given the coalescent data . Asymptotically, the uncertainty around this MLE can be described with a multivariate Gaussian distribution with covariance matrix . The Fisher information, then defines a confidence ellipsoid that circumscribes the total uncertainty from this distribution. In (Parag and Pybus, 2019), this ellipsoid was found central to understanding the statistical properties of skyline-based estimates.
The volume of this ellipsoid is , with as some -dependent constant. Decreasing increases the best estimate precision attainable from the data (Lehmann and Casella 1998). In a Bayesian framework, the asymptotic posterior distribution of also follows a multivariate Gaussian distribution with covariance matrix of . We can therefore construct an analogous ellipsoid from with volume that measures the uncertainty around the MAP estimate (Tichavsky et al. 1998). This volume includes the effect of both prior and data on estimate precision. Accordingly, we propose the ratio
(4) |
as a novel and natural statistic for dissecting the relative impact of the data and prior distribution on posterior estimate precision.
From Equation 4, we observe that with signifying that the information from our prior distribution is negligible in comparison to that from the data and indicating the converse. Importantly, we find
(5) |
At this threshold value contributes at least as much information as the data. Moreover, since the prior contribution becomes negligible with increasing data and is undefined when is unidentifiable from (i.e., when is singular, (Rothenburg 1971). Consequently, we posit that a smaller implies the prior provides a greater contribution to estimate precision.
We define as an information ratio due to its close connection to both the Fisher and mutual information. The mutual information between and , , measures how much information (in bits for example) contains about (Cover and Thomas 2006). This is distinct but related to , which quantifies the precision of estimating from (Brunel and Nadal 1998). Recent work from (Huang and Zhang, 2018) into the connection between the Fisher and mutual information has yielded two key approximations to . These can be obtained by substituting either or for in
(6) |
with as the differential entropy of (Cover and Thomas 2006).
For a flat prior or many observations, , as the prior contributes little or no information (Brunel and Nadal 1998). For sharper priors, as the prior contribution is significant—using would lead to large errors (Huang and Zhang 2018). Equation 6 is predicated on (i) regularity assumptions for the distributions used (i.e., that the second derivatives exist), (ii) conditional dependence of the observed data given and (iii) that the likelihood is peaked around its most probable value (Lehmann and Casella 1998; Brunel and Nadal 1998; Huang and Zhang 2018). The skyline-based inference problems that we consider here automatically satisfy (i) and (ii) as these models belong to an exponential family. Condition (iii) is satisfied for moderate to large trees (and asymptotically) (Lehmann and Casella 1998; Parag and Pybus 2019).
Using the above approximations, we derive the interesting expression
(7) |
which suggests that our ratio directly measures the excess mutual information introduced by the prior, providing a substantive link between how sharper estimate precision is attained with extra mutual information. Observe that both sides of Equation (7) diminish when . Because the mutual information and its approximations (see Equation (6)) are invariant to invertible parameter transformations (Huang and Zhang 2018), our coalescent information ratio does not depend on whether we infer , its inverse, or its logarithm.
Moreover, we can use normalizing transformations to make valid at even small tree sizes. In (Slate, 1994), several such transformations for exponentially distributed models like the coalescent are derived. Among them, the logarithmic transform can achieve approximately normal log-likelihoods for about seven observations and above (). Thus, , which is also optimal for experimental design (Parag and Pybus 2019), ensures the validity of on small trees. This is the parametrization adopted by the Skyride and Skygrid methods (Minin et al. 2008). Other (cubic-root) parametrizations under which would be valid at even smaller also exist (Slate 1994).
Equations 4–7 are not restricted to coalescent inference problems and are generally applicable to statistical models that involve exponential families (Lehmann and Casella 1998). We now specify for skyline-based models, which all possess piecewise-constant population sizes and orthogonal matrices (Parag and Pybus 2019). These properties permit the expansion (Ipsen and Rehman 2008):
where are the diagonal elements of with , and is the sub-matrix formed by deleting the rows and columns of .
This allows us to formulate a prior signal-to-noise ratio
(8) |
which quantifies the relative excess Fisher information (the ``signal'') that is introduced by the prior. This ratio signifies when the prior contribution overwhelms that of the data i.e., . Having derived theoretically meaningful metrics for resolving prior-data precision contributions, we next investigate their ramifications.
The Kingman Conjugate Prior
Kingman's coalescent process (Kingman 1982), which describes the phylogeny of a constant sized population , is the foundation of all skyline model formulations. Specifically, a -dimensional skyline model is analogous to having Kingman coalescent models, the of which is valid over and describes the genealogy under population size . Here, we use Kingman's coalescent to validate and clarify the utility of as a measure of relative data-prior precision contributions.
We assume an -tip Kingman coalescent tree, and initially work with the inverse parametrization, . We scale at by as in (Parag and Pybus, 2017) so that for with . If defines the space of values, and has prior distribution , then, by (Snyder and Miller, 1991), its posterior distribution is
where is a constant and is the scaled TMRCA of .
The likelihood function embedded within is proportional to a shape-rate parametrized gamma distribution, with known shape . The conjugate prior for is also gamma (Fink 1997) i.e., with shape and rate . The posterior distribution is then with counting coalescent events in (Robert 2007). Transforming to implies . This is an inverse gamma distribution with mean , shape and inverse rate . If describes the space of possible values and then
We can interpret the parameters of the gamma posterior distribution as involving a prior contribution of coalescent events from a virtual tree, , with scaled TMRCA . This is then combined with the actual coalescent data, which contributes coalescent events from , with scaled TMRCA of (Robert 2007). This offers a clear breakdown of how our posterior estimate precision is derived from prior and likelihood contributions and suggests that if has more tips than then we are depending more on the prior than the data. We now calculate to determine if we can formalize this intuition.
The Fisher information values of are and . The information ratio and mutual information difference, , which hold for all parametrizations, then follow from Equations 4, 7, and 8 as
(9) |
with , as the effective signal-to-noise ratio. The approximations shown are valid when . Interestingly, when so that , we get (see Equation (5)). This exactly quantifies the relative impact of real and virtual observations described previously. At this point, we are being equally informed by both the conjugate prior and the likelihood. Prior over-reliance can be defined by the threshold condition of .
The expression of confirms our interpretation of as an effective signal-to-noise ratio controlling the extra mutual information introduced by the conjugate prior. This can be seen by comparison with the standard Shannon mutual information expressions from information theory (Cover and Thomas 2006). At small , where the data dominates, we find that the prior linearly detracts from and linearly increases . We also observe that , the gamma rate parameter, has no effect on estimate precision or mutual information.
Our information ratio therefore provides a systematic decomposition of the posterior population size estimate precision and generalizes the virtual observation idea to any prior distribution. In essence, the prior is contributing an effective sample size, which for the conjugate Kingman prior is . We summarize these points in Figure 1, which shows the conjugate prior and two posteriors together with their corresponding values.
Skyline Smoothing Priors
In this section, we tailor for the BSP, Skyride, and Skygrid coalescent inference methods. These popular skyline-based approaches couple a piecewise-constant demographic coalescent data likelihood with a smoothing prior to produce population size estimates that change more continuously with time. The smoothing prior achieves this by assuming informative relationships between and its neighboring parameters . Such a priori correlation implicitly introduces additional demographic information that is not available from the coalescent data . While these priors can embody sensible biological assumptions, we show that they may also engender overconfident statements or obscure parameter non-identifiability. We propose as a simple but meaningful analytic for diagnosing these problems.
We first define uniquely objective (i.e., uninformative) reference skyline priors, which we denote . Finding objective priors for multivariate statistical models is generally nontrivial, but (Berger et al., 2015) state that if has form then . Here, and are some functions and symbolizes the vector excluding . Following this, we obtain the objective priors
with , as normalization constants. Given its optimal properties (Parag and Pybus 2019), we only consider , and drop explicit notational references to it. Under this parametrization, and its expectation with respect to the prior are equal, that is . In addition, the reference prior in this case is , with as a matrix of zeros. This yields by Equation (4). A uniform prior over log-population space is hence uniquely objective for skyline inference.
Other prior distributions, which are subjective by this definition, necessarily introduce extra information and contribute to the posterior estimate precision. This contribution will result in . The two most widely used, subjective, skyline plot smoothing priors are:
-
(i)
the Sequential Markov Prior (SMP) used in the BSP (Drummond et al. 2005), and
-
(ii)
the Gaussian Markov Random Field (GMRF) prior employed in both the Skyride and Skygrid methods (Minin et al. 2008; Gill et al. 2013).
As the SMP and GMRF both propose nearest neighbor autocorrelations among elements of , tridiagonal posterior Fisher information matrices result. We represent these as and , respectively.
The SMP is defined as: (Drummond et al. 2005). It assumes that with a prior mean of . An objective prior is used for . To adapt this for , we define for . In the Appendix, we show how this expression yields Equation A1 and hence the transformed prior . We then take relevant derivatives to obtain , which for the minimally representative case is written as:
(10) |
The matrices simply extend the tridiagonal pattern of Equation (10).
An issue with the SMP is its dependence on the unknown ``true'' demographic parameter values. As a result, we cannot evaluate (or control) a priori how much information is contributed by this smoothing prior. Rapidly declining populations could feature , for example, which would result in prior over-reliance. Conversely, exponentially growing populations would be more data-dependent. This likely reflects the asymmetry in using sequential exponential distributions. The only control we have on smoothing implicitly emerges from choosing the number of segments, . Some recent implementations of the BSP include an alternative log-normal prior that links with (Bouckaert et al. 2019), which is conceptually similar to the GMRF below.
The possibly strong or inflexible prior assumptions under the BSP motivated the development of the GMRF for the Skyride and Skygrid methods (Minin et al. 2008). The GMRF works directly with and models the autocorrelation between the neighbouring segments with multivariate Gaussian distributions. The GMRF prior (Minin et al. 2008) is defined as . In this model, is a normalization constant, a smoothing parameter, to which a gamma prior is often applied, and the values adjust for the duration of the piecewise-constant skyline segments. Usually, either (i) is chosen based on the inter-coalescent midpoints in or (ii) a uniform GMRF is assumed with for every .
Similarly, we calculate for the as:
(11) |
The appendix provides the general derivation for any . As is arbitrary and the depend only on , the GMRF is insensitive to the unknown parameter values. This property makes it more desirable than the SMP and gives us some control (via ) of the level of smoothing introduced. Nevertheless, the next section demonstrates that this model still tends to over-smooth demographic estimates.
We diagonalize and to obtain matrices of form . Here is an orthogonal transformation matrix (i.e., ) and with as the eigenvalue of . Since , we can use Equation 4 to find that . This equality reveals that acts as a prior perturbed version of . When objective reference priors are used we recover and . We can use the matrix to gain insight into how the GMRF and SMP encode population size correlations. The principal components of our posterior demographic estimates (which are obtained from ) are the vectors forming the axes of the uncertainty ellipsoid described by .
These principal component vectors take the form when we apply the reference prior . Thus, as we would expect, our uncertainty ellipses are centered on the parameters we wish to infer. However, if we use the GMRF prior these axes are instead transformed to . These new axes are linear combinations of and elucidate how smoothing priors share information (i.e., introduce autocorrelations) about across its elements. These geometrical changes also hint at how smoothing priors influence the statistical properties of our coalescent inference problem.
To solidify these ideas, we provide a visualization of and an example of . We consider the simple case, where the posterior Fisher information and for the GMRF and SMP both take the form:
(12) |
with for the GMRF and for the SMP. The signal-to-noise ratio is (see Equation 9), and performance clearly depends on how the coalescent events in are apportioned between the two population size segments.
We can lower bound the contribution of these priors to under any settings by using the robust coalescent design from (Parag and Pybus, 2019). This stipulates that we define our skyline segments such that in order to optimize estimate precision under . At this robust point, we also find that (or ) is attained. Figure 2 gives the uncertainty ellipses for this robust model at . These are constructed in coordinates centered about population size means as with controlling the confidence level.
Here is either or . Because is diagonal the data-informed confidence ellipse has principal axes aligned with . The covariance among population size segments in , which is induced by the smoothing prior, skews these principal axes. We can see this by diagonalizing at and for every to obtain:
(13) |
Applying , we find that the axes of our uncertainty ellipse (as visible in Figure 2) have changed from to . Sums and differences of log-populations are now the parameters that can be most naturally estimated under the SMP and GMRF. The reduction in the area of the ellipses of Figure 2 is a proxy for .
The Dangers of Smoothing
Having defined ratios for measuring the contribution of smoothing priors to the precision of estimates, we now use them to explore and expose the conditions under which prior over-reliance is likely to occur in practice. We assume that skyline segments are chosen to satisfy the robust design for (Parag and Pybus 2019), with as the total number of skyline segments. We previously proved that robust designs, at , minimize dependence on the prior (maximize ). While this is not the case for , in Figure A1 of the Appendix, we illustrate that the maximal point is generally well approximated by this robust setting. The values computed here are therefore conservative for most settings. Other experimental designs rely more on the prior.
As in Equation 5, we use the threshold to diagnose when the coalescent data (likelihood) and prior are equally influencing demographic posterior estimate precision. At the total Fisher information doubles since . We previously uncovered the importance of this threshold in the Kingman conjugate prior problem, where it signified an equality between the number of pseudo and real samples contributed by the prior and data, respectively. As (see Equation 8), this setting is also meaningful because it achieves a unit signal-to-noise ratio for any skyline-based model.
We first reconsider the case of Equation 12, where controls the prior contribution to . Here suggests , which implies that we are overly-reliant on smoothing when is larger than of the total observed coalescent events. This occurs when or , for the SMP and GMRF respectively. The improved precision due to the prior at this threshold is shown in Figure 2. The relative ellipse area (and hence ) will shrink further as we deviate from robust designs.
As the number of skyline segments, , increase, smoothing becomes more influential and can promote misleading conclusions. For the cases, we will only examine the GMRF, since the SMP has the undesirable property of dependence on the unknown values. To better expose the impact of the smoothing parameter , we will assume a uniform GMRF () so that then only depends on and . We compute and hence , at various . For example, we find that
under the robust design. Interestingly, the order of the polynomial dependence of (and hence ) on increases with . We find that this trend holds for any design. We will use the term robust for when is calculated under a robust design.
Figure 3 plots the robust against and for the uniform GMRF. A key feature of Figure 3 is the steep -dependent decay of relative to the threshold, which exposes how easily we can be unduly reliant on the prior, as increases. Given a phylogeny , increasing the complexity of a skyline-based model enhances the dependence of our posterior estimate precision on the smoothing prior. This pattern is intuitive as fewer coalescent events now inform each demographic parameter (Parag and Pybus 2019). However, decays with surprising speed. For example, at (the lowest curve in Figure 3), we get for and . Usually, has a gamma-prior with mean of 1 (Minin et al. 2008). We show the corresponding mutual information increases due to these GMRF priors in Figure A2 of the Appendix.
While Figure 3 might seem specific to the uniform GMRF, it is broadly applicable to the BSP, Skyride, and Skygrid methods. We now outline the implications of Figure 3 for each of these skyline-based approaches.
(1) Bayesian Skyline Plot. This method uses the SMP, which depends on the unknown values. However, the results of Figure 3 remain valid if we set to , which results in the smallest non-data contribution to Equation 10. This follows as and have similar forms. While this choice underestimates the impact of the SMP, it still cautions against high- skylines and confirms suspected BSP issues related to poor estimation precision when skylines are too complex, or the coalescent data are not sufficiently informative (Ho and Shapiro 2011). However, good use of the BSP grouping parameter (Drummond et al. 2005), which sets , could alleviate these problems.
(2) Skyride. When this method uses the uniform GMRF, all results apply exactly. In its full implementation, the Skyride employs a time-aware GMRF that sets based on and estimates from the data (Minin et al. 2008). However, even with these adjustments, the GMRF can over-smooth, and fail to recover population size changes (Ho and Shapiro 2011; Faulkner et al. 2019). Our results provide a theoretical grounding for this observation. The Skyride constrains and then smooths this noisy piecewise model. Consequently, it constructs a skyline which is too complex by our measures (the lowest curve in Equation 3 is at ). By rescaling the smoothing parameter to , the curves in Figure 3 upper bound the true values of the time-aware GMRF.
(3) Skygrid. This method uses a scaled GMRF. For a tree with TMRCA , the Skygrid assumes new population size segments every time units (Gill et al. 2013). As a result, every and the time-aware GMRF becomes uniform with rescaled smoothing parameter . Therefore, the conclusions of Figure 3 hold exactly for the Skygrid, provided the horizontal axis is scaled by . This setup reduces the rate of decay but the curves still caution strongly against using skylines with . Unfortunately, as its default formulation sets to 1 less than the number of sampled taxa (or lineages) (Gill et al. 2013), the Skygrid is also be vulnerable to prior over-reliance.
The popular skyline-based coalescent inference methods therefore all tend to over-smooth, resulting in population size estimates that can be overconfident or misleading. This issue can be even more severe than Figure 3 suggests since in current practice is often close to and non-robust designs are generally employed. Further, skylines are only statistically identifiable if every segment has at least 1 coalescent event (Parag and Pybus 2019; Parag et al. 2020). Consequently, if is set, smoothing priors can even mask identifiability problems. We recommend that must be guaranteed and in the next section derive a model rejection guideline for finding , the suggested minimum number of coalescent events per skyline segment, and diagnosing prior over-reliance.
Prior Informed Model Rejection
We previously demonstrated how commonly-used smoothing priors can dominate the posterior estimate precision when coalescent inference involves complex, highly parametrized (large-) skyline models. Since data are more influential than the prior when , we can use this threshold to define a simple -rejection policy to guard against prior over-reliance. Assume that the matrix resulting from our prior of interest is symmetric and positive definite. This holds for the GMRF and SMP. The standard arithmetic–geometric mean inequality, , then applies with denoting the matrix trace. Since , we can expand this inequality and substitute in Equation 4 to get .
Since this inequality applies to all , we can maximize its right hand side to get a tighter lower bound on . This bound, termed , is achieved at the robust design and is given by
(14) |
We define as a conservative model rejection criterion with implying that . If is the largest satisfying these inequalities (see Equation 14, indicates argument), then any skyline with more than segments is likely to be overly dependent on the prior and should be rejected under the current coalescent data or tree.
Alternatively, we recommend that skylines using a smoothing prior (with matrix ) should have at least events per segment to avoid prior reliance. The condition in Equation 14 ensures skyline identifiability (Parag and Pybus 2019) and generally (i.e., ). The dependence of on means that additions to the diagonals of necessarily increase the precision contribution from the prior. This insight supports our previous analysis, which used from the uniform GMRF to bound the performance of the SMP and time-aware GMRF. In the Appendix (see Equation A2) we derive analogous rejection bounds based on the excess mutual information, , from Equation 7. There we find that acts like an information-theoretic bandwidth, controlling the prior-contributed mutual information.
Equation 14, which forms a key contribution of this work, can be computed and is valid for any smoothing prior of interest. For the uniform GMRF where , we get . Note that here whenever or , as expected (i.e., there is no smoothing at these values). In Figure A4 of the Appendix, we confirm that is a good lower bound of . We enumerate across and , for an observed tree with , to get Figure 4, which recommends using no more than segments (). In Figure A5, we plot curves for various and , defining boundaries beyond which skyline estimates will be overly dependent on the GMRF.
In the Appendix, we further analyze Equation 14 for the uniform GMRF to discover that is bounded by curves with exponents linear in and quadratic in (see Equation A3). This explains how the influence of smoothing increases with skyline complexity and yields a simple transformation , which can negate prior over-reliance. For comparison, the Skyride implements . The marked improvement, relative to Figure 3, is striking in Figure A3. Other revealing prior-specific insights can be obtained from Equation 14, reaffirming its importance as a model rejection statistic.
Our model rejection tool of Equation 14 can serve as a useful diagnostic for skyline over-parametrization, and as a precaution against prior over-reliance. However, we do not propose as the sole measure of optimal skyline complexity; because while warns against the prior being too relatively influential, it does not guarantee any absolute estimate precision. For example, a small pair might produce the same as a larger pair. Choosing an optimal in a data-justified manner is an open problem that is still under active study (Parag and Donnelly 2020). We next illustrate how , via its more easily computed approximation, , can be practically applied to detect and reject over-smoothed skyline plot models, using data sets that are commonly employed to evaluate the performance of coalescent demographic inference.
Illustrative Examples: Egyptian HCV and Beringian Bison
We validate the practical utility of (and hence ), as a diagnostic of prior over-dependence, by investigating changes in effective population size inferred from the well-studied Egyptian HCV-4 (Pybus et al. 2003) and Beringian steppe bison (Shapiro et al. 2004) data sets. The first consists of 63 partial sequences of HCV genotype 4 and was previously analyzed in (Pybus et al., 2003) using a coalescent model with a parametric demographic function that featured periods of constant population size separated by a phase of exponential growth. The second data set comprises 152 modern and partial mtDNA and was investigated in (Shapiro et al., 2004), where skyline plot models confirmed a demographic history of exponential growth then decline (boom-bust) with an additional bottleneck dynamic (Drummond et al. 2005). These two data sets have since been re-examined under various alternate models in (Minin et al., 2008), (Gill et al., 2013), (Parag et al., 2020) and several other studies.
We simulated 100 trees with and tips, using the software package MASTER (Vaughan and Drummond 2013), according to inferred HCV and bison population size trends, respectively. The HCV population size trend that we simulated from is provided in (Pybus et al., 2003). We inferred the population size trend of the bison data set using the BSP (with sequential Markovian prior) in accordance with published analyses (Drummond et al. 2005). We used 20 population groups and the optimal design from (Parag and Pybus 2019) to ensure that we captured complex bison population dynamics reliably. As our focus is on exploring the behavior of skylines and given a particular underlying population size trend and not the uncertainty associated with that trend, we used the posterior mean (HCV) or median (bison) of these inferred trends for simulating trees and do not consider genealogical uncertainty.
The simulated set of coalescent trees from each data set provide an approximate measure of the coalescent variance that could arise from the inferred underlying population size trends. We then estimated from every simulated tree using various skyline models with time-aware GMRF smoothing priors, as in (Minin et al., 2008). We varied the relative contributions of the coalescent data and GMRF to our posterior log-population size estimates by changing either the skyline dimension, , or the GMRF smoothing parameter . As is fixed for a given data set and robust designs are applied, increasing the number of coalescent events in each segment, , reduces .
We analyzed every tree over all combinations of across a wide range of . For comparison, we also generated purely data-informed estimates of , for the same , by replacing the subjective GMRF with a uniform, objective prior. We computed from Equation 14 for these settings in Figure 5 and observe that, as expected, it decreases with both and (i.e., increases with ). Practical analyses of these data sets using Skyride or Skygrid approaches, would choose or infer a value and set . However, Figure 5 shows and hence events per skyline parameter are often necessary to achieve . This raises questions about the validity of the common practice of applying these methods using their default settings.
Figure 5 confirms that the recommended maximum skyline dimension falls and hence the minimum allowable number of coalescent events per segment grows as the smoothing parameter increases. We demonstrate the qualitative difference in skyline-based estimates between values on either side of the criterion for a single simulated HCV and bison tree in Figure 6. In panels A and C, we present the Skyride estimate, which uses and implements , at the chosen values (0.05 and 1). Contrastingly, in B and D, we illustrate an equivalent skyline with a different , which achieves at this same , according to our metric (see the and curves at and in panels A and B of Figure 5, respectively). We overlay the corresponding skyline (with the same ) obtained with an objective uniform prior, to visualize the uncertainty engendered from the coalescent data alone.
At (panels A and C of Figure 6), the uniform prior produces a skyline that infers more rapid demographic fluctuations through time than that estimated with the GMRF prior. Further, the 95% HPD intervals from the uniform prior (red) are substantially wider than those from the GMRF prior (blue) in both examples, highlighting the marked contribution of the time-aware GMRF prior to posterior estimate precision. While this smoothed trajectory looks reliable we argue that, because (and hence ), it is difficult to justify using the data alone and that the prior is responsible for too much of the estimate precision. In contrast, at and (panels B and D of Figure 6), which apply , both prior distributions yield more similar skylines, implying that GMRF smoothing has not substantially inflated posterior estimate precision.
Under these settings, we have fewer demographic fluctuations than for because 4 and 2 times more coalescent events are informing each parameter or skyline segment, respectively. We achieve smaller uncertainty than with a uniform prior (which is overfitted) but without excessively relying on the GMRF smoothing, which at is likely underfitting. The metric and hence criterion help us better balance data, noise, and our prior assumptions. In contextualizing these results it is important to note that skyline plots provide harmonic mean and not point estimates of population size (Pybus et al. 2000). Consequently, we are inferring sequences of means from our coalescent data, which a priori may not need to conform to a smooth pattern.
The HCV example shows that for times beyond years there are so few events that it is more sensible to estimate a single mean (panel B), which we are confident in across this period, as opposed to several less certain and overfitted means (panel A). In contrast, for the bison example, the bottleneck over years is over-smoothed (panel C), despite many coalescent events occurring in that region. The simple correction of extending our harmonic mean over 2 events (panel D) restores the necessary fall in population size. Deciding on how to balance uncertainty with model complexity is non-trivial and, as shown in these examples, caution is needed to avoid misleading conclusions. We posit that (and hence ) can help formalize this decision-making and improve our quantification of the uncertainty across skyline plots.
Having confirmed as a credible measure of relative uncertainty, we briefly explore how it relates to more easily ascertained measures of uncertainty. For each simulated coalescent tree in the HCV example above, we computed (via Equation 4) and two ancillary statistics based on the 95% highest posterior density (HPD) intervals of the estimates. These are the median HPD ratio and the relative HPD product (across the skyline segments) , which are formulated as:
with med indicating the median value of a set. Here is the 95% HPD interval of under a GMRF with smoothing parameter and is the equivalent HPD when the objective uniform prior is applied instead.
The 95% HPD interval is closely connected to the inverse of the Fisher information matrices that define and, further, describes the most visually conspicuous representation of the uncertainty present in skyline plot estimates. Comparing to these ancillary statistics, which evaluate the median and total 95% uncertainty of a skyline plot, allows us to contextualize against more relatable (though different) and obvious visualizations of posterior performance. We present these comparisons in Figure A6 of the Appendix. There we find that all statistics monotonically decay with that is as the time-aware GMRF becomes more informative. The sharpness of this decay is highly sensitive to . Larger means that more coalescent data are informing each estimated parameter (smaller ).
The reduced decay with supports our assertion that acts as an exponent controlling prior over-reliance (see Fig. 3). The gentler decay of (relative to and ), which largely does not account for , confirms that we could be misled in our understanding of the impact of smoothing if we neglected skyline dimension. In contrast and , which both measure, in some sense, the relative volumes of uncertainty across the entire skyline-plot due to the data alone and the data and prior, fall more significantly and consistently. At (), which is the most common setting in the Skyride and Skygrid methods, both statistics are markedly below and posterior estimates will often be too dependent on the prior. This high- behavior is also indicative of model overparametrization (Parag and Donnelly, 2020). Our metric therefore relates sensibly to visible and common proxies of uncertainty.
Discussion
Popular approaches to coalescent inference, such as the BSP, Skyride, and Skygrid methods, all rely on combining a piecewise-constant population size likelihood function with prior assumptions that enforce continuity. This combination, which is meant to maximize descriptive flexibility without sacrificing the smoothness that is expected to be exhibited by real population size curves over time, has led to many insights in phylodynamics (Ho and Shapiro 2011). However, it has also spawned concerns related to over-smoothing and lack of methodological transparency (Minin et al. 2008; Faulkner et al. 2019). In this work, we attempted to address these concerns by deriving metrics for diagnosing and clarifying the existing assumptions present in current best practice.
Detecting and correcting for underfitting or over-smoothing is crucial if reliable and meaningful assessments of the effective population size changes of a species or pathogen of interest are to be made from sequence data. Abrupt changes in effective population size are not only biologically plausible but may also signal key events that have shaped the demographic histories of populations (Pyron and Burbink 2013). In ecology, identifying rapid extinctions and bottlenecks in diversity might signify the impact of environmental change or anthropogenic influences (e.g., hunting or changes in land use) (Stiller et al. 2010; Thomas et al. 2019). Similarly, in epidemiology, sharp fluctuations in the prevalence of an infection might support hypotheses about emergence in novel populations, seasonality, the effect of interventions, vaccines, or drug treatments. Further, rapid exponential growth of any population may, when observed over a longer timescale, appear as a near-stepwise transition in population size.
Underfitting or over-smoothing these changes would limit understanding of the dynamics of the study population and could affect conclusions about the potential causative factors that influenced those dynamics. However, recognizing when commonly used methods for inferring these demographic trends are over-smoothing is difficult. By capitalizing on (mutual) information theory and (Fisher) information geometry, we formulated the novel coalescent information ratio, , which provides a rigorous means of solving this over-smoothing problem. This ratio describes both the proportion of the asymptotic uncertainty around our posterior estimates that is due solely to the data and the additional mutual information that the prior assumptions introduce.
We derived analytic expressions for for the BSP, Skyride, and Skygrid estimators of effective population size, which combine piecewise skyline likelihoods with either SMP or GMRF smoothing priors. We also showed that has an exact and intuitive interpretation as the ratio of real coalescent events to the sum of real and virtual (prior-contributed) ones in a Kingman coalescent model. Using as a threshold delimiting when the prior contributes as much information as the coalescent data, we found that it is easy to become overly dependent on prior assumptions as the skyline dimension, , increases (for a fixed tree size). This central result emerges from the drastic reduction in the number of coalescent events informing on any population size parameter as rises. Per parameter, the BSP and Skyride use only a few or one event respectively (Minin et al. 2008; Drummond et al. 2005), while the Skygrid may have no events informing some parameters (Gill et al. 2013).
These issues can be obscured by current Bayesian implementations, which can still produce apparently reasonable population size estimates, at least visually, as illustrated in our simulated HCV and bison case studies. Our simulations indicate that analyses that combine maximally parametrized skylines (one event per segment or parameter) with GMRF smoothing can lead to errors in population size inference. For trees simulated according to the HCV demographic scenario, estimates were likely overfitted in the far past, inflating HPDs, but over-smoothed towards the present. The resulting skyline uncertainty contrasted that from the original (Pybus et al. 2003) and later (Parag and Pybus 2017) analyses. In the bison example, we found evidence for underfitting. The inferred skyline there emphasized a smoother boom-bust trend with concentrated HPDs. However, this underestimated the depth of a bottleneck during which coalescent events were concentrated.
These mismatches between data and smoothing can be difficult to diagnose and problematic, not just for prior over-dependence. Low coalescent event counts, for example, can lead to poor statistical identifiability (Rothenburg 1971), which might manifest in spurious MCMC mixing. Consequently, we proposed a practical rejection criterion for ensuring that coalescent data is the main source of inferential information. This criterion, which was based on an approximation to , provided a way of regularizing skyline complexity. When applied to our examples it recommended a 4-event skyline grouping that resulted in demographic reconstructions that were more consistent with the above mentioned HCV studies. It also suggested a simple 2-event grouping that recovered the bison bottleneck dynamic without generating too much estimate noise.
This criterion bounds the maximum recommended skyline dimension for a given data set (tree) size and provides a usable means of defining the minimum number of coalescent events, , which we should allocate to each skyline segment to guard against too much prior influence. Since only requires our computing the sum of the diagonals of the prior Fisher matrix, it can serve as a simple rule-of-thumb for sensibly balancing the prior-data tradeoff in skyline plots (e.g., in the BSP, the grouping parameter might be set to a value above to ensure well-regularized estimates). As we found to be lower-bounded by more visible measures of skyline uncertainty, such as the product of relative HPD widths, useful approximations to and may also be computed from these measures.
Our metric also provides insight into how we can alleviate the dramatic impact of skyline complexity on prior over-reliance. When specialized to the GMRF, for example, it reveals that we can negate over-smoothing by scaling the smoothing parameter with a quadratic of . Moreover, it shows that only by increasing the information available from the sampled phylogeny can we reasonably allow for more complex piecewise-constant functions under a given prior. Recent methods, such as the epoch sampling skyline plot (Parag et al. 2020), which can double the Fisher information extracted from a given phylogeny by exploiting the informativeness of sampling times, would support higher dimensional skylines. Such approaches have the potential to increase the contribution of the data without elevating the influence of the smoothing prior.
While in this article we have applied to non-parametric, skyline inference problems in population genetics, ecology and infectious disease epidemiology, its general formulation in Equation 4 is more widely applicable. It can be also applied to coalescent inference problems where specific parametric models (e.g., exponential/logistic growth) are used, in order to disentangle the contributions of observed data and the prior distributions over these parameters, though numerical solutions will likely be necessary. More generally, our approach is valid for any statistical problem, provided the Hessian matrices necessary for deriving the prior and data Fisher information terms are valid and computable. This is not limited to prior-data tradeoffs. Similar ratio metrics should be derivable by comparing Fisher information terms from different sources (e.g., to test whether one source of data is more informative than another).
Thus, we have devised and validated a rigorous means of better understanding, diagnosing and preventing prior over-dependence. We hope that our statistic, which clarifies and quantifies the often inscrutable impact of the prior and data, will help researchers make more active and considered design decisions when adapting popular skyline-based techniques. Our work also aligns with recent studies, which have started to re-examine both model selection and prior definition (Parag and Donnelly 2020; Faulkner et al. 2019) in an attempt to derive more reliable effective population size estimates from coalescent trees. While we believe that data-driven conclusions are generally the most justifiable we note that, in the context of skyline plots, this can be open to interpretation and the choice of prior is far from trivial.
Acknowledgments
We thank Louis du Plessis for his useful comments and insights on this project.
Appendix
Smoothing Prior Fisher Information Matrices
Here, we derive the prior-informed Fisher information matrices for the SMP and GMRF smoothing priors. We start by finding the log-population size transformed version of the SMP smoothing prior. We then calculate its Hessian to get , and so obtain the general form of Equation 10. The SMP is given in (Drummond et al., 2005) as . We define so that its inverse . These expressions are in vector form so . We want the transformed prior . Applying the multivariate change of variables formula gives , with as the Jacobian of . This implies that . Substituting gives the SMP log-prior:
(A1) |
We can then obtain , with . The diagonals of are: for , and . The non-zero off-diagonal terms are: and . The result is a symmetric tridiagonal matrix that has zero row and column sums. The matrix is then added to the Fisher information matrix (with as the number of coalescent events informing on the parameter), to get .
We now compute , which is given in the main text as Equation (11). For the GMRF (Minin et al. 2008) and so . Taking second derivatives we get diagonal terms of the Hessian, , as: for , and . The nonzero off diagonal terms are: and . The GMRF also gives a symmetric tridiagonal with row and column sums of zero. Adding to the diagonal matrix yields .
Further Smoothing Results
In the main text, we asserted that the computed at the robust point of (Parag and Pybus 2019) generally upper bounds the achievable values at other settings. Here we provide evidence for this assertion. While strictly (except for ), we numerically find that . We show this for the GMRF under uniform smoothing in Figure A1. This makes sense as while (for fixed smoothing parameters) and , there is no reason to believe that this also maximizes their ratio. The sawtooth curves in Figure A1 reflect changes in the other values, given a fixed .
Hence, we used the robust design point in our calculation of the curves for the GMRF in Figure 3. The corresponding additional mutual information () curves for this case are provided in Figure A2. These show how larger values of the smoothing parameter, , directly lead to increases in the relative mutual information contribution from the prior. Observe that is highly sensitive to the skyline complexity, , thus clarifying how estimates from overparametrized skyline plots can be dominated by prior information.
Interestingly, we can largely negate the impact of skyline complexity by making a function of . In the main text we explained how the Skyride implicitly implements the scaling . While this reduces some of the effect of shown in Figure 3, it still leads to decaying curves that can, for a given , be deceptively dependent on smoothing. Here we propose the key transformation , as a means of reducing our smoothing in line with our skyline complexity. This transformation was inspired by the dependence of a lower bound on , which we derive in Equation A3 later in the Appendix. Its striking impact on the spread of curves from Figure 3 is given in Figure A3.
Further Model Selection Bounds
In the the main text, we derived lower bounds on , which led to the model rejection parameter, (see Equation 14). Here, we extend and support those results. In Figure A4, we first show that the bound of Equation 14 is a good measure of the true value, for a skyline with uniform GMRF smoothing. We used this bound to define a maximum , , above which the skyline would be over-parametrized and susceptible to prior induced overconfidence. We explore over and for this GMRF in Figure A5 and observe that becomes more restrictive with fewer observed data (coalescent events) or increased smoothing. This supports as a useful measure of prior-data contribution.
Lower bounds on imply upper bounds on the excess mutual information, (see Equation 7). We manipulate Equation 14 (under a robust design) to obtain the first inequality in Equation A2, with as follows
(A2) |
This expression reveals that is akin to a signal bandwidth, by comparison with standard Shannon–Hartley theory (Cover and Thomas 2006) and is therefore a key controlling factor in defining how much additional information the prior will introduce. This supports our proposed rejection criterion.
Under the parametrization, and are symmetric, positive definite matrices. For such matrices we can apply a theorem from (Huang and Zhang, 2018), which states that , with . At the robust point, we get , which leads to the second inequality in Equation A2. Thus, our bound is tighter than that in (Huang and Zhang, 2018), and useful for broader, future mathematical analyses of . This inequality also clarifies why is often important for characterizing performance here.
We can also use the bound of (Huang and Zhang, 2018) to derive alternate (but slacker) lower bounds on . This gives the first inequality in Equation A3. Applying this to the uniform GMRF gives the second inequality:
(A3) |
Interestingly, Equation A3 shows that the dependence of on the smoothing parameter is at most only linear, while the dependence on complexity can be quadratic. This provides further theoretical backing for the use of to reject models and emphasizes how smoothing can play a deceptively prominent role in the resulting estimate precision produced under complex (high-dimensional) skyline plots.
Ancillary Uncertainty Statistics
In the Egyptian-HCV simulated example, we defined two 95% HPD based ancillary statistics for characterizing the visual uncertainty present in a skyline plot demographic estimate. In Figure A6, we plot these statistics and for various and values under a time-aware GMRF. We discuss the implications of Figure A6 in the main text but observe here that trends between the more common (and more easily visualized) HPD based measures and our novel statistic are largely consistent.
Funding
This study was funded by the UK Medical Research Council (MRC) and the UK Department for International Development (DFID) under the MRC/DFID Concordat agreement and is also part of the EDCTP2 programme supported by the European Union [grant reference MR/R015600/1]. This work was also supported by the Oxford Martin School.
Supplementary Material
Data available from the Dryad Digital Repository: https://datadryad.org/stash/dataset/doi:10.5061/dryad.1jwstqjs2.
References
- Beerli P., Felsenstein J. 1999. Maximum likelihood estimation of migration rates and effective population numbers in two populations using a coalescent approach. Genetics 152:763–773. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ben-Haim Z., Eldar Y. 2009. A lower bound on the Bayesian MSE based on the optimal bias function. IEEE Trans. Information Theory 55(11):5179–5196. [Google Scholar]
- Berger J., Bernardo J., Sun D. 2015. Overall objective priors. Bayesian Anal. 10(1):189–221. [Google Scholar]
- Bouckaert R., Vaughan T., Barido-Sottani J., Duchêne S., Fourment M., Gavryushkina A., Heled J., Jones G., Kühnert D., De Maio N., Matschiner M., Mendes F., Müller N., Ogilvie H., du Plessis L., Popinga A., Rambaut A., Rasmussen D., Siveroni I., Suchard M., Wu C., Xie D., Zhang C., Stadler T., Drummond A. 2019. BEAST 2.5: an advanced software platform for Bayesian evolutionary analysis. PLoS Comput. Biol. 15(4):e1006650. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brunel N., Nadal J. 1998. Mutual information, fisher information, and population coding. Neural Comput. 10:1731–1757. [DOI] [PubMed] [Google Scholar]
- Cover T., Thomas J. 2006. Elements of information theory. 2nd ed. New Jersey: Wiley. [Google Scholar]
- Drummond A., Nicholls G., Rodrigo A., Solomon W. 2002. Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics 161:1307-1320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Drummond A., Rambaut A., Shapiro B., Pybus O. 2005. Bayesian coalescent inference of past population dynamics from molecular sequences. Mol. Biol. Evol. 22:1185-1192. [DOI] [PubMed] [Google Scholar]
- Faulkner J., Magee A., Shapiro B., Minin V. 2019. Horseshoe-based Bayesian nonparametric estimation of effective population size trajectories. Biometrics. 76:677–690. [DOI] [PubMed] [Google Scholar]
- Fink D. 1997. A compendium of conjugate priors. Technical Report, Montana State University. [Google Scholar]
- Gill M., Lemey P., Faria, N., Rambaut A., Shapiro B., Suchard M. 2013. Improving Bayesian population dynamics inference: a coalescent-based model for multiple loci. Mol. Biol. Evol. 30(3):713–724. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Griffiths R., Tavare, S. 1994. Sampling theory for neutral alleles in a varying environment. Philos. Trans. R. Soc. B 344:403–410. [DOI] [PubMed] [Google Scholar]
- Ho S., Shapiro B. 2011. Skyline-plot methods for estimating demographic history from nucleotide sequences. Mol. Ecol. Resour. 11:423–434. [DOI] [PubMed] [Google Scholar]
- Huang W., Zhang K. 2018. Information-theoretic bounds and approximations in neural population coding. Neural Comput. 30(4):885–944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ipsen I., Rehman R. 2008. Perturbation bounds for determinants and characteristic polynomials. SIAM J. Matrix Anal. Appl. 30(2):762–776. [Google Scholar]
- Kingman J. 1982. On the genealogy of large populations. J. Appl. Probab. 19:27–43. [Google Scholar]
- Kuhner M., Yamato J., Felsenstein J. 1998. Maximum likelihood estimation of population growth rates based on the coalescent. Genetics 149:429–434. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lehmann E., Casella G. 1998. Theory of point estimation. 2nd ed. New York:Springer. [Google Scholar]
- Li H., Durbin R. 2011. Inference of human population history from individual whole-genome sequences. Nature 475(7357): 493-496. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Minin V., Bloomquist E., Suchard M. 2008. Smooth Skyride through a rough Skyline: Bayesian coalescent-based inference of population dynamics. Mol. Biol. Evol. 25(7):1459–1471. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Parag K., Donnelly C. 2020. Adaptive estimation for epidemic renewal and phylogenetic Skyline models. Syst. Biol. 69(6):1163–1179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Parag K., Pybus O. 2017. Optimal point process filtering and estimation of the Coalescent process. J. Theor. Biol. 421:153–167. [DOI] [PubMed] [Google Scholar]
- Parag K., Pybus O. 2019. Robust design for coalescent model inference. Syst. Biol. 68(5):730–743. [DOI] [PubMed] [Google Scholar]
- Parag K., du Plessis L., Pybus O. 2020. Jointly inferring the dynamics of population size and sampling intensity from molecular sequences. Mol. Biol. Evol. 37(8):2414–2429. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pybus O., Rambaut A., Harvey P. 2000. An integrated framework for the inference of viral population history from reconstructed genealogies. Genetics 155:1429–1437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pybus O., Drummond A., Nakano T., Robertson B., Rambaut. A. 2003. The epidemiology and iatrogenic transmission of hepatitis C virus in Egypt: a Bayesian coalescent approach. Mol. Biol. Evol. 20(3):381–387. [DOI] [PubMed] [Google Scholar]
- Pyron R., Burbink F. 2013. Phylogenetic estimates of speciation and extinction rates for testing ecological and evolutionary hypotheses. Trends Ecol. Evol. 28(12):729–736. [DOI] [PubMed] [Google Scholar]
- Robert C. 2007. The Bayesian choice. Newyork:Springer Science and Business Media. [Google Scholar]
- Rodrigo A., Felsenstein J. 1999. Coalescent approaches to HIV-1 population. The evolution of HIV. Baltimore:Johns Hopkins University Press. [Google Scholar]
- Rothenburg T. 1971. Identification in parametric models. Econometrica 39(3):577–591. [Google Scholar]
- Shapiro B., Drummond A., Rambaut A., Wilson M., Matheus P., Sher A., Pybus O., Gilbert M., Barnes I., Binladen J., Willerslev E., Hansen A., Baryshnikov G., Burns J., Davydov S., Driver J., Froese D., Harington C., Keddie G., Kosintsev P., Kunz M., Martin L., Stephenson R., Storer J., Tedford R., Zimov S., Cooper A. 2004. Rise and fall of the Beringian steppe bison. Science 306(5701):1561–1565. [DOI] [PubMed] [Google Scholar]
- Slate E. 1994. Parameterizations for natural exponential families with quadratic variance functions. J. Am. Stat. Assoc. 89(428): 1471–1481. [Google Scholar]
- Snyder D., Miller M. 1991. Random point processes in time and space. 2nd ed. Newyork:Springer. [Google Scholar]
- Stiller M., Baryshnikov G., Bocherens H., d’Anglade A., Hilpert B., Munzel S., Pinhasi R., Rabeder G., Rosendahl W., Trinkaus E., Hofreiter M., Knapp M. 2010. Withering away-25,000 years of genetic decline preceded cave bear extinction. Mol. Biol. Evol. 27(5): 975–978. [DOI] [PubMed] [Google Scholar]
- Strimmer K., Pybus O. 2001. Exploring the demographic history of DNA sequences using the generalized skyline plot. Mol. Biol. Evol. 18(12):2298–2305. [DOI] [PubMed] [Google Scholar]
- Thomas J., Carvalho G., Haile J., Rawlence N., Martin M., Ho S., Sigfusson A., Josefsson V., Frederiksen M., Linnebjerg J., Castruita J., Niemann J., Sinding M., Sandoval-Velasco M., Soares A., Lacy R., Barilaro C., Best J., Brandis D., Cavallo C., Elorza M., Garrett K., Groot M., Johansson F., Lifjeld J., Nilson G., Serjeanston D., Sweet P., Fuller E., Hufthammer A., Meldgaard M., Fjeldsa J., Shapiro B., Hofreiter M., Stewart J., Gilbert M., Knapp M. (2019). Demographic reconstruction from ancient DNA supports rapid extinction of the great auk. eLife 8:e47509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tichavsky P., Muravchik C., Nehorai A. 1998. Posterior Cramer-Rao bounds for discrete-time nonlinear filtering. IEEE Trans. Signal Process. 46(5):1386–1395. [Google Scholar]
- van Trees H. 1968. Detection, estimation, and modulation theory, Part I. New Jersey:Wiley. [Google Scholar]
- Vaughan T., Drummond A. 2013. A stochastic simulator of birth–death master equations with application to phylodynamics. Mol. Biol. Evol. 30(6):1480–1493. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wakeley J. 2008. Coalescent theory: an introduction. Colorado:Roberts and Company Publishers. [Google Scholar]