Skip to main content
Philosophical Transactions of the Royal Society B: Biological Sciences logoLink to Philosophical Transactions of the Royal Society B: Biological Sciences
. 2025 Feb 20;380(1919):20230310. doi: 10.1098/rstb.2023.0310

Leveraging graphical model techniques to study evolution on phylogenetic networks

Benjamin Teo 1, Paul Bastide 2, Cécile Ané 1,3,
PMCID: PMC11867149  PMID: 39976402

Abstract

The evolution of molecular and phenotypic traits is commonly modelled using Markov processes along a phylogeny. This phylogeny can be a tree, or a network if it includes reticulations, representing events such as hybridization or admixture. Computing the likelihood of data observed at the leaves is costly as the size and complexity of the phylogeny grows. Efficient algorithms exist for trees, but cannot be applied to networks. We show that a vast array of models for trait evolution along phylogenetic networks can be reformulated as graphical models, for which efficient belief propagation algorithms exist. We provide a brief review of belief propagation on general graphical models, then focus on linear Gaussian models for continuous traits. We show how belief propagation techniques can be applied for exact or approximate (but more scalable) likelihood and gradient calculations, and prove novel results for efficient parameter inference of some models. We highlight the possible fruitful interactions between graphical models and phylogenetic methods. For example, approximate likelihood approaches have the potential to greatly reduce computational costs for phylogenies with reticulations.

This article is part of the theme issue ‘“A mathematical theory of evolution”: phylogenetic models dating back 100 years’.

Keywords: belief propagation, cluster graph, admixture graph, trait evolution, Brownian motion, linear Gaussian

1. Introduction

Stochastic processes are used to model the evolution of traits over time along a phylogeny, a graph representing the historical relationships between species, populations or individuals of interest, in which internal nodes represent divergence (e.g. speciation) or merging (e.g. introgression) events. In this work, we consider traits that may be multivariate, discrete and/or continuous, with a focus on continuous traits. Trait evolution models are used to infer evolutionary dynamics [1,2] and historical correlation between traits [35], predict unobserved traits at ancestral nodes or extant leaves [6,7], or estimate phylogenies from rich datasets [812].

Calculating the likelihood is no easy task because the traits at ancestral nodes are unobserved and need to be integrated out. This problem is very well studied for phylogenetic trees, with efficient solutions for both discrete and continuous traits [1318]. Admixture graphs and phylogenetic networks with reticulations are now gaining traction due to growing empirical evidence for gene flow, hybridization and admixture [11,19,20]. Yet many methods for these networks could be improved with more efficient likelihood calculations.

The vast majority of phylogenetic models make a Markov assumption, in that the trait distribution at all nodes can be expressed by a set of local models. At the root, this model describes the prior distribution of the ancestral trait. For each node in the phylogeny, a local transition model describes the trait distribution at this node conditional on the trait(s) at its parent node(s). As each local model can be specified individually with its own set of parameters, the overall evolutionary model can be very flexible, including possible shifts in rates, constraints and modes of evolution across different clades [21,22]. Other models do not make a Markov assumption, such as threshold models [23,24], or models that combine a backwards-in-time coalescent process for gene trees and forward-in-time mutation process along gene trees [25]. We show here that some of these models can still be expressed as a product of local conditional distributions, over a graph that is more complex than the initial phylogeny.

These evolutionary models are special cases of graphical models, also known as Bayesian networks, which have been heavily studied [26]. The likelihood calculation task has received a lot of attention, including algorithms for efficient approximations when the network is too complex to calculate the likelihood exactly [26]. Another well-studied task is that of predicting the state of unobserved variables (ancestral states in phylogenetics) conditional on the observed data. We argue here that the field of phylogenetics could greatly benefit from applying and expanding knowledge from graphical models for the study and use of phylogenetic networks.

In §2, we review the challenge brought by phylogenetic models in which only tip data are observed and current techniques for efficient likelihood calculations. In §3, we focus on general Gaussian models for the evolution of a continuous trait, possibly multivariate to capture evolutionary correlations between traits. On reticulate phylogenies, these models need to describe the trait of admixed populations conditional on their parental populations. Turning to graphical models in §4, we describe their general formulation and show that many phylogenetic models can be expressed as special cases, from known examples to less obvious examples (using the coalescent process on species trees or species networks). We then provide a short review of belief propagation, a core technique to perform inference on graphical models, first in its general form and then specialized for continuous traits in linear Gaussian models. In §5, we describe loopy belief propagation, a technique to perform approximate inference in graphical models when exact inference does not scale. As far as we know, loopy belief propagation has never been used in phylogenetics. Section 6 describes leveraging belief propagation (BP) for parameter inference: fast calculations of the likelihood and its gradient can be used in any likelihood-based framework, frequentist or Bayesian. Finally, §7 discusses future challenges for the application and extension of graphical model techniques in phylogenetics. These techniques offer a range of avenues to expand the phylogeneticists’ toolbox for fitting evolutionary models on phylogenetic networks, from approximate inference methods that are more scalable, to algorithms for fast gradient computation for better parameter inference.

2. Complexity of the phylogenetic likelihood calculation

(a). The pruning algorithm

Felsenstein’s pruning algorithm [13,27] launched the era of model-based phylogenetic inference, now rich with complex models to account for a large array of biological processes including DNA and protein substitution models, variation in their substitution rates across genomic loci, lineages and time and evolutionary models for continuous traits and geographic distributions. The pruning algorithm gave the key to calculate the likelihood of these models along a phylogenetic tree, in a practically feasible way. The basis of this algorithm, which extends to tasks beyond likelihood calculation, was discovered in other areas and given other names, such as the sum–product algorithm, forward–backward algorithm, message passing and BP. In particular, the field of statistical human genetics saw the early development of such algorithms for models on pedigrees [2830] including loops [31,32]. See section 9.8 in Koller & Friedman [26] for a review of the literature on variable elimination algorithms.

The pruning algorithm, which is a form of BP, computes the full likelihood of all the observed taxa by traversing the phylogenetic tree once, taking advantage of the Markov property: where the evolution of the trait of interest along a daughter lineage is independent of its past evolution, given knowledge of the parent’s state. The idea is to traverse the tree and calculate the likelihood of the descendant leaves of an ancestral species conditional on its state, from similar likelihoods calculated for each of its children. If the trait is discrete with four states, for example (as for DNA), then this entails keeping track of four likelihood values at each ancestral species. If the trait is continuous with a Gaussian distribution, e.g. from a Brownian motion (BM) or an Ornstein–Uhlenbeck (OU) process [33], then the likelihood at an ancestral species is a nice function of its state that can be concisely parametrized by quantities akin to the posterior mean and variance conditional on descendant leaves. Felsenstein’s independent contrasts (IC) [34] also capture these partial posterior quantities and can be viewed as a special implementation of BP for likelihood calculation.

BP is used ubiquitously for the analysis of discrete traits, such as for DNA substitution models (e.g. in RAxML [35], IQ-TREE [36], MrBayes [37]) or for discrete morphological traits in comparative methods (e.g. in phytools [38], BayesTraits [39], corHMM [40,41], RevBayes [42]). For discrete traits, there is simply no feasible alternative. On a tree with 20 taxa and 19 ancestral species, the naive calculation of the likelihood at a given DNA site would require the calculation and summation of 419 or 274 billion likelihoods, one for each nucleotide assignment at the 19 ancestral species. This calculation would need to be repeated for each site in the alignment, then repeated all over again during the search for a well-fitting phylogenetic tree.

(b). Continuous traits on trees: the lazy way

For continuous traits under a Gaussian model (including the Brownian motion), BP is not used as ubiquitously because a multivariate Gaussian distribution can be nicely captured by its mean and covariance matrix: the multivariate Gaussian formula can serve as an alternative. For example, for one trait Y with ancestral state μ at the root of the phylogeny, the phylogenetic covariance Σ between the taxa at the leaves can be obtained from the branch lengths in the tree. Under a BM, the covariance cov(Yi,Yj) between taxa i and j is Σij=σ2tij where tij is the length between the root and their most recent common ancestor. The likelihood of the observed traits at the n leaves can then be calculated using matrix and vector multiplication techniques as

(2π)n/2det|Σ|1/2exp(12(Yμ)Σ1(Yμ)). (2.1)

This alternative to BP has the disadvantage of requiring the inversion of the covariance matrix Σ, a task whose computing time typically grows as m3 for a matrix of size m×m. It also has the disadvantage that Σ needs to be calculated and stored in memory in the first place. For multivariate observations of p traits on each of n taxa, the covariance matrix has size m=pn so the typical calculation cost of (2.1) is then O(p3n3), which can quickly become very large. For example, with only 30 taxa and 10 traits, Σ is a 300×300 matrix. Studies with large p and/or large n are now frequent, especially from geometric morphometric data with p over 100 typically (e.g. [43]) or with expression data on p>1000 genes easily, that also require more complex models to account for variation (e.g. within species, between organs, between batches [44,45]). Studies with a large number n of taxa are now frequent (e.g. n>5000 in birds and mammals [46,47]) and virus phylogenies can be massive (e.g. n>500000 SARS-CoV-2 strains [48]). Viral continuous traits previously studied include virulence traits (e.g. n>1000 and p=3 traits in HIV [49,50]) and geographic data for phylodynamics (e.g. n=801 and p=2 continuous coordinates to describe the spread of the West Nile virus [2]).

In these cases with large data size np, the matrix-based alternative to BP is prone to numerical inaccuracy and numerical instability in addition to the increased computational time, because it is hard to accurately invert a large matrix. Even when the matrix is of moderate size, numerical inaccuracy can arise when the matrix is ‘ill-conditioned’. These problems were identified under OU models on phylogenetic trees that have closely related sister taxa, or under early burst (EB) models with strong morphological diversification early on during the group radiation, and much slowed-down evolution later on [5153].

For some simple models, the large np×np covariance matrix can be decomposed as a Kronecker product of a p×p trait covariance and a n×n phylogenetic covariance. This decomposition can simplify the complexity of calculating the likelihood. However, this decomposition is not available under many models, such as the multivariate Brownian motion with shifts in the evolutionary rates (e.g. [54]) or the multivariate Ornstein–Uhlenbeck model with non-scalar rate or selection matrices [21,55].

(c). Belief propagation for continuous traits on trees

To bypass the complexity of matrix inversion, Felsenstein pioneered IC to test for phylogenetic correlation between traits, assuming a BM model on a tree [34]. Many authors then used BP approaches to handle Gaussian models beyond the BM [14,15,17]. Notably, Ho & Ané [16] describe a fast algorithm that can be used for non-Gaussian models as well. Most recently, Mitov et al. [18] highlighted that BP can be applied to a large class of Gaussian models including the BM and the OU process with shifts and variation of rates and selection regimes across branches. Software packages that use these fast BP algorithms include phylolm [16], Rphylopars [17], BEAST [56] or the most recent versions of hOUwie [41] and mvSLOUCH [53].

All the methods cited above only use the first post-order tree traversal of BP to compute the likelihood. A second pre-order traversal allows, in the Gaussian case, for the computation of the distribution of all internal nodes conditionally on the model and on the trait values at the tips. These distributions can then be used for, e.g. ancestral state reconstruction [7], expectation–maximization algorithms for shift detection in the optimal values of an OU [57], or the computation of the gradient of the likelihood in the BM [4,58] or general Gaussian model [59]. Such BP techniques have also been used for taking gradients of the likelihood with respect to branch lengths in sequence evolution models [60,61] or for phylogenetic factor analysis [62,63].

(d). From trees to networks

So far, Felsenstein’s pruning algorithm and related BP approaches have mostly been restricted to phylogenetic trees. There is now ample evidence that reticulation is ubiquitous in all domains of life from biological processes such as lateral gene transfer, hybridization, introgression and gene flow between populations. Networks are recognized to be better than trees for representing the phylogenetic history of species and populations in many groups. Although current studies using networks have few taxa, typically between 10 and 20 (e.g. [12]), they tend to have increasingly more tips as network inference methods become more scalable (e.g. n=39 languages in [10]). As viruses are known to be affected by recombination, we also expect future virus studies to use large network phylogenies [64], so that BP will become essential for network studies too. In this work, we describe approaches currently used for trait evolution on phylogenetic networks. We argue that the field of evolutionary biology would benefit from applying BP approaches to networks more systematically. Transferring knowledge from the mature and rich literature on BP would advance evolutionary biology research when phylogenetic networks are used.

(e). Current network approaches for discrete traits

For discrete traits on general networks, very few approaches use BP techniques as far as we know. For DNA data, for example, PhyLiNC [65] and NetRAX [66] extend the typical tree-based model to general networks, assuming no incomplete lineage sorting. That is, each site is assumed to evolve along one of the trees displayed in the network, chosen according to inheritance probabilities at reticulate edges. PhyLiNC assumes independent (unlinked) sites. NetRAX assumes independent loci, which may have a single site each. Each locus may have its own set of branch lengths and substitution model parameters. Both methods calculate the likelihood of a network N via extracting its displayed trees and then applying BP on each tree. Similarly, comparative methods for binary and multi-state traits implemented in PhyloNetworks also extract displayed trees and then apply BP on each displayed tree [67]. While these approaches use BP on each displayed tree, a network with h reticulations can have up to 2h displayed trees. This leads to a computational bottleneck when the number of reticulations increases.

BP approaches have also been used for models with incomplete lineage sorting, modelled by the coalescent [68]. Notably, SNAPP models the evolution of unlinked biallelic markers along a species tree, accounting for incomplete lineage sorting [25]. This method was recently made faster with SNAPPER [69] and extended to phylogenetic networks with SnappNet [70]. The coalescent process introduces the challenge that each site may evolve along any tree, depending on past coalescent events. SNAPP introduced a way to bypass the difficulties of handling coalescent histories and hence decrease computation time. After we describe BP for general graphical models, we recast this innovation as BP on a graphical model formulation of the problem.

BP was also used to calculate the likelihood of the joint sample frequency spectrum (SFS). To account for incomplete lineage sorting on a tree, Kamm et al. [71] use the continuous-time Moran model to reduce computational complexity and assume that each site undergoes at most one mutation. In momi2, Kamm et al. [72] extend the approach to phylogenetic networks by assuming a pulse of admixture at reticulations. The associated graphical model is much simpler than that required by SNAPP or SnappNet, thanks to the assumption of no recurrent mutation.

(f). Current network approaches for continuous traits

Compared with the rich toolkit available for the analysis of continuous traits on trees, the toolkit for phylogenetic networks is still limited. PhyloNetworks includes comparative methods on networks [73], implemented in Julia [74]. These methods extend phylogenetic ANOVA to networks for a continuous response trait predicted by any number of continuous or categorical traits, with residual variation being phylogenetically correlated. So far, the models available in PhyloNetworks include the BM, Pagel’s λ, possible within-species variation and shifts at reticulations to model transgressive evolution [75,76]. However, all calculations are based on working with the full covariance matrix, without BP. TreeMix [77], ADMIXTOOLS [11,78], poolfstat [79] and AdmixtureBayes [12] use allele frequency as a continuous trait. They model its evolution along a network, or admixture graph, using a Gaussian model in which the evolutionary rate variance is affected by the ancestral allele frequency [80,81]. Again, these methods work with the phylogenetic covariance matrix, rather than BP approaches. They also consider subsets of up to four taxa at a time via f2, f3 and f4 statistics, which simplifies the likelihood calculation. To identify selection and adaptation on a network, PolyGraph [82] and GRoSS [83] assume a similar model and use the full covariance matrix. In summary, BP has yet to be used for continuous trait evolution on networks.

3. Continuous trait evolution on a phylogenetic network

We now present phylogenetic models for the evolution of continuous traits, to which we apply BP later. We generalize the framework in Mitov et al. [18] and Bastide et al. [59] from trees to networks, and we extend the network model in Bastide et al. [75] from the BM to more general evolutionary models. We consider a multivariate X consisting of p continuous traits and model their correlation over time. Our model ignores the potential effects of incomplete lineage sorting on X, a reasonable assumption for highly polygenic traits.

(a). Linear Gaussian models

Most random processes used to model continuous trait evolution on a phylogenetic tree are extensions of the BM to capture processes such as evolutionary trends, adaptation and variation in rates across lineages, for example. In its most general form, the linear Gaussian evolutionary model on a tree (referred to as the GLInv family in [18]) assumes that the trait Xv at node v has the following distribution conditional on its parent pa(υ)

XvXpa(v)N(qvXpa(v)+ωv,Vv) (3.1)

where the actualization matrix qv, the trend vector ωv and the covariance matrix Vv are appropriately sized and do not depend on trait values Xpa(υ). When the tree is replaced by a network, a node v can have multiple parents pa(υ). In this case, we can write Xpa(υ) as the vector formed by stacking the elements of {Xuupa(υ)} vertically, with length equal to the number of traits times the number of parents of v. In the following, we show that (3.1), already used on trees, can easily be extended to networks to describe both evolutionary models along one lineage and a merging rule at reticulation events.

(b). Evolutionary models along one lineage

For a tree node v with parent node u, we need to describe the evolutionary process along one lineage, graphically modelled by the tree edge e=(u,v). It is well known that a wide range of evolutionary models can fit in the general form (3.1) [18,59]. For instance, the BM with variance rate Σ (a variance–covariance matrix for a multivariate trait) is described by (3.1) where qv is the p×p identity matrix Ip, there is no trend ωv=0, and the variance is proportional to the edge length (e): Vv=(e)Σ.

Allowing for rate variation amounts to letting the variance rate vary across edges Σ=Σ(e). For example, the early burst (EB) model assumes that the variance rate at any given point in the phylogeny depends on the time t from the root to that point, as:

Σ(t)=Σ0ebt.

For this t to be well-defined on a reticulate network, the network needs to be time-consistent (distinct paths from the root to a node all share the same length). The rate b is a rate of variance decay if it is negative, to be expected during adaptive radiations, with a burst of variation near the root (hence early burst) before a slow-down of trait evolution [84]. When b>0, this model is called ‘accelerating rate’ (AC) [85]. Clavel & Morlon [86] used a flexible extension of this model (on a tree), replacing t by one or more covariates that are known functions of time, such as the average global temperature and other environmental variables:

Σ(t)=σ~(t,T1(t),,Tk(t)).

Then, the variance accumulated along edge e=(u,v) is given by

Vv=t(u)t(v)Σ(t)dt.

In the particular case of the EB model, we get

Vv=Σ0ebt(u)(eb(e)1)/b.

Allowing for shifts in the trait value, perhaps due to jumps or cladogenesis, amounts to including ωv0 for some v.

Adaptive evolution is typically modelled by the OU process, which includes a parameter Ae for the strength of selection along edge e. This selection strength is often assumed constant across edges, and is typically denoted as α for a univariate trait. The OU process also includes a primary optimum value θe, which may vary across edges when we are interested in detecting shifts in the adaptive regime across the phylogeny. Under the OU model, the trait evolves along edge e with random drift and a tendency towards θe:

dX(e)(t)=Ae(θeX(e)(t))dt+RedB(t)

where B is a standard BM and the drift variance is Σe=ReRe. Then, conditional on the starting value at the start of e, the end value Xv is linear Gaussian as in (3.1) with actualization qv=e(e)Ae, trend ωv=(Ie(e)Ae)θe and variance

Vv=0(e)esAeΣeesAeds=See(e)AeSee(e)Ae

where Se is the stationary variance matrix. These equations simplify greatly if Ae and Σe commute, such as if Ae is scalar of the form αeIp, including when the process is univariate. In this case,

Vv=(1e2α(e))Σe/(2α).

Shifts in adaptive regimes can be modelled by shifts in any of the parameters θe, Ae or Σe across edges.

Finally, variation within species, including measurement error, can be easily modelled by grafting one or more edges at each species node to model the fact that the measurement taken from an individual may differ from the true species mean. The model for within-species variation, then, should also follow (3.1) by which an individual value is assumed to be normally distributed with a mean that depends linearly on the species mean, and a variance independent of the species mean—although this variance can vary across species. Most typically, observations from species v are modelled using q=Ip, ω=0 and some phenotypic variance to be estimated, that may or may not be tied to the evolutionary variance parameter from the phylogenetic model across species. This additional observation layer can also be used for factor analysis, where the unobserved latent trait evolving on the network has a smaller dimension than the observed traits. In that case, q is a rectangular, representing the loading matrix [62,63].

(c). Evolutionary models at reticulations

For a continuous trait and a hybrid node h, Bastide et al. [75] and Pickrell & Pritchard [77] assumed that Xh is a weighted average of its immediate parents, using their state immediately before the reticulation event. Specifically, if h has parent edges e1,,em, and if we denote by Xek the state at the end of edge ek right before the reticulation event (1km), then the weighted-average model assumes that

Xh=ek parent of hγ(ek)Xek_. (3.2)

This model is a reasonable null model for polygenic traits, reflecting the typical observation that hybrid species show intermediate phenotypes. In this model, the biological process underlying the reticulation event (such as gene flow versus hybrid speciation) does not need to be known. Only the proportion of the genome inherited by each parent, γ(ek), needs to be known. Compared with the evolutionary timescale of the phylogeny, the reticulation event is assumed to be instantaneous.

To describe this process as a graphical model, we may add a degree-2 node at the end of each hybrid edge e to store the value Xe, so as to separate the description of the evolutionary process along each edge from the description of the process at a reticulation event. With these extra degree-2 nodes, the weighted-average model (3.2) corresponds to the linear Gaussian model (3.1) with no trend ωh=0, no variance Vh=0 and with actualization qh=[γ(e1)Ipγ(em)Ip] made of scalar diagonal blocks.

Several extensions of this hybrid model can be considered. Bastide et al. [75] modelled transgressive evolution with a shift ωh0 for the hybrid population to differ from the weighted average of its immediate parents, even possibly taking a value outside their range. Jhwueng & O’Meara [87] considered transgressive shifts at each hybrid node as random variables with a common variance, corresponding to a model with ωh=0 but non-zero variance Vh.

More generally, we may consider models in which the hybrid value is any linear combination of its immediate parents qvXpa(υ) as in (3.1). A biologically relevant model could consider qv to be diagonal, with, on the diagonal, parental weights γ(e,j) that may depend on the trait j instead of being shared across all p traits.

We may also consider both a fixed transgressive shift ωh0 and an additional hybrid variance Vh. For both of these components to be identifiable in the typical case when we observe a unique realization of the trait evolution, the model would need extra assumptions to induce sparsity. For example, we may assume that Vh is shared across all reticulations and is given an informative prior to capture small variations around the parental weighted average. We may also need a sparse model on the set of ωh parameters, e.g. letting ωh0 only at a few candidate reticulations h, chosen based on external domain knowledge.

For a continuous trait known to be controlled by a single gene, we may prefer a model similar to the discrete trait model presented later in example 4.2, by which Xh takes the value of one of its immediate parent Xe with probability γ(e). This model would no longer be linear Gaussian unless we condition on which parent is being inherited at each reticulation. Such conditioning would reduce the phylogeny to one of its displayed trees. But it would require other techniques to integrate over all parental assignments to each hybrid population, such as Markov Chain Monte Carlo or expectation-maximization.

(d). Evolutionary models with interacting populations

Models have been proposed in which the evolution of X(e)(t) along one edge e depends on the state on other edges existing at the same time t [8891]. These models can describe ‘phenotype matching’ that may arise from ecological interactions (mutualism, competition) or demographic interactions (migration), in which traits across species or populations converge to or diverge from one another. To express this coevolution, we consider the set E(t) of edges contemporary to one another at time t and divide the phylogeny into epochs: time intervals [τi,τi+1] during which the set E(t) of interacting lineages is constant, denoted as Ei. Within each epoch i (i.e. t[τi,τi+1]), the vector of all traits (X(e)(t))eEi is modelled by a linear stochastic differential equation. Since its mean is linear in and its variance independent of the starting value (X(e)(τi))eEi, these models are linear Gaussian [89,90]. In fact, they can be expressed by (3.1) on a supergraph of the original phylogeny, in which an edge (u,v) is added if u is at the start τi of some epoch i, v is at the end τi+1, and if the mean of Xv conditional on all traits at time τi has a non-zero coefficient for Xu. The specific form of qv, ωv and Vv in (3.1) depends on the specific interaction model, and may be more complex than the merging rule (3.2).

4. A short review of graphical models and belief propagation

Implementing BP techniques on general networks is more complex than on trees and involves the construction of an auxiliary graph known as a clique tree or cluster graph (§4b). To explain why, we review here the main ideas of graphical models and belief propagation for likelihood calculation.

(a). Graphical models

A probabilistic graphical model is a graph representation of a probability distribution. Each node in the graph represents a random variable, typically univariate but possibly multivariate. We focus here on graphical models with directed edges on a directed acyclic graph (DAG). Edges represent dependencies between variables, where the direction is typically used to represent causation. The graph expresses conditional independencies satisfied by the joint distribution of all the variables at all nodes in the graph.

Given the directional nature of evolution and inheritance, models for trait evolution on a phylogeny are often readily formulated as directed graphical models. Höhna et al. [92] demonstrate the utility of representing phylogenetic models as graphical models for exposing assumptions and for interpretation and implementation. They present a range of examples common in evolutionary biology, with a focus on how graphical models facilitate greater modularity and transparency. Directed graphical models have also been used to parametrize distributions on tree topologies for accurate approximations of posterior distributions and for variational inference [9395]. In this issue, similar DAGs are used to store phylogenetic trees efficiently, for parsimony-based inference [96]. Here we focus instead on the computational gains that BP allows in graphical models.

A directed graphical model consists of a DAG G and a set of conditional distributions, one for each node in G. At a node v with parent nodes pa(υ), the distribution of variable Xv conditional on its set of parent variables Xpa(υ)={Xu;upa(υ)} is given by a factor ϕv, which is a function whose scope is the set of variables from v and pa(υ). For each node v, the set formed by this node and its parents {v}pa(υ) is called a node family. If V denotes the vertex set of G, then the set of factors {ϕv,vV} defines the joint density of the graphical model as

pθ(Xv;vV)=vVϕv(Xv|Xu,θ;upa(v)) (4.1)

where we add the possible dependence of factors on model parameters θ. This factor formulation implies that, conditional on its parents, Xv is independent of any non-descendant node (e.g. ‘grandparents’) [26].

Example 4.1. Brownian motion (BM) on a tree. Consider the phylogenetic tree T in figure 1a. The graphical model for the node states of T under a BM, whose parameters θ are the trait evolutionary variance rate σ2, the ancestral state at the root xρ and edge lengths i, has the same topology as T. On a tree, each node family consists of a node v and its single parent, or the root ρ by itself. The distribution ϕρ may be deterministic as when xρ is a fixed parameter of the model, or it may be given a prior distribution ϕρ.

Figure 1.

Example graphical model on a phylogenetic tree with factors defined by the BM

Example graphical model on a phylogenetic tree with factors defined by the BM. The joint distribution of all variables at all nodes is given by the product of factors vϕv as in (4.1), where ϕv is the distribution of xv conditional on its parent variable xpa(υ): N(xpa(υ),σ2v) under the BM. (a) Phylogenetic tree T. The graphical model uses the same graph G=T. (b) Clique tree U for the graphical model (see definition 4.1 in §4b(i)). Its nodes are clusters of variables in T (ellipses). Each edge is labelled by a sepset (squares, see definition 4.1 (ii)): a subset of variables shared by adjacent clusters.

Example 4.2. Discrete trait on a network. A rooted phylogenetic network is a DAG with a single root and taxon-labelled leaves (or tips). A node with at most one parent is called a tree node and its incoming edge is a tree edge. A node with multiple parents is called a hybrid node and represents a population (or species more generally) with mixed ancestry. An edge e=(u,h) going into a hybrid node h is called a hybrid edge. It is assigned an inheritance probability γ(e)>0 that represents the proportion of the genome in h that was inherited from the parent population u (via edge e). Obviously, at each hybrid node h we must have upa(h)γ((u,h))=1. The phylogenetic network N in figure 2a has one hybrid node x5 whose genetic makeup comes from x4 with proportion 0.4 and from x6 with proportion 0.6.

Figure 2.

Phylogenetic network with hybrid edges shown in blue.

(a) Phylogenetic network N with hybrid edges shown in blue and annotated with γ values. The graphical model uses the same graph G=N. N displays two trees, depending on which hybrid edge is retained. One tree, with sister taxa 1 and 2, has probability 0.4. The other tree, with sister taxa 2 and 3, is displayed with probability 0.6. The distribution of the hybrid node x5 depends on both its parents and induces a factor cluster {x4,x5,x6} of size 3 in U and U* (see the family-preserving property in definition 4.1 (i)). (b) Clique tree U for the graphical model. (c) Cluster graph U* (see definition 4.1) for the same graphical model in which {x4,x6,xρ} in U is replaced by smaller clusters {x4,xρ}, {x6,xρ} and {xρ} that induce a cycle. Leaf clusters are not shown.

For a discrete trait X, the traditional model of evolution on a tree can be extended to a network N as follows. Along each edge e, X evolves according to a Markov process with some transition rate matrix Q for an amount of time (e) that depends on the edge. At a tree node, the state of X at the end of its parent edge is passed as the starting value to each daughter lineage, as in the traditional tree model. At reticulations, we follow previous authors to model the value xh at a hybrid node h [6567,97]. Let xe denote the state at the end of edge e, going forward in time. If h has m parent edges e1,,em, then xh is assumed to take value xek with probability γ(ek). This model reflects the idea that the trait is controlled by unknown genes, but the proportion of genes inherited from each parent is known. Incomplete lineage sorting, which can lead to hemiplasy for a trait [98], is unaccounted for. Similar to Example 4.1, the graphical model uses the topology of the network N.

To describe the factors of this graphical model and simplify notations, consider the case when X is binary with states 0 and 1. For a tree node v, the factor ϕv can be represented by the 2×2 matrix exp((e)Q), where e is the parent edge of v. For a hybrid node h with m parents p1,,pm and edges ek=(pk,h) with γ(ek)=γk, the factor ϕh has scope (Xh,Xp1,,Xpm) and can be described by a 2×2m matrix to store the conditional probabilities (Xh=j|Xp1=i1,,Xpm=im). This is a 2×4 matrix in the typical case when h is admixed from m=2 parental populations. With m=2 and with parental values (Xp1,Xp2) arranged in ordered ((0,0),(0,1),(1,0),(1,1)), then

ϕh=(1γ1γ200γ2γ11).

Example 4.3. Binary trait with incomplete lineage sorting (ILS). More complex evolutionary processes, such as ILS, can result in a graph G for the graphical model that is constructed from but not identical to the phylogeny. Such is the case for the evolution of a genetic marker whose gene tree is generated according to the coalescent model along the species phylogeny [68,99,100]. For a marker with two alleles, say ‘green’ and ‘red’, the data consist of the number of red alleles in a sample of individuals from each species. In electronic supplementary material, §A, we formulate the likelihood calculations by Bryant et al. [25] on a species tree and by Rabier et al. [70] on a species network as belief propagation. For this evolutionary model, the graph G of the associated graphical model differs significantly from the original phylogeny (illustrated in electronic supplementary material, figures S1 and S2).

Example 4.4. Discrete trait determined by an unobserved continuous trait. The threshold model uses a latent (unobserved) continuous trait, or ‘liability’, evolving as a Brownian motion to determine an observed discrete trait. The discrete trait changes state when the liability crosses a threshold [23,101]. As the liability is unobserved, the discrete trait has ‘memory’ and is not Markovian [24]: the probability to transition from one state to another depends on the amount of time spent in the current state. However, we can express the model as a graphical model suitable for BP by modelling the liability at all nodes in the phylogeny, and by adding an observation layer to the graph G for the value of the discrete trait at each tip. This layer adds a pendant edge to connect an observed trait to its corresponding liability. Thresholding adds significant complexity to likelihood calculations. Existing algorithms, on trees, use approximations [24,102] or resort to sampling the latent liability in a Bayesian context [3,6,103], including Hamiltonian techniques that exploit the gradient [4,5].

Representing discrete traits as thresholded liabilities makes it easy to model correlations between continuous and discrete traits in multivariate datasets with both types of traits [3]. Using graphical models, such models can leverage BP techniques and hence be extended to phylogenetic networks and general Gaussian processes.

(b). Belief propagation

BP is a framework for efficiently computing various integrals of the factored density pθ by grouping nodes and their associated variables into clusters and integrating them out according to rules along a clique tree (also known as a junction tree, join tree or tree decomposition) or along a cluster graph, more generally.

(i). Cluster graphs and clique trees

Definition 4.1 (cluster graph and clique tree). Let Φ={ϕv,vV} be the factors of a graphical model on graph G and let U=(V,E) be an undirected graph whose nodes CiV, called clusters, are sets of variables in the scope of Φ. U is a cluster graph for Φ if it satisfies the following properties:

  1. (family-preserving) There exists a map α:ΦV such that for each factor ϕv, its scope (node family for node v in the graphical model) is a subset of the cluster α(ϕv).

  2. (edge-labelled) Each edge {Ci,Cj} in E is labelled with a non-empty sepset Si,j (‘separating set’) such that Si,jCiCj.

  3. (running intersection) For each variable x in the scope of Φ, ExE, the set of edges with x in their sepsets forms a tree that spans VxV, the set of clusters that contain x.

If U is acyclic, then U is called a clique tree and we refer to its nodes as cliques. In this case, properties (ii) and (iii) imply that Si,j=CiCj.

A clique tree U is shown in figure 1b for the BM model from example 4.1, on the tree T in figure 1a. To check the running intersection property for x5, for example, we extract the graph defined by edges with x5 in their sepsets (squares). There are two such edges. They induce a subtree of U that connects all three clusters (ellipses) containing x5, as desired. More generally, when the graphical model is defined on a tree T, a corresponding clique tree U is easily constructed, where cliques in U correspond to edges in T and edges in U correspond to nodes in T. Multiple clique trees can be constructed for a given graphical model. In this example, the clique {xρ} (shown at the top) could be suppressed because it is a subset of adjacent cliques.

For the network N in figure 2a and the evolution of a discrete trait in example 4.2, one possible clique tree U is shown in figure 2b. Note that x5,x4 and x6 have to appear together in at least one of the clusters for the clique tree to be family-preserving (property (i)), because x4 and x6 are partners with a common child x5 whose distribution depends on both of their states.

We first focus on clique trees, which provide a structure for the exact likelihood calculation. In §5, we discuss the advantages of cluster graphs to approximate the likelihood at a lower computational cost.

(ii). Evidence

To calculate the likelihood of the data, or the marginal distribution of the traits at some node conditional on the data, we inject evidence into the model in one of two equivalent ways. For each observed value xv,t of the tth trait xv,t at node v, we add to the model the indicator function 1{xv,t}(xv,t) as an additional factor. Equivalently, we can plug in the observed value xv,t in place of the variable xv,t in all factors where xv,t appears, and then drop xv,t from the scope of all these factors. This second approach is more tractable than the first to avoid the degenerate zero-variance Dirac distribution. But it requires careful bookkeeping of the scope and of re-parametrization of each factor with missing data when some traits, but not all, are observed at some nodes. Below, we assume that the factors and their scopes have been modified to absorb evidence from the data.

(iii). Belief update message passing

There are multiple equivalent algorithms to perform BP. We focus here on the belief update algorithm. It assigns a belief to each cluster and to each sepset in the cluster graph. After running the algorithm, each belief should provide the marginal probability of the variables in its scope and of the observed data, with all other variables integrated out as desired to calculate the likelihood. The belief of cluster Ci, denoted as βi, is initialized as the product of all factors assigned to that cluster:

βi(initial)=ψi=ϕ;α(ϕ)=Ciϕ for cluster Ci (4.2)

The belief of an edge between cluster i and j, denoted as μi,j, is initialized to the constant function 1. These beliefs are then updated iteratively by passing messages. Passing a message from Ci to Cj along an edge with sepset Si,j corresponds to passing information about the marginal distribution of the variables in Si,j as shown in algorithm 1. If U is a clique tree, then all beliefs converge to the true marginal probability of their variables and of the observed data, after traversing U only twice: once to pass messages from leaf cliques towards some root clique, and then back from the root clique to the leaf cliques. If our goal is to calculate the likelihood, then one traversal is sufficient. Once the root clique has received messages from all its neighbouring cliques, we can marginalize over all its variables (similar to step 1) to obtain the probability of the observed data only, which is the likelihood. The second traversal is necessary to obtain the marginal probability of all variables, such as if one is interested in the posterior distribution of ancestral states conditional on the observed data.

(iii).

Some equivalent formulations of BP only store sepset messages, and avoid storing cluster beliefs. This strategy requires less memory but more computing time if U is traversed multiple times.

In example 4.1 on a tree (figure 1a), the conditional distribution of xv at a non-root node v corresponds to a factor ϕv for the BM model along edge (pa(v),v) in T. This factor is assigned to clique Cv={pa(v),v} in U to initialize the belief βv of Cv. If v is a leaf in T, then βv is further multiplied by the indicator function at the value xv observed at v, such that the belief of clique Cv can be expressed as a function of the leaf’s parent state only: ϕv(xpa(v))=P(xvxpa(v)). The prior distribution ϕ(xρ) at the root ρ of T (which can be an indicator function if the root value is fixed as a model parameter) can be assigned to any clique containing ρ. In figure 1, U includes a clique Cρ={xρ} drawn at the top, to which we assign the root prior ϕρ(xρ) and which we will use as the root of U. Since U is a clique tree, BP converges after traversing U twice: from the tips to Cρ and then back to the tips. IC [27,34] implements the first ‘rootwards’ traversal of BP. For example, the belief of clique {x5,xρ} after receiving messages (steps 1−3) from both of its daughter cliques is the function

β5(x5,xρ)=exp((xρx5)225(x5x5*)22v5*+g5*)

where

x5=2x1+1x21+2,v5=121+2andg5=(x2x1)22(1+2)log((2π)3/2125)

are quantities calculated for IC: x5* corresponds to the estimated ancestral state at node 5, v5* corresponds to the extra length added to 5 when pruning the daughters of node 5 and g5* captures the contrast (x2x1)/1+2 below node 5. At this stage of BP, β5(x5,xρ) can be interpreted as (x1,x2,x5|xρ) such that the message μ~5ρ(xρ) sent from {x5,xρ} to the root clique Cρ is the partial likelihood (x1,x2|xρ) after x5 is integrated out. The first pass is complete when Cρ has received messages from all its neighbours. Its final belief is then βρ(xρ)=(x1,,x4|xρ)ϕρ(xρ). If xρ is a fixed model parameter, then this is the likelihood. Otherwise, we get the likelihood by integrating out xρ in βρ(xρ).

In example 4.2 on a network (figure 2), we label the cliques in U as follows: Cv={xv,xpa(v)} for leaves v=1,2,3, C5={x5,x4,x6} for hybrid node v=5 and its parents, and Cρ={x4,x6,xρ}. To initialize beliefs, we assign ϕv to Cv for v=1,2,3,5, and ϕ4, ϕ6 are both assigned to Cρ. Unlike in example 4.1, a clique may correspond to more than a single edge in N. This is expected at a hybrid node h, because the factor describing its conditional distribution needs to contain h and both of its parents. But for U to be a clique tree, the root clique Cρ also has to contain the factors from two edges in N. Also, unlike for trees, sepsets may contain more than a single node. Here, the two large cliques are separated by {x4,x6} so they will send messages μ~(x4,x6) about the joint distribution of these two variables. In this binary trait setting, these messages and sepset beliefs can be stored as 2×2 arrays, and the three-node clique beliefs can be stored as arrays of 23 values. As they involve more variables than when G is a tree (in which case BP would store only two values at each sepset), storing and updating them requires more computing time and memory.

More generally, we see that the computational complexity of BP scales with the size of the cliques and sepsets. This complexity may become prohibitive on a more complex phylogenetic network, even for a simple binary trait without ILS, if the size of the largest cluster in U is too large—a topic that we explore later.

Example 4.3 illustrates the fact that beliefs cannot always be interpreted as partial (or full) likelihoods at every step of BP, unlike in examples 4.1 and 4.2. For example, consider the tip clique C1 containing the total number of alleles and the number of red alleles in species 1, and the number of their ancestral alleles (n total, of which r are red) just before the speciation event that led to species 1 (electronic supplementary material, figure S1(c). At the first iteration of BP, the first message sent by C1 is the quantity denoted by FT(n,r) in Bryant et al. [25]. It is not a partial likelihood because it is not the likelihood of some partial subset of the data conditional on some ancestral values (n and r). Intuitively, this is because nodes with data below variables in C1 in G are not all below C1 in the clique tree. Information from these data will flow towards C1 at later steps. The beauty of BP is that after a second traversal of the clique tree, C1’s belief is guaranteed to converge to the likelihood of the full data, conditional on the state of the clique variables. See electronic supplementary material, §A.1, for details.

(iv). Clique tree construction

For a given graphical model on G, there are many possible clique trees and cluster graphs. For running BP, it is advantageous to have small clusters and small sepsets. Indeed, clusters and sepsets with fewer variables require less memory to store beliefs and less computing time to run steps 1 (integration) and 2 (belief update). Ideally, we would like to find the best clique tree, whose largest clique is of the smallest size possible. For a general graph G, finding this best clique tree is hard but good heuristics exist [26].

The first step is to create the moralized graph Gm from G. This is done by connecting all nodes that share a common child, and then undirecting all edges. We can then triangulate Gm, that is, build a new graph H by adding edges to Gm such that H is chordal (any cycle includes a chord). The width of H is the size of its largest clique minus 1. The treewidth of Gm is the smallest width of all its possible triangulations H. Finding H of minimum width is hard, though efficient heuristics exist (e.g. greedy minimum-fill [104,105]). The nodes of U are then defined as the maximal cliques of H [106]. Finally, the edges of U are formed such that U becomes a tree and such that the sum of the sepset sizes is maximum, by finding a maximum spanning tree using Kruskal’s algorithm or Prim’s algorithm [107]. All these steps have polynomial complexity.

(c). Belief propagation for Gaussian models

Before discussing BP on cluster graphs that are not clique trees, we focus on BP updates for the evolutionary models presented in §3. On a phylogenetic network N, the joint distribution of all present and ancestral species (Xv)vN is multivariate Gaussian precisely when it comes from a graphical model on N whose factors ϕv are linear Gaussian [26]. The factor at node v is linear Gaussian if, conditional on its parents, Xv is Gaussian with a mean that is linear in the parental values and a variance independent of parental values, hence the term GLInv used by Moran et al. [18]. In other words, for the joint process to be Gaussian, each factor ϕv(xvxpa(υ)) should be of the form (3.1).

Such models have been called Gaussian Bayesian networks or graphical Gaussian networks, and are special cases of Gaussian processes (on a graph). These Gaussian models are convenient for BP because linear Gaussian factors have a convenient parametrization that allows for a compact representation of beliefs and belief update operations. Namely, the factor giving the conditional distribution ϕv(xvxpa(υ)) from (3.1) can be expressed in a canonical form as the exponential of a quadratic form:

C(x;K,h,g)=exp(12xKx+hx+g). (4.3)

For example, if we think of ϕv(xvxpa(υ)) as a function of xv primarily, we may use the parametrization C(xv;K,h,g) with

K=Vv1,h=Vv1(qvxpa(υ)+ωv) and g=12(log|2πVv|+qvxpa(υ)+ωvVv12)

where yM2 denotes yMy. We can also express ϕv as a canonical form over its full scope

ϕv(xvxpa(υ))=C([xvxpa(υ)];Kv,hv,gv)

with

Kv=[Vv1Vv1qvqvVv1qvVv1qv]=[Iqv]Vv1[Iqv],hv=[Vv1ωvqvVv1ωv],gv=12(log|2πVv|+ωvVv1). (4.4)

If v is a leaf with fully observed data, then we need to plug-in the data xv into ϕv and consider this factor as a function of xpa(υ) only. We can express ϕv(xvxpa(υ)) as the canonical form C(xpa(υ);K,h,g) with

K=qvVv1qv,h=qvVv1(xvωv) and g=12(log|2πVv|+xvωvVv12).

If data are partially observed at leaf v, the same principle applies. We can plug-in the observed traits into ϕv and express ϕv as a canonical form over its reduced scope: xpa(υ) and any unobserved xv,t. Some quadratic terms captured by Kv on the full scope become linear or constant terms after plugging-in the data, and some linear terms captured by hv on the full scope become constant terms in the canonical form on the reduced scope.

An important property of this canonical form is its closure under the belief update operations: marginalization (step 1) and factor product (step 2). Indeed, the product of two canonical forms with the same scope satisfies

C(x;K1,h1,g1)C(x;K2,h2,g2)=C(x;K1+K2,h1+h2,g1+g2).

Now consider marginalizing a factor C(x;K,h,g) to a subvector x* of x, by integrating out the elements xx* of x . Let KS and KI be the submatrices of K that correspond to x* (Scope of marginal or Sepset) and xx* (variables to be Integrated out), and let KS,I=KI,S be the cross-terms. If KI is invertible, then:

C(x;K,h,g)d(xx)=C(x;K,h,g)

where K=KSKS,IKI1KI,S, h=hSKS,IKI1hI with hS and hI defined as the subvector of h corresponding to x* and xx* respectively, and g=g+(log|2πKI1|+hIKI1)/2.

If the factors of a Gaussian network are non-deterministic, then each belief can be parametrized by its canonical form, and the above equations can be applied to update the cluster and sepset beliefs for BP (algorithm 1). For cluster Ci, let (Ki,hi,gi) parametrize its belief βi. For sepset Si,j, let (Ki,j,hi,j,gi,j) parametrize its belief μi,j. Also, for step 1 of BP, let (Kij,hij,gij) parametrize the message μ~ij sent from Ci to Cj. Then BP updates can be expressed as shown below.

(c).

In step 1, KS and KI are the submatrices of Ki that correspond to Si,j and CiSi,j. Similarly, hS and hI are subvectors of hi. In step 2, ext(Kμ~Ki,j) extends Kμ~Ki,j to the same scope as Kj by padding it with zero rows and zero columns for CjSi,j. Similarly, ext(hijhi,j) extends hijhi,j to scope Cj with 0 entries on rows for CjSi,j.

If the phylogeny is a tree, performing these updates from the tips to the root corresponds to the recursive eqns (9), (10) and (11) of Mitov et al. [18], and to the propagation formulas (A.3)–(A.8) of Bastide et al. [59], who both considered the general linear Gaussian model (3.1).

At any point, a belief C(x;K,h,g) gives a local estimate of the conditional mean (K1h) and conditional variance (K1) of trait X given data Y, for K0. An exact belief, such that C(x;K,h,g)pθ(xY), gives exact conditional estimates, that is: E(XY)=K1h and var(XY)=K1.

5. Scalable approximate inference with loopy belief propagation

The previous examples focused on clique trees and the exact calculation of the likelihood. We now turn to the use of cluster graphs with cycles, or loopy cluster graphs, such as in figure 2c or 3c,d. BP on a loopy cluster graph, abbreviated as loopy BP, can approximate the likelihood and posterior distributions of ancestral values and can be more computationally efficient than BP on a clique tree.

Figure 3.

Admixture graph from [108], figure 3] with reticulations (hybrid edges are coloured).

(a) Admixture graph N from Lazaridis et al. [108, fig. 3] with h=4 reticulations (hybrid edges are coloured). N has one non-trivial biconnected component (blob) B, induced by all its internal nodes except for the root. B contains all four reticulations so N has level =4. (bd) Various cluster graphs for the moralized blob Bm: (b) clique tree, (c) join-graph structuring with the maximum cluster size set to three, (d) LTRIP using the set of node families in B. Here sepsets (not shown) are the intersection of their incident clusters, and are small with one node only in (c) and (d). Purple boxes and edges, clusters and sepsets that contain node 8; red text, hybrid families.

(a). Calibration

Updating beliefs on a loopy cluster graph uses algorithm 1 in the same way as on a clique tree. A cluster graph is said to be calibrated when its normalized beliefs have converged (i.e. are unchanged by algorithm 1 along any edge). For calibration, neighbouring clusters Ci and Cj must have beliefs that are marginally consistent over the variables in their sepset Si,j:

βid(CiSi,j)=μ~ijμi,jμ~ji=βjd(CjSi,j).

On a clique tree, calibration can be guaranteed at the end of a finite sequence of messages passed. Clique and sepset beliefs are then proportional to the posterior distribution over their variables, and can be integrated to compute the common normalization constant κ=κi(=βidCi)=κj,k(=μj,kdSj,k), which equals the likelihood. For loopy BP, calibration is not guaranteed. If it is attained, then we can similarly view cluster and sepset beliefs as unnormalized approximations of the posterior distribution over their variables, though the κis and κj,ks may differ, grow unboundedly, and generally do not equal or estimate the likelihood. Gaussian models enjoy the remarkable property that, if calibration can be attained on a cluster graph, then the approximate posterior means (ancestral values) are guaranteed to be exact. By contrast, the posterior variances are generally inexact, and are typically underestimated [109111], although we found them overestimated in our phylogenetic examples below (figure 6).

Successful calibration depends on various aspects, such as the features of the loops in the cluster graph, the factors in the model, and the scheduling of messages. For beliefs to converge, a proper message schedule requires that a message is passed along every sepset, in each direction, infinitely often (until stopping criteria are met) [111]. Multiple scheduling schemes have been devised to help reach calibration more often and more accurately. These can be data-independent (e.g. choosing a list of trees nested in the cluster graph that together cover all clusters and edges, then iteratively traversing each tree in both directions [110]) or adaptive (e.g. prioritizing messages between clusters that are further from calibration [112115]).

(b). Likelihood approximation

To approximate the log-likelihood LL(θ)=logpθ(x)dx from calibrated beliefs on cluster graph U*=(V*,E*), denoted together as q={βi,μi,j;CiV*,{Ci,Cj}E*}, we can use the factored energy functional [26]:

F~(pθ,q)=CiVEβi(logψi)+CiVH(βi){Ci,Cj}EH(μi,j). (5.1)

Recall that ψi is the product of factors ϕv assigned to cluster Ci. Here Eβi denotes the expectation with respect to βi normalized to a probability distribution. H(βi) and H(μi,j) denote the entropy of the distributions defined by normalizing βi and μi,j, respectively. F~(pθ,q) has the advantage of involving local integrals that can be calculated easily: each over the scope of a single cluster or sepset. The justification for F~(pθ,q) comes from two approximations. First, following the expectation-maximization (EM) decomposition, LL(θ) can be approximated by the evidence lower bound (ELBO) used for variational inference [116]. For any distribution q over the full set of variables, which are here the unobserved (latent) variables after absorbing evidence from the data, we have

LL(θ)ELBO(pθ,q)=Eq(logpθ)+H(q).

The gap LL(θ)ELBO(pθ,q) is the Kullback–Leibler divergence between q, and pθ normalized to the distribution of the unobserved variables conditional on the observed data. The first approximation comes from minimizing this gap over a class of distributions q that does not necessarily include the true conditional distribution. The second approximation comes from pretending that for a given distribution q with a belief factorization

qCiVβi{Ci,Cj}Eμi,j,

its marginal over a given cluster (or sepset) is equal to the normalized belief of that cluster (or sepset), simplifying Eq(logψi) to Eβi(logψi) and simplifying Eq(logβi) to H(βi). This simplification leads to the more tractable F~(pθ,q), in which each integral is of lower dimension, within the scope of a single cluster or sepset.

(c). Scalability versus accuracy: choice of cluster graph complexity

(i). Scalability, treewidth and phylogenetic network complexity

At the cost of exactness, loopy cluster graphs can offer greater computational scalability than clique trees because they allow for smaller cluster sizes, which reduces the complexity associated with belief updates. For example, consider a Gaussian model for p traits: dim(xv)=p at all nodes v in the network. For a clique tree U with m cliques and maximum clique size k, passing a message between neighbour cliques has complexity O(p3k3) and calibrating U has complexity O(mp3k3). Now consider a cluster graph U* with m* clusters, O(m*) edges and maximum cluster size k*<k. Then passing a message between neighbour cliques of U* has complexity O(p3k*3) so it is faster than on U. But calibrating U* now requires more belief updates because each edge needs to be traversed more than twice. If each edge is traversed in both directions b times to reach convergence, then calibrating U* has complexity O(bm*p3k*3). So if U* has smaller clusters than U and if (k/k*)3bm*/m, then loopy BP on U* runs faster than BP on U. Loopy BP could be particularly advantageous for complex networks whose clique trees have large clusters.

Cluster graph construction determines the balance between scalability and approximation quality. At one end of the spectrum, the most scalable and least accurate are the factor graphs, also known as Bethe cluster graphs [117]. A factor graph has one cluster per factor ϕv and one cluster per variable, and so has the smallest possible maximum clique size k* and each sepset reduced to a single variable. Various algorithms have been proposed for constructing cluster graphs along the spectrum (e.g. LTRIP [118]) (figure 3). Notably, join-graph structuring [119] spans the whole spectrum because it is controlled by a user-defined maximum cluster size k*, which can be varied from its smallest possible value to a value large enough to obtain a clique tree.

At the other end of the spectrum, the best maximum clique size k is 1+tw(Gm), where tw(Gm) is the treewidth of the moralized graph. Loopy BP becomes interesting when tw(Gm) is large, making exact BP costly. Unfortunately, determining the treewidth of a general graph is NP-hard [120,121]. Heuristics such as greedy minimum-fill or nested dissection [122,123] can be used to obtain clique trees whose maximum clique size k is near the optimum 1+tw(Gm).

Different cluster graph algorithms could potentially be applied to the different biconnected components, or blobs [124] (e.g. LTRIP for one blob, clique tree for another), perhaps based on a blob’s attributes that are easy to compute. To choose between loopy versus exact BP, or between different cluster graph constructions more generally, one could use traditional complexity measures of phylogenetic networks as potential predictors of cost-effectiveness. For example, the reticulation number h is straightforward to compute. In a binary network, where all internal non-root nodes have degree 3, h is simply the number of hybrid nodes. More generally h=|{hybrid edges}||{hybrid nodes}| [125]. The level of a network is the maximum reticulation number within a blob [126]. The network’s level ought to predict treewidth better than h because a graph’s treewidth equals the maximum treewidth of its blobs [127], and moralizing the network does not affect its nodes’ blob membership. These phylogenetic complexity measures do not predict treewidth perfectly [128] except in simple cases, as shown below and proved in electronic supplementary material, §B.

Proposition 5.1. Let N be a binary phylogenetic network with h hybrid nodes, level and let t be the treewidth of the moralized network Nm obtained from N. For simplicity, assume that N has no parallel edges and no degree-2 nodes other than the root.

(A0) If =0 then h=0 and t=1.

(A1) If =1 then h1 and t=2.

(A2) Let v1 be a hybrid node with non-adjacent parents u1,u2. If v1 has a descendant hybrid node v2 such that one of its parents is not a descendant of either u1 or u2, then 2 and t3.

Level-1 networks have received much attention in phylogenetics because they are identifiable under various models under some mild restrictions [9,129131]. Several inference methods limit the search to level-1 networks [9,132134]. Since moralized level-1 networks have treewidth 2, exact BP is guaranteed to be efficient on them.

Beyond level-1, a network has a hybrid ladder (also called stack [135]) if a hybrid node v1 has a hybrid child node v2. By proposition 5.1, a hybrid ladder has the potential to increase the treewidth of the moralized network and decrease BP scalability, if the remaining conditions in (A2) are met. Related results in Chaplick et al. [136] are for undirected graphs that do not require prior moralization, and contain ladders defined as regular 2×L grids. Their Observation 1, that a graph containing a non-disconnecting grid ladder of length L2 has treewidth at least 3, relies on a similar argument as for (A2). However, structures leading to the conditions in (A2) are more general, even before moralization. It may be interesting to extend some of the results from Chaplick et al. [136] to moralized hybrid ladders in rooted networks.

In figure 4 (right), N2 has a hybrid ladder that does not meet all conditions of (A2) and has t=2. Outerplanar networks have a treewidth of at most 2 [127], and if bicombining (hybrid nodes have exactly two parents), remain outerplanar after moralization. Networks in which no hybrid node is the descendant of another hybrid node in the same blob are called galled networks [137]. They provide more tractability to solve the cluster containment problem [138]. Here, galled networks would then never meet the assumptions of (A2) and it would be interesting to study their treewidth after moralization.

Figure 4.

Two binary networks with a hybrid ladder

Two binary networks with a hybrid ladder and h==2. N1 satisfies (A2) of proposition 5.1 and N1m has treewidth t=3. N2 does not meet (A2) (see red/purple annotations) and N2m has treewidth t=2. Stacking more hybrid ladders in the same way above a and b increases h and but leaves N2m outerplanar, keeping t=2.

We performed an empirical investigation of how h and can predict the treewidth t of the moralized network. Figure 5 shows that t correlates with h and on networks estimated from real data using various inference methods and on networks simulated under the biologically realistic birth–death-hybridization model [149,150], especially for complex networks. For networks with hundreds of tips ([151] lists several studies of this size), large maximum clique sizes k30 are not uncommon. By contrast, a Bethe cluster graph would have maximum cluster size k*=3, so that (k/k*)3103 would provide a large computational gain for loopy BP to be considered.

Figure 5.

We observe a positive sublinear relationship between a maximum clique size upperbound

We observe a positive sublinear relationship between a maximum clique size upperbound (from the greedy minfill heuristic) and the number of hybrids h (a) or network level (b) on a combined sample of 11 empirical networks and 2509 simulated birth–death-hybridization networks. The empirical networks were sampled from Maier et al. [11, figs. 3(a–c) (left), 4(a-c) (left)] (reported as estimated by [139144]), Lazaridis et al. [108, fig. 3], Nielsen et al. [12, fig. 3 (left)], Sun et al. [145, fig. 4(c)], Müller et al. [146, fig. 1(a)], Neureiter et al. [10, fig. 5(a)]; fit by these authors using ADMIXTOOLS [11,78], admixturegraph [147], OrientAGraph [148], contacTrees [10], Recombination [12], AdmixtureBayes [12]. The simulated networks were obtained by subsampling 10 networks per parameter scenario simulated by Justison & Heath [149], then filtering out networks of treewidth 1 (trees, possibly with parallel hybrid edges). The graphs are similar at high values because most networks have most of their hybrids contained in one large blob, leading to h. For example, |h|2 in 99% of the networks with h>10.

(ii). Approximation quality with loopy belief propagation

We simulated data on a complex graph (40 tips, 361 hybrids) [146, fig. 1(a)] and a simpler graph (12 tips, 12 hybrids) [139, extended data fig. 4], then compared estimates from exact and loopy BP. For both networks, edges of length 0 were assigned the minimum non-zero edge length after suppressing any non-root degree-2 nodes. Trait values x=(x1,,xn) at the tips were simulated from a BM with rate σ2=1 and xρ=0 at the root. Figure 6 shows the exact and approximate log-likelihood and conditional mean and variance of xρ assuming a BM with rate σ2=1 but improper prior xρN(0,), using a greedy minimum-fill clique tree U and a cluster graph U*. Using a factor graph, calibration failed for the complex network (electronic supplementary material, §C, figure S3), so we used join-graph structuring to build U*. U can be calibrated in one iteration and the calculated quantities are exact (horizontal lines). By contrast, U* requires multiple passes and gives approximations. Calibration required more iterations on the complex network (h=361) than on the simpler network (h=12), as expected. But for both networks, the factored energy (5.1) approximated the log-likelihood very well. The distribution of the root state xρ conditional on the data seems more difficult to approximate. The conditional mean was correctly estimated but required more iterations than the log-likelihood approximation on the complex network. The conditional variance was severely overestimated on the complex network and very slightly overestimated on the simpler network. As desired, the average computing time per belief update was lower on U*, although modestly so due to the clique tree U having many small clusters of size similar to those in U* (electronic supplementary material, figure S4).

Figure 6.

Accuracy of loopy BP

Accuracy of loopy BP. Approximation of the conditional distribution of the root state Xρ (left and centre) and log-likelihood (right) using a greedy minimum-fill clique tree U and a join-graph structuring cluster graph U* for two networks of varying complexity [139,146] as measured by their number of tips (n), level (), number of hybrids (h), maximum clique size (k) and maximum cluster size (k*). For U, estimates are exact after one iteration and shown as horizontal red lines. For U*, estimates are shown over 20 (first row), 50 or 200 (second row) iterations. Each iteration consists of two passes through each spanning tree in a minimal set that jointly covers U*. In each plot, the two curves correspond to two different regularizations of initial beliefs (electronic supplementary material, §E, dotted: algorithm R1, solid: algorithm R2).

6. Leveraging belief propagation for efficient parameter inference

(a). Belief propagation for fast likelihood computation

In some particularly simple models, such as the BM on a tree, fast algorithms such as IC [34] or phylolm [16] can directly calculate the best-fitting parameters that maximize the restricted likelihood (REML), in a single tree traversal avoiding numerical optimization. For more general models, such closed-form estimates are not available. One product of BP is the likelihood of any fixed set of model parameters. BP can hence be simply used as a fast algorithm for likelihood computation, which can then be exploited by any statistical estimation technique, in a Bayesian or frequentist framework. Most of the tools cited in §2c use either direct numerical optimization of the likelihood [41,53,152] or sampling techniques such as Markov Chain Monte Carlo (MCMC) [1,14] for parameter inference.

BP also outputs the trait distribution at internal, unobserved nodes conditioned on the observed data at the tips. In addition to providing a tool for efficient ancestral state reconstruction, these conditional means and variances can be used for parameter inference, with approaches based on latent variable models such as EM [57], or Gibbs sampling schemes [3]. Although not currently used in the field of evolutionary biology to our knowledge, approaches based on approximate EM algorithms [153] and relying on loopy BP could also be used.

The linear Gaussian framework can also be useful for traits that rely on latent Gaussian liabilities, such as the threshold model for discrete traits (example 4.4), factor analysis [62], and phylogenetic structural equation models [151]. In a Bayesian context, the latent liabilities can be sampled at the tips, and then the conditional likelihood can be computed efficiently with Gaussian BP. A similar approach was successful on trees [35]. The BP framework can generalize such methods to phylogenetic networks.

(b). Belief propagation for fast gradient computation

As we show below, the conditional means and variances at ancestral nodes can be used to efficiently compute the gradient of the likelihood [154]. The gradient of the likelihood can help speed up inference in many different statistical frameworks [155]. In a phylogenetic context, gradients have been used to improve maximum likelihood estimation [60,61], Bayesian estimation through Hamiltonian Monte Carlo (HMC) approaches [4,58,59], or variational Bayes approximations [156]. Although automatic differentiation can be used on trees for some models [157], direct computations of the gradient using BP-like algorithms have been shown to be more efficient in some contexts [158]. After recalling Fisher’s identity to calculate gradients after BP calibration, we illustrate its use on the BM model (univariate or multivariate) where it allows for the derivation of a new analytical formula for the REML parameter estimates.

(i). Gradient computation with Fisher’s identity

In a phylogenetic context, latent variables are usually internal nodes, while observed variables are leaves. We write Y={Xv,j:traitjobserved atvV} the set of observed variables. Fisher’s identity provides a way to link the gradient of the log-likelihood of the data LL(θ)=logpθ(Y) at parameter θ, with the distribution of all the variables conditional on the observations Y. We refer to Cappé et al. [159, ch. 10] or Barber [155, ch. 11] for general introductions on Markov models with latent variables. Under broad assumptions, Fisher’s identity states (see proposition 10.1.6 in [159], or §11.6 in [155]):

θ[logpθ(Y)]|θ=θ=Eθ[θ[logpθ(Xv;vV)]|θ=θ|Y],

where θ[f(θ)]|θ=θ denotes the gradient of f with respect to the generic parameters θ and evaluated at θ=θ, and Eθ[|Y] the expectation conditional on the observed data under the model parametrized by θ, which is precisely where the output from BP can be used. Plugging in the factor decomposition from the graphical model (4.1), we get

θ[logpθ(Y)]|θ=θ=vVEθ[θ[logϕv(Xv|Xu,θ;upa(v))]|θ=θ|Y]. (6.1)

While (6.1) applies to the full vector of all model parameters, it can also be applied to take the gradient with respect to a single parameter θ of interest, keeping the other parameters fixed. For instance, we can focus on one rate matrix Σ of a BM model, or one primary optimum of an OU model. Special care needs to be taken for gradients with respect to structured matrices, such as variance matrices that need to be symmetric (e.g. [59]) or with a sparse inverse under structural equation modelling for high-dimensional traits [160].

For models where the conditional expectation of the factor in (6.1) has a simple form, this formula is the key to an efficient gradient computation. In particular, for discrete traits as in example 4.2, the expectation becomes a sum of a manageable number of terms, local to a cluster, weighted by the normalized cluster belief after calibration [26, ch. 19].

(ii). Gradient computation for linear Gaussian models

For linear Gaussian models (3.1), log-factors can be written as quadratic forms (4.3), so their derivatives have analytical formulas (see electronic supplementary material, §D). The conditional expectation in (6.1) then only depends on the joint first and second-order moments of the variables (Xv,Xpa(v)) in a cluster, which are known as soon as the beliefs are calibrated. When the graph is a tree, Bastide et al. [59] exploited this formula to derive gradients in the general linear Gaussian case. However, they did not use the complete factor decomposition (4.1), but instead an ad hoc decomposition that only works when the graph is a (binary) tree, and exploits the split partitions defined by the tree. By contrast, the present approach gives a recipe for the efficient gradient computation of any linear Gaussian model on any network, as soon as beliefs are calibrated.

In the special case where the process is a homogeneous BM (univariate or multivariate) on a network with a weighted-average merging rule (3.2), a constant rate Σ, no missing data at the tips and, if present, within-species variation that is proportional to Σ, then the gradient with respect to Σ takes a particularly simple form. Setting this gradient to zero, we find an analytical formula for the REML estimate of Σ and for the ML estimate of the ancestral mean μρ (electronic supplementary material, §D.3). In a phylogenetic regression setting, a similar formula can be found for the ML estimate of coefficients (electronic supplementary material, §D.4). Efficient algorithms such as IC and phylolm already exist to compute these quantities on a tree in a single traversal. Here, our formulas need two traversals but remain linear in the number of tips, and because they rely on a general BP formulation, they apply to networks with reticulations. Fisher’s identity and BP hence offer a general method for gradient computation and could lead to analytical formulas for other simple models. Such efficient formulas could alleviate numerical instabilities observed in software such as mvSLOUCH, which experienced a significant failure rate for the BM on trees with a large number of traits [161].

(iii). Hessian computation with Louis’s identity

Second-order derivatives can improve both maximum likelihood and Bayesian statistical inference methods. Efficient Hessian computation methods have been developed for branch lengths in sequence evolution models [60,61], and recently for continuous trait model parameters on trees [162]. Using BP, the Hessian of the log-likelihood with respect to the parameters can also be obtained as a conditional expectation of the Hessian of the complete log-likelihood:

{θ2[logpθ(Y)]+θ[logpθ(Y)][θ[logpθ(Y)]]}|θ=θ=Eθ[{θ2[logpθ(Xv;vV)]+θ[logpθ(Xv;vV)][θ[logpθ(Xv;vV)]]}|θ=θ|Y].

This so-called Louis identity [159] also simplifies under the factor decomposition (4.1), and leads to tractable formulas in simple Gaussian or discrete cases.

(c). Belief propagation for direct Bayesian parameter inference

Likelihood or gradient-based approaches require careful analytical computations to get exact formulas in any new model within the class of linear Gaussian graphical models, depending on the parameters of interest [59]. One way to alleviate this problem is to use a Bayesian framework, and expand the graphical model to include both the phylogenetic network and the evolutionary parameters, which are seen as random variables themselves (e.g. as in [92]). Then, inferring parameters amounts to learning their conditional distribution in this larger graphical model. In this setting, the output of interest from BP is not the likelihood but the distribution of random variables (evolutionary parameters primarily) conditional on the observed data.

Exact computation may not be possible in this extended graphical model, because it is typically not linear Gaussian and the graph’s treewidth can be much larger than that of the phylogenetic network when one parameter (e.g. the evolutionary rate) affects multiple node families. Therefore, approximations may need to be used. For example, ‘black box’ optimization techniques rely on variational approaches to reach a tractable approximation of the posterior distribution of model parameters [116]. The conditional distribution of unobserved variables, provided by BP, facilitates the noisy approximation of the variational gradient that can be used to speed up the optimization of the variational Bayes approximation.

7. Challenges and extensions

(a). Degeneracy

While our implementation provides a proof-of-concept, various technical challenges still need to be solved. Much of the literature on BP focuses on factor graphs, which failed to converge for one of our example phylogenetic networks. More work is needed to better understand the convergence and accuracy of alternative cluster graphs, and on other choices that can substantially affect loopy BP’s efficiency, such as scheduling. Below, we focus on implementation challenges due to degeneracies.

For the message μ~ij to be well-defined in step 1 of Gaussian BP, the belief of the sending cluster must have a precision matrix K in (4.3) with a full-rank submatrix with respect to the variables to be integrated out (KI in algorithm 2). This condition can fail under realistic phylogenetic models, due to two different types of degeneracy.

The first type arises from deterministic factors: when Vv=0 in (3.1) and Xv is determined by the states at parent nodes Xpa(v) without noise, e.g. when all of v’s parent branches have length 0 in standard phylogenetic models. This is expected at hybridization events when both parents have sampled descendants in the phylogeny because the parents and hybrid need to be contemporary of one another. This situation is also common in admixture graphs [11] due to a lack of identifiability of hybrid edge lengths from f statistics, leading to a ‘zipped-up’ estimated network in which the estimable composite length parameter is assigned to the hybrid’s child edge [131]. With this degeneracy, Xv has infinite precision given its parents, that is, K has some infinite values. The complications are technical, but not numerical. For example, one can use a generalized canonical form that includes a Dirac distribution to capture the deterministic equation of Xv given Xpa(v) from (3.1). Then BP operations need to be extended to these generalized canonical forms, as done in Schoeman et al. [163] (illustrated in electronic supplementary material, §F). One could also modify the network by contracting internal tree edges of length 0. At hybrid nodes, adding a small variance to Vv would be an approximate yet biologically realistic approach.

The second type of degeneracy arises when the precision submatrix KI is finite but not of full rank. In phylogenetic models, this is frequent at initialization (4.2). For example, consider a cluster of three nodes: a hybrid v and its two parents. By (4.4) we see that rank(Kv)p. So at initialization with belief ϕv, KI is degenerate if we seek to integrate out |I|=2 nodes, which would occur if the cluster is adjacent to a sepset containing only one parent of v. This situation is typical of factor graphs. Initial beliefs would also be degenerate with K=0 for any cluster that is not assigned any factor by (4.2). This may occur if there are more clusters than node families, or if the graph has nested redundant clusters (e.g. from join-graph structuring). In some cases, a schedule may avoid these degeneracies, guaranteeing a well-defined message at each BP update. On a clique tree, a schedule based on a post-order traversal has this guarantee, provided that all p traits are observed at all leaves. But generally, it is unclear how to find such a schedule. Another approach is to simply skip a BP update if its message is ill-defined, though there is no guarantee that the sending cluster will eventually have a well-behaved belief to pass the message later. A robust option is to regularize cluster beliefs, right after initialization (4.2) or during BP, by increasing some diagonal elements of K to make KI of full rank. To maintain the probability model, this cluster belief regularization is balanced by a similar modification to a corresponding sepset. Electronic supplementary material, §E, describes two such approaches that appear to work well in practice, although theoretical guarantees have not been established.

(b). Loopy belief propagation is promising for discrete traits

We focused on Gaussian models in this paper, for which the ‘lazy’ matrix approach is polynomial. For discrete trait models, the computational gains from loopy BP can be much greater because alternative approaches are not polynomial in general networks. For a trait with c states (c=2 for a binary trait as in example 4.2), passing a message has complexity O(ck) where k is the sending cluster size. Thus, cluster graphs with small clusters can bring exponential computational gains. Even exact BP can bring significant computational gains to existing approaches that rely on other means to reduce complexity. For example, the model without ILS used in Allen-Savietta and Lutteropp et al. [65,66] is a mixture model, so the network likelihood can be calculated as a weighted average of tree likelihoods for which exact BP takes linear time. This approach scales exponentially with h because there are typically O(2h) trees displayed in a network. By contrast, the complexity of BP on a clique tree of maximum clique size k is O(nck), thus parametrized by the treewidth t of the moralized network instead of h (t=k1 for an optimal clique tree). Given our empirical evidence that t grows more slowly than h or the network’s level in biologically realistic networks (figure 5), exact BP could achieve significant computational gains and loopy BP substantially more.

A BP approach is already used in momi2 by Kamm et al. [72], who use a clique tree built from a node ordering by age from youngest to oldest, to get conditional likelihoods of the derived allele count under a Moran model (without mutation). The mutation-with-ILS model in SnappNet can also be reframed as a graphical model on a graph expanded from the phylogenetic network (as shown in example 4.3 and electronic supplementary material, §A). Accordingly, the BP-like algorithm in Rabier et al. [70] has complexity controlled by the network’s scanwidth, a parameter introduced by Berry et al. [164]. Using regular BP on more optimal clique trees and loopy BP on cluster graphs may help speed up computations even more.

Also related is the algorithm in Scornavacca & Weller [128], which uses a clique tree to solve a parsimony problem. In this non-probabilistic setting, it is unclear how cluster graphs could be leveraged to speed up algorithms as they do in loopy BP.

To deal with computational intractability, the most widely used probabilistic methods to infer networks from DNA sequences are based on composite likelihoods [8,9] or summary statistics like f statistics [11,12], leading to a lack of identifiability for parts of the network topology and some of its parameters [9,129,131,165167]. These identifiability issues should be alleviated if using the full data becomes tractable thanks to exact or loopy BP.

Contributor Information

Benjamin Teo, Email: bteo@wisc.edu.

Paul Bastide, Email: paul.bastide@umontpellier.fr.

Cécile Ané, Email: cecile.ane@wisc.edu.

Ethics

This work did not require ethical approval from a human subject or animal welfare committee.

Data accessibility

Code to reproduce figures is is archived at [168] and available as a GitHub repository [169]. It uses a Julia package for Gaussian BP on phylogenetic networks, version 0.0.1, which is archived at [170] and available as a GitHub repository [171].

Supplementary material is available online [172].

Declaration of AI use

We have not used AI-assisted technologies in creating this article.

Authors’ contributions

B.T.: conceptualization, data curation, formal analysis, investigation, methodology, software, visualization, writing—original draft, writing—review and editing; P.B.: conceptualization, data curation, formal analysis, funding acquisition, investigation, methodology, software, visualization, writing—original draft, writing—review and editing; C.A.: conceptualization, data curation, formal analysis, funding acquisition, investigation, methodology, software, visualization, writing—original draft, writing—review and editing.

All authors gave final approval for publication and agreed to be held accountable for the work performed therein.

Conflict of interest declaration

We declare we have no competing interests.

Funding

This work was supported in part by the National Science Foundation (DMS 2023239 to C.A.) and by the University of Wisconsin-Madison Office of the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation. C.A. visited P.B. at the University of Montpellier thanks to support from the I-SITE MUSE through the Key Initiative 'Data and Life Sciences'.

References

  • 1. Pybus OG, et al. 2012. Unifying the spatial epidemiology and molecular evolution of emerging epidemics. Proc. Natl Acad. Sci. USA 109, 15066–15071. ( 10.1073/pnas.1206598109) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Dellicour S, et al. 2020. Epidemiological hypothesis testing using a phylogeographic and phylodynamic framework. Nat. Commun. 11, 5620. ( 10.1038/s41467-020-19122-z) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Cybis GB, Sinsheimer JS, Bedford T, Mather AE, Lemey P, Suchard MA. 2015. Assessing phenotypic correlation through the multivariate phylogenetic latent liability model. Ann. Appl. Stat. 9, 969–991. ( 10.1214/15-AOAS821) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Zhang Z, Nishimura A, Bastide P, Ji X, Payne RP, Goulder P, Lemey P, Suchard MA. 2021. Large-scale inference of correlation among mixed-type biological traits with phylogenetic multivariate probit models. Ann. Appl. Stat. 15230--251. ( 10.1214/20-aoas1394) [DOI] [Google Scholar]
  • 5. Zhang Z, Nishimura A, Trovão NS, Cherry JL, Holbrook AJ, Ji X, Lemey P, Suchard MA. 2023. Accelerating Bayesian inference of dependency between mixed-type biological traits. PLoS Comput. Biol. 19, e1011419. ( 10.1371/journal.pcbi.1011419) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Revell LJ. 2014. Ancestral character estimation under the threshold model from quantitative genetics. Evolution 68, 743–759. ( 10.1111/evo.12300) [DOI] [PubMed] [Google Scholar]
  • 7. Lartillot N. 2014. A phylogenetic Kalman filter for ancestral trait reconstruction using molecular data. Bioinformatics 30, 488–496. ( 10.1093/bioinformatics/btt707) [DOI] [PubMed] [Google Scholar]
  • 8. Yu Y, Nakhleh L. 2015. A maximum pseudo-likelihood approach for phylogenetic networks. BMC Genom. 16, S10. ( 10.1186/1471-2164-16-S10-S10) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Solís-Lemus C, Ané C. 2016. Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLoS Genet. 12, e1005896. ( 10.1371/journal.pgen.1005896) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Neureiter N, Ranacher P, Efrat-Kowalsky N, Kaiping GA, Weibel R, Widmer P, Bouckaert RR. 2022. Detecting contact in language trees: a Bayesian phylogenetic model with horizontal transfer. Humanit. Soc. Sci. Commun. 9, 205. ( 10.1057/s41599-022-01211-7) [DOI] [Google Scholar]
  • 11. Maier R, Flegontov P, Flegontova O, Işıldak U, Changmai P, Reich D. 2023. On the limits of fitting complex models of population history to F-statistics. eLife 12, e85492. ( 10.7554/eLife.85492) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Nielsen SV, Vaughn AH, Leppälä K, Landis MJ, Mailund T, Nielsen R. 2023. Bayesian inference of admixture graphs on Native American and Arctic populations. PLoS Genet. 19, e1010410. ( 10.1371/journal.pgen.1010410) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Felsenstein J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368–376. ( 10.1007/BF01734359) [DOI] [PubMed] [Google Scholar]
  • 14. FitzJohn RG. 2012. Diversitree: comparative phylogenetic analyses of diversification in R. Methods Ecol. Evol. 3, 1084–1092. ( 10.1111/j.2041-210x.2012.00234.x) [DOI] [Google Scholar]
  • 15. Freckleton RP. 2012. Fast likelihood calculations for comparative analyses. Methods Ecol. Evol. 3, 940–947. ( 10.1111/j.2041-210x.2012.00220.x) [DOI] [Google Scholar]
  • 16. Ho LT, Ané C. 2014. A linear-time algorithm for Gaussian and non-Gaussian trait evolution models. Syst. Biol. 63, 397–408. ( 10.1093/sysbio/syu005) [DOI] [PubMed] [Google Scholar]
  • 17. Goolsby EW, Bruggeman J, Ané C. 2017. Rphylopars: fast multivariate phylogenetic comparative methods for missing data and within‐species variation. Methods Ecol. Evol. 8, 22–27. ( 10.1111/2041-210x.12612) [DOI] [Google Scholar]
  • 18. Mitov V, Bartoszek K, Asimomitis G, Stadler T. 2020. Fast likelihood calculation for multivariate Gaussian phylogenetic models with shifts. Theor. Popul. Biol. 131, 66–78. ( 10.1016/j.tpb.2019.11.005) [DOI] [PubMed] [Google Scholar]
  • 19. Moran BM, Payne C, Langdon Q, Powell DL, Brandvain Y, Schumer M. 2021. The genomic consequences of hybridization. eLife 10, e69016. ( 10.7554/eLife.69016) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Kong S, Pons JC, Kubatko L, Wicke K. 2022. Classes of explicit phylogenetic networks and their biological and mathematical significance. J. Math. Biol. 84, 47. ( 10.1007/s00285-022-01746-y) [DOI] [PubMed] [Google Scholar]
  • 21. Clavel J, Escarguel G, Merceron G. 2015. mvMORPH: an R package for fitting multivariate evolutionary models to morphometric data. Methods Ecol. Evol. 6, 1311–1319. ( 10.1111/2041-210x.12420) [DOI] [Google Scholar]
  • 22. Bartoszek K, Tredgett Clarke J, Fuentes‐González J, Mitov V, Pienaar J, Piwczyński M, Puchałka R, Spalik K, Voje KL. 2024. Fast mvSLOUCH: multivariate Ornstein–Uhlenbeck‐based models of trait evolution on large phylogenies. Methods Ecol. Evol. 15, 1507–1515. ( 10.1111/2041-210x.14376) [DOI] [Google Scholar]
  • 23. Felsenstein J. 2005. Using the quantitative genetic threshold model for inferences between and within species. Philos. Trans. R. Soc. Lond. B Biol. Sci. 360, 1427–1434. ( 10.1098/rstb.2005.1669) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Goldberg EE, Foo J. 2020. Memory in trait macroevolution. Am. Nat. 195, 300–314. ( 10.1086/705992) [DOI] [PubMed] [Google Scholar]
  • 25. Bryant D, Bouckaert R, Felsenstein J, Rosenberg NA, RoyChoudhury A. 2012. Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Mol. Biol. Evol. 29, 1917–1932. ( 10.1093/molbev/mss086) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Koller D, Friedman N. 2009. Probabilistic graphical models: principles and techniques. Cambridge, MA: MIT Press. [Google Scholar]
  • 27. Felsenstein J. 1973. Maximum-likelihood estimation of evolutionary trees from continuous characters. Am. J. Hum. Genet. 25, 471–492. [PMC free article] [PubMed] [Google Scholar]
  • 28. Hilden J. 1970. GENEX—an algebraic approach to pedigree probability calculus. Clin. Genet. 1, 319–348. ( 10.1111/j.1399-0004.1970.tb02252.x) [DOI] [Google Scholar]
  • 29. Elston RC, Stewart J. 1971. A general model for the genetic analysis of pedigree data. Hum. Hered. 21, 523–542. ( 10.1159/000152448) [DOI] [PubMed] [Google Scholar]
  • 30. Heuch I, Li FHF. 1972. PEDIG—a computer program for calculation of genotype probabilities using phenotype information. Clin. Genet. 3, 501–504. ( 10.1111/j.1399-0004.1972.tb01488.x) [DOI] [PubMed] [Google Scholar]
  • 31. Cannings C, Thompson EA, Skolnick MH. 1978. Probability functions on complex pedigrees. Adv. Appl. Probab. 10, 26–61. ( 10.2307/1426718) [DOI] [Google Scholar]
  • 32. Lange K. 2002. Statistics for biology and health, 2nd edn. New York, NY: Springer. [Google Scholar]
  • 33. Hansen TF. 1997. Stabilizing selection and the comparative analysis of adaptation. Evolution 51, 1341–1351. ( 10.1111/j.1558-5646.1997.tb01457.x) [DOI] [PubMed] [Google Scholar]
  • 34. Felsenstein J. 1985. Phylogenies and the comparative method. Am. Nat. 125, 1–15. ( 10.1086/284325) [DOI] [Google Scholar]
  • 35. Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313. ( 10.1093/bioinformatics/btu033) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Nguyen LT, Schmidt HA, von Haeseler A, Minh BQ. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274. ( 10.1093/molbev/msu300) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Ronquist F, Huelsenbeck JP. 2003. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19, 1572–1574. ( 10.1093/bioinformatics/btg180) [DOI] [PubMed] [Google Scholar]
  • 38. Revell LJ. 2012. phytools: an R package for phylogenetic comparative biology (and other things). Methods Ecol. Evol. 3, 217–223. ( 10.1111/j.2041-210x.2011.00169.x) [DOI] [Google Scholar]
  • 39. Pagel M, Meade A, Barker D. 2004. Bayesian estimation of ancestral character states on phylogenies. Syst. Biol. 53, 673–684. ( 10.1080/10635150490522232) [DOI] [PubMed] [Google Scholar]
  • 40. Boyko JD, Beaulieu JM. 2021. Generalized hidden Markov models for phylogenetic comparative datasets. Methods Ecol. Evol. 12, 468–478. ( 10.1111/2041-210x.13534) [DOI] [Google Scholar]
  • 41. Boyko JD, O’Meara BC, Beaulieu JM. 2023. A novel method for jointly modeling the evolution of discrete and continuous traits. Evol. Int. J. Org. Evol. 77, 836–851. ( 10.1093/evolut/qpad002) [DOI] [PubMed] [Google Scholar]
  • 42. Höhna S, Landis MJ, Heath TA, Boussau B, Lartillot N, Moore BR, Huelsenbeck JP, Ronquist F. 2016. RevBayes: Bayesian phylogenetic inference using graphical models and an interactive model-specification language. Syst. Biol. 65, 726–736. ( 10.1093/sysbio/syw021) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Hedrick BP. 2023. Dots on a screen: the past, present, and future of morphometrics in the study of nonavian dinosaurs. Anat. Rec. 306, 1896–1917. ( 10.1002/ar.25183) [DOI] [PubMed] [Google Scholar]
  • 44. Dunn CW, Luo X, Wu Z. 2013. Phylogenetic analysis of gene expression. Integr. Comp. Biol. 53, 847–856. ( 10.1093/icb/ict068) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Shafer MER. 2019. Cross-species analysis of single-cell transcriptomic data. Front. Cell Dev. Biol 7, 175. ( 10.3389/fcell.2019.00175) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Jetz W, Thomas GH, Joy JB, Hartmann K, Mooers AO. 2012. The global diversity of birds in space and time. Nature 491, 444–448. ( 10.1038/nature11631) [DOI] [PubMed] [Google Scholar]
  • 47. Upham NS, Esselstyn JA, Jetz W. 2019. Inferring the mammal tree: species-level sets of phylogenies for questions in ecology, evolution, and conservation. PLoS Biol. 17, e3000494. ( 10.1371/journal.pbio.3000494) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. De Maio N, Kalaghatgi P, Turakhia Y, Corbett-Detig R, Minh BQ, Goldman N. 2023. Maximum likelihood pandemic-scale phylogenetics. Nat. Genet. 55, 746–752. ( 10.1038/s41588-023-01368-0) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Blanquart F, et al. 2017. Viral genetic variation accounts for a third of variability in HIV-1 set-point viral load in Europe. PLoS Biol. 15, e2001855. ( 10.1371/journal.pbio.2001855) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Hassler G, Tolkoff MR, Allen WL, Ho LST, Lemey P, Suchard MA. 2022. Inferring phenotypic trait evolution on large trees with many incomplete measurements. J. Am. Stat. Assoc. 117, 678–692. ( 10.1080/01621459.2020.1799812) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Adams DC, Collyer ML. 2018. Multivariate phylogenetic comparative methods: evaluations, comparisons, and recommendations. Syst. Biol. 67, 14–31. ( 10.1093/sysbio/syx055) [DOI] [PubMed] [Google Scholar]
  • 52. Jhwueng DC, O’Meara BC. 2020. On the matrix condition of phylogenetic tree. Evol. Bioinform. Online 16, 1176934320901721. ( 10.1177/1176934320901721) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Bartoszek K, Fuentes-González J, Mitov V, Pienaar J, Piwczyński M, Puchałka R, Spalik K, Voje KL. 2023. Model selection performance in phylogenetic comparative methods under multivariate Ornstein–Uhlenbeck models of trait evolution. Syst. Biol. 72, 275–293. ( 10.1093/sysbio/syac079) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Caetano DS, Harmon LJ. 2019. Estimating correlated rates of trait evolution with uncertainty. Syst. Biol. 68, 412–429. ( 10.1093/sysbio/syy067) [DOI] [PubMed] [Google Scholar]
  • 55. Bartoszek K, Pienaar J, Mostad P, Andersson S, Hansen TF. 2012. A phylogenetic comparative method for studying multivariate adaptation. J. Theor. Biol. 314, 204–215. ( 10.1016/j.jtbi.2012.08.005) [DOI] [PubMed] [Google Scholar]
  • 56. Hassler GW, Magee A, Zhang Z, Baele G, Lemey P, Ji X, Fourment M, Suchard MA. 2023. Data integration in Bayesian phylogenetics. Annu. Rev. Stat. Its Appl. 10, 353–377. ( 10.1146/annurev-statistics-033021-112532) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Bastide P, Ané C, Robin S, Mariadassou M. 2018. Inference of adaptive shifts for multivariate correlated traits. Syst. Biol. 67, 662–680. ( 10.1093/sysbio/syy005) [DOI] [PubMed] [Google Scholar]
  • 58. Fisher AA, Ji X, Zhang Z, Lemey P, Suchard MA. 2021. Relaxed random walks at scale. Syst. Biol. 70, 258–267. ( 10.1093/sysbio/syaa056) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Bastide P, Ho LST, Baele G, Lemey P, Suchard MA. 2021. Efficient Bayesian inference of general Gaussian models on large phylogenetic trees. Ann. Appl. Stat. 15, S1419. ( 10.1214/20-aoas1419) [DOI] [Google Scholar]
  • 60. Felsenstein J, Churchill GA. 1996. A Hidden Markov Model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13, 93–104. ( 10.1093/oxfordjournals.molbev.a025575) [DOI] [PubMed] [Google Scholar]
  • 61. Ji X, Zhang Z, Holbrook A, Nishimura A, Baele G, Rambaut A, Lemey P, Suchard MA. 2020. Gradients do grow on trees: a linear-time O(N)-dimensional gradient for statistical phylogenetics. Mol. Biol. Evol. 37, 3047–3060. ( 10.1093/molbev/msaa130) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Tolkoff MR, Alfaro ME, Baele G, Lemey P, Suchard MA. 2018. Phylogenetic factor analysis. Syst. Biol. 67, 384–399. ( 10.1093/sysbio/syx066) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Hassler GW, Gallone B, Aristide L, Allen WL, Tolkoff MR, Holbrook AJ, Baele G, Lemey P, Suchard MA. 2022. Principled, practical, flexible, fast: a new approach to phylogenetic factor analysis. Methods Ecol. Evol. 13, 2181–2197. ( 10.1111/2041-210X.13920) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Ignatieva A, Hein J, Jenkins PA. 2022. Ongoing recombination in SARS-CoV-2 revealed through genealogical reconstruction. Mol. Biol. Evol. 39, c028. ( 10.1093/molbev/msac028) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Allen-Savietta C. 2020. Estimating phylogenetic networks from concatenated sequence alignments. PhD thesis, University of Wisconsin-Madison. https://digital.library.wisc.edu/1711.dl/K5RXYJLMUO4OM86. [Google Scholar]
  • 66. Lutteropp S, Scornavacca C, Kozlov AM, Morel B, Stamatakis A. 2022. NetRAX: accurate and fast maximum likelihood phylogenetic network inference. Bioinformatics 38, 3725–3733. ( 10.1093/bioinformatics/btac396) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Karimi N, Grover CE, Gallagher JP, Wendel JF, Ané C, Baum DA. 2020. Reticulate evolution helps explain apparent homoplasy in floral biology and pollination in baobabs (Adansonia; Bombacoideae; Malvaceae). Syst. Biol. 69, 462–478. ( 10.1093/sysbio/syz073) [DOI] [PubMed] [Google Scholar]
  • 68. Kingman JFC. 1982. On the genealogy of large populations. J. Appl. Probab. 19, 27–43. ( 10.2307/3213548) [DOI] [Google Scholar]
  • 69. Stoltz M, Baeumer B, Bouckaert R, Fox C, Hiscott G, Bryant D. 2021. Bayesian inference of species trees using diffusion models. Syst. Biol. 70, 145–161. ( 10.1093/sysbio/syaa051) [DOI] [PubMed] [Google Scholar]
  • 70. Rabier CE, Berry V, Stoltz M, Santos JD, Wang W, Glaszmann JC, Pardi F, Scornavacca C. 2021. On the inference of complex phylogenetic networks by Markov Chain Monte-Carlo. PLoS Comput. Biol. 17, e1008380. ( 10.1371/journal.pcbi.1008380) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Kamm JA, Terhorst J, Song YS. 2017. Efficient computation of the joint sample frequency spectra for multiple populations. J. Comput. Graph. Stat. 26, 182–194. ( 10.1080/10618600.2016.1159212) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Kamm J, Terhorst J, Durbin R, Song YS. 2020. Efficiently inferring the demographic history of many populations with allele count data. J. Am. Stat. Assoc. 115, 1472–1487. ( 10.1080/01621459.2019.1635482) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Solís-Lemus C, Bastide P, Ané C. 2017. PhyloNetworks: a package for phylogenetic networks. Mol. Biol. Evol. 34, 3292–3298. ( 10.1093/molbev/msx235) [DOI] [PubMed] [Google Scholar]
  • 74. Bezanson J, Edelman A, Karpinski S, Shah VB. 2017. Julia: a fresh approach to numerical computing. SIAM Rev. 59, 65–98. ( 10.1137/141000671) [DOI] [Google Scholar]
  • 75. Bastide P, Solís-Lemus C, Kriebel R, William Sparks K, Ané C. 2018. Phylogenetic comparative methods on phylogenetic networks with reticulations. Syst. Biol. 67, 800–820. ( 10.1093/sysbio/syy033) [DOI] [PubMed] [Google Scholar]
  • 76. Teo B, Rose J, Bastide P, Ané C. 2023. Accounting for within-species variation in continuous trait evolution on a phylogenetic network. Bull. Soc. Syst. Biol. 2, 1–29. ( 10.18061/bssb.v2i3.8977) [DOI] [Google Scholar]
  • 77. Pickrell JK, Pritchard JK. 2012. Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet. 8, e1002967. ( 10.1371/journal.pgen.1002967) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78. Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, Genschoreck T, Webster T, Reich D. 2012. Ancient admixture in human history. Genetics 192, 1065–1093. ( 10.1534/genetics.112.145037) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79. Gautier M, Vitalis R, Flori L, Estoup A. 2022. f ‐Statistics estimation and admixture graph construction with Pool‐Seq or allele count data using the R package poolfstat. Mol. Ecol. Resour. 22, 1394–1416. ( 10.1111/1755-0998.13557) [DOI] [PubMed] [Google Scholar]
  • 80. Soraggi S, Wiuf C. 2019. General theory for stochastic admixture graphs and F-statistics. Theor. Popul. Biol. 125, 56–66. ( 10.1016/j.tpb.2018.12.002) [DOI] [PubMed] [Google Scholar]
  • 81. Lipson M. 2020. Applying f‐statistics and admixture graphs: theory and examples. Mol. Ecol. Resour 20, 1658–1667. ( 10.1111/1755-0998.13230) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82. Racimo F, Berg JJ, Pickrell JK. 2018. Detecting polygenic adaptation in admixture graphs. Genetics 208, 1565–1584. ( 10.1534/genetics.117.300489) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83. Refoyo-Martínez A, da Fonseca RR, Halldórsdóttir K, Árnason E, Mailund T, Racimo F. 2019. Identifying loci under positive selection in complex population histories. Genome Res. 29, 1506–1520. ( 10.1101/gr.246777.118) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84. Harmon LJ, et al. 2010. Early bursts of body size and shape evolution are rare in comparative data. Evol. Int. J. Org. Evol. 64, 2385–2396. ( 10.1111/j.1558-5646.2010.01025.x) [DOI] [PubMed] [Google Scholar]
  • 85. Blomberg SP, Garland T, Ives AR. 2003. Testing for phylogenetic signal in comparative data: behavioral traits are more labile. Evol. Int. J. Org. Evol. 57, 717–745. ( 10.1111/j.0014-3820.2003.tb00285.x) [DOI] [PubMed] [Google Scholar]
  • 86. Clavel J, Morlon H. 2017. Accelerated body size evolution during cold climatic periods in the Cenozoic. Proc. Natl Acad. Sci. USA 114, 4183–4188. ( 10.1073/pnas.1606868114) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87. Jhwueng DC, O’Meara B. 2015. Trait evolution on phylogenetic networks. bioRxiv. ( 10.1101/023986) [DOI]
  • 88. Drury J, Clavel J, Manceau M, Morlon H. 2016. Estimating the effect of competition on trait evolution using maximum likelihood inference. Syst. Biol. 65, 700–710. ( 10.1093/sysbio/syw020) [DOI] [PubMed] [Google Scholar]
  • 89. Manceau M, Lambert A, Morlon H. 2017. A unifying comparative phylogenetic framework including traits coevolving across interacting lineages. Syst. Biol. 66, 551–568. ( 10.1093/sysbio/syw115) [DOI] [PubMed] [Google Scholar]
  • 90. Bartoszek K, Glémin S, Kaj I, Lascoux M. 2017. Using the Ornstein–Uhlenbeck process to model the evolution of interacting populations. J. Theor. Biol. 429, 35–45. ( 10.1016/j.jtbi.2017.06.011) [DOI] [PubMed] [Google Scholar]
  • 91. Duchen P, Hautphenne S, Lehmann L, Salamin N. 2020. Linking micro and macroevolution in the presence of migration. J. Theor. Biol. 486, 110087. ( 10.1016/j.jtbi.2019.110087) [DOI] [PubMed] [Google Scholar]
  • 92. Höhna S, Heath TA, Boussau B, Landis MJ, Ronquist F, Huelsenbeck JP. 2014. Probabilistic graphical model representation in phylogenetics. Syst. Biol. 63, 753–771. ( 10.1093/sysbio/syu039) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93. Zhang C, Matsen IV FA. 2018. Generalizing tree probability estimation via Bayesian networks. In Advances in neural information processing systems (eds Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R), vol. 31. San Diego, CA: Curran Associates, Inc. [Google Scholar]
  • 94. Zhang C, Matsen IV FA. 2024. A variational approach to Bayesian phylogenetic inference. J. Mach. Learn. Res. 25, 1–56. http://jmlr.org/papers/v25/22-0348.html [Google Scholar]
  • 95. Jun SH, Nasif H, Jennings-Shaffer C, Rich DH, Kooperberg A, Fourment M, Zhang C, Suchard MA, Matsen FA. 2023. A topology-marginal composite likelihood via a generalized phylogenetic pruning algorithm. Algorithms Mol. Biol. 18, 10. ( 10.1186/s13015-023-00235-1) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96. Dumm W, Ralph D, DeWitt W, Vora A, Araki T, Victora G, Matsen F. 2024. Leveraging DAGs to improve context-sensitive and abundance-aware tree estimation. Phil. Trans. R. Soc. B 380, 20230315. ( 10.1098/rstb.2023.0315) [DOI] [PubMed] [Google Scholar]
  • 97. Strimmer K, Moulton V. 2000. Likelihood analysis of phylogenetic networks using directed graphical models. Mol. Biol. Evol. 17, 875–881. ( 10.1093/oxfordjournals.molbev.a026367) [DOI] [PubMed] [Google Scholar]
  • 98. Avise JC, Robinson TJ. 2008. Hemiplasy: a new term in the lexicon of phylogenetics. Syst. Biol. 57, 503–507. ( 10.1080/10635150802164587) [DOI] [PubMed] [Google Scholar]
  • 99. Rannala B, Yang Z. 2003. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 164, 1645–1656. ( 10.1093/genetics/164.4.1645) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100. Fogg J, Allman ES, Ané C. 2023. PhyloCoalSimulations: a simulator for network multispecies coalescent models, including a new extension for the inheritance of gene flow. Syst. Biol. 72, 1171–1179. ( 10.1093/sysbio/syad030) [DOI] [PubMed] [Google Scholar]
  • 101. Wright S. 1934. The results of crosses between inbred strains of guinea pigs, differing in number of digits. Genetics 19, 537–551. ( 10.1093/genetics/19.6.537) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102. Hiscott G, Fox C, Parry M, Bryant D. 2016. Efficient recycled algorithms for quantitative trait models on phylogenies. Genome Biol. Evol. 8, 1338–1350. ( 10.1093/gbe/evw064) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103. Felsenstein J. 2012. A comparative method for both discrete and continuous characters using the threshold model. Am. Nat. 179, 145–156. ( 10.1086/663681) [DOI] [PubMed] [Google Scholar]
  • 104. Rose DJ. 1972. A graph-theoretic study of the numerical solution of sparse positive definite systems of linear equations. In Graph theory and computing (ed. Read RC), pp. 183–217. New York, NY: Academic Press. ( 10.1016/B978-1-4832-3187-7.50018-0) [DOI] [Google Scholar]
  • 105. Fishelson M, Geiger D. 2003. Optimizing exact genetic linkage computations. In Proc. 7th Annu. Int. Conf. Research in Computational Molecular Biology, Berlin, Germany, pp. 114–121. ( 10.1145/640075.640089). https://dl.acm.org/doi/proceedings/10.1145/640075. [DOI] [PubMed] [Google Scholar]
  • 106. Blair JRS, Peyton B. 1993. An introduction to chordal graphs and clique trees. In Graph theory and sparse matrix computation (eds George JA, Gilbert JR, Liu JWH), pp. 1–29. New York, NY: Springer. ( 10.1007/978-1-4613-8369-7_1) [DOI] [Google Scholar]
  • 107. Cormen TH, Leiserson CE, Rivest RL, Stein C. 2009. Introduction to algorithms, 3rd edn. Cambridge, MA: MIT Press. 10.5555/1614191 [DOI] [Google Scholar]
  • 108. Lazaridis I, et al. 2014. Ancient human genomes suggest three ancestral populations for present-day Europeans. Nature 513, 409–413. ( 10.1038/nature13673) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 109. Weiss Y, Freeman W. 1999. Correctness of belief propagation in Gaussian graphical models of arbitrary topology. Adv. Neural Inf. Process. Syst. 12.https://proceedings.neurips.cc/paper_files/paper/1999/file/10c272d06794d3e5785d5e7c5356e9ff-Paper.pdf [DOI] [PubMed] [Google Scholar]
  • 110. Wainwright MJ, Jaakkola TS, Willsky AS. 2003. Tree-based reparameterization framework for analysis of sum-product and related algorithms. IEEE Trans. Inf. Theory 49, 1120–1146. ( 10.1109/tit.2003.810642) [DOI] [Google Scholar]
  • 111. Malioutov DM, Johnson JK, Willsky AS. 2006. Walk-sums and belief propagation in Gaussian graphical models. J. Mach. Learn. Res. 7, 2031–2064. http://jmlr.org/papers/v7/malioutov06a.html [Google Scholar]
  • 112. Elidan G, McGraw I, Koller D. 2006. Residual belief propagation: informed scheduling for asynchronous message passing. In Proc. 22nd Conf. on Uncertainty in Artificial Intelligence, Cambridge, MA, pp. 165–173. Arlington, VA: AUAI Press. [Google Scholar]
  • 113. Sutton C, McCallum A. 2007. Improved dynamic schedules for belief propagation. In Proc. 23rd Conf. on Uncertainty in Artificial Intelligence, Vancouver, Canada, pp. 376–383. Arlington, VA: AUAI Press. [Google Scholar]
  • 114. Knoll C, Rath M, Tschiatschek S, Pernkopf F. 2015. Message scheduling methods for belief propagation. pp. 295–310. Porto, Portugal: Springer International Publishing. ( 10.1007/978-3-319-23525-7_18) [DOI] [Google Scholar]
  • 115. Aksenov V, Alistarh D, Korhonen JH. 2020. Scalable belief propagation via relaxed scheduling. Adv. Neural Inf. Process. Syst. 33, 22361–22372. https://proceedings.neurips.cc/paper_files/paper/2020/file/fdb2c3bab9d0701c4a050a4d8d782c7f-Paper.pdf [Google Scholar]
  • 116. Ranganath R, Gerrish S, Blei D. 2014. Black box variational inference. In Proc. 17th Int. Conf. Artificial Intelligence and Statistics, vol. 33, pp. 814–822, Reykjavik, Iceland: PMLR. [Google Scholar]
  • 117. Yedidia JS, Freeman WT, Weiss Y. 2005. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. Inf. Theory 51, 2282–2312. ( 10.1109/tit.2005.850085) [DOI] [Google Scholar]
  • 118. Streicher S, du Preez J. 2017. Graph coloring: comparing cluster graphs to factor graphs. In Proc. ACM Multimedia 2017 Workshop on South African Academic Participation, Mountain View, CA, pp. 35–42. New York, NY: Association for Computing Machinery. ( 10.1145/3132711.3132717) [DOI] [Google Scholar]
  • 119. Mateescu R, Kask K, Gogate V, Dechter R. 2010. Join-Graph propagation algorithms. J. Artif. Intell. Res. 37, 279–328. ( 10.1613/jair.2842) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 120. Arnborg S, Corneil DG, Proskurowski A. 1987. Complexity of finding embeddings in a k-tree. SIAM J. Algebr. Discret. Methods 8, 277–284. ( 10.1137/0608024) [DOI] [Google Scholar]
  • 121. Bodlaender HL, Koster AMCA. 2010. Treewidth computations I. Upper bounds. Inf. Comput. 208, 259–275. ( 10.1016/j.ic.2009.03.008) [DOI] [Google Scholar]
  • 122. Strasser B. 2017. Computing tree decompositions with FlowCutter (PACE 2017 Submission). arXiv ( 10.48550/arXiv.1709.08949) [DOI] [Google Scholar]
  • 123. Hamann M, Strasser B. 2018. Graph bisection with pareto optimization. ACM J. Exp. Algorithm. 23, 1–34. ( 10.1145/3173045) [DOI] [Google Scholar]
  • 124. Gusfield D, Bansal V, Bafna V, Song YS. 2007. A decomposition theory for phylogenetic networks and incompatible characters. J. Comput. Biol. 14, 1247–1272. ( 10.1089/cmb.2006.0137) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 125. van Iersel L, Kelk S, Rupp R, Huson D. 2010. Phylogenetic networks do not need to be complex: using fewer reticulations to represent conflicting clusters. Bioinformatics 26, i124–131. ( 10.1093/bioinformatics/btq202) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 126. Gambette P, Berry V, Paul C. 2009. The structure of level-k phylogenetic networks. In Annu. Symp. Combinatorial Pattern Matching, pp. 289–300. Berlin, Heidelberg: Springer. ( 10.1007/978-3-642-02441-2_26) [DOI] [Google Scholar]
  • 127. Bodlaender HL. 1998. A partial k-arboretum of graphs with bounded treewidth. Theor. Comput. Sci. 209, 1–45. ( 10.1016/s0304-3975(97)00228-4) [DOI] [Google Scholar]
  • 128. Scornavacca C, Weller M. 2022. Treewidth-based algorithms for the small parsimony problem on networks. Algorithms Mol. Biol. 17, 15. ( 10.1186/s13015-022-00216-w) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 129. Baños H. 2019. Identifying species network features from gene tree quartets under the coalescent model. Bull. Math. Biol. 81, 494–534. ( 10.1007/s11538-018-0485-4) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 130. Gross E, van Iersel L, Janssen R, Jones M, Long C, Murakami Y. 2021. Distinguishing level-1 phylogenetic networks on the basis of data generated by Markov processes. J. Math. Biol. 83, 32. ( 10.1007/s00285-021-01653-8) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 131. Xu J, Ané C. 2022. Identifiability of local and global features of phylogenetic networks from average distances. J. Math. Biol. 86, 12. ( 10.1007/s00285-022-01847-8) [DOI] [PubMed] [Google Scholar]
  • 132. Oldman J, Wu T, van Iersel L, Moulton V. 2016. TriLoNet: piecing together small networks to reconstruct reticulate evolutionary histories. Mol. Biol. Evol. 33, 2151–2162. ( 10.1093/molbev/msw068) [DOI] [PubMed] [Google Scholar]
  • 133. Allman ES, Baños H, Rhodes JA. 2019. NANUQ: a method for inferring species networks from gene trees under the coalescent model. Algorithms Mol. Biol. 14, 24. ( 10.1186/s13015-019-0159-2) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 134. Kong S, Swofford DL, Kubatko LS. 2024. Inference of phylogenetic networks from sequence data using composite likelihood. Systemat. Biol. ( 10.1093/sysbio/syae054) [DOI] [PubMed]
  • 135. Semple C, Simpson J. 2018. When is a phylogenetic network simply an amalgamation of two trees? Bull. Math. Biol. 80, 2338–2348. ( 10.1007/s11538-018-0463-x) [DOI] [PubMed] [Google Scholar]
  • 136. Chaplick S, Kelk S, Meuwese R, Mihalák M, Stamoulis G. 2023. Snakes and ladders: a treewidth story. In Graph-theoretic concepts in computer science (eds Paulusma D, Ries B), pp. 187–200. Cham, Switzerland: Springer Nature. ( 10.1007/978-3-031-43380-1_14) [DOI] [Google Scholar]
  • 137. Huson DH, Rupp R, Scornavacca C. 2010. Phylogenetic networks: concepts, algorithms and applications. Cambridge, UK: Cambridge University Press. ( 10.1017/CBO9780511974076) [DOI] [Google Scholar]
  • 138. Huson DH, Rupp R, Berry V, Gambette P, Paul C. 2009. Computing galled networks from real data. Bioinformatics 25, i85–93. ( 10.1093/bioinformatics/btp217) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 139. Lipson M, et al. 2020. Ancient West African foragers in the context of African population history. Nature 577, 665–670. ( 10.1038/s41586-020-1929-1) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 140. Bergström A, et al. 2020. Origins and genetic legacy of prehistoric dogs. Science 370, 557–564. ( 10.1126/science.aba9572) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 141. Librado P, et al. 2021. The origins and spread of domestic horses from the Western Eurasian steppes. Nature 598, 634–640. ( 10.1038/s41586-021-04018-9) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 142. Hajdinjak M, et al. 2021. Initial Upper Palaeolithic humans in Europe had recent Neanderthal ancestry. Nature 592, 253–257. ( 10.1038/s41586-021-03335-3) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 143. Wang CC, et al. 2021. Genomic insights into the formation of human populations in East Asia. Nature 591, 413–419. ( 10.1038/s41586-021-03336-2) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 144. Sikora M, et al. 2019. The population history of northeastern Siberia since the Pleistocene. Nature 570, 182–188. ( 10.1038/s41586-019-1279-z) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 145. Sun X, et al. 2023. Ancient DNA reveals genetic admixture in China during tiger evolution. Nat. Ecol. Evol. 7, 1914–1929. ( 10.1038/s41559-023-02185-8) [DOI] [PubMed] [Google Scholar]
  • 146. Müller NF, Kistler KE, Bedford T. 2022. A Bayesian approach to infer recombination patterns in coronaviruses. Nat. Commun. 13, 4186. ( 10.1038/s41467-022-31749-8) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 147. Leppälä K, Nielsen SV, Mailund T. 2017. admixturegraph: an R package for admixture graph manipulation and fitting. Bioinformatics 33, 1738–1740. ( 10.1093/bioinformatics/btx048) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 148. Molloy EK, Durvasula A, Sankararaman S. 2021. Advancing admixture graph estimation via maximum likelihood network orientation. Bioinformatics 37, i142–i150. ( 10.1093/bioinformatics/btab267) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 149. Justison JA, Heath TA. 2024. Exploring the distribution of phylogenetic networks generated under a birth-death-hybridization process. Bull. Soc. Syst. Biol. 2, 1–22. ( 10.18061/bssb.v2i3.9285) [DOI] [Google Scholar]
  • 150. Justison JA, Solis‐Lemus C, Heath TA. 2023. SiPhyNetwork: an R package for simulating phylogenetic networks. Methods Ecol. Evol. 14, 1687–1698. ( 10.1111/2041-210x.14116) [DOI] [Google Scholar]
  • 151. Thorson JT, et al. 2023. Identifying direct and indirect associations among traits by merging phylogenetic comparative methods and structural equation models. Methods Ecol. Evol. 14, 1259–1275. ( 10.1111/2041-210x.14076) [DOI] [Google Scholar]
  • 152. Mitov V, Bartoszek K, Stadler T. 2019. Automatic generation of evolutionary hypotheses using mixed Gaussian phylogenetic models. Proc. Natl Acad. Sci. USA 116, 16921–16926. ( 10.1073/pnas.1813823116) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 153. Heskes T, Zoeter O, Wiegerinck W. 2003. Approximate expectation maximization. In Advances in neural information processing systems (eds Thrun S, Saul LK, Schölkopf B), vol. 16. Cambridge, MA: MIT Press. See https://proceedings.neurips.cc/paper_files/paper/2003/file/8208974663db80265e9bfe7b222dcb18-Paper.pdf. [Google Scholar]
  • 154. Salakhutdinov R, Roweis S, Ghahramani Z. 2003. Optimization with EM and expectation-conjugate-gradient. In Proc. 20th Int. Conf. Machine Learning (ICML-03), pp. 672–679. Washington, DC: AAAI Press. [Google Scholar]
  • 155. Barber D. 2012. Bayesian reasoning and machine learning. Cambridge, UK: Cambridge University Press. ( 10.1017/CBO9780511804779) [DOI] [Google Scholar]
  • 156. Fourment M, Darling AE. 2019. Evaluating probabilistic programming and fast variational Bayesian inference in phylogenetics. PeerJ 7, e8272. ( 10.7717/peerj.8272) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 157. Swanepoel C, Fourment M, Ji X, Nasif H, Suchard MA, Matsen IV FA, Drummond A. 2022. TreeFlow: probabilistic programming and automatic differentiation for phylogenetics. arXiv e-print. ( 10.48550/arXiv.2211.05220) [DOI]
  • 158. Fourment M, Swanepoel CJ, Galloway JG, Ji X, Gangavarapu K, Suchard MA, Matsen IV FA. 2023. Automatic differentiation is no panacea for phylogenetic gradient computation. Genome Biol. Evol. 15, d099. ( 10.1093/gbe/evad099) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 159. Cappé O, Moulines E, Rydén T. 2005. Inference in hidden Markov models. New York, NY: Springer. ( 10.1007/0-387-28982-8) [DOI] [Google Scholar]
  • 160. Thorson JT, van der Bijl W. 2023. phylosem: A fast and simple R package for phylogenetic inference and trait imputation using phylogenetic structural equation models. J. Evol. Biol. 36, 1357–1364. ( 10.1111/jeb.14234) [DOI] [PubMed] [Google Scholar]
  • 161. Bartoszek K, Fuentes-González J, Mitov V, Pienaar J, Piwczyński M, Puchałka R, Spalik K, Voje KL. 2024. Analytical advances alleviate model misspecification in non-Brownian multivariate comparative methods. Evol. Int. J. Org. Evol. 78, 389–400. ( 10.1093/evolut/qpad185) [DOI] [PubMed] [Google Scholar]
  • 162. Kiang WHC. 2024. Exact expressions for the log-likelihood’s Hessian in multivariate continuous-time continuous-trait Gaussian evolution along a phylogeny. arXiv. ( 10.48550/arXiv.2405.07394) [DOI]
  • 163. Schoeman JC, van Daalen CE, du Preez JA. 2022. Degenerate Gaussian factors for probabilistic inference. Int. J. Approx. Reason. 143, 159–191. ( 10.1016/j.ijar.2022.01.008) [DOI] [Google Scholar]
  • 164. Berry V, Scornavacca C, Weller M. 2020. Scanning phylogenetic networks Is NP-hard. In SOFSEM 2020: Theory and Practice of Computer Science: 46th Int. Conf. Current Trends in Theory and Practice of Informatics, pp. 519–530. Limassol, Cyprus: Springer International Publishing. ( 10.1007/978-3-030-38919-2_42) [DOI] [Google Scholar]
  • 165. Ané C, Fogg J, Allman ES, Baños H, Rhodes JA. 2024. Anomalous networks under the multispecies coalescent: theory and prevalence. J. Math. Biol. 88, 29. ( 10.1007/s00285-024-02050-7) [DOI] [PubMed] [Google Scholar]
  • 166. Allman ES, Baños H, Garrote-Lopez M, Rhodes JA. 2024. Identifiability of level-1 species networks from gene tree quartets. Bull. Math. Biol. 86, 110. ( 10.1007/s11538-024-01339-4) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 167. Rhodes JA, Baños H, Xu J, Ané C. 2025. Identifying circular orders for blobs in phylogenetic networks. Adv. Appl. Math. 163, 102804. ( 10.1016/j.aam.2024.102804) [DOI] [Google Scholar]
  • 168. Teo B, Bastide P, Ané C. 2024. Code/data supplement for: Leveraging graphical model techniques to study evolution on phylogenetic networks. Zenodo ( 10.5281/zenodo.14247650) [DOI] [PubMed]
  • 169. Teo B, Bastide P, Ané C . 2024. Code to reproduce simulations and figures in: Leveraging graphical model techniques to study evolution on phylogenetic networks. GitHub. See https://github.com/bstkj/graphicalmodels_for_phylogenetics_code. [DOI] [PubMed]
  • 170. Ané C, Teo B, Bastide P. 2024. PhyloGaussianBeliefProp.jl. Zenodo. ( 10.5281/zenodo.14250995) [DOI]
  • 171. Teo B, Bastide P, Ané C . 2024. Julia package for the analysis of Gaussian models on phylogenetic networks and admixture graphs using belief propagation (aka message passing). See https://github.com/JuliaPhylo/PhyloGaussianBeliefProp.jl.
  • 172. Teo B, Bastide P, Ané C. 2025. Supplementary material from: Leveraging graphical model techniques to study evolution on phylogenetic networks. Figshare. ( 10.6084/m9.figshare.c.7663112) [DOI] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Code to reproduce figures is is archived at [168] and available as a GitHub repository [169]. It uses a Julia package for Gaussian BP on phylogenetic networks, version 0.0.1, which is archived at [170] and available as a GitHub repository [171].

Supplementary material is available online [172].


Articles from Philosophical Transactions of the Royal Society B: Biological Sciences are provided here courtesy of The Royal Society

RESOURCES