PACo: A Novel Procrustes Application to Cophylogenetic Analysis

Juan Antonio Balbuena; Raúl Míguez-Lozano; Isabel Blasco-Costa

doi:10.1371/journal.pone.0061048

. 2013 Apr 8;8(4):e61048. doi: 10.1371/journal.pone.0061048

PACo: A Novel Procrustes Application to Cophylogenetic Analysis

Juan Antonio Balbuena ^1,^*, Raúl Míguez-Lozano ¹, Isabel Blasco-Costa ²

Editor: Corrie S Moreau³

PMCID: PMC3620278 PMID: 23580325

Abstract

We present Procrustean Approach to Cophylogeny (PACo), a novel statistical tool to test for congruence between phylogenetic trees, or between phylogenetic distance matrices of associated taxa. Unlike previous tests, PACo evaluates the dependence of one phylogeny upon the other. This makes it especially appropriate to test the classical coevolutionary model that assumes that parasites that spend part of their life in or on their hosts track the phylogeny of their hosts. The new method does not require fully resolved phylogenies and allows for multiple host-parasite associations. PACo produces a Procrustes superimposition plot enabling a graphical assessment of the fit of the parasite phylogeny onto the host phylogeny and a goodness-of-fit statistic, whose significance is established by randomization of the host-parasite association data. The contribution of each individual host-parasite association to the global fit is measured by means of jackknife estimation of their respective squared residuals and confidence intervals associated to each host-parasite link. We carried out different simulations to evaluate the performance of PACo in terms of Type I and Type II errors with respect to two similar published tests. In most instances, PACo performed at least as well as the other tests and showed higher overall statistical power. In addition, the jackknife estimation of squared residuals enabled more elaborate validations about the nature of individual links than the ParaFitLink1 test of the program ParaFit. In order to demonstrate how it can be used in real biological situations, we applied PACo to two published studies using a script written in the public-domain statistical software R.

Introduction

The phenomenal growth in sequence information in the last decades has propelled the development of phylogenetic approaches to ecology and evolution. Aimed at understanding coevolutionary and cospeciation processes, cophylogeny focuses on species associations (organisms tracking organisms, such as parasites and hosts or pollinators and flowering plants) [1], [2], molecular systematics (organisms or genes tracking genes) [3], [4] and historical biogeography (organisms tracking areas) [5], [6]. Cophylogenetic studies stem from the observation that the diversification patterns over evolutionary time of tightly associated organisms, such as parasites and their hosts, are seldom independent [2]. Thus some degree of topological similarity, often termed congruence [7], between the phylogenies of the associated taxa is expected to occur. Congruence quantifies the extent to which each node in a given tree maps to a corresponding position in the other tree and perfect congruence can be interpreted as evidence for cospeciation, which may or may not result from coevolutionary mechanisms [8], [9]. Such perfect congruence is rarely, if ever, observed in nature, because in addition to cospeciation, three other types of evolutionary events can act concurrently, namely host-switching (the parasite is able to colonize a new unrelated host), duplication (independent speciation of the parasite), and lineage sorting (failure to speciate or disappearance of a parasite linage on a host lineage) [10], [11]. (For simplicity, the evolutionary events are presented and discussed herein in the context of host-parasite systems, but they can be readily adapted and generalized to any other cophylogenetic scenario). Thus, the historical reconstruction of the associations between two given sets of organisms is not straightforward because it needs to evaluate and disentangle the relative roles played by all four evolutionary processes.

The numerous methods of cophylogenetic analysis currently available can be broadly classified in two categories: event-based methods and global-fit methods [12]. The former are aimed at finding the most probable coevolutionary history of the associated taxa. Numerous approaches, based on character optimization, e.g. Brooks’ Parsimony Analysis [13], tree reconciliation of the associated taxa, e.g. COMPONENT [14] and PACT [6], or assignment of relative costs to the evolutionary events, e.g., TreeMap [15], Jungles [16], Tarzan [17] and Jane [18], have been proposed. Event-based methods have strong appeal because they promise to deliver the coevolutionary history of the associated taxa. However, the challenges faced in their application are important. First, well resolved phylogenies are required to obtain reliable results and even with a small number of taxa the number of equally parsimonious solutions can be exceedingly high [12], [19]. Second, event-cost methods are strongly dependent on a good estimation of the set of costs considered [20]. Third, given that not all the topological congruence between trees is necessarily a result of cospeciation [21], the precise reconstruction of coevolutionary history often requires additional data, such as the ages of the nodes, assumptions on the probability of the different events, consideration to the geological history of the areas involved and experimental evidence, such as reciprocal transplant experiments [8], [22].

For their part, global-fit methods are used to quantify the degree of congruence between two given topologies, and identify the associations contributing to the cophylogenetic structure. Although they do not explicitly evaluate evolutionary scenarios, the amount of phylogenetic congruence can be related to the importance of coevolution in the system studied [12]. In addition, there is a clear need for this kind of methods because they afford large-scale cophylogenetic analyses for which the application of event-based counterparts becomes computationally prohibitive [23], [24]. To some extent, the approach taken by global-fit methods is similar to statistical tests for congruence between two given trees. A large variety of approaches have been proposed for this problem, e.g. [25], [26], [27], [28], including a Procrustes-based technique [29] similar to the one described herein. Even methods based on maximum likelihood and Bayesian inference have been specifically designed to study the cophylogeny of host and parasites [30]. However, the applicability of these methods to cophylogenetic studies is limited because they are primarily intended for one-to-one associations, something that rarely occurs in nature [24], [31], [32], [33].

Among the several of global-fit methods currently available, e.g. [7], [34], [35], ParaFit [7] has been the most used one, e.g. [3], [24], [36], [37]. ParaFit is an application to a phylogenetic context of the 4th-corner problem [38], testing whether or not the topological position of parasites in a tree is independent from the phylogenetic position of the associated hosts. The test requires three data matrices as input. The first one is a presence/absence matrix describing the host-parasite associations, whereas the two others contain information of the phylogenetic trees of hosts and parasites. Usually they consist of pairwise patristic or genetic distances, which are transformed into principal coordinate (PCo) matrices. The host PCo matrix is transposed and the three matrices (transposed host PCo, host-parasite association and parasite PCo matrices) are combined into a new one, whose trace is used to obtain a global goodness-of-fit statistic of congruence between the two trees. The significance of the statistic is established by randomization of the host-parasite association matrix. ParaFit also provides two statistics (ParaFitLink1 and 2) for testing individual host-parasite links using similar randomization procedures [7].

A second, more recent, test was proposed by Hommola et al. [34]; for convenience it will be hereafter referred to as HCT for Hommola et al. Cospeciation Test. HCT is a generalization of the Mantel test that correlates the host and parasite phylogenetic distance matrices accommodating multiple hosts associated to a single parasite and vice versa. The method is based on composing a host and a parasite vector using the patristic or genetic distances between the taxa and computing a correlation coefficient between the vectors. Unlike ParaFit, this method does not evaluate the contribution of individual host-parasite links to the global cophylogenetic structure. In addition, HCT differs from ParaFit in the randomization procedure to test the significance of the global-fit statistic. In HCT the null hypothesis is that the host and parasite phylogenies are unrelated. So the labels of the host and parasite phylogenies are randomly and separately permuted, while the tree topologies and host-parasite association matrix remain unchanged. In ParaFit, the null hypothesis states that the parasites species are randomly associated to leaves of the host phylogenetic trees and significance is established by randomization of the host-parasite matrix.

In this paper we introduce PACo (Procrustes Approach to Cophylogeny) – a new test based on Procrustes analysis. Procrustes analysis is an extremely flexible technique used for displaying two or more multivariate datasets in their optimal superimposition [38]. Our method provides a superimposition plot enabling a graphical comparison of the fit of the host-parasite associations. In addition, residual analysis affords evaluating the contribution of each individual host-parasite associations to the global fit. Like ParaFit and HCT, PACo is a distance-based test that can be carried out with any pair of distance or dissimilarity matrices, i.e., fully resolved host and parasite phylogenies are not required, and allows for multiple host-parasite associations and different number of hosts and parasites. For this purpose, rows of the host and parasite matrices are replicated to account for the multiple host-parasite links. PACo is also similar to ParaFit in that it uses the same three data matrices as input and converts the phylogenies to PCo coordinates, and it is possible to assess the contribution of individual host-parasite associations to the global topological congruence.

An important conceptual difference with the previous tests is that both ParaFit and HCT compare the host and parasite distance matrices and test for random association between the host and parasite taxa, whereas PACo explicitly tests the dependence of the parasite phylogeny upon the host phylogeny, because in the Procrustean superimposition, the parasite matrix is rotated and scaled to fit the host matrix. Accordingly the permutational procedure to test for global significance of the fit is also different by assigning hosts randomly to parasites. PACo is appropriate to establish whether the classical view of host-parasite cospeciation, which assumes that parasite speciation is driven by host speciation [2], [39], holds in a given host-parasite system. Thus the null hypothesis tested is different from that of ParaFit and HCT, although sufficiently similar as to justify a comparison of the three methods.

In the present study, we carried out several simulation experiments to compare the performance of the new test with that of ParaFit and HCT in terms of Type I and Type II errors. An additional recent test for congruence between phylogenetic trees [35] requires ultrametric trees and, given the complexity of the algorithm, detailed comparison with PACo deserves separate attention. Thus, the present study is restricted to the analysis of additive trees. We show herein that, in most cases, PACo performs at least as well as ParaFit and HCT, and in some instances, it produces better Type I errors that ParaFit and higher statistical power than ParaFit and HCT. Finally, the use of the new test is demonstrated with a script written in the public-domain statistical software R applied to two case studies [32], [40] that illustrate how the residuals of the Procustean fit can give further insight into the nature of individual taxa associations.

Materials and Methods

PACo Analysis

The present test builds on three pieces of information: two phylogenetic trees corresponding to hosts and parasites, and a binary matrix (A) coding the host-parasite associations (Fig. 1). Let h and p be the numbers of host and parasite species in the respective phylograms, A is an h × p matrix, where 1 denotes presence of a given parasite species in a given host species, and 0 corresponds to absence of a particular parasite species in a particular host species (Fig. 1). [Note the arbitrary assignation of hosts to rows and parasites to columns. Although the original ParaFit test of Legendre et al. [7] and HCT use A′, we opted to adopt the same input format required for the parafit function of the ape package of R [41] to ease comparison and integration with our R script implementing PACo.] The R code needed and instructions to implement PACo in R are given in File S1. In addition, an annotated code version, the input file examples and R code for the simulations described below can be downloaded at http://www.uv.es/cophylpaco/index.html.

(1) The phylogenetic information encapsulated by the host-parasite (H-P) tanglegram gives way to two distance matrices of host and parasites, and a binary matrix of host-parasite (H-P) links. (2) The distance matrices are transformed by Principal Coordinates. (3) The H-P link matrix (A) is converted into an identity matrix to account for multiple host-parasite associations. (4) Rows in the Principal Component matrices are duplicated (arched arrows) following the order dictated by the identity matrix. (5) The extended Principal Coordinate matrices (X and Y) are centred by mean column vectors and subjected to Procrustes analysis, where the parasite configuration is rotated and scaled to fit the host configuration. The fit can be visualised in a Procrustes superimposition plot. (6) The analysis yields a global goodness-of-fit statistic (), whose significance can be established by a randomization procedure, and individual link residuals that can be further analysed to establish the contribution of each H-P link to the global fit.

Inline graphic — (1) The phylogenetic information encapsulated by the host-parasite (H-P) tanglegram gives way to two distance matrices of host and parasites, and a binary matrix of host-parasite (H-P) links. (2) The distance matrices are transformed by Principal Coordinates. (3) The H-P link matrix (A) is converted into an identity matrix to account for multiple host-parasite associations. (4) Rows in the Principal Component matrices are duplicated (arched arrows) following the order dictated by the identity matrix. (5) The extended Principal Coordinate matrices (X and Y) are centred by mean column vectors and subjected to Procrustes analysis, where the parasite configuration is rotated and scaled to fit the host configuration. The fit can be visualised in a Procrustes superimposition plot. (6) The analysis yields a global goodness-of-fit statistic (), whose significance can be established by a randomization procedure, and individual link residuals that can be further analysed to establish the contribution of each H-P link to the global fit.

Figure 1 provides an overview of how PACo works. First, the host and parasite phylogenies are transformed into their respective distance matrices between species. This can be achieved by computing either patristic or genetic distances, or any dissimilarity measure between the species involved. The host and parasite distance matrices are, in turn, transformed into their respective matrices of principal coordinates (PCo), with h and p rows, and h –1 and p –1 columns, the latter representing each of the PCo axes. The PCo matrices can be viewed as representations of the host and parasite phylogenies in a Euclidean hyperspace, although they may contain noisy information with respect to the true phylogeny [7], [42].

PACo contemplates a given parasite occurring in more than one host species and, conversely, a host harbouring more than one parasite species (Fig. 1). Since Procrustes analysis requires the same number of observations in both ordinations, A is transformed into an identity matrix by duplicating multiple associations, which in turn are used to replicate in the right order rows of hosts harbouring more than a parasite (PCo hosts) and the corresponding parasites occurring in more than one host (PCo parasites, see Fig. 1). It has been shown in studies using the Mantel test that the replication of taxa produces incorrect Type I rates [34]. Although we had no sufficient a priori information on the behaviour Procrustes analysis with duplicated data points, we show below through simulations that no systematic biases in P values were produced and the Type I errors were mostly correct (see below). This is probably so because the replicated taxa in the corresponding PCo matrices are treated as independent observations occupying identical positions in the hyperspace. Next, the expanded matrices of PCo coordinates of hosts (X) and parasites (Y), with column vectors centred on their respective means, are compared by means of Procrustes analysis using least-squares superimposition. Whereas the X configuration is kept fixed, the Y counterpart is scaled, centred, mirrored (if necessary) and rotated to minimize the squared differences between the two configurations [43], [44]. If X and Y do not contain the same number of columns, the narrow matrix is completed with the appropriate number of zero columns. The Procrustean fit of Y onto X can be visualised in an ordination plot (Fig. 1) and yields a residual sum of squares Inline graphic , which is computed as follows:

(1)

where W is obtained by singular value decomposition of (X′Y) = VWU′ [38]. Given that Inline graphic is inversely proportional to the topological congruence between the two ordinations, it represents a measure of the fit of the parasite phylogeny onto the host phylogeny. Note that the statistic is asymmetric, i.e. . (Not to be confused with the nature of the Procrustean fit, which itself can be symmetric or asymmetric [43]). It is possible to obtain a symmetric statistic by normalizing the column vectors of X and Y [44], [45]. This approach yields a dimensionless residual sum of squares, which is appropriate in an ecological context [45] where the original variables have different units. Herein, we adopted the asymmetric Inline graphic because the PCo axes taken all together preserve the original dissimilarities among the taxa [46] and thus it provides a goodness-of-fit statistic with squared units of the original dissimilarity measure of the host phylogeny. In addition, some of our preliminary analyses using the symmetric sum of squares yielded biased Type I errors perhaps due to the influence of the replicated taxa on the estimated variances computed for normalization of the column vectors of X and Y.

Goodness-of-fit Test

The global fit of the regression of the parasite phylogeny onto the host phylogeny can be tested taking Inline graphic as a test statistic whose significance is established by a randomization procedure. Since A encapsulates the associations between hosts and parasites, it is the element that can be randomized under different criteria for hypothesis testing [7], [38], [45]. Given that in PACo we specifically test whether the parasite phylogeny depends on the host phylogeny, hosts are randomly allocated to parasites (i.e., each row in A is permuted independently). Thus, the null hypothesis (H₀) is that the host ordination does not predict the parasite ordination and so the parasite clades are randomly associated to the host clades. Conversely, the alternative hypothesis (H₁) implies that at least some part of the parasite ordination is constrained by that of the hosts and, thus the host-parasite associations are to some extent mirrored in phylogenetic congruence.

Testing H₀ against H₁ with PACo involves the following steps [27]:

Set the desired significance level α.
Compute the observed using Equation 1.
Obtain a randomized host-parasite association matrix Z by permuting the rows of A independently. Compute the new statistic as in step 2, with Z instead of A.
Repeat step 3 a large number of times and keep each for further reference.
Estimate the one-tailed probability P of the data under H₀ as the proportion of values ≤. If P≤ α, H₀ can be rejected and the analysis provides evidence for significant dependence of the parasite phylogeny on the host phylogeny.

Simulations

In any hypothesis test, two kinds of errors can be committed: H₀ can be rejected when H₀ is true (Type I error) or H₀ can be accepted when H₀ is false (Type II error) [47]. In order to estimate and compare both the Type I and Type II error rates obtained with ParaFit, HCT and PACo, we carried out several simulation experiments. For each simulation, exactly the same data (i.e., hosts and parasite phylogenetic trees, and A) were used, thus rendering the results directly comparable between the three tests. All simulations were carried out with R 2.14.1 [48]. Random additive phylogenetic trees were generated with the function rtree of the ape package [41] with branch lengths drawn randomly from the uniform distribution. The ParaFit global test [7] was carried out with the parafit function of ape, and HCT and PACo were implemented, respectively, with a script by K. Hommola (available at http://www1.maths.leeds.ac.uk/~kerstin/. Accessed 2013 March 11.) and our script based on the procrustes function of the vegan package [43] (File S1). Given that phylogenetic distances are often non-Euclidean [42], the transformation to PCo coordinates may produce negative eigenvalues, whose axes cannot be represented on the real space. To avoid this problem, the Cailliez correction [49] was used as default in the simulations with both PACo and ParaFit. Although this approach may inflate the total sum of squares [44], [50], it did not result in any substantial decrease in the Type I error as shown in the results below.

Type I error

For a test to be correct, the probability of committing a Type I error should not exceed the nominal significance level of the test α. In order to estimate the Type I errors of the three tests, we simulated data under H₀. In each simulation, a pair of random host and parasite trees, and a corresponding A containing a random sample (without replication) of all possible of parasite links were generated. The following parameter combinations were used in the simulations:

10 hosts, 10 parasites, and 10, 15, 20 and 25 host-parasite random links.
10 hosts, 15 parasites, and 10, 15, 20 and 25 links.
15 hosts, 10 parasites, and 10, 15, 20 and 25 links.
20 hosts, 20 parasites, and 20, 25, 30 and 35 links.

To our knowledge this is the first time that Type I errors of ParaFit and HCT are evaluated with larger phylogenies (>15 taxa) as in (d), which is of practical interest given the current availability of phylogenies of this size range.

For each parameter combination, 10,000 simulations were generated and the P values were calculated based on 999 permutations for each method in each simulation. For each set of simulations, the correctness of the Type I errors was evaluated by two procedures: (1) Type I error rates were computed for the commonly used 0.01 and 0.05 significance levels, together with their 95% confidence intervals based on 1,000 bootstrap samples of the 10,000 simulations. (2) To evaluate the overall accuracy of the error rates for any significance level, plots of the empirical cumulative distribution function of the P values resulting from each parameter combination were composed. When H₀ is true, correctly formed P values must follow a uniform distribution (i.e., y = x) [24].

Type II error

We assessed the Type II error rate as the statistical power of the test, which is measured as the probability of rejecting a false H₀. The power of the three tests was estimated and compared through simulations where H₀ was made to be false by construct. Three types of simulations, adapted from Legendre et al. [7], were performed:

Random links added. In each simulation, a single random tree was generated to represent identical phylogenies for host and parasites. Then A was formed by associating each host species to the parasite species at the corresponding position on the tree. These host-parasite systems could be viewed as representing ideal coevolutionary scenarios. Next a given number of random host-parasite links was added to A without replacing the existing links. Simulations were carried out with 10 hosts and 10 parasites and with 20 hosts and 20 parasites, with a number of added random links equal to 0%, 20%, 40%, 60%, 80% and 100% of the number of coevolutionary links.
Coevolutionary links replaced. The host and parasite trees, and A were generated as in the previous set of simulations. Then a given number of coevolutionary links in A was replaced (without replication of existing links) by an equal number of randomly located links. The following parameter combinations were explored: 10 hosts and 10 parasites, and 20 hosts and 20 parasites, replacing 0%, 20%, 40%, 60%, 80% and 100% of the number of coevolutionary links.
Partly congruent trees. In this set of simulations, a portion of the host and parasite trees was identical, whereas the remainder was generated at random. Then, coevolutionary links were created between host and parasites placed in the common part of the tree, whereas hosts and parasites in the random part of the tree were related by random links. Simulations were carried out with 10 hosts, 10 parasites and 10 host-parasite links, and with 20 hosts, 20 parasites and 20 host-parasite links, with varied proportions of coevolutionary links: 100%, 80%, 60%, 40%, 20% and 0% of the total number of links.

We applied the three tests to 10,000 simulations for each of these parameter combinations. Statistical power was estimated, based on 999 permutations for each method in each simulation, as the rejection rate of the false H₀ at the 0.01 and 0.05 significance levels.

Contribution of Individual Links

PACo is amenable to statistically testing the significance of the individual links. For instance, an analogue to ParaFitLink1 of Legendre et al. [7] can be devised by replacing with 0 the value 1 of the i ^th link representing a host-parasite link in A. A new sum of squared residuals can then be estimated for the i ^th link and the significance of the difference between the new statistic and Inline graphic can be established by random permutations. However, we did not pursue this approach because multiple testing of the host-parasite links requires adjusting the α levels to account for the increased Type I error rates. Although there are procedures to correct for this effect [51], it comes at the cost of reducing statistical power [52], [53], [54]. Since finding the appropriate adjustment of α can be very complex, we propose a strategy based on assessing the biological relevance [55] of each host-parasite link contributing to the global fit. Given that Inline graphic represents the sum of squared residuals of each link , the latter provides a direct measure of host-parasite link importance. The ’s, together with their 95% confidence intervals, can be estimated using a jackknife method [47] as follows:

Compute for each of the n links.
For i = 1 to i = n,
1. replace the value 1 in A corresponding to the i ^th link with 0, to yield a new host-parasite association matrix A(-i).
2. For j = 1 to j = n; if j ≠ i then
  1. estimate the n –1 squared residuals ’s with PACo using A(-i);
  2. compute the jackknifed pseudovalues as φ_ij = n · – (n –1) ·.
3. Set the jackknifed estimate and its standard error S_e as the arithmetic mean and standard error of the φ_ij’s, respectively.
4. Compute the approximate 95% confidence intervals of as CI = .

This approach is illustrated in the application to the case studies below.

Application to Case Studies

We use data from two published studies to illustrate how PACo can be applied to real biological situations. The first one concerns the cophylogeny of pocket gophers and their chewing lice based on mitochondrial cytochrome oxidase I sequences [40] (Fig. 2). This model represents a classic example of host-parasite cospeciation [1], [39] that has been much used to test new methods in cophylogeny [7], [30], [34], [56]. The analyses were carried out with patristic and raw HKY85 [57] genetic distances, which were computed as explained in File S2.

Blue lines represent host-parasite associations observed in nature. ***Gopher species abbreviations:*** Ccas: *Cratogeomys castanops*; Cmer: *C. merriami*; GburA: *Geomys bursarius halli*; GburB: *G. bursarius majusculus*; Gbre: *G. breviceps*; Gpers: *G. personatus*; Ocav: *Orthogeomys cavator*; Oche: *O. cherriei*; Ohet: *O. heterodus*; Ohis: *O. hispidus*; Ound: *O. underwoodii*; Pbul: *Pappogeomys bulleri*; Ztri: *Zygogeomys trichopus*; Tbot: *Thomomys bottae*; Ttal: *T. talpoides.* ***Louse species abbreviations:*** Gact: *Geomydoecus actuosi*; Gcha: *G. chapini*; Gche: *G. cherriei*; Gcos: *G. costaricensis*; Gewi; *G. ewingi*; Gexp: *G. expansus*; Ggeo: *G. geomydis*; Gnad: *G. nadleri*; Gokl: *G. oklahomensis*; Gpan: *G. panamensis*; Gpero: *G. perotensis*; Gset: *G. setzeri*; Gtex: *G. texanus*; Gtho: *G. thomomyus*; Gtri: *G. trichopi*; Tbar: *Thomomydoecus barbarae*; Tmin: *T.minor.*

The second study involves 51 monogenean species of Dactylogyrus associated to 20 species of freshwater fishes [32]. This is clearly a more complex scenario with 60 host-parasite associations (Fig. 3), where the authors identified a relatively high number of intra-host parasite duplications together with some cospeciation and host-switching events [32]. We performed the analyses with phylogenetic patristic distances, which, for Dactylogyrus spp., were inferred from the published tree (Figure 2 in Šimková et al. [32]). For the fish species, in order to include Romanogobio albipinnatus, whose sequence was unavailable at the time of the original publication [32], we computed the patristic distances from a newly produced phylogeny based on cytochrome b sequences. (See the new phylogeny and details about its construction in File S2).

Lineages 1–3 of *Dactylogyrus* correspond to those recognized by Šimková et al. [32]. *Fish species abbreviations:* Aalb: *Alburnus alburnus*; Aasp: *Aspius aspius*; Abra: *Abramis brama*; Bbal: *Ballerus ballerus*; Bbar: *Barbus barbus*; Bbjo: *Blicca bjoerkna*; Bsap: *Ballerus sapa*; Caur: *Carassius auratus*; Ccar: *Cyprinus carpio*; Cide: *Ctenopharyngodon idella*; Cnas: *Chondrostoma nasus*; Gcer: *Gymnocephalus cernua*; Ggob: *Gobio gobio*; Lidu: *Leuciscus idus*; Ppar: *Pseudorasbora parva*; Ppho: Phoxinus phoxinus; Ralb: *Romanogobio albipinnatus*; Rrut: *Rutilus rutilus*; Scep: *Squalius cephalus*; Sery: *Scardinius erythrophthalmus*. *Dactylogyrus – specific-name abbreviations:* achm: *achmerovi*; alat: *alatus*; amph: *amphibothrium*; anch: *anchoratus*; auri: *auriculatus*; bore: *borealis*; caba: *caballeroi*; carp: *carpathicus*; chon: *chondrostomi*; chra: *chranilowi*; corn: *cornoides*; coru: *cornu*; cruc: *crucifer*; cryp: *cryptomeres*; difd: *difformoides*; diff: *difformis*; dist: *distinguendus*; dulk: *dulkeiti*; dyki: *dyki*; erge: *ergensi*; exte: *extensus*; falc: *falcatus*; fall: *fallax*; fini: *finitimus*; folk: *folkmanovae*; form: *formosus*; frat: *fraternus*; hemi: hemiamphibothrium; inex: *inexpectatus*; inte: *intermedius*; izju: *izjumovae*; lame: *lamellatus*; mall: *malleus*; mino: *minor*; nano: *nanoides*; nanu: *nanus*; parv: *parvus*; prop: *propinquus*; pros: *prostae*; ramu: *ramulosus*; rari: *rarissimus*; ruti: *rutili*; simi: *similis*; sphy: *sphyrna*; squa: *squameus*; tuba: *tuba*; vast: *vastator*; vist: *vistulae*; vran: *vranoviensis*; wund: *wunderi*; zand: *zandti*.

In both case studies, the trees and host-parasite associations were tested globally with PACo, ParaFit and HCT and the contribution of individual host-parasite links was evaluated by jackknifed estimates of the squared residuals (File S1) and ParaFitLink1 [7]. In the fish-Dactylogyrus model, the number of associations was too numerous to produce a clear global superimposition plot, but our emphasis was placed instead on the analysis of individual host-parasite links. ParaFitLink1 was carried out with CopyCat [24], which incorporates optimized algorithms for PCo and ParaFit to facilitate analyses with large datasets [23]. In order to obtain precise P values, all tests were performed with 100,000 permutations.