Abstract
Organismal traits can evolve in a coordinated way, with correlated patterns of gains and losses reflecting important evolutionary associations. Discovering these associations can reveal important information about the functional and ecological linkages among traits. Phylogenetic profiles treat individual genes as traits distributed across sets of genomes and can provide a fine-grained view of the genetic underpinnings of evolutionary processes in a set of genomes. Phylogenetic profiling has been used to identify genes that are functionally linked and to identify common patterns of lateral gene transfer in microorganisms. However, comparative analysis of phylogenetic profiles and other trait distributions should take into account the phylogenetic relationships among the organisms under consideration. Here, we propose the Community Coevolution Model (CCM), a new coevolutionary model to analyze the evolutionary associations among traits, with a focus on phylogenetic profiles. In the CCM, traits are considered to evolve as a community with interactions, and the transition rate for each trait depends on the current states of other traits. Surpassing other comparative methods for pairwise trait analysis, CCM has the additional advantage of being able to examine multiple traits as a community to reveal more dependency relationships. We also develop a simulation procedure to generate phylogenetic profiles with correlated evolutionary patterns that can be used as benchmark data for evaluation purposes. A simulation study demonstrates that CCM is more accurate than other methods including the Jaccard Index and three tree-aware methods. The parameterization of CCM makes the interpretation of the relations between genes more direct, which leads to Darwin’s scenario being identified easily based on the estimated parameters. We show that CCM is more efficient and fits real data better than other methods resulting in higher likelihood scores with fewer parameters. An examination of 3786 phylogenetic profiles across a set of 659 bacterial genomes highlights linkages between genes with common functions, including many patterns that would not have been identified under a nonphylogenetic model of common distribution. We also applied the CCM to 44 proteins in the well-studied Mitochondrial Respiratory Complex I and recovered associations that mapped well onto the structural associations that exist in the complex. [Coevolution; evolutionary rates; gene network; graphical models; phylogenetic profiles; phylogeny.]
Comparative studies can provide useful insights into selection and adaptation of organismal traits in concert with their evolutionary history (Sanford et al. 2002). The types of traits that can be assessed in this framework are broad and can include morphology, behavior, physiology, and ecology (Peiman and Robinson 2017). For example, a study of traits in plants showed positive correlations between woodiness and tannin frequency, and a negative correlation between tannin frequency and alkaloid frequency, which they hypothesize is related to chemical defense (Silvertown and Dodd 1996). Many examples in animals have been observed as well, such as the association between coloration and behavior in snakes (Brodie III 1992). Comparative-genomic techniques can be used to identify homologous genes that underpin a multitude of traits (Koonin et al. 2000; Haubold and Wiehe 2004). Genes can exhibit similar patterns of presence and absence (Pellegrini et al. 1999) across a set of genomes for reasons such as participation in a common biochemical pathway, physical linkage, or colocalization on a mobile genetic element such as a plasmid (Fraser et al. 2004; Bowers et al. 2004; Cong et al. 2019). Examination of these patterns can reveal important information about related functions (e.g., participation in a common biochemical pathway) and common pathways of lateral gene transfer (LGT). A well-established approach to represent presence and absence patterns among genes is the construction of phylogenetic profiles, binary vectors that summarize the presence and absence of genes across a set of genomes, effectively treating each gene as a separate trait (Pellegrini 2012; Niu et al. 2017; Moi et al. 2020).
The success of phylogenetic profiling depends on the use of appropriate measures to express the distance and similarity between profiles. Approaches include the Hamming distance (Pellegrini et al. 1999), mutual information (Huynen et al. 2000), Pearson correlation (Sadreyev et al. 2015), and the hypergeometric test (Wu et al. 2003). Although effective, these approaches do not take phylogenetic effects into account. Since closely related genomes are more likely to share similar gene content, they are likely to have an outsized influence on profile comparisons relative to their phylogenetic diversity. Thus, the genomes connected by the phylogenetic tree are not independent (Pyron et al. 2015), which will violate the assumptions of these methods, and therefore skew results. Large genomic data sets are often imbalanced due to high relative abundance or oversampling of certain genomes; for example, the over-representation of pathogen isolates in the set of complete prokaryotic genome sequences (Alcaraz et al. 2010).
Several heuristic approaches have been developed to account for phylogenetic effects in profiles. For example, Jothi et al. (2007) and Sadreyev et al. (2015) used a null distribution of the similarity scores inferred by sampling the genomes to estimate the impacts of phylogenetic correlation; while von Mering et al. (2003) corrected for biases in the number of sequenced genomes by collapsing the genomes within the same clade into one single node if a specific gene pair has the same phyletic pattern in these genomes. Cokus et al. (2007) first ordered the genomes within profiles and enumerated runs of consecutive matches so that the co-occurrences concentrated in part of the tree would be considered as only one run. The shared underlying idea of these methods is the application of a weighting scheme to the genomes in order to counteract phylogenetic effects. These methods can be computationally efficient and feasible for large-scale analysis (Szklarczyk et al. 2021; Tremblay et al. 2021), but they are ad hoc approaches that do not properly model the underlying evolutionary processes.
In contrast with weighting approaches, evolutionary models aim to explain the distribution of genes by modeling the correlation patterns of gain and loss on a phylogenetic tree. Model-based approaches include CoPAP which uses a stochastic mapping approach to detect coevolving gene families (Cohen et al. 2012, 2013), the CLustering by Inferred Models of Evolution (CLIME) algorithm that was developed to model gene evolution in eukaryotes (Li et al. 2014), and the Count software package for the analysis of numerical profiles using phylogenetic birth-and-death models (Csűös 2010). However, Count was not specifically developed for binary traits; CLIME assumes that each gene has a single gain event in evolution which is not suitable for prokaryotes which have high rates of gene transfers (Vos et al. 2015); and CoPAP assumes that the gain rate and loss rate independently vary among genes rather than explicitly modeling the interactions during evolution. Pagel (1994) developed a likelihood-based coevolutionary method that specifically tests the evolutionary correlations between pairs of binary traits. In Pagel’s method, each pair of binary traits is evaluated under both independent and dependent models, and a likelihood ratio test is applied to infer whether there is significant evidence suggesting two traits evolved dependently. Although Pagel’s correlation model performed well in previous studies of detecting functionally linked genes (Barker and Pagel 2005; Liu et al. 2018), Pagel’s method is computationally expensive and it cannot directly infer the direction (positive/negative) of the correlation. In addition, there is a more general concern regarding the phylogenetic comparative methods represented by Pagel’s correlation model raised by Maddison and FitzJohn (2015) and Uyeda et al. (2018) that comparative methods may overestimate the evidence for the correlation of the patterns caused by singular events which they refer to as Darwin’s scenario. Darwin’s scenario occurs when two traits have a single origin on the same lineage and are then inherited by nearly all species in the descendant clade, resulting in (almost) perfectly codistribution. This “within-clade pseudoreplication” could lead to dubious conclusions such as a significant association between fur and middle earbone (Maddison and FitzJohn 2015).
Furthermore, most of the existing methods such as Pagel’s correlation method and distance-based methods can only be applied to pairs of genes. However, studying phylogenetic profiles in higher-order groups (such as triplets and quadruplets) can offer a more-sensitive approach to detecting complex patterns of correlation (Wu et al. 2005). Direct-coupling analysis is a class of methods often used to infer direct relationships between residues in biological sequences that can deal with conditional dependency by taking the inverse of the covariance matrix, but it is mainly for continuous data and phylogeny naive so it is highly dependent on the sampling of the genomes (Morcos et al. 2011; Baldassi et al. 2014).
Here, we propose the Community Coevolution Model (CCM), a new method that directly infers the strength and direction of the interactions among genes during the evolutionary process. For a pair of genes, CCM is more efficient in that it fits only one model instead of three (one separate model for each gene and one dependent model) in Pagel’s method and is approximately five times faster than Pagel’s method when tested on phylogenetic trees with 500 tips (performed on a server running Linux with a 2.67 GHz CPU and 18 GB RAM). Although maximum-likelihood estimation (MLE) is still a time-consuming procedure compared to other heuristic methods based on standard metrics, our method provides more biological insights such as the evolutionary rates, significance levels and directions (positive/negative) of interactions, and more importantly our method can be extended to model multiple genes as a community to discover more-complex associations. We also develop a simulation procedure to generate phylogenetic profiles with adjustable extents of evolutionary interactions that can be used as benchmark data for evaluating comparative methods.
Materials and Methods
The Community Coevolution Model
In our CCM, we consider whether sets of two or more genes have potential associations on a given phylogenetic tree, in particular whether the transition (gain or loss) of any gene within the community is affected by the current states of other members. Associations between genes can be positive if genes tend to be gained and lost together, and negative if the gain of some genes in a set appears to be associated with the loss of others in the same set. Gene sets that show evidence of associations are termed as a “community” in our model.
We formulate the transition rate
for one gene as a function of its intrinsic rate of gain and loss
, and the association factor
depending on the current states of all other genes in the community,
![]() |
(1) |
To further specify our model, we use the following notation:
is the total number of genes in the community;
is the state space of a community consisting of
genes;
is a vector of size
.
denotes the state of the specific
th gene when the community state is at
. We define
when the
th gene is absent and 1 when the
th gene is present;
is a symmetric
matrix whose off-diagonal entries are the coefficients of interaction between every pair of genes and diagonal entries indicate half the difference between the gain and loss rates of each gene;
is a vector of size
containing the intrinsic rate which is the mean of gain and loss rates for each gene;
The instantaneous transition rate for a specific gene
when the whole gene community is in state
is defined in the log scale as
![]() |
(2) |
Positive
means the
th and
th genes are positively associated, thus if the current states of
th and
th genes are the same (
), the last term tends to reduce the rate of change for gene
; and vice versa for negative values of
.
By taking the exponential of equation (2), we have the final model as
![]() |
(3) |
where the first part represents the intrinsic gain/loss rates of gene
and the latter part represents the influence from the community.
We model the gene state changes along a phylogenetic tree as a continuous-time Markov process and assume the instantaneous rate for all transitions involving more than one gene is 0. The transition rate matrix
, where element
denotes the rate of the community departing from state
and arriving in state
, can be constructed in accordance with the following rule,
![]() |
(4) |
where
denote the standard basis vectors of all 0’s except the
th element as 1 so that
indicates that only the
th gene changes state.
Constructing the likelihood function given the tree
In addition to the Markov assumption, we also assume that transitions on separate branches are independent. This means that the distribution of the state at the end of a given branch depends only on the starting state of that branch. The computational cost of constructing the likelihood function could be expensive as we need to sum over all the possible combinations of the states at each internal node, but it can be reduced by applying Felsenstein’s pruning algorithm (Felsenstein 1973). The pruning algorithm is a dynamic-programming approach that takes advantage of the nested structure of the tree and computes the likelihood for the given tree recursively. By applying the pruning algorithm, the likelihood function
of the tree in Figure 1a can be formatted as
Figure 1.
a) A phylogenetic tree with four tips:
represents the state at each node and
denotes the branch length. b) An illustration of the simulation process on one branch.
denotes the community states and
indicates the time that there is a transition out of current state. The process ends when the total transition time is beyond the branch length. c) A random realization of two groups of correlated profiles of Size 3 generated by our simulation procedure. The interaction coefficient is set to be 1.5 within a group and 0 between groups. Each row is a profile and each black bar denotes presence of the gene.
![]() |
(5) |
In this way, the likelihoods for subtrees can be reused and the computational complexity is reduced to linear in the number of leaves in the tree. Then, the negative log-likelihood function
is minimized to acquire the maximum likelihood (ML) estimates of the parameters by using a quasi-Newton optimizer (nlminb in R, version 4.0.2) (Paradis et al. 2004).
Inference and Regularization of the ML Estimates
Due to the complexity of the likelihood function, it is necessary to assess whether the optimizer is going to provide acceptable estimates. We examined two ways to obtain the standard error of estimates: first, the parametric bootstrap method that simulates a large number of profiles using the estimated parameters and calculates the standard deviation of the estimates using the bootstrap samples; second, the analytical approach based on likelihood theory which utilizes the numerically approximated Hessian matrix
of the objective function
, with the standard error given as
. Using this estimated standard error, a Z-test is conducted to obtain the P-value for the hypothesis
. The bootstrap method is obviously more time-consuming, but we can use the bootstrap to assess the accuracy of the likelihood asymptotic for finite samples. The performance of these two methods is compared in the Results section.
As tree and community size increases, the likelihood function can become extremely complicated. A potential problem with the MLE procedure is overshooting, which happens when the parameter estimators diverge substantially from the true values due to a flat likelihood surface. To avoid the overshooting problem, we apply
-regularization on the parameters and have the penalized objective function
![]() |
(6) |
where
is the tuning parameter. Unlike setting a boundary for parameters or allowing a large error tolerance, which could cause an early stop of the optimization process to avoid overshooting, the regularization approach leads to more stable estimations by adding smoothness to the surface of the likelihood function.
The
-regularization is not meant for model sparsity, but only for dealing with computational issues of likelihood singularity in some occasional cases, and therefore it is not needed in most analyses which avoids unnecessary bias. When it is needed, a reasonable
is desired to avoid the overshooting problem but without introducing too much bias into the estimation. The condition number, which is the ratio of the largest eigenvalue to the smallest of the Hessian matrix, describes the rate of convergence of the optimization (Thacker 1989), and can be used as a guide to find a proper
. To provide a rule of thumb, we find that a condition number below
generally indicates a successful convergence without the overshooting problem.
Simulation Procedures
To simulate the coevolutionary relationships among genes, we use the framework of CCM that the transition rates of one gene depend on the states of other genes in the community. The procedure for simulating the evolution of a gene community of size
on one branch can be summarized as below:
-
Input:
the starting state of the community
, a user-defined coefficient matrix
, user-defined intrinsic rates
and branch length
.Substitute the current states and user-defined parameters into Formula 3 to calculate the current transition rate for each gene
.Sample the transition time for each gene from the exponential distribution,
.Find the gene
with the minimum transition time,
.If
, update the state of gene
in
with the opposite state and if
, do not update the gene state. Then update the branch length
.Repeat steps 1–4 until
.
Output: the new state of the community
.
An illustration of the evolutionary process on one branch is shown in Figure 1b. Then, the end state
will be the starting state for the next adjacent branches. The same procedure will be applied to all branches sequentially from the root to the tips. Figure 1c shows a simulation example of six genes in two groups of Size 3 using the interaction matrix which has within-group interaction coefficients equal to 1.5, indicating strong relationships and between-group interaction coefficients equal to 0, indicating independent evolution.
Analysis of Genomes from Class Clostridia
We applied our method to the draft assembly of the bacterium “Lachnospiraceae bacterium 3-1-57FAA-CT1” (abbreviated as LZ), which was isolated from a biopsy retrieved from the transverse colon of a female Crohn’s Disease patient at the time of colonoscopy (Liu et al. 2018). LZ is of interest because its genome size is very large compared to most of its immediate neighbors (6505 protein-coding genes as compared with a median of 3124 in our complete data set of Clostridia) and identifying sets of genes with shared patterns of gain and loss may yield insights into its ecological role in the host. 658 completed and draft genomes from class Clostridia were retrieved from the National Center for Biotechnology Information (NCBI) for the comparative analysis of LZ. The phylogenetic tree was built through the AMPHORA2 pipeline (Wu and Scott 2012) and RAxML-HPC (Stamatakis 2006) using their concatenated, conserved protein sequences and another set of eight outgroup genomes from class Bacilli and phyla Actinobacteria and Proteobacteria were used for rooting. The phylogenetic profiles were constructed by comparing the complete set of LZ (6505 predicted genes) against all other genomes using rapsearch (Ye et al. 2011). Before our analysis, we firstly filtered out the genes that are very rare (present in
genomes) or very common (present in
genomes) and obtained the final data set of 3786 profiles. The Markov Clustering algorithm (MCL) was also used to obtain clusters of genes. MCL is a graph-based clustering method that simulates random walks within the graph to reflect the cluster structure based on the idea that random walks are more likely to stay in one natural cluster than to move across clusters (van Dongen and Abreu-Goodger 2012).
Results
Results on Simulated Data
Evaluation of model estimates
We first evaluated the performance of CCM on simulated data. We used the parameters estimated from one pair of flagellar genes in the real data set, with profiles shown in Figure 2a. We simulated 100 pairs of genes using the parameters estimated from these two genes. The results are shown in Figure 2b. We see that the estimates are distributed around the true parameter values. Furthermore, we compared the estimates of the standard error based on the Hessian matrix and the parametric bootstrap and Table 1 shows that the estimation results of two methods are consistent.
Figure 2.
Estimation of the parameters using simulated pairs: a) Two phylogenetic profiles from the real data set; b) Estimated parameter values from CCM based on 100 simulated pairs using the parameters estimated from the two profiles in (a). The “*” represents the true parameters used in simulation. Evaluation of the interaction in pairs: c) The distributions of the estimated coefficients of interaction of the “no interaction” group and the “with interaction” group; d) the ROC curves of detecting the significant linkages by Jaccard Index, Pagel’s correlation method, clade-adjusted mutual information and hypergeometric, run-adjusted hypergeometric and our CCM model.
Table 1.
Comparison of estimated standard error using the parametric bootstrap and analytical Hessian methods based on the simulations in Figure 2
|
|
|
|
|
|
|---|---|---|---|---|---|
| Bootstrap SE | 0.282 | 0.215 | 0.107 | 0.099 | 0.090 |
| Hessian SE | 0.382 | 0.255 | 0.104 | 0.105 | 0.085 |
Detection of significant interactions between genes
We next used our simulation approach to examine the ability of the CCM to distinguish genes with associations from those that do not interact. To evaluate the performance of CCM, we simulated 500 gene pairs with no interaction (
) as negatives and 500 gene pairs with interactions (
uniformly drawn between 0.2 and 0.5) as positives. Figure 2c shows the distributions of the estimated coefficients of interaction in two groups: the mean value for the “no interaction” group is 0.0046 (
0.0807) and for the “interaction” group is 0.3497 (
0.1173). We also compared the performance of Pagel’s correlation test method, the Jaccard Index (
) and two heuristic methods: hypergeometric with consecutive runs method (Cokus et al. 2007) and mutual information with clade adjustment method (von Mering et al. 2003). We evaluated both clade-adjusted and runs-adjusted methods with four different metrics (Hamming, Jaccard Index, Hypergeometric test, and Mutual Information) using simulated data, and we found that the hypergeometric test works best. Thus, we also included the clade-adjusted Hypergeometric test for comparison since it performed better than the method proposed in their original paper (clade-adjusted mutual information). From the ROC curve shown in Figure 2d, our CCM method obtained the highest AUC score of 0.9521 followed by Pagel’s correlation model (0.8968), run-adjusted Hypergeometric (0.838), clade-adjusted Hypergeometric (0.7665), and clade-adjusted Mutual Information (0.7215). The Jaccard Index, being a nonphylogenetic method had the lowest AUC score (0.609).
Identifying Darwin’s scenario of codistribution
Under Darwin’s scenario, there is a single concurrent origin for two genes leading to the perfect codistribution across all species within a clade as in the example shown in Figure 3a. As Darwin’s scenario provides little evidence for coevolution, it is of interest to distinguish such scenarios from a replicated coevolution scenario that has multiple disjoint instances of a given trait. Both scenarios are considered significantly correlated by CCM due to their perfect codistribution, but the replicated coevolution scenario yields stronger significance scores and has much higher intrinsic rates. We demonstrate this difference using 100 simulated data sets. In each simulation, we randomly generate a 100-tip tree, one pair of genes that have co-occurrences concentrated in one random clade chosen uniformly across all clades as Darwin’s scenario and another pair of genes that have same number of co-occurrences spreading across the tree as the replicated codistribution scenario. Although both scenarios produce significant results (P-value
), replicated co-occurrence tends to achieve greater significance scores (Fig. 3b). Very few of the Darwin scenarios would be deemed significant in a real data analysis with correction for multiple testing. Another distinguishing feature between these scenarios is the estimated intrinsic rate
, with gene pairs simulated under Darwin’s scenario having much lower estimates of
(
) than under replicated coevolution scenario (
) as shown in Figure 3c.
Figure 3.
Comparison between Darwin’s scenario and replicated co-occurrence: a) An example of Darwin’s scenario (Pair 1) and replicated co-occurrence (Pair 2); b) The distributions of the Z-scores (
) for the two scenarios; c) The distributions of the estimated intrinsic rates for the two scenarios.
Modeling multiple genes as a community to reduce pairwise false-positive links
Most comparative methods (e.g., Pagel’s method and all the methods based on distance or similarity metrics) use pairwise comparisons. By modeling more than two genes as a community, the CCM can screen out false-positive links that can be caused by genes that show pairwise evidence for coevolution but are conditionally independent when other genes are taken into consideration. For example, consider a community of three genes where Gene
is directly related to both
and
, but there is no direct connection between Genes
and 2. Pairwise methods will often falsely identify a significant connection between Genes 1 and 2. We simulated 100 triplets of genes with this structure where Gene 3 is moderately linked to Gene 1 (
) and is strongly linked to Gene 2 (
= 0.8), but Genes 1 and 2 are independent conditional on the presence of Gene 3 (
). As shown in Figure 4a, CCM correctly estimates the true parameters. We also ran Pagel’s model over the same simulation data in pairs (three pairs for each group of three genes, so 300 pairwise comparisons in total). From Figure 4b which shows the distribution of estimated P-values on the conditionally independent linkage between Genes
and 2, we can see that our method resulted in the desired uniform distribution of P-values while Pagel’s method shows
of estimates had P-value
.
Figure 4.
Evaluation of the conditionally independent links in the simulated triplets: a) Estimated parameter values from CCM based on 100 simulated triplets. The sign “*” indicates the true parameters. b) The P-values of the conditionally independent pairs (genes
and
) inferred by our CCM model and Pagel’s model. Simulation of four association-network structures: c) line, d) partially connected, e) star, and f) fully connected. The networks on the left demonstrate the structures and the box-plots on the right show the estimated coefficients of interactions within the community. All the edges have an interaction coefficient (
) of 0.5.
Recovery of community structures
To evaluate the ability of CCM at recovering the relationships among more genes within a community, we simulated four basic network topologies of a five-node community: i) a line structure; ii) a star structure where the Node 3 acts as the hub; iii) a partially connected network where Node 3 acts as the bridge that connects two subgroups; and iv) a fully connected network. For each structure, we simulated 100 data sets with an interaction coefficient of 0.5 for all links. As shown in Fig. 4c–f, CCM successfully reveals the linkages within the community. Unlike the pairwise methods that will tend to result in a densely connected network due to false positives among the conditionally independent pairs, our community model provides clear insights in finding the importance of the members (e.g., hub genes), and complex dependency structures within the community.
As community size increases, the Q matrix dimension increases quickly which dramatically slows down the evaluation of the log-likelihood values. Thus, our current method can only handle a small number of genes in a community. Table S1 of the Supplementary material available on Dryad at https://doi.org/10.5061/dryad.p8cz8w9rd provides the approximate running time of our program as a reference for different community and tree sizes. For large groups, one strategy is to run all-versus-all pairwise comparisons first to construct a gene-interaction network, which is usually very densely linked at this stage. We then run all the triplets within the network to remove the conditionally independent links. We can continue to examine all the subnetworks of Size 4 or 5 to further prune the network to the desired sparsity.
Effect of tuning parameters
We also evaluated the influence of the tuning parameters on the MLEs. We simulated 100 gene pairs with random parameters and expected that the overshooting problem may happen to some of the pairs. These problematic cases will have the estimations of parameters far away from true values, like the outliers in Figure S1 of the Supplementary material available on Dryad. Then, we added the regularization term and increased the tuning parameter
gradually. For each simulated pair, the mean squared error (MSE =
) of the estimators (five parameters for two genes) is reported. From Figure S1a of the Supplementary material available on Dryad, we can see that the tuning parameters mainly have a large impact on those outliers. The condition number plot (Fig. S1b of the Supplementary material available on Dryad) shows that when we increase the tuning parameter to make the condition number of the Hessian matrix below 200, the overshooting problems with those outliers were effectively solved.
Results on Prokaryotic Data
Model comparisons
The computational cost of running a single pairwise profile comparison using Pagel’s model was around 15 s (performed on a server running Linux with a 2.67 GHz CPU and 18 GB Ram). Since the entire data set requires (
) such comparisons, an exhaustive evaluation of Pagel’s method is infeasible. Instead, we focused on a complete pairwise evaluation of 50 adjacent genes to compare against our coevolutionary model. The software we used to implement Pagel’s approach is BayesTraitsV3 (Meade and Pagel 2017). A comparison of the negative logarithm of the P-values inferred by the two methods yielded a correlation coefficient of 0.741 as shown in Figure 5a (P-values have been adjusted for multiple correlated tests using the Benjamini–Yekutieli (BY) method (Benjamini and Yekutieli 2005)). Applying a log P-value threshold of
, we found that both methods agreed on the significance or nonsignificance of 1188 (96.98
) comparisons. Twenty-nine (2.37
) comparisons gave a significant result with the Pagel test but not with the coevolutionary model, while the opposite result was seen in the remaining 8 (0.65
) pairwise comparisons. After examining the discordant pairs in the top-left corner of Figure 5a, we found that a common issue for Pagel’s model is that most of these pairs reached the default maximum rate of 100 (the mean branch length of the tree has been scaled to 0.1 as suggested by the authors (Meade and Pagel 2017)), which indicates that Pagel’s dependent model may overestimate the likelihood of correlated evolution because of overshooting and therefore detect more false-positive links. One example of the estimated transition rates by the two methods is compared in Table 2. Pagel’s dependent model has a strange transition rate matrix where the transition rate from (0,0) to (0,1) is abnormally large (98.575) and the transition rate from (1,0) to (0,0) is 0, which may suggest that Pagel’s eight-parameter dependent model may be overparametrized and therefore overestimate the likelihood of dependent evolution.
Figure 5.
The comparison of significance of pairwise linkages by two methods: the horizontal axis is the
-value) of CCM and the vertical axis is the
-value) of Pagel’s approach; the correlation between the P-values for the two methods is 0.741. b,c) The comparison of the goodness-of-fit of models to data between the independent model (four parameters), dependent model (eight parameters), and our CCM model (five parameters). The independent model and dependent model are the two components of Pagel’s approach required for the likelihood ratio test.
Table 2.
The transition rate matrices inferred by Independent and Dependent models of Pagel’s method and the CCM. The gene pair (GI:511537597 and GI:496550319) in this table is considered strongly correlated by Pagel’s method (P-value
0.00011), but not by the CCM (interaction coefficient
0.0832; P-value
0.283)
| 0, 0 | 0, 1 | 1, 0 | 1, 1 | |
|---|---|---|---|---|
(a) Independent model (likelihood) 283.183 | ||||
| 0, 0 | – | 2.143 | 0.506 | 0 |
| 0, 1 | 0.304 | – | 0 | 0.506 |
| 1, 0 | 1.866 | 0 | – | 2.143 |
| 1, 1 | 0 | 1.866 | 0.304 | – |
(b) Dependent model (likelihood) 271.4941 | ||||
| 0, 0 | – | 98.575 | 0.375 | 0 |
| 0, 1 | 10.190 | – | 0 | 0.447 |
| 1, 0 | 0 | 0 | – | 2.139 |
| 1, 1 | 0 | 2.267 | 0.0003 | – |
(c) CCM (likelihood) 283.015 | ||||
| 0, 0 | – | 2.158 | 0.441 | 0 |
| 0, 1 | 0.335 | – | 0 | 0.521 |
| 1, 0 | 2.188 | 0 | – | 2.548 |
| 1, 1 | 0 | 1.853 | 0.284 | – |
We further evaluated the goodness-of-fit of the two methods to the real data by comparing the likelihood scores. As shown in Figure 5b,c, the CCM model obtains a significantly lower negative log-likelihood than Pagel’s independent model (P-value
) and dependent model (P-value = 0.00217), which suggests that our model generally has better fit to the real data, even though our CCM model (five parameters) has fewer parameters than Pagel’s dependent model (eight parameters).
Gene clustering based on significant pairwise linkages
To discover sets of genes that collectively show evidence of correlated gains and losses, we performed a full pairwise comparison over the genes of LZ using CCM. There are in total 1918 genes annotated with gene ontology (GO) biological process terms, which were used to evaluate the gene functional similarities of the linkages and for the GO enrichment analysis. All GO annotations of genes were retrieved from the UniProt database (UniProt Consortium 2015). We use Wang’s graph-based method (Wang et al. 2007) to measure the semantic similarity of GO terms, which produces a score between 0 and 1 for a given pair of GO terms and higher values represent more functional similarity (Wang et al. 2007; Yu et al. 2010). Figure S2 of the Supplementary material available on Dryad shows that the most significant linkages under our model are between closely functionally related genes. Our results confirm the strong relationship between evolutionary similarity and functional similarity between genes.
To obtain the clusters of genes with highly correlated evolution, we firstly applied a strict threshold (coefficient of interaction
and Z score
) on the linkages to obtain a gene network which consists of 1401 vertices and 19,391 highly significant (P-value
) edges (Fig. 6a). We further applied Markov clustering with inflation parameter 1.5 on the network to provide a guidance for labeling the genes into clusters in the largest component (Fig. 6b). We reported the GO enrichment analysis for all the clusters of size at least five in Table S2 of the Supplementary material available on Dryad.
Figure 6.
Visualization of the gene network: a) The gene network obtained from the full pairwise comparisons and labeled with the MCL clustering results. Black vertices indicate the genes annotated with GO (BP) terms and gray vertices denote unannotated genes. b) A detailed structure inside the largest component in (a). Each pie chart denotes the percentage of the annotated genes within each cluster. Only the clusters of size
are labeled for a clean visualization.
We do not expect profile similarity and clustering to align perfectly with participation in a common biological process, especially when biological processes are annotated at very low levels of specificity (e.g., “transmembrane transport”). Nonetheless, we expect that many genes with common functions (such as transmembrane transport, transcription, and carbohydrate metabolic process) will show similar distributions across genomes, reflecting processes such as hitchhiking on frequently transferred mobile elements and coincidental loss of genes that collectively confer no selective benefit. The flagellum cluster (Cluster 5) and amino-acid biosynthetic cluster (Cluster 6) were also discovered and examined in our previous study using Pagel’s correlation method applied on a reduced data set (a 74-tip subtree). It was only possible to analyze a reduced data set because of the computational cost of Pagel’s method, and a phylogenetic analysis was also conducted to find potential evidence for LGTs (Liu et al. 2018). In this study, by applying our method to the full data set (659 species), we discovered another candidate group of flagellar genes (Cluster 16) which are much less common (found in only 45 genomes) compared to the genes in Cluster 5 which are found in 396 genomes (Table S2 of the Supplementary material available on Dryad).
The intrinsic rates inferred by CCM were consistent with distribution patterns of genes in phylogenetic profiles. For example, the pattern in Cluster 4 appears to be more consistent with Darwin’s scenario, which is consistent with its relatively low intrinsic rate (Fig. S3 of the Supplementary material available on Dryad). Clusters 29 and 33 have the largest estimated intrinsic rates, and both show patchy distributions in the same very shallow clade in the tree. This rapid gain and loss over a relatively short span in the tree is a possible cause of the high rates. Cluster 36 (profiles in Fig. S4b of the Supplementary material available on Dryad) and Cluster 55 have the largest estimated interaction coefficients (
), and they both show strong functional associations according to their GO annotations as well. More detailed information about clusters can be found in Table S2 of the Supplementary material available on Dryad. To complete the analysis, we also provided a list of GO predictions on 823 unannotated genes based on most interacting genes that have known GOs, and the results are summarized in Table S3 of the Supplementary material available on Dryad.
Examples of inferred evolutionary relationships
The simulation results have shown that the pairwise comparisons could not detect the conditionally independent linkages, so that using all-versus-all pairwise comparisons tends to produce densely connected networks. For example, the five genes in Cluster 49 (Fig. S4a of the Supplementary material available on Dryad) are all related to iron–sulfur (Fe–S) assembly (three are annotated with “iron–sulfur cluster assembly,” one is annotated with “cysteine metabolic process” and one has no GO annotation but has the protein name “FeS assembly ATPase SufC”). The pairwise comparisons suggest that the linkages between all five genes are extremely strong (largest P-value
), which would lead to a fully connected network. However, by modeling these five genes as a community, 4 out of 10 total linkages can be removed as conditionally independent linkages (P-value
).
In other cases, the pairwise interactions are still significant even when we account for conditional dependence. As an example, Cluster 36 consists of six genes which are all annotated with GO term “alginic acid biosynthetic process.” The pairwise comparisons show that all links between the six genes are highly significant (largest P-value
). By modeling these six genes simultaneously as a community, only 3 out of 15 total linkages have a P-value
as shown in Figure S4b of the Supplementary material available on Dryad.
Because the size of the transition matrix, and therefore the computational cost of our method, increases exponentially with the number of genes, it is infeasible to apply our method to large groups of genes. For large clusters, we get around this issue by applying our method to smaller cliques within the network, and using this to detect linkages that are conditionally independent. This is different from directly removing linkages by thresholding, as it aims to only remove the “redundant” linkages conditioning on other genes’ presences to reveal the refined structure rather than to break the cluster into smaller groups. For example, we started from the original network of Cluster 6 which consists of 32 amino-acid related genes and 381 highly significant linkages (P-value
) obtained from all-versus-all pairwise comparisons (Fig. 7a). Then we applied CCM over all the triplets within this network and some strong linkages became weakly significant due to the presence of the third gene. We removed 272 such edges (P-value
and interaction coefficient (
)
) and obtained the refined network (Fig. 7b). For comparison, we directly deleted the same number of edges from the original graph by increasing the threshold (Fig. 7c). This results in a very different network structure consisting of multiple densely connected components, rather than a more sparsely connected network obtained using our method.
Figure 7.
Network analysis of the amino-acid gene cluster. a) The original network (Cluster 6) consists of 32 vertices and 381 highly significant (P-value
) edges based on the all-versus-all pairwise comparisons (b) Application of the CCM on every triplet from network (a) followed by removal of the conditionally independent edges (P-value
and interaction coefficient 
). The resulting network consists of 32 vertices and 109 edges. c) Direct deletion of edges from (a) by thresholding to retain the same number of edges as in (b). The cluster is disconnected into two components and two singletons. The force-directed layout algorithm is used for the network visualization.
Analysis of Mitochondrial Respiratory Complex I
Eukaryotic genes are less susceptible to LGT (Keeling and Palmer 2008; Sibbald et al. 2020), and we may therefore expect significant differences in the performance of CCM between prokaryotic and eukaryotic data. To evaluate the performance of CCM on eukaryotic data, we applied our CCM method on a well-studied protein complex which consists of a total of 44 human genes encoding Mitochondrial respiratory complex I (Balsa et al. 2012; Li et al. 2014; Guo et al. 2017). The data sets we used are published phylogenetic profiles and a species tree consisting of 138 diverse eukaryotes and a prokaryote outgroup (Bick et al. 2012; Li et al. 2014). We performed an all-versus-all comparison using CCM to infer the interactions among 44 genes and illustrate the detailed relationships within the complex with the average linkage hierarchical dendrogram as shown in Figure 8. We also compared our results with CLIME, an approach to infer evolutionary modules specifically for eukaryotic species which assumes that each gene must only have one single gain event in evolution followed by zero or more loss events. CLIME groups 20 of the 44 genes into four evolutionary modules (ECMs) with the remainder as singletons with no assigned group as shown in Figure 8 (the results of CLIME are available at https://gene-clime.org/). Comparing our clustering results to the detailed structure of complex I reported by Guo et al. (2017), we find a a single cluster of size 21 encompassing 15 genes that all localize to the matrix arm of CI including all 7 core subunits (NDUFV1, NDUFV2, NDUFS1, NDUFS2, NDUFS3, NDUFS7, and NDUFS8). The other main cluster includes 20 subunits, 15 of which localize on the membrane arm. We also analyzed the estimated evolutionary rates and find that the loss rates are significantly (P-value
) larger than the gain rates (Fig. S5 of the Supplementary material available on Dryad), which supports the idea that eukaryotic genes are much less mobile than prokaryotic genes. To further study the structure of complex I, we first obtained a network consisting of 462 significant (P-value
) links that were inferred by full pairwise comparisons using CCM. After pruning the network by removing the conditionally independent links (P-value
) detected from all triplets, we obtained a more sparse network consisting of 101 linkages (Fig. S6 of the Supplementary material available on Dryad). We can observe two loosely connected components in this network: one is mainly composed of more densely linked subunits on the matrix arm with higher estimated values for the coefficients of interaction, while the other component is mainly composed of the subunits on the membrane arm. This network representation of the gene-interaction map shows more comprehensive information about the gene evolutionary cohesiveness than the pure clustering results in Figure 8.
Figure 8.
Clustering of mitochondrial respiratory complex I genes: the heatmap shows the phylogenetic profiles of 44 genes where black bars indicate presence. The column labels give the information of subunits, names - location (M, Matrix; T, Transmembrane; I, Intermembrane). The symbols below the gene names indicate the four components inferred from CLIME and those without symbols below indicate singletons. The dendrogram on the left indicates the eukaryotic tree and the names of species are given on the right as the row labels; the dendrogram above shows the hierarchical structure constructed with the estimated pairwise interactions by CCM.
Discussion
Identifying associations among traits is an important tool to generate hypotheses about linkages between phenotypic, ecological, and genetic attributes. Often these associations need to be further analyzed or tested experimentally to demonstrate whether they have arisen due to selection or other factors (Peiman and Robinson 2017). Phylogenetic profiles are a specialized type of trait representation that have been used for over 20 years as a tool to explore and compare genomes; while they can be treated in a similar fashion to other types of traits, the sequences, genetic linkage information, and functional annotations associated with genes in a profile can be used to shed more light on evolutionary hypotheses. Many studies suggest that phylogenetic relationships among source genomes should be taken into account (Pagel 1994; Cokus et al. 2007; Cohen et al. 2013; Liu et al. 2018). Our previous work demonstrated the utility of Pagel’s model (Pagel 1994) in identifying sets of genes with correlated evolutionary trajectories; however, this approach was computationally expensive and could not infer the direction of the relationship. In this study, we proposed a new coevolution model, CCM, to detect the genes with correlated evolutionary histories based on phylogenetic profiles. CCM was able to identify correlated genes as well as the direction of the relationship (e.g., Fig. S4a of the Supplementary material available on Dryad) and ran five times faster than Pagel’s method when tested on phylogenetic trees with 500 tips. The number of pairwise comparisons increases quadratically with the number of genes to be considered, but the independence of each comparison allows calculations to proceed in parallel. Heuristic methods can be used to quickly subdivide genes into large clusters that can then be refined using the CCM. Our model also has the ability to analyze the evolutionary relationships among sets of genes of size greater than 2. Examining sets of size
can provide a more sparse gene network and greater insights of the complex relationships between genes.
Based on CCM, we also developed a simulation procedure that can generate a set of coevolved profiles with interactions along a given phylogenetic tree. The strength of the interactions during evolution is also adjustable. A common way to evaluate comparative methods for detecting genes with correlated evolutionary histories is measuring the functional similarities based on gene annotations such as GO terms (Radivojac et al. 2013) and KEGG pathways (Jothi et al. 2007). However, such evaluation is subject to annotation completeness and the correlated patterns may not always reflect shared function as expressed by GO annotations. Our coevolving simulation procedure provides a way to generate benchmark data for evaluating the comparative methods.
In the simulation study, our method outperformed the nonphylogenetic method (Jaccard Index) and the tree-aware methods (Pagel’s correlation model, run-adjusted methods, and clade-adjusted methods) in detecting the significant links (Fig. 2d). We showed that our method can distinguish between Darwin’s scenario and the replicated co-occurrence scenario (Fig. 3). We also demonstrated that pairwise comparisons cannot detect conditionally independent links and further showed the performance of CCM in recovering the community structures (Fig. 4).
Finally, we applied our method to 3786 profiles across 659 genomes and the results showed a strong positive relationship between the evolutionary similarity and functional similarity (Fig. S3 of the Supplementary material available on Dryad). We also identified the gene clusters with enriched functions (Table S2 of the Supplementary material available on Dryad) that can be used to better understand the functional roles of gene groups and predicted 823 unannotated genes based on their most interacting genes with known GO annotations (Table S3 of the Supplementary material available on Dryad). We also demonstrated using CCM to refine the network obtained from the pairwise comparisons by removing conditionally independent linkages (Fig. 7). In addition to analyzing prokaryotic data, CCM has also been successfully applied to a eukaryotic data set of the well-studied Human Complex I and the recovered associations mapped well onto the structural associations that exist in the complex (Fig. 8, Fig. S6 of the Supplementary material available on Dryad). The results show that CCM as a general comparative model can also be applied to eukaryotic data. Although our method is specifically used to analyze the phylogenetic profiles in this study, we think it can have wide applications in other fields such as to study phenotypes of species (Goberna and Verdú 2016), ecological habitats (Fierer et al. 2012), and metagenomic profiling (Aagaard et al. 2012).
The uniqueness of the CCM lies in the careful modeling of each gene’s instantaneous gain and loss rates dependent on the current states of other genes. In addition to improving our ability to identify related genes, the CCM directly models the dependence between related genes in the evolutionary process. The same idea can possibly be generalized to phylogenetic models to jointly estimate the transition rate matrix of each site based on the current states of its neighbor sites or other related sites. The dependence between different genes, or different sites within a single gene, is an underexplored area in phylogeny and molecular evolution, with the majority of models assuming independence of sites. By developing better-fitting models that incorporate the dependence between different genes, we expect to gain insights into the mechanisms driving this dependence.
We also met a challenge in extending our method to directly model larger communities. The state space
will increase exponentially as we include more genes into the community. Currently, we have successfully tested our method on communities of sizes less than 10, but two problems will arise if we include more genes: the huge memory requirements to store the
matrix of dimension
and the long computation time for eigendecomposition of
. We have found that if we reorder the rows and columns of the transition matrix, there exists a recursive structure: the
matrix can be written as a block matrix of the form
, where
is an antidiagonal matrix and
has the same recursive structure as
,
(
is still an antidiagonal matrix and
is a block matrix). We can solve the first problem by storing the
matrix as a sequence of small “blocks,” but we have not found existing mathematical methods to solve the eigendecomposition of block matrices with such recursive structures. Our future work will explore the possible solutions to decompose the
matrix more efficiently so that the CCM method is scalable.
Contributor Information
Chaoyue Liu, Department of Mathematics and Statistics, Dalhousie University, Halifax, NS B3H 4R2, Canada; Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada.
Toby Kenney, Department of Mathematics and Statistics, Dalhousie University, Halifax, NS B3H 4R2, Canada.
Robert G Beiko, Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada.
Hong Gu, Department of Mathematics and Statistics, Dalhousie University, Halifax, NS B3H 4R2, Canada.
Software availability
The R package evolCCM was written in R v4.0.2 and is available on Github (https://github.com/beiko-lab/evolCCM).
Supplementary material
Data available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.p8cz8w9rd.
Funding
This work was supported by Natural Sciences and Engineering Research Council of Canada [RGPIN/4945-2014, RGPIN-2017-05108, and RGPIN/05141-2017], Genome Canada, and Research Nova Scotia.
References
- Aagaard K., Riehle K., Ma J., Segata N., Mistretta T.A., Coarfa C., Raza S., Rosenbaum S., Van den Veyver I., Milosavljevic A., Gevers D.. 2012. A metagenomic approach to characterization of the vaginal microbiome signature in pregnancy. PLoS One 7(6):e36466. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alcaraz L.D., Moreno-Hagelsieb G., Eguiarte L.E., Souza V., Herrera-Estrella L., Olmedo G.. 2010. Understanding the evolutionary relationships and major traits of bacillus through comparative genomics. BMC Genomics 11(1):1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baldassi C., Zamparo M., Feinauer C., Procaccini A., Zecchina R., Weigt M., Pagnani A.. 2014. Fast and accurate multivariate Gaussian modeling of protein families: predicting residue contacts and protein-interaction partners. PLoS One 9(3):e92721. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Balsa E., Marco R., Perales-Clemente E., Szklarczyk R., Calvo E., Landázuri M.O., Enríquez J.A.. 2012. NDUFA4 is a subunit of complex IV of the mammalian electron transport chain. Cell Metab. 16(3):378–386. [DOI] [PubMed] [Google Scholar]
- Barker D., Pagel M.. 2005. Predicting functional gene links from phylogenetic-statistical analyses of whole genomes. PLoS Comput. Biol. 1(1):e3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Benjamini Y., Yekutieli D.. 2005. False discovery rate-adjusted multiple confidence intervals for selected parameters. J. Am. Stat. Assoc. 100(469):71–81. [Google Scholar]
- Bick A.G., Calvo S.E., Mootha V.K.. 2012. Evolutionary diversity of the mitochondrial calcium uniporter. Science 336(6083):886–886. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bowers P.M., Pellegrini M., Thompson M.J., Fierro J., Yeates T.O., Eisenberg D.. 2004. Prolinks: a database of protein functional linkages derived from coevolution. Genome Biol. 5(5):1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brodie E.D. III 1992. Correlational selection for color pattern and antipredator behavior in the garter snake Thamnophis ordinoides. Evolution 46(5):1284–1298. [DOI] [PubMed] [Google Scholar]
- Cohen O., Ashkenazy H., Burstein D., Pupko T.. 2012. Uncovering the co-evolutionary network among prokaryotic genes. Bioinformatics 28(18):i389–i394. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cohen O., Ashkenazy H., Levy Karin E., Burstein D., Pupko T.. 2013. COPAP: coevolution of presence–absence patterns. Nucleic Acids Res. 41(W1):W232–W237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cokus S., Mizutani S., Pellegrini M.. 2007. An improved method for identifying functionally linked proteins using phylogenetic profiles. BMC Bioinformatics 8:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cong Q., Anishchenko I., Ovchinnikov S., Baker D.. 2019. Protein interaction networks revealed by proteome coevolution. Science 365(6449):185–189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Csűös M. 2010. Count: evolutionary analysis of phylogenetic profiles with parsimony and likelihood. Bioinformatics 26(15):1910–1912. [DOI] [PubMed] [Google Scholar]
- Felsenstein J. 1973. Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Syst. Biol. 22(3):240–249. [Google Scholar]
- Fierer N., Lauber C.L., Ramirez K.S., Zaneveld J., Bradford M.A., Knight, R.. 2012. Comparative metagenomic, phylogenetic and physiological analyses of soil microbial communities across nitrogen gradients. ISME J. 6(5):1007–1017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fraser H.B., Hirsh A.E., Wall D.P., Eisen M.B.. 2004. Coevolution of gene expression among interacting proteins. Proc. Natl. Acad. Sci. USA 101(24):9033–9038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goberna M., Verdú M.. 2016. Predicting microbial traits with phylogenies. ISME J. 10(4):959–967. [DOI] [PMC free article] [PubMed] [Google Scholar]
-
Guo R., Zong S., Wu M., Gu J., Yang M..
2017. Architecture of human mitochondrial respiratory megacomplex I
III
IV
. Cell 170(6):1247–1257. [DOI] [PubMed] [Google Scholar] - Haubold B., Wiehe T.. 2004. Comparative genomics: methods and applications. Naturwissenschaften 91(9):405–421. [DOI] [PubMed] [Google Scholar]
- Huynen M., Snel B., Lathe W., Bork P.. 2000. Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res. 10(8):1204–1210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jothi R., Przytycka T.M., Aravind L.. 2007. Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment. BMC Bioinformatics 8(1):1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Keeling P.J., Palmer J.D.. 2008. Horizontal gene transfer in eukaryotic evolution. Nat. Rev. Genetics 9(8):605–618. [DOI] [PubMed] [Google Scholar]
- Koonin E.V., Aravind L., Kondrashov A.S.. 2000. The impact of comparative genomics on our understanding of evolution. Cell 101(6):573–576. [DOI] [PubMed] [Google Scholar]
- Li Y., Calvo S.E., Gutman R., Liu J.S., Mootha V.K.. 2014. Expansion of biological pathways based on evolutionary inference. Cell 158(1):213–225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu C., Wright B., Allen-Vercoe E., Gu H., Beiko R.. 2018. Phylogenetic clustering of genes reveals shared evolutionary trajectories and putative gene functions. Genome Biol. Evol. 10(9):2255–2265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maddison W.P., FitzJohn R.G.. 2015. The unsolved challenge to phylogenetic correlation tests for categorical characters. Syst. Biol. 64(1):127–136. [DOI] [PubMed] [Google Scholar]
- Meade A., Pagel M.. 2017. Bayestraits v3. 0.1. Available from: http://www.evolution.reading.ac.uk/BayesTraitsV3.0.1/BayesTraitsV3.0.1.html. Accessed July2022. [Google Scholar]
- Moi D., Kilchoer L., Aguilar P.S., Dessimoz, C.. 2020. Scalable phylogenetic profiling using minhash uncovers likely eukaryotic sexual reproduction genes. PLoS Comput. Biol. 16(7):e1007553. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morcos F., Pagnani A., Lunt B., Bertolino A., Marks D.S., Sander C.1, Zecchina R., Onuchic J.N., Hwa T., Weigt, M.. 2011. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl. Acad. Sci. USA 108(49):E1293–E1301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Niu Y., Moghimyfiroozabad S., Safaie S., Yang Y., Jonas E.A., Alavian K.N.. 2017. Phylogenetic profiling of mitochondrial proteins and integration analysis of bacterial transcription units suggest evolution of F1Fo ATP synthase from multiple modules. J. Mol. Evol. 85(5):219–233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pagel M. 1994. Detecting correlated evolution on phylogenies: a general method for the comparative analysis of discrete characters. Proc. R. Soc. Lond. B 255(1342):37–45. [Google Scholar]
- Paradis E., Claude J., Strimmer K.. 2004. Ape: analyses of phylogenetics and evolution in r language. Bioinformatics 20(2):289–290. [DOI] [PubMed] [Google Scholar]
- Peiman K.S., Robinson B.W.. 2017. Comparative analyses of phenotypic trait covariation within and among populations. Am. Nat. 190(4):451–468. [DOI] [PubMed] [Google Scholar]
- Pellegrini M. 2012. Using phylogenetic profiles to predict functional relationships. In: Bacterial molecular networks. New York (NY):Springer. p. 167–177. [DOI] [PubMed] [Google Scholar]
- Pellegrini M., Marcotte E.M., Thompson M.J., Eisenberg D., Yeates T.O.. 1999. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA 96(8):4285–4288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pyron R.A., Costa G.C., Patten M.A., Burbrink F.T.. 2015. Phylogenetic niche conservatism and the evolutionary basis of ecological speciation. Biol. Rev. 90(4):1248–1262. [DOI] [PubMed] [Google Scholar]
- Radivojac P., Clark W.T., Oron T.R., Schnoes A.M., Wittkop T., Sokolov A., Graim K., Funk C., Verspoor K., Ben-Hur A., Pandey G.. 2013. A large-scale evaluation of computational protein function prediction. Nat. Methods 10(3):221–227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sadreyev I.R., Ji F., Cohen E., Ruvkun G., Tabach Y.. 2015. Phylogene server for identification and visualization of co-evolving proteins using normalized phylogenetic profiles. Nucleic Acids Res. 43(W1):W154–W159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sanford G.M., Lutterschmidt W.I., Hutchison V.H.. 2002. The comparative method revisited. BioScience 52(9):830–836. [Google Scholar]
- Sibbald S.J., Eme L., Archibald J.M., Roger A.J.. 2020. Lateral gene transfer mechanisms and pan-genomes in eukaryotes. Trends Parasitol. 36(11):927–941. [DOI] [PubMed] [Google Scholar]
- Silvertown J., Dodd M.. 1996. Comparing plants and connecting traits. Philos. Trans. R. Soc. Lond. B 351(1345):1233–1239. [Google Scholar]
- Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22(21):2688–2690. [DOI] [PubMed] [Google Scholar]
- Szklarczyk D., Gable A.L., Nastou K.C., Lyon D., Kirsch R., Pyysalo S., Doncheva N.T., Legeay M., Fang T., Bork P., Jensen L.J.. 2021. The STRING database in 2021: customizable protein -protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res. 49(D1): D605–D612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thacker W.C. 1989. The role of the hessian matrix in fitting models to measurements. J. Geophys. Res.: Oceans; 94(C5):6177–6196. [Google Scholar]
- Tremblay B.J., Lobb B., Doxey A.C.. 2021. Phylocorrelate: inferring bacterial gene-gene functional associations through large-scale phylogenetic profiling. Bioinformatics 37(1):17–22. [DOI] [PubMed] [Google Scholar]
- UniProt Consortium. 2015. Uniprot: a hub for protein information. Nucleic Acids Res. 43(D1):D204–D212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Uyeda J.C., Zenil-Ferguson R., Pennell M.W.. 2018. Rethinking phylogenetic comparative methods. Syst. Biol. 67(6):1091–1109. [DOI] [PubMed] [Google Scholar]
- van Dongen S., Abreu-Goodger C.. 2012. Using MCL to extract clusters from networks. In: Bacterial molecular networks. New York (NY): Springer, p. 281–295. [DOI] [PubMed] [Google Scholar]
- von Mering C., Huynen M., Jaeggi D., Schmidt S., Bork P., Snel B.. 2003. String: a database of predicted functional associations between proteins. Nucleic Acids Res. 31(1):258–261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vos M., Hesselman M.C., Te Beek T.A., van Passel M.W., Eyre-Walker A.. 2015. Rates of lateral gene transfer in prokaryotes: high but why? Trends in Microbiology 23(10):598–605. [DOI] [PubMed] [Google Scholar]
- Wang J.Z., Du Z., Payattakool R., Yu P.S., Chen C.-F.. 2007. A new method to measure the semantic similarity of GO terms. Bioinformatics 23(10):1274–1281. [DOI] [PubMed] [Google Scholar]
- Wu J., Kasif S., DeLisi C.. 2003. Identification of functional links between genes using phylogenetic profiles. Bioinformatics 19(12):1524–1530. [DOI] [PubMed] [Google Scholar]
- Wu J., Mellor J.C., De Lisi C.. 2005. Deciphering protein network organization using phylogenetic profile groups. Genome Informatics 16(1):142–149. [PubMed] [Google Scholar]
- Wu M., Scott A.J.. 2012. Phylogenomic analysis of bacterial and archaeal sequences with amphora2. Bioinformatics 28(7):1033–1034. [DOI] [PubMed] [Google Scholar]
- Ye Y., Choi J.-H., Tang H.. 2011. RAPSearch: a fast protein similarity search tool for short reads. BMC Bioinformatics 12(1):1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu G., Li F., Qin Y., Bo X., Wu Y., Wang S.. 2010. GOSemSim: an r package for measuring semantic similarity among go terms and gene products. Bioinformatics 26(7):976–978. [DOI] [PubMed] [Google Scholar]




















