Abstract
After transitioning to a new environment, species often exhibit rapid phenotypic innovation. One of the fastest mechanisms for this is duplication followed by specialization of existing genes. When this happens to a member of a gene family, it tends to leave a detectable phylogenetic signature of lineage-specific expansions and contractions. These can be identified by analyzing the gene family across several species and identifying patterns of gene duplication and loss that do not correlate with the known relationships between those species. This signature, termed phylogenetic instability, has been previously linked to adaptations that change the way an organism samples and responds to its environment; conversely, low phylogenetic instability has been previously linked to proteins with endogenous functions. With the increase in genome-level data, there is a need to identify and quantify phylogenetic instability. Here, we present Minimizing Instability in Phylogenetics (MIPhy), a tool that solves this problem by quantifying the incongruence of a gene’s evolutionary history. The motivation behind MIPhy was to produce a tool to aid in interpreting phylogenetic trees. It can predict which members of a gene family are under adaptive evolution, working only from a gene tree and the relationship between the species under consideration. While it does not conduct any estimation of positive selection—which is the typical indication of adaptive evolution—the results tend to agree. We demonstrate the usefulness of MIPhy by accurately predicting which members of the mammalian cytochrome P450 gene superfamily metabolize xenobiotics and which metabolize endogenous compounds. Our predictions correlate very well with known substrate specificities of the human enzymes. We also analyze the Caenorhabditis collagen gene family and use MIPhy to predict genes that produce an observable phenotype when knocked down in C. elegans, and show that our predictions correlate well with existing knowledge. The software can be downloaded and installed from https://github.com/dave-the-scientist/miphy and is also available as an online web tool at http://www.miphy.wasmuthlab.org.
Keywords: Phylogenetic clustering, Phylogenetic instability, Gene family evolution
Introduction
In the absence of specific selective pressures, the phylogeny of a multi-species gene family will tend to agree with the underlying species tree. However, gene events such as gene duplication/loss, horizontal gene transfer (HGT), and incomplete lineage sorting (ILS)—where a polymorphic locus in an ancestral species results in incongruence with the species tree—may become fixed in a species due to evolutionary processes. These events can result in lineage-specific variations in gene family size and incongruence between the gene family phylogeny and the species tree, properties that have collectively been referred to as “phylogenetic instability” (Thomas, 2007). Attempting to work backwards and determine the sequence of events that led from the species tree to the observed gene family is a process called event-inference reconciliation.
It has been hypothesized that the change in environment during a speciation event may lead to higher levels of phylogenetic instability (Lynch & Conery, 2000; Zhang, 2003; Hurley, Hale & Prince, 2005), especially in genes involved in responding to molecules from the environment (xenobiotics). This has been observed in gene families involved in the immune response (de Bono, Madera & Chothia, 2004; Nei, Gu & Sitnikova, 1997; Su et al., 1999), chemosensory receptors (Niimura & Nei, 2005; Thomas et al., 2005), detoxification (Thomas, 2007), and host-pathogen interactions (Wasmuth et al., 2012). These observations are supported by recent ecological experimental evidence showing that higher rates of evolution allow populations to more rapidly expand into new territory (Szűcs et al., 2017).
Here, we propose using phylogenetic instability to predict the functional roles of the members of a gene family using a new tool, Minimizing Instability in Phylogenetics (MIPhy). Specifically, to identify which family members are under pressure to duplicate and contribute to altered or new functions, with the possibility of new phenotypes. Understanding the effects of these selective pressures is of more than purely theoretical importance; as one example the rapid evolution of drug resistance remains one of the most significant challenges in managing both human (Saunders & Lon, 2016) and livestock parasites (Kaplan & Vidyashankar, 2012), and the mechanisms underlying these resistant phenotypes is often unknown. We show the usefulness of MIPhy by validating it against two data sets: the cytochrome P450 (cyp) genes from ten species of vertebrates, and the collagens from eight species of free-living nematodes. This tool can be used to prioritize genes for further study, for example, by predicting the origin of some species-specific function or identifying essential genes as new therapeutic targets in pathogens. The process to detect phylogenetically unstable genes is twofold. First, a tree of a large multi-member gene family is split into meaningful clusters—termed minimum instability groups (MIGs)—by incorporating an event-inference model of gene evolution. Second, each MIG is independently scored for phylogenetic instability.
Related work
There are several existing algorithms for species/gene tree reconciliation, but none are able to segregate a gene tree into meaningful clusters, quantify the stability of those gene clusters, or score each gene in order to compare and rank the individual family members. CAFE 3 uses a stochastic birth-death model of gene family evolution to infer the size of ancestral families (De Bie et al., 2006; Han et al., 2013). It implements a sampling procedure to determine the statistical significance of those gene families that differ from their expected values, and models the effects of genome assembly and gene annotation errors to provide a more accurate estimate of its evolutionary rates. CAFE 3 uses only the gene family counts without considering the phylogenetic relationships within them, and so would be unable to distinguish inherited paralogs from independently duplicated genes. Further, the algorithm calculates whether an entire gene family is under adaptive evolution, while we are interested in the relative differences between specific clusters of genes within a family. Because of this, it is more suited for large-scale analyses of many gene families at once.
BadiRate is similar to CAFE 3, implementing several additional stochastic models of evolution, and providing three statistical frameworks to calculate significance (Librado, Vieira & Rozas, 2012). While it allows for more detailed analyses of species traits, it still relies on gene count data and so is unsuitable here for the same reasons as CAFE. It also requires a species tree with meaningful branch lengths, the creation of which is in itself a challenging analysis.
NOTUNG (Chen, Durand & Farach-Colton, 2000; Vernot et al., 2008; Stolzer et al., 2012) implements a parsimony-based reconciliation algorithm. It finds the sequence of gene events (gene duplication, gene loss, HGT, and ILS) explaining the differences between the observed gene tree and the underlying species relationships that minimizes a weighted sum. Uniquely amongst other reconciliation methods, it allows for the species or gene tree to be non-binary; as the true history of many species is unclear, polytomies can be useful to describe the current state of knowledge. Important in this consideration is that NOTUNG explicitly models HGT and assumes that ILS is a very rare event, only considering it at polytomies in the gene tree. A recent paper has proposed a similar algorithm, with advances in identifying ILS and HGT (Chan, Ranwez & Scornavacca, 2017). Identifying HGT is a computationally intensive process and is unlikely to play an important role in gene families from multi-cellular organisms, and we assume that incongruence (as produced by ILS, adaptive evolution, or any other mechanism) is a common enough event to allow throughout the tree (Carstens & Knowles, 2007; Mirarab, Bayzid & Warnow, 2016; Scally et al., 2012). RANGER-DTL is another reconciliation method, and has been reported to be 1,000–1,000,000× faster than software like NOTUNG (Bansal, Alm & Kellis, 2012). Unfortunately, this model proved unsuitable as it too does not allow for incongruence events.
There are also several probabilistic reconciliation methods available (Rasmussen & Kellis, 2007, 2011; Ma et al., 2008; Doyon et al., 2010; Doyon, Hamel & Chauve, 2012). While these models make use of more sophisticated models of evolution, they are far more computationally intensive and are only applicable to species for which speciation times and/or ancestral population size estimates are available, which is not the case for most species. PHYLDOG overcomes some of these limitations as it is able to estimate the most likely gene trees, species tree, and evolutionary history of a large number of gene families at once (Boussau et al., 2013). Though it does not explicitly model ILS, the authors state that the algorithms can accommodate it as long as the signal is not too strong. This makes it unsuitable, as we expect gene families involved in direct environmental interactions to have a strong ILS signal. Further, this software is designed to combine the information from many gene families at once, and requires extremely significant computational resources (Chaudhary et al., 2015).
Validation
A previous study conducted a detailed analysis of the vertebrate cyp gene family (Thomas, 2007), and found that enzymes with known xenobiotic substrates (about half of the gene family) exhibited high phylogenetic instability, while those with known endogenous substrates were strikingly phylogenetically stable, with clearly defined orthologous relationships. We validate the accuracy of MIPhy by comparing its predictions to the results of that study. That work relied upon the author’s detailed knowledge of the gene family under study, and so was not quantified. As the genomes of an increasing number of species are being made available, manual analysis of large gene families from hundreds of species will become intractable. Further, it is desirable to use an algorithm that is consistent and deterministic.
Nematode collagens are a large multi-gene family of structural proteins. The Caenorhabditis elegans genome contains 181 collagen genes (The C. elegans Sequencing Consortium, 1998), many of which encode for proteins that form a major part of the nematode cuticle, which molts five times in the nematode life-cycle and protects the worm from environmental insult. A combination of high throughput and targeted gene knock-down studies have shown that 28 of these genes are associated with an observable phenotype, ranging from morphological variants to lethality (reviewed in (Page & Johnstone, 2007)). Available genome sequences from other Caenorhabditis species reveal both conservation and divergence of genes and their role in biochemical pathways (Stein et al., 2003; Fierst et al., 2015; Gilabert et al., 2016). To validate MIPhy’s predictions for researchers aiming to prioritize genes for functional characterization, we test whether MIGs with lower phylogenetic instability scores were more likely to contain C. elegans genes associated with phenotypic changes when knocked-out.
While an individual can manually cluster a small tree without much trouble, the large size of some gene families and the ever-expanding availability of sequence data mean that this will quickly become intractable. There are several software packages used to automatically cluster a phylogenetic tree, but because of the ill-defined nature of clustering problems in general, the methods generally come to different conclusions on the same data sets. We are aware of no method that is targeted towards multi-species gene families, which means that none make use of problem-specific information such as an event-inference model of gene evolution. The clustering algorithm described here combines the similarity between each gene with the most parsimonious explanation of gene events, to predict the ancestry of each observed member of the gene family.
Methods
Running MIPhy on a large phylogeny
The NCBI genome database (https://www.ncbi.nlm.nih.gov/assembly/organism/) was filtered for all animal genomes that were at a “Chromosome” or “Complete” level of assembly on July 26, 2016, yielding 98 hits. When there were multiple genome assemblies for a single species, only that with the highest number of annotated proteins was kept. Finally, the Bos indicus, Capra aegagrus, Mus spretus, Nasalis larvatus, and Nomascus leucogenys genomes were discarded as they were judged to contain too few protein sequences to have reliable annotations (all had fewer than 1,500). All protein sequences for the remaining 58 species were concatenated into one file, which was queried with the 628 vertebrate Cyp proteins from (Thomas, 2007) using blastp (Camacho et al., 2009), and resulting in 5,498 hits with an E-value < 10−10. We note that this is not a particularly rigorous procedure; some of these sequences may not actually be Cyp proteins, and we may have missed some true hits. However, the purpose of this procedure was to generate a very large and representative phylogeny as a test case for MIPhy, not to comment on animal Cyps themselves.
The sequences were aligned using Clustal Omega (Sievers et al., 2011) with the command:
clustalo -i INPUT_FILE.fa --threads 10 --log INPUT_FILE-clustalO.log -v --force --use-kimura --iter 10 -o OUT_FILE
The “use-kimura” option specifies that a correction should be applied to the distance between sequences to better estimate their true evolutionary distance (Kimura, 1980). The columns of this alignment with <75% gaps were used to build a phylogenetic tree using RAxML (Stamatakis, 2014) with the command:
raxml -s INPUT_FILE.phylip -T 10 -# 5 -m PROTGAMMAWAG -j -p 12345 -n OUT_FILE
Here, the “-T” option specifies the number of threads used, “-#” specifies the number of iterations, and “-p” is just a random number seed to allow reproduction of the results. The “PROTGAMMAWAG” model was chosen, which uses the empirical amino acid frequencies and fits a gamma model of rate heterogeneity onto the LG substitution model.
Analysis of nematode collagen genes
From Wormbase (Howe et al., 2016), there are 157 genes from C. elegans annotated with the gene class “col.” To these we added the 19 genes listed in (Page & Johnstone, 2007). A further five were found by searching for the repetitive Gly-X-Y amino acid motif and checking each entry in WormBase. Phenotype data from gene knock-down studies is available from Wormbase. The protein sequences of C. angaria, C. brenneri, C. briggsae, C. japonica, C. remanei, C. sinica, and C. tropicalis were downloaded from Wormbase (version WS259). We searched the 181 C. elegans collagens against these protein sets using BLASTP (Camacho et al., 2009) and confirmed the presence of the characteristic and repetitive Gly-X-Y amino acid motif. In instances of different isoforms, we selected the longest for subsequent analysis. In total, 1,349 genes were collected from the eight species.
The diversity of the N- and C-terminus across the collagens, coupled to the variable number of the Gly-X-Y motif, precludes a standard sequence alignment based approach. Therefore, we constructed a distance matrix based on k-mer frequency, using the jD2Stat program (Chan et al., 2014) with the command:
java -Xmx20g -jar jD2Stat_1.0.jar -n 1 -k 8
Here, “-Xmx20g” indicates we allocated the program 20GB of RAM, “-k 8” indicates we are using the default k-mer size of 8, and “-n 1” indicates we allow one wildcard character when identifying those k-mers.
We used the neighbor program with default parameters from the phylip suite to reconstruct the phylogenetic tree (Felsenstein, 1989). The species phylogenetic relationships had been previously determined using the ITS-2 genetic barcode (Félix, Braendle & Cutter, 2014). Note that C. sp. 5 has since been renamed as C. sinica (Huang et al., 2014).
When statistically evaluating the instability scores between MIGs with and without observable knock-down phenotypes in C. elegans, neither set was normally distributed (via the Shapiro–Wilk test). We therefore used a one-tail Mann–Whitney U test to compare them.
Parsimony clustering of the gene tree using a model of gene family evolution
The algorithm described in this work uses a model of gene family evolution derived from the core reconciliation methods of NOTUNG (Chen, Durand & Farach-Colton, 2000; Vernot et al., 2008; Stolzer et al., 2012), with some modifications such as allowing incongruence throughout the tree. We do this as apparent incongruence may arise for many reasons: due to errors in sequencing or gene-finding, incompletely resolved branches in tree-building software, HGT, ILS, or it may be due to selective pressures acting on one or more species. Using our model, each internal node of the gene tree is classified as representing one gene event: duplication, speciation, or incongruence. Gene loss is also considered a gene event, and is quantified at duplication nodes. The algorithm is detailed in Article S1, but summarized here.
MIPhy was designed to identify members of a gene family under adaptive evolution, and so must also cluster the given gene tree into MIGs. This is necessary to isolate “unstable” genes from “stable” genes, and has the effect of assigning all genes from all species in one MIG the same phylogenetic instability score. This score is a function of the model of gene family evolution, and for a given MIG it quantifies all gene events at or below the most recent common ancestor of that group:
where D(g), I(g), and L(g) are the total duplications, incongruence, and loss events within the MIG, respectively; P(g) is a measure of the “relative spread” of the MIG (how dissimilar the sequences are—set to 0 for this phase); and the θ values are the strictly positive weights applied to each event. Under this definition, the score can be interpreted as a measure of the incongruence experienced by a cluster of genes throughout their evolutionary history.
Every node in the gene tree is evaluated in a depth-first post-order traversal; if the node is a leaf a new MIG is defined as containing only that node. At each non-leaf node in the tree the score function is used to compare two possibilities: merging all of that node’s descendants into a single MIG, versus allowing the existing MIG patterns to remain. Initially, while traveling from the leaves towards the root, the “merge” choice tends to be most parsimonious. This continually populates the MIG, the final boundaries of which are determined by the point that the “remain” option instead becomes most parsimonious. This is the initial clustering phase, and it generates a preliminary clustering pattern.
Cluster refinement
This initial clustering pattern arises from the most parsimonious history of gene events required to reconcile the gene family phylogeny (TG) with the species phylogeny (TS). It indicates which groups of genes, under this model, for the given weights and while disregarding all branch lengths in TG, most probably evolved from a single homologue in an ancestral species. This second phase of the algorithm refines these predictions by incorporating branch length information, specifically the pairwise distance information between the sequences. If a sequence in the gene tree is separated by an uncommonly large phylogenetic distance from its closest MIG, there should be a cost associated with the decision to include it in that MIG.
This is accomplished by the “relative spread” term P(g) in the score function, which measures the spread within a cluster. It is a measure of how “good” a cluster is compared to the others:
where σ(g) is the standard deviation of the points representing the sequences in the MIG rooted by g, and is the median standard deviation of all MIGs (excluding singleton clusters). The spread quantity is normalized around 0, so P(g) = 1.0 indicates that the spread of MIG g is 100% larger than the median spread, while P(h) = −0.3 indicates that the spread of MIG h is 30% smaller than . Many clustering metrics, including this one, can only be calculated from data in a coordinate space, and so we first transform the phylogenetic tree into a set of points using multi-dimensional scaling (Torgerson, 1952) (see Article S1 for implementation details). Standard deviation is used as a measure of the pairwise branch lengths within a MIG because it is widely used and intuitive, but clustering-specific methods like the Davies–Bouldin index (Davies & Bouldin, 1979) or silhouette (Rousseeuw, 1987) could be easily substituted. As in the initial clustering phase, each node g in TG is again visited in turn. The clustering procedure is repeated, this time using the full score function.
Results
Program input, workflow, and interface
This software requires two input files: the gene tree in Newick format, and an information file that contains the species tree (topology only; no branch lengths) as well as the assignment of each sequence to one species. MIPhy is agnostic to the method used to generate the tree and can be used to analyze those produced from nucleotides, amino acids, or any other features. The cluster analysis algorithm is written in Python and a local daemon server is started along with an HTML document to display the results. This page has interactive controls and communicates directly with the Python server, allowing the user to reanalyze their data and see the effects of modifying any of the parameters in real time.
The visualization page displays the gene tree clustered into MIGs, the current parameter values, summary statistics, and a sortable list of the MIGs (Fig. 1). Selecting a specific sequence or MIG will provide additional details. The page also contains a usage description, and provides options to modify visual elements like font sizes, the tree size, and the color of each element. The tree and legend can be exported and saved as an SVG image file, or the clustering pattern and instability scores from one or more species can be exported and saved as a CSV file.
MIPhy was used to analyze a dataset of annotated vertebrate Cyp proteins, which consists of 628 sequences from 10 species (Thomas, 2007). The algorithm calculated the optimal clustering pattern in 0.2 s on a 2.7 GHz laptop. Loading the results in a web browser required ∼5 s. Modifying parameter weights causes the clustering analysis to be rerun, and redrawing the new results is sped up as only a subset of the page elements need to be modified or recreated (<1 s). To determine how MIPhy will scale to cope with the ever-increasing number of genome sequences, we analyzed a tree of 5,498 Cyp protein sequences from 58 animal species. MIPhy completed the initial clustering phase in 30 s, the optional cluster refinement phase in 7 min, and loaded the results in a web browser in 1.5 min.
Phylogenetic instability of human Cyp proteins
MIPhy was run with default parameters on the Cyp phylogenetic tree from (Thomas, 2007), and the 59 scores from human sequences were extracted and graphed (Fig. 2); the substrate classification, positive selection, and genome clustering results from the same study were overlaid. These scores fell into two broad categories: 31 were unstable with scores in the interval [18.2, 97.5], and 28 were stable with scores in [0.1, 10.9]. Of the stable sequences, 23 had low scores in [0.1, 5.7], and the remaining five had intermediate scores in [7.8, 10.9].
Among the MIGs with intermediate scores, Cyp-11B1 (steroid 11β-hydroxylase) and Cyp-11B2 (aldosterone synthase) appear to have been recently duplicated in the terrestrial vertebrates, and likely played a role in the ancient transition from sea to land (Colombo et al., 2006). Their instability score is elevated because rats appear to have two additional genes in that cluster and no homologs were found in chicken or frog. It is unclear whether they are actually lost in these species or simply absent from the assemblies.
Parameter impact
The default MIPhy weight values are set at 1, 1, 0.5, and 1, for duplications, loss, incongruence, and spread, respectively. These have performed well in testing and analyses. The effects of modifying these values are considered in terms of the clustering pattern—which indicates which sequences are clustered together—and the cluster rankings—which indicates the instability score of each MIG relative to the others. Increasing the weight for gene loss had very little effect; at even triple its default value it only caused four small MIGs out of the 47 from the vertebrate Cyp tree to be merged with their sister groups. Decreasing the gene duplication weight had much the same effect, causing five MIGs to be merged when it was set to 1/3 of the default value. Increasing the weights for duplication and loss together had no effect on the clustering pattern, and very minimal effect on the cluster rankings. Decreasing both weights together had the same effect as increasing the spread weight, which tended to break up larger MIGs. Decreasing the spread weight to zero had minimal impact, only merging two singleton groups with their neighbors. Decreasing the incongruence weight had no effect, and increasing it had little impact until it became very high, at which point it tended to break up groups.
Phylogenetic instability of Caenorhabditis collagens
Across the eight species of Caenorhabditis, we found 1,349 collagen genes (Table S1). The characteristic Gly-X-Y repeat domain can vary greatly in length, presenting a problem for usual alignment guided phylogenetics. To overcome this, we used a k-mer based distance matrix (Chan et al., 2014). Default settings were used to cluster the protein phylogeny and subsequently score each cluster’s phylogenetic instability (Fig. 3). A total of 244 MIGs were generated, with 41 MIGs containing proteins from all eight species, 60 MIGs covering any seven of the species, and 151 MIGs containing at least one protein from C. elegans. Twenty-five of the 151 MIGs that contained a C. elegans protein encoded by a gene whose knock-down is associated with an observable phenotype. The distribution of scores from these 25 MIGs was significantly smaller than the remaining 126 MIGs (medians = 2.02 and 3.22; U test statistic = 991; p = 0.002).
Discussion
Positive selection, pseudogenization, and the presence of tandem gene arrays are characteristic of rapidly evolving genes, such as those involved in xenobiotic interactions (Thomas, 2007). Even though the MIPhy analysis does not incorporate any of this information, every human Cyp sequence with these characteristics received a high instability score (Fig. 2). These predictions appear to extend to the functional role of the enzymes as well, as MIPhy performed very well at classifying the human Cyp proteins into those primarily acting on xenobiotic or endogenous substrates. All enzymes with known endogenous functions had low scores, while all but two with primarily xenobiotic substrates had high instability scores; these exceptions were Cyp-1A1 and Cyp-1A2. While the latter is one of the most important human enzymes involved in xenobiotic metabolism, it has been suggested that both also have important endogenous roles (Zhou et al., 2009; Kapitulnik & Gonzalez, 1993), which may have shaped their evolutionary history in the vertebrate species studied here.
The predictions can be extended to species for which detailed substrate specificity information is limited. The sequences from terrestrial species in the MIG containing human Cyp-27A1 appear stable, but those of the aquatic or amphibious species do not. This observation suggests that these paralogs may play some role specific to aquatic environments. A similar observation can be made about the cluster containing human Cyp-2W1. It has the second-highest instability score, and of the 43 total sequences there is only one each from human, macaque, mouse, and cow. There are 16 from frog, 10 from zebrafish, and four from pufferfish, which would suggest that these paralogs may also have evolved to metabolize substrates specific to an aquatic environment, and that this capacity was lost in terrestrial species.
The collagens are a large gene family encoding for structural proteins. Most members have been investigated in the past using gene knock-down assays in C. elegans, resulting in observed phenotypic changes for approximately 15%. While not all the remaining genes have been investigated, many have, suggesting wide-spread functional redundancy. Using MIPhy to cluster and score the collagen gene phylogeny showed that we could prioritize genes for detailed functional assays, and that a low phylogenetic instability score was a good predictor of genes with observable knock-down phenotypes. Further, this demonstrated that MIPhy is agnostic to the methods or underlying characters used to construct a gene tree, and so is applicable to a wide range of data.
An additional use of MIPhy is in the naming of genes, specifically towards generating hierarchical naming conventions using an evolutionary framework. Because a sequence identity threshold was used when annotating Cyp proteins, one may reasonably assume that Cyp-3A4 and Cyp-3A5 have related functions, as they are likely closely related. Conversely, no such assumptions may be made about many other gene families, whose members have often been annotated in order of discovery. This can pose a problem with the discovery of novel genes. If two species possess the example genes pqr-21 and pqr-22, and one of them additionally possesses a paralog to pqr-21, this paralog will be named with the next available number; perhaps pqr-42. This single tiered naming system does not accommodate any way to suggest that pqr-21 and pqr-42 are related to each other. We propose that a phylogenetic analysis like MIPhy could be used to cluster such a gene family into sub-families, and that these clusters could be used to inform a multi-tiered naming system that is better able to accommodate newly discovered gene members. This is an issue that is going to arise more often as increasing numbers of species are being sequenced.
The predictions from these analyses would be complimentary to a between-genes positive selection analysis, which is the most commonly used measure of adaptive evolution. While a codon-based positive selection test measures the patterns of sequence variation, phylogenetic instability combines the relative sequence variation between species (from the cluster spread and incongruence events) with the most likely history of duplications and losses.
However, MIPhy does have its limitations. It is very sensitive to the given gene tree and does not currently incorporate any measures of uncertainty such as bootstrapping. We recognize that this information could be useful in an analysis of phylogenetic instability—for example by differentiating true gene events from those that may simply be phylogenetic artifacts—but leave this for a future version of the software. There are also exceptions to the assumption that phylogenetic instability is a hallmark of adaptive evolution; the most well-known may be the beta-globin genes that form part of hemoglobin. These genes exhibit sequence polymorphism within and between human populations, lineage-specific expansions and contractions in gene cluster size, and yet continue to play a very vital endogenous role (Hill & Wainscoat, 1986; Opazo, Hoffmann & Storz, 2008).
Conclusion
This work presents, to our knowledge, the first algorithm for simultaneous reconciliation and clustering of large gene families. MIPhy’s instability score has proven to be a valuable tool in identifying the members of gene families that exhibit characteristics of adaptive evolution, predicting collagens that play an important functional role in C. elegans, and agrees very well with the known substrate specificity of human Cyp enzymes. It is a useful tool to gain an understanding of the evolution of large gene families, and to generate hypotheses about the potential functional roles of both the stable and unstable sequences.
Supplemental Information
Acknowledgments
We thank everyone that has tested the software during its development. We also wish to thank Dr. Dannie Durand for her work on reconciliation algorithms, and for discussions on the nuances and applications of this work. Finally, we are grateful to Dr. James Thomas for providing us with his past data so that MIPhy could be validated on his published work.
Funding Statement
This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) through a Discovery Grant (#06239-2015) to James Wasmuth, a Collaborative Research and Training Experience Program (CREATE) program in Host-Parasite Interactions (#413888-2012) to James Wasmuth and John Gilleard (and others), and by Alberta Innovates—Technology Futures through a doctoral scholarship to David Curran. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Contributor Information
David M. Curran, Email: dmcurran@ucalgary.ca.
James D. Wasmuth, Email: jwasmuth@ucalgary.ca.
Additional Information and Declarations
Competing Interests
The authors declare that they have no competing interests.
Author Contributions
David M. Curran conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft.
John S. Gilleard conceived and designed the experiments, authored or reviewed drafts of the paper, approved the final draft.
James D. Wasmuth conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft.
Data Availability
The following information was supplied regarding data availability:
MIPhy is freely available at https://github.com/dave-the-scientist/miphy under a BSD 2-clause license. It is also available as an online tool at http://miphy.wasmuthlab.org.
References
- Bansal, Alm & Kellis (2012).Bansal MS, Alm EJ, Kellis M. Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinformatics. 2012;28(12):i283–i291. doi: 10.1093/bioinformatics/bts225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boussau et al. (2013).Boussau B, Szollosi GJ, Duret L, Gouy M, Tannier E, Daubin V. Genome-scale coestimation of species and gene trees. Genome Research. 2013;23(2):323–330. doi: 10.1101/gr.141978.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Camacho et al. (2009).Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST plus: architecture and applications. BMC Bioinformatics. 2009;10(421):421. doi: 10.1186/1471-2105-10-421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carstens & Knowles (2007).Carstens BC, Knowles LL. Estimating species phylogeny from gene-tree probabilities despite incomplete lineage sorting: an example from melanoplus grasshoppers. Systematic Biology. 2007;56(3):400–411. doi: 10.1080/10635150701405560. [DOI] [PubMed] [Google Scholar]
- Chan et al. (2014).Chan CX, Bernard G, Poirion O, Hogan JM, Ragan MA. Inferring phylogenies of evolving sequences without multiple sequence alignment. Scientific Reports. 2014;4(1):6504. doi: 10.1038/srep06504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chan, Ranwez & Scornavacca (2017).Chan YB, Ranwez V, Scornavacca C. Inferring incomplete lineage sorting, duplications, transfers and losses with reconciliations. Journal of Theoretical Biology. 2017;432:1–13. doi: 10.1016/j.jtbi.2017.08.008. [DOI] [PubMed] [Google Scholar]
- Chaudhary et al. (2015).Chaudhary R, Boussau B, Burleigh JG, Fernández-Baca D. Assessing approaches for inferring species trees from multi-copy genes. Systematic Biology. 2015;64(2):325–339. doi: 10.1093/sysbio/syu128. [DOI] [PubMed] [Google Scholar]
- Chen, Durand & Farach-Colton (2000).Chen K, Durand D, Farach-Colton M. NOTUNG: a program for dating gene duplications and optimizing gene family trees. Journal of Computational Biology. 2000;7(3-4):429–447. doi: 10.1089/106652700750050871. [DOI] [PubMed] [Google Scholar]
- Colombo et al. (2006).Colombo L, Valle LD, Fiore C, Armanini D, Belvedere P. Aldosterone and the conquest of land. Journal of Endocrinological Investigation. 2006;29(4):373. doi: 10.1007/bf03344112. [DOI] [PubMed] [Google Scholar]
- Davies & Bouldin (1979).Davies DL, Bouldin DW. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1979;1(2):224–227. doi: 10.1109/TPAMI.1979.4766909. [DOI] [PubMed] [Google Scholar]
- De Bie et al. (2006).De Bie T, Cristianini N, Demuth JP, Hahn MW. CAFE: a computational tool for the study of gene family evolution. Bioinformatics. 2006;22(10):1269–1271. doi: 10.1093/bioinformatics/btl097. [DOI] [PubMed] [Google Scholar]
- De Bono, Madera & Chothia (2004).De Bono B, Madera M, Chothia C. VH gene segments in the mouse and human genomes. Journal of Molecular Biology. 2004;342(1):131–143. doi: 10.1016/j.jmb.2004.06.055. [DOI] [PubMed] [Google Scholar]
- Doyon, Hamel & Chauve (2012).Doyon JP, Hamel S, Chauve C. An efficient method for exploring the space of gene tree/species tree reconciliations in a probabilistic framework. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2012;9(1):26–39. doi: 10.1109/TCBB.2011.64. [DOI] [PubMed] [Google Scholar]
- Doyon et al. (2010).Doyon JP, Scornavacca C, Gorbunov KY, Szöllősi GJ, Ranwez V, Berry V. An efficient algorithm for gene/species trees parsimonious reconciliation with losses, duplications and transfers. Lecture Notes in Computer Science. 2010;6398:93–108. doi: 10.1007/978-3-642-16181-0_9. [DOI] [Google Scholar]
- Felsenstein (1989).Felsenstein J. PHYLIP―phylogeny inference package (version 3.2) Cladistics. 1989;5(2):163–166. doi: 10.1111/j.1096-0031.1989.tb00562.x. [DOI] [Google Scholar]
- Fierst et al. (2015).Fierst JL, Willis JH, Thomas CG, Wang W, Reynolds RM, Ahearne TE, Cutter AD, Phillips PC. Reproductive mode and the evolution of genome size and structure in Caenorhabditis Nematodes. PLOS Genetics. 2015;11(6):e1005323. doi: 10.1371/journal.pgen.1005323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Félix, Braendle & Cutter (2014).Félix MA, Braendle C, Cutter AD. A streamlined system for species diagnosis in Caenorhabditis (Nematoda: Rhabditidae) with name designations for 15 distinct biological species. PLOS ONE. 2014;9(4):4. doi: 10.1371/journal.pone.0094723. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gilabert et al. (2016).Gilabert A, Curran DM, Harvey SC, Wasmuth JD. Expanding the view on the evolution of the nematode dauer signalling pathways: refinement through gene gain and pathway co-option. BMC Genomics. 2016;17(1):476. doi: 10.1186/s12864-016-2770-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Han et al. (2013).Han MV, Thomas GWC, Lugo-Martinez J, Hahn MW. Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3. Molecular Biology and Evolution. 2013;30(8):1987–1997. doi: 10.1093/molbev/mst100. [DOI] [PubMed] [Google Scholar]
- Hill & Wainscoat (1986).Hill AV, Wainscoat JS. The evolution of the alpha-and beta-globin gene clusters in human populations. Human Genetics. 1986;74(1):16–23. doi: 10.1007/BF00278779. [DOI] [PubMed] [Google Scholar]
- Howe et al. (2016).Howe KL, Bolt BJ, Cain S, Chan J, Chen WJ, Davis P, Done J, Down T, Gao S, Grove C, Harris TW, Kishore R, Lee R, Lomax J, Li Y, Muller HM, Nakamura C, Nuin P, Paulini M, Raciti D, Schindelman G, Stanley E, Tuli MA, Van Auken K, Wang D, Wang X, Williams G, Wright A, Yook K, Berriman M, Kersey P, Schedl T, Stein L, Sternberg PW. WormBase 2016: expanding to enable helminth genomic research. Nucleic Acids Research. 2016;44(D1):D774–D780. doi: 10.1093/nar/gkv1217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang et al. (2014).Huang RE, Ren X, Qiu Y, Zhao Z. Description of Caenorhabditis sinica Sp. N. (Nematoda: Rhabditidae), a nematode species used in comparative biology for C. elegans. PLOS ONE. 2014;9(11):e110957. doi: 10.1371/journal.pone.0110957. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hurley, Hale & Prince (2005).Hurley I, Hale ME, Prince VE. Duplication events and the evolution of segmental identity. Evolution Development. 2005;7(6):556–567. doi: 10.1111/j.1525-142X.2005.05059.x. [DOI] [PubMed] [Google Scholar]
- Kapitulnik & Gonzalez (1993).Kapitulnik J, Gonzalez FJ. Marked endogenous activation of the CYP1A1 and CYP1A2 genes in the congenitally jaundiced Gunn rat. Molecular Pharmacology. 1993;43(5):722–725. [PubMed] [Google Scholar]
- Kaplan & Vidyashankar (2012).Kaplan RM, Vidyashankar AN. An inconvenient truth: global worming and anthelmintic resistance. Veterinary Parasitology. 2012;186(1–2):70–78. doi: 10.1016/j.vetpar.2011.11.048. [DOI] [PubMed] [Google Scholar]
- Kimura (1980).Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution. 1980;16(2):111–120. doi: 10.1007/BF01731581. [DOI] [PubMed] [Google Scholar]
- Librado, Vieira & Rozas (2012).Librado P, Vieira FG, Rozas J. BadiRate: estimating family turnover rates by likelihood-based methods. Bioinformatics. 2012;28(2):279–281. doi: 10.1093/bioinformatics/btr623. [DOI] [PubMed] [Google Scholar]
- Lynch & Conery (2000).Lynch M, Conery JS. The evolutionary fate and consequences of duplicate genes. Science. 2000;290(5494):1151–1155. doi: 10.1126/science.290.5494.1151. [DOI] [PubMed] [Google Scholar]
- Ma et al. (2008).Ma J, Ratan A, Raney BJ, Suh BB, Zhang L, Miller W, Haussler D. DUPCAR: reconstructing contiguous ancestral regions with duplications. Journal of Computational Biology. 2008;15(8):1007–1027. doi: 10.1089/cmb.2008.0069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mirarab, Bayzid & Warnow (2016).Mirarab S, Bayzid MS, Warnow T. Evaluating summary methods for multilocus species tree estimation in the presence of incomplete lineage sorting. Systematic Biology. 2016;65(3):366–380. doi: 10.1093/sysbio/syu063. [DOI] [PubMed] [Google Scholar]
- Nei, Gu & Sitnikova (1997).Nei M, Gu X, Sitnikova T. Evolution by the birth-and-death process in multigene families of the vertebrate immune system. Proceedings of the National Academy of Sciences of the United States of America. 1997;94(15):7799–7806. doi: 10.1073/pnas.94.15.7799. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Niimura & Nei (2005).Niimura Y, Nei M. Evolutionary changes of the number of olfactory receptor genes in the human and mouse lineages. Gene. 2005;346:23–28. doi: 10.1016/j.gene.2004.09.027. [DOI] [PubMed] [Google Scholar]
- Opazo, Hoffmann & Storz (2008).Opazo JC, Hoffmann FG, Storz JF. Genomic evidence for independent origins of beta-like globin genes in monotremes and therian mammals. Proceedings of the National Academy of Sciences of the United States of America. 2008;105(5):1590–1595. doi: 10.1073/pnas.0710531105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Page & Johnstone (2007).Page AP, Johnstone IL. Kramer JM, Moerman DG, editors. The cuticle. WormBook. 2007:1–15. doi: 10.1895/wormbook.1.138.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rasmussen & Kellis (2007).Rasmussen MD, Kellis M. Accurate gene-tree reconstruction by learning gene-and species-specific substitution rates across multiple complete genomes. Genome Research. 2007;17(12):1932–1942. doi: 10.1101/gr.7105007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rasmussen & Kellis (2011).Rasmussen MD, Kellis M. A Bayesian approach for fast and accurate gene tree reconstruction. Molecular Biology and Evolution. 2011;28(1):273–290. doi: 10.1093/molbev/msq189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rousseeuw (1987).Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. 1987;20(C):53–65. doi: 10.1016/0377-0427(87)90125-7. [DOI] [Google Scholar]
- Saunders & Lon (2016).Saunders D, Lon C. Combination therapies for malaria are failing-what next? Lancet Infectious Diseases. 2016;16(3):274. doi: 10.1016/S1473-3099(15)00525-3. [DOI] [PubMed] [Google Scholar]
- Scally et al. (2012).Scally A, Dutheil JY, Hillier LW, Jordan GE, Goodhead I, Herrero J, Hobolth A, Lappalainen T, Mailund T, Marques-Bonet T, McCarthy S, Montgomery SH, Schwalie PC, Tang YA, Ward MC, Xue Y, Yngvadottir B, Alkan C, Andersen LN, Ayub Q, Ball EV, Beal K, Bradley BJ, Chen Y, Clee CM, Fitzgerald S, Graves TA, Gu Y, Heath P, Heger A, Karakoc E, Kolb-Kokocinski A, Laird GK, Lunter G, Meader S, Mort M, Mullikin JC, Munch K, O’Connor TD, Phillips AD, Prado-Martinez J, Rogers AS, Sajjadian S, Schmidt D, Shaw K, Simpson JT, Stenson PD, Turner DJ, Vigilant L, Vilella AJ, Whitener W, Zhu B, Cooper DN, de Jong P, Dermitzakis ET, Eichler EE, Flicek P, Goldman N, Mundy NI, Ning Z, Odom DT, Ponting CP, Quail MA, Ryder OA, Searle SM, Warren WC, Wilson RK, Schierup MH, Rogers J, Tyler-Smith C, Durbin R. Insights into hominid evolution from the gorilla genome sequence. Nature. 2012;483(7388):169–175. doi: 10.1038/nature10842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sievers et al. (2011).Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG. Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega. Molecular Systems Biology. 2011;7(1):539. doi: 10.1038/msb.2011.75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stamatakis (2014).Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30(9):1312–1313. doi: 10.1093/bioinformatics/btu033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stein et al. (2003).Stein LD, Bao Z, Blasiar D, Blumenthal T, Brent MR, Chen N, Chinwalla A, Clarke L, Clee C, Coghlan A, Coulson A, D’Eustachio P, Fitch DH, Fulton LA, Fulton RE, Griffiths-Jones S, Harris TW, Hillier LW, Kamath R, Kuwabara PE, Mardis ER, Marra MA, Miner TL, Minx P, Mullikin JC, Plumb RW, Rogers J, Schein JE, Sohrmann M, Spieth J, Stajich JE, Wei C, Willey D, Wilson RK, Durbin R, Waterston RH. The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics. PLOS Biology. 2003;1(2):e45. doi: 10.1371/journal.pbio.0000045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stolzer et al. (2012).Stolzer M, Lai H, Xu M, Sathaye D, Vernot B, Durand D. Inferring duplications, losses, transfers and incomplete lineage sorting with nonbinary species trees. Bioinformatics. 2012;28(18):i409–i415. doi: 10.1093/bioinformatics/bts386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Su et al. (1999).Su C, Jakobsen I, Gu X, Nei M. Diversity and evolution of T-cell receptor variable region genes in mammals and birds. Immunogenetics. 1999;50(5–6):301–308. doi: 10.1007/s002510050606. [DOI] [PubMed] [Google Scholar]
- Szűcs et al. (2017).Szűcs M, Vahsen ML, Melbourne BA, Hoover C, Weiss-Lehman C, Hufbauer RA. Rapid adaptive evolution in novel environments acts as an architect of population range expansion. Proceedings of the National Academy of Sciences of the United States of America. 2017;114(51):201712934. doi: 10.1073/pnas.1712934114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- The C. elegans Sequencing Consortium (1998).The C. elegans Sequencing Consortium Genome sequence of the nematode C. elegans: a platform for investigating biology. Science. 1998;282(5396):2012–2018. doi: 10.1126/science.282.5396.2012. [DOI] [PubMed] [Google Scholar]
- Thomas (2007).Thomas JH. Rapid birth-death evolution specific to xenobiotic cytochrome P450 genes in vertebrates. PLOS Genetics. 2007;3(5):e67. doi: 10.1371/journal.pgen.0030067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thomas et al. (2005).Thomas JH, Kelley JL, Robertson HM, Ly K, Swanson WJ. Adaptive evolution in the SRZ chemoreceptor families of Caenorhabditis elegans and Caenorhabditis briggsae. Proceedings of the National Academy of Sciences of the United States of America. 2005;102(12):4476–4481. doi: 10.1073/pnas.0406469102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Torgerson (1952).Torgerson WS. Multidimensional scaling: I. theory and method. Psychometrika. 1952;17(4):401–419. doi: 10.1007/BF02288916. [DOI] [PubMed] [Google Scholar]
- Vernot et al. (2008).Vernot B, Stolzer M, Goldman A, Durand D. Reconciliation with non-binary species trees. Journal of Computational Biology. 2008;15(8):981–1006. doi: 10.1089/cmb.2008.0092. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wasmuth et al. (2012).Wasmuth JD, Pszenny V, Haile S, Jansen EM, Gast AT, Sher A, Boyle JP, Boulanger MJ, Parkinson J, Grigg ME. Integrated bioinformatic and targeted deletion analyses of the SRS gene superfamily identify SRS29C as a negative regulator of toxoplasma virulence. mBio. 2012;3(6):e00321–12. doi: 10.1128/mBio.00321-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang (2003).Zhang J. Evolution by gene duplication: an update. Trends in Ecology & Evolution. 2003;18(6):292. doi: 10.1016/S0169-5347(03)00033-8. [DOI] [Google Scholar]
- Zhou et al. (2009).Zhou SF, Yang LP, Zhou ZW, Liu YH, Chan E. Insights into the substrate specificity, inhibitors, regulation, and polymorphisms and the clinical impact of human cytochrome P450 1A2. Aaps J. 2009;11(3):481–494. doi: 10.1208/s12248-009-9127-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The following information was supplied regarding data availability:
MIPhy is freely available at https://github.com/dave-the-scientist/miphy under a BSD 2-clause license. It is also available as an online tool at http://miphy.wasmuthlab.org.